I filed an issue due to an issue I see with PrunedScan, that causes sub-optimal performance in ML pipelines. Sorry if the issue is already known. Having tried a few approaches to working with large binary files with Spark ML, I prefer loading the data into a vector-type column from a relation supporting pruned scan. This is better, I think, than a lazy-loading scheme based on binaryFiles/PortalDataStream. SPARK-8794 undermines the approach. Eron