I filed an issue due to an issue I see with PrunedScan, that causes sub-optimal 
performance in ML pipelines.   
Sorry if the issue is already known.
Having tried a few approaches to working with large binary files with Spark ML, 
I prefer loading the data into a vector-type column from a relation supporting 
pruned scan.  This is better, I think, than a lazy-loading scheme based on 
binaryFiles/PortalDataStream.   SPARK-8794 undermines the approach.
Eron                                      

Reply via email to