Hi Folks, In Spark SQL there is the ability to have Spark do it's partition discovery/file listing in parallel on the worker nodes and also avoid locality lookups. I'd like to expose this in core, but given the Hadoop APIs it's a bit more complicated to do right. I made a quick POC and two potential different paths we could do for implementation and wanted to see if anyone had thoughts - https://github.com/apache/spark/pull/29179.
Cheers, Holden -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau