On Wed, 22 Jul 2020 at 00:51, Holden Karau <hol...@pigscanfly.ca> wrote:

> Hi Folks,
>
> In Spark SQL there is the ability to have Spark do it's partition
> discovery/file listing in parallel on the worker nodes and also avoid
> locality lookups. I'd like to expose this in core, but given the Hadoop
> APIs it's a bit more complicated to do right. I
>

That's ultimately fixable, if we can sort out what's good from the app side
and reconcile that with 'what is not pathologically bad across both HDFS
and object stores".

Bad: globStatus, anything which returns an array rather than a remote
iterator, encourages treewalk
Good: deep recursive listings, remote iterator results for:
incremental/async fetch of next page of listing, soon: option for iterator,
if cast to IOStatisticsSource, actually serve up stats on IO performance
during the listing. (e.g. #of list calls, mean time to get a list
response back., store throttle events)

Also look at LocatedFileStatus to see how it parallelises its work. its not
perfect because wildcards are supported, which means globStatus gets used

happy to talk about this some more, and I'll review the patch

-steve


> made a quick POC and two potential different paths we could do for
> implementation and wanted to see if anyone had thoughts -
> https://github.com/apache/spark/pull/29179.
>
> Cheers,
>
> Holden
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Reply via email to