Dandandan commented on issue #470:
URL: https://github.com/apache/arrow-ballista/issues/470#issuecomment-1294446774

   I think async listing/planning tasks feels like a good solution. Listing 
implementations already support returning 1000 files in one call, this should 
be enough to utilize a cluster (with 1000 tasks) before the next page arrives.
   
   Any real inspection of data (gathering parquet metadata / stats) should 
preferably move to executors for larger tables.
   
   For partitioned data it is also possible to parallelize the listing (for 
each partition) if all partition values are known.
   
   Additionally, formats like delta / iceberg avoid this problem by already 
having the files/metadata/stats available in the format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to