Dandandan commented on issue #470: URL: https://github.com/apache/arrow-ballista/issues/470#issuecomment-1294446774
I think async listing/planning tasks feels like a good solution. Listing implementations already support returning 1000 files in one call, this should be enough to utilize a cluster (with 1000 tasks) before the next page arrives. Any real inspection of data (gathering parquet metadata / stats) should preferably move to executors for larger tables. For partitioned data it is also possible to parallelize the listing (for each partition) if all partition values are known. Additionally, formats like delta / iceberg avoid this problem by already having the files/metadata/stats available in the format. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
