Dandandan commented on PR #21351: URL: https://github.com/apache/datafusion/pull/21351#issuecomment-4250603405
> Awesome, so the PR changes who reads which file at runtime using morselizer, would be extremely interesting to try this on many small files environments. > > Do we expect improvements for even partitions(partition have the similar number of files with similar sizes)? > > Is it planned to morselize deeper to process row groups in parallel? > > This activity actually reminds me of #19815 benchmark. > Do we expect improvements for even partitions(partition have the similar number of files with similar sizes) In my experience, there is always a some partition skew even for very balanced scans on local FS. So, this will benefit from morsel-based scan (but only something like 5-10%) as long as there are enough morsels to spread the work (at least more than the number of cores). Object store will almost always benefit (due to huge variation in response times). OTOH the relative improvement will be less for larger / more evenly balanced datasets. > Is it planned to morselize deeper to process row groups in parallel? Yes - it is the plan to split morsels into sub-row-group morsels, so smaller datasets (e.g. TPC-DS at SF=1 which has single-row group files) or high-cpu machines (due to too little parallelism) will benefit more as well. Currently parallelism is limited in datasets with few row groups as we can't go beyond row groups. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
