alamb commented on issue #17169: URL: https://github.com/apache/datafusion/issues/17169#issuecomment-3191899056
I tried to get DataFusion to use the sort order and avoid using much memory but was not successful I can resort the file quite efficiently with minimal RAM no problem: ```shell $ datafusion-cli -m 4GB ``` ``` > COPY ( SELECT * FROM 'https://digitalsociety-public.fsn1.your-objectstorage.com/f94d0c87-8798-4bf6-9c98-8d89971e2539.parquet' ORDER BY "ADDRESS1", "ADDRESS2", "ADDRESS3", "POSTCODE", "INSPECTION_DATE" DESC ) to 'f94d0c87-8798-4bf6-9c98-8d89971e2539.reordered.parquet'; +---------+ | count | +---------+ | 1523389 | +---------+ 1 row(s) fetched. Elapsed 41.416 seconds. ``` Then I can tell DataFusion it is sorted: ```sql > CREATE EXTERNAL TABLE t STORED AS PARQUET LOCATION 'f94d0c87-8798-4bf6-9c98-8d89971e2539.reordered.parquet' WITH ORDER ( "ADDRESS1", "ADDRESS2", "ADDRESS3", "POSTCODE", "INSPECTION_DATE" DESC); 0 row(s) fetched. Elapsed 0.004 seconds. ``` But when I try to deduplicate this I still run out of memory: ```sql SELECT DISTINCT ON ("ADDRESS1", "ADDRESS2", "ADDRESS3", "POSTCODE") * FROM t ORDER BY "ADDRESS1", "ADDRESS2", "ADDRESS3", "POSTCODE", "INSPECTION_DATE" DESC; Resources exhausted: Additional allocation failed with top memory consumers (across reservations) as: ExternalSorter[7]#306(can spill: true) consumed 227.7 MB, ExternalSorter[13]#324(can spill: true) consumed 227.5 MB, ExternalSorter[14]#327(can spill: true) consumed 227.2 MB. Error: Failed to allocate additional 679.8 MB for GroupedHashAggregateStream[11] (, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ) with 6.8 MB already allocated for this reservation - 448.6 MB remain available for the total pool ``` I would have expected that this could use `GroupOrderingFull` and avoid large amounts of memory usage, but that does not appear to be the case. Maybe it is a bug we should look into 🤔 https://github.com/apache/datafusion/blob/1daa5ed5cc51546904d45e23cc148601d973942a/datafusion/physical-plan/src/aggregates/order/mod.rs#L38-L39 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org