alamb commented on issue #17169:
URL: https://github.com/apache/datafusion/issues/17169#issuecomment-3191899056

   I tried to get DataFusion to use the sort order and avoid using much memory  
but was not successful
   
   I can resort the file quite efficiently with minimal RAM no problem:
   
   ```shell
   $ datafusion-cli -m 4GB
   ```
   
   ```
   > COPY (
     SELECT *
       FROM
         
'https://digitalsociety-public.fsn1.your-objectstorage.com/f94d0c87-8798-4bf6-9c98-8d89971e2539.parquet'
       ORDER BY
         "ADDRESS1",
         "ADDRESS2",
         "ADDRESS3",
         "POSTCODE",
         "INSPECTION_DATE" DESC
     )
   to 'f94d0c87-8798-4bf6-9c98-8d89971e2539.reordered.parquet';
   +---------+
   | count   |
   +---------+
   | 1523389 |
   +---------+
   1 row(s) fetched.
   Elapsed 41.416 seconds.
   ```
   
   
   Then I can tell DataFusion it is sorted:
   
   ```sql
   > CREATE EXTERNAL TABLE t
   STORED AS PARQUET
   LOCATION 'f94d0c87-8798-4bf6-9c98-8d89971e2539.reordered.parquet'
   WITH ORDER ( "ADDRESS1", "ADDRESS2", "ADDRESS3", "POSTCODE", 
"INSPECTION_DATE" DESC);
   0 row(s) fetched.
   Elapsed 0.004 seconds.
   ```
   
   But when I try to deduplicate this I still run out of memory:
   
   ```sql
    SELECT DISTINCT
     ON ("ADDRESS1", "ADDRESS2", "ADDRESS3", "POSTCODE") *
   FROM
     t
   ORDER BY
     "ADDRESS1",
     "ADDRESS2",
     "ADDRESS3",
     "POSTCODE",
     "INSPECTION_DATE" DESC;
   
   Resources exhausted: Additional allocation failed with top memory consumers 
(across reservations) as:
     ExternalSorter[7]#306(can spill: true) consumed 227.7 MB,
     ExternalSorter[13]#324(can spill: true) consumed 227.5 MB,
     ExternalSorter[14]#327(can spill: true) consumed 227.2 MB.
   Error: Failed to allocate additional 679.8 MB for 
GroupedHashAggregateStream[11] (, , , , , , , , , , , , , , , , , , , , , , , , 
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
, ) with 6.8 MB already allocated for this reservation - 448.6 MB remain 
available for the total pool
   ```
   
   I would have expected that this could use `GroupOrderingFull` and avoid 
large amounts of memory usage, but that does not appear to be the case. Maybe 
it is a bug we should look into 🤔 
   
   
https://github.com/apache/datafusion/blob/1daa5ed5cc51546904d45e23cc148601d973942a/datafusion/physical-plan/src/aggregates/order/mod.rs#L38-L39


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to