Re: [I] Select DISTINCT with LIMIT 10 is doing a full-scan of the database [arrow-datafusion]

via GitHub Mon, 09 Oct 2023 13:44:48 -0700


alamb commented on issue #7781:
URL: 
https://github.com/apache/arrow-datafusion/issues/7781#issuecomment-1753800877


   > I then looked at the query plan, and it seems like its actually doing 
GROUP BY my_column which causes a full-scan, 
   
   Yes, that is what I would expect for this kind of query
   
   > what makes it even worse, is that all 10 values returned are present in 
the first parquet file in the dataset (pyarrow.Dataset.files[0]), so it 
could've just stopped scanning after the first file immediately.
   
   
   While it happens to be the case for your particular dataset that all values 
are present in the first tile,  I don't think there is any way for datafusion 
to know that. To answer the query faithfully it needs to check all the files 
(what if there was a new value in the last file?)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Select DISTINCT with LIMIT 10 is doing a full-scan of the database [arrow-datafusion]

Reply via email to