a-campbell opened a new issue #8646:
URL: https://github.com/apache/arrow/issues/8646


   Hi Arrow community,
   
   I'm new to the project and am trying to understand exactly what is happening 
under the hood when I run a filter-collect query on an Arrow Dataset (backed by 
Parquet).
   
   Let's say I created a Parquet dataset with no file-level partitions. I just 
wrote a bunch of separate files to a dataset. Now I want to run a query that 
returns the rows corresponding to a specific range of datetimes in the 
dataset's dt column.
   
   My understanding is that the Dataset API will push this query down to the 
file level, checking the footer of each file for the min/max value of dt and 
determining whether this block of rows should be read.
   
   Assuming this is correct, a few questions:
   
   Will every query result in the reading all of the file footers? Is there any 
caching of these min/max values?
   
   Is there a way to profile query performance? A way to view a query plan 
before it is executed?
   
   I appreciate your time in helping me better understand.
   
   Andrew
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to