alamb opened a new issue, #18118:
URL: https://github.com/apache/datafusion/issues/18118

   ### Is your feature request related to a problem or challenge?
   
   Using the tooling that @BlakeOrth have been working on for instrumenting 
datafusion-cli we can see what requests are actually being made when we query 
remote files;
   - https://github.com/apache/datafusion/issues/17207
   
   ```sql
   DataFusion CLI v50.2.0
   > \object_store_profiling trace
   ObjectStore Profile mode set to Trace
   > SELECT COUNT(*) from 
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'
 where "SearchPhrase" <> '';
   +----------+
   | count(*) |
   +----------+
   | 131559   |
   +----------+
   1 row(s) fetched.
   Elapsed 0.606 seconds.
   
   Object Store Profiling
   Instrumented Object Store: instrument_mode: Trace, inner: HttpStore
   2025-10-17T08:55:22.202041+00:00 operation=Get duration=0.025994s size=8 
range: bytes=174965036-174965043 
path=hits_compatible/athena_partitioned/hits_1.parquet
   2025-10-17T08:55:22.228064+00:00 operation=Get duration=0.028127s size=34322 
range: bytes=174930714-174965035 
path=hits_compatible/athena_partitioned/hits_1.parquet
   2025-10-17T08:55:22.295696+00:00 operation=Get duration=0.032303s size=15503 
range: bytes=5120273-5135775 
path=hits_compatible/athena_partitioned/hits_1.parquet
   2025-10-17T08:55:22.296663+00:00 operation=Get duration=0.060797s 
size=3895852 range: bytes=145483536-149379387 
path=hits_compatible/athena_partitioned/hits_1.parquet
   2025-10-17T08:55:22.330266+00:00 operation=Get duration=0.040970s size=61815 
range: bytes=46392516-46454330 
path=hits_compatible/athena_partitioned/hits_1.parquet
   ```
   
   There are 5!! requests made to read this file, annotated:
   ```
   * operation=Get size=8 range: bytes=174965036-174965043  <-- reads the last 
8 bytes to find metadata size
   * operation=Get size=34322 range: bytes=174930714-174965035   <-- Footer 
Metadata
   * operation=Get size=15503 range: bytes=5120273-5135775 <-- "SearchPhrase" 
data pages
   * operation=Get size=3895852 range: bytes=145483536-149379387  <-- 
"SearchPhrase" data pages
   * operation=Get size=61815 range: bytes=46392516-46454330 
   ```
   
   
   
   ### Describe the solution you'd like
   
   I would like to avoid the first 8 byte request which adds an entire new 
object store request (and thus additional latency and additional cost) 
   
   ### Describe alternatives you've considered
   
   I recommend changing the default of  
`datafusion.execution.parquet.metadata_size_hint ` to 512k or 1MB
   
   It turns out there there is an existing setting to avoid this first 8 byte 
request already, called `datafusion.execution.parquet.metadata_size_hint ` 
which will prefetch a larger initial request (and will only make a second 
request if the first request does not have enough bytes)
   
   
   Here is an example of using `metadata_size_hint` and reducing the number of 
requests to 4:
   
   ```sql
   DataFusion CLI v50.2.0
   > \object_store_profiling trace
   ObjectStore Profile mode set to Trace
   > set datafusion.execution.parquet.metadata_size_hint = 500000;
   0 row(s) fetched.
   Elapsed 0.001 seconds.
   
   Object Store Profiling
   > SELECT COUNT(*) from 
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'
 where "SearchPhrase" <> '';
   +----------+
   | count(*) |
   +----------+
   | 131559   |
   +----------+
   1 row(s) fetched.
   Elapsed 0.573 seconds.
   
   Object Store Profiling
   Instrumented Object Store: instrument_mode: Trace, inner: HttpStore
   2025-10-17T09:11:51.870079+00:00 operation=Get duration=0.031349s 
size=500000 range: bytes=174465044-174965043 
path=hits_compatible/athena_partitioned/hits_1.parquet
   2025-10-17T09:11:51.986178+00:00 operation=Get duration=0.025578s size=15503 
range: bytes=5120273-5135775 
path=hits_compatible/athena_partitioned/hits_1.parquet
   2025-10-17T09:11:51.986345+00:00 operation=Get duration=0.064672s 
size=3895852 range: bytes=145483536-149379387 
path=hits_compatible/athena_partitioned/hits_1.parquet
   2025-10-17T09:11:52.012529+00:00 operation=Get duration=0.064541s size=61815 
range: bytes=46392516-46454330 
path=hits_compatible/athena_partitioned/hits_1.parquet
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to