alamb opened a new issue, #18118: URL: https://github.com/apache/datafusion/issues/18118
### Is your feature request related to a problem or challenge? Using the tooling that @BlakeOrth have been working on for instrumenting datafusion-cli we can see what requests are actually being made when we query remote files; - https://github.com/apache/datafusion/issues/17207 ```sql DataFusion CLI v50.2.0 > \object_store_profiling trace ObjectStore Profile mode set to Trace > SELECT COUNT(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet' where "SearchPhrase" <> ''; +----------+ | count(*) | +----------+ | 131559 | +----------+ 1 row(s) fetched. Elapsed 0.606 seconds. Object Store Profiling Instrumented Object Store: instrument_mode: Trace, inner: HttpStore 2025-10-17T08:55:22.202041+00:00 operation=Get duration=0.025994s size=8 range: bytes=174965036-174965043 path=hits_compatible/athena_partitioned/hits_1.parquet 2025-10-17T08:55:22.228064+00:00 operation=Get duration=0.028127s size=34322 range: bytes=174930714-174965035 path=hits_compatible/athena_partitioned/hits_1.parquet 2025-10-17T08:55:22.295696+00:00 operation=Get duration=0.032303s size=15503 range: bytes=5120273-5135775 path=hits_compatible/athena_partitioned/hits_1.parquet 2025-10-17T08:55:22.296663+00:00 operation=Get duration=0.060797s size=3895852 range: bytes=145483536-149379387 path=hits_compatible/athena_partitioned/hits_1.parquet 2025-10-17T08:55:22.330266+00:00 operation=Get duration=0.040970s size=61815 range: bytes=46392516-46454330 path=hits_compatible/athena_partitioned/hits_1.parquet ``` There are 5!! requests made to read this file, annotated: ``` * operation=Get size=8 range: bytes=174965036-174965043 <-- reads the last 8 bytes to find metadata size * operation=Get size=34322 range: bytes=174930714-174965035 <-- Footer Metadata * operation=Get size=15503 range: bytes=5120273-5135775 <-- "SearchPhrase" data pages * operation=Get size=3895852 range: bytes=145483536-149379387 <-- "SearchPhrase" data pages * operation=Get size=61815 range: bytes=46392516-46454330 ``` ### Describe the solution you'd like I would like to avoid the first 8 byte request which adds an entire new object store request (and thus additional latency and additional cost) ### Describe alternatives you've considered I recommend changing the default of `datafusion.execution.parquet.metadata_size_hint ` to 512k or 1MB It turns out there there is an existing setting to avoid this first 8 byte request already, called `datafusion.execution.parquet.metadata_size_hint ` which will prefetch a larger initial request (and will only make a second request if the first request does not have enough bytes) Here is an example of using `metadata_size_hint` and reducing the number of requests to 4: ```sql DataFusion CLI v50.2.0 > \object_store_profiling trace ObjectStore Profile mode set to Trace > set datafusion.execution.parquet.metadata_size_hint = 500000; 0 row(s) fetched. Elapsed 0.001 seconds. Object Store Profiling > SELECT COUNT(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet' where "SearchPhrase" <> ''; +----------+ | count(*) | +----------+ | 131559 | +----------+ 1 row(s) fetched. Elapsed 0.573 seconds. Object Store Profiling Instrumented Object Store: instrument_mode: Trace, inner: HttpStore 2025-10-17T09:11:51.870079+00:00 operation=Get duration=0.031349s size=500000 range: bytes=174465044-174965043 path=hits_compatible/athena_partitioned/hits_1.parquet 2025-10-17T09:11:51.986178+00:00 operation=Get duration=0.025578s size=15503 range: bytes=5120273-5135775 path=hits_compatible/athena_partitioned/hits_1.parquet 2025-10-17T09:11:51.986345+00:00 operation=Get duration=0.064672s size=3895852 range: bytes=145483536-149379387 path=hits_compatible/athena_partitioned/hits_1.parquet 2025-10-17T09:11:52.012529+00:00 operation=Get duration=0.064541s size=61815 range: bytes=46392516-46454330 path=hits_compatible/athena_partitioned/hits_1.parquet ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
