[GitHub] [arrow-datafusion] thinkharderdev commented on a diff in pull request #3885: Consolidate remaining parquet config options into ConfigOptions

GitBox Fri, 25 Nov 2022 05:05:58 -0800


thinkharderdev commented on code in PR #3885:
URL: https://github.com/apache/arrow-datafusion/pull/3885#discussion_r1032437474



##########
datafusion/core/src/config.rs:
##########
@@ -237,6 +247,29 @@ impl BuiltInConfigs {
                  to reduce the number of rows decoded.",
                 false,
             ),
+            ConfigDefinition::new_bool(
+                OPT_PARQUET_ENABLE_PRUNING,
+                "If true, the parquet reader attempts to skip entire row 
groups based \
+                 on the predicate in the query and the metadata (min/max 
values) stored in \
+                 the parquet file.",
+                true,
+            ),
+            ConfigDefinition::new_bool(
+                OPT_PARQUET_SKIP_METADATA,
+                "If true, the parquet reader skip the optional embedded 
metadata that may be in \
+                 the file Schema. This setting can help avoid schema conflicts 
when querying \
+                 multiple parquet files with schemas containing compatible 
types but different metadata.",
+                true,
+            ),
+            ConfigDefinition::new(
+                OPT_PARQUET_METADATA_SIZE_HINT,
+                "If specified, the parquet reader will try and fetch the last 
`size_hint` \
+                 bytes of the parquet file optimistically. If not specified, 
two read are required: \
+                 One read to fetch the 8-byte parquet footer and  \
+                 another to fetch the metadata length encoded in the footer.",
+                DataType::UInt64,
+                ScalarValue::UInt64(None),

Review Comment:
   Do we want to default this something? Back in the day we would default to 
reading the last 64k to try and capture the entire footer in a single read 
which seems like a sensible default (especially for object storage where the 
cost of an additional read is very expensive relative to reading a bit more 
data in the first read)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] thinkharderdev commented on a diff in pull request #3885: Consolidate remaining parquet config options into ConfigOptions

Reply via email to