parthchandra commented on code in PR #20370:
URL: https://github.com/apache/datafusion/pull/20370#discussion_r2836995332


##########
datafusion/datasource-parquet/src/metadata.rs:
##########
@@ -68,6 +68,55 @@ pub struct DFParquetMetadata<'a> {
     file_metadata_cache: Option<Arc<dyn FileMetadataCache>>,
     /// timeunit to coerce INT96 timestamps to
     pub coerce_int96: Option<TimeUnit>,
+    /// Whether to extract and use Parquet field IDs for column resolution
+    pub enable_field_ids: bool,
+}
+
+/// Extracts Parquet field IDs and stores them in Arrow field metadata
+/// under the key "PARQUET:field_id"
+///
+/// # Limitations
+///
+/// TODO: Currently only supports flat schemas (top-level primitive fields).
+/// Nested field IDs within structs, lists, and maps are not yet supported.
+/// This requires recursive traversal of the Parquet schema tree to extract
+/// field IDs at all nesting levels. See PARQUET_FIELD_ID_IMPLEMENTATION.md
+/// for details on nested schema support.
+fn add_field_ids_to_arrow_schema(
+    arrow_schema: &Schema,
+    parquet_schema: &SchemaDescriptor,
+) -> Result<Schema> {
+    use arrow::datatypes::Field;
+
+    let fields_with_ids: Vec<Arc<Field>> = arrow_schema
+        .fields()
+        .iter()
+        .enumerate()
+        .map(|(idx, field)| {
+            // Get the corresponding Parquet column descriptor
+            // TODO: This only works for flat schemas - 
parquet_schema.column(idx)
+            // returns leaf columns only, missing nested struct fields
+            let col_desc = parquet_schema.column(idx);
+
+            // Extract field ID from the schema type
+            // Field IDs are optional in Parquet; if not set, they may be 0 or 
negative
+            let field_id = col_desc.self_type().get_basic_info().id();
+
+            if field_id > 0 {
+                // Add field ID to field metadata
+                let mut metadata = field.metadata().clone();
+                metadata.insert("PARQUET:field_id".to_string(), 
field_id.to_string());

Review Comment:
   Should we define a constant for the field id's fieldname? Maybe at some 
point even allow it to be configurable. 



##########
datafusion/common/src/file_options/parquet_writer.rs:
##########
@@ -209,6 +209,7 @@ impl ParquetOptions {
             coerce_int96: _,     // not used for writer props
             skip_arrow_metadata: _,
             max_predicate_cache_size: _,
+            field_id_read_enabled: _, // not used for writer props

Review Comment:
   Why have this for the writer at all if it is not supported yet (or maybe 
just call it `field_id_enabled`)? 



##########
datafusion/datasource-parquet/src/metadata.rs:
##########
@@ -68,6 +68,55 @@ pub struct DFParquetMetadata<'a> {
     file_metadata_cache: Option<Arc<dyn FileMetadataCache>>,
     /// timeunit to coerce INT96 timestamps to
     pub coerce_int96: Option<TimeUnit>,
+    /// Whether to extract and use Parquet field IDs for column resolution
+    pub enable_field_ids: bool,
+}
+
+/// Extracts Parquet field IDs and stores them in Arrow field metadata
+/// under the key "PARQUET:field_id"
+///
+/// # Limitations
+///
+/// TODO: Currently only supports flat schemas (top-level primitive fields).
+/// Nested field IDs within structs, lists, and maps are not yet supported.
+/// This requires recursive traversal of the Parquet schema tree to extract
+/// field IDs at all nesting levels. See PARQUET_FIELD_ID_IMPLEMENTATION.md
+/// for details on nested schema support.
+fn add_field_ids_to_arrow_schema(
+    arrow_schema: &Schema,
+    parquet_schema: &SchemaDescriptor,
+) -> Result<Schema> {
+    use arrow::datatypes::Field;
+

Review Comment:
   To be safe we should probably assert/check that this is a flat schema before 
proceeding.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to