cbmixx opened a new pull request, #10113:
URL: https://github.com/apache/arrow-rs/pull/10113

   # Which issue does this PR close?
   
   - Closes #10112.
   
   # Rationale for this change
   
   `#[derive(ParquetRecordReader)]` and `#[derive(ParquetRecordWriter)]` could 
not
   handle a Parquet column whose name is a Rust keyword (e.g. `type`). The only 
way
   to spell such a field in Rust is a raw identifier (`r#type`), but the derives
   stringified the identifier including the `r#` prefix:
   
   - The reader's column-index lookup used 
`name_to_index.get(stringify!(#field_names))`,
     and `stringify!(r#type)` yields `"r#type"`, so reading failed with
     `ParquetError::General("column name 'r#type' is not found in parquet 
file!")`.
   - The writer's `Field::parquet_type()` used `self.ident.to_string()`, which 
keeps
     the `r#` prefix, so the written schema got a column literally named 
`r#type`.
   
   This made it impossible to read or write Parquet columns whose names are Rust
   keywords, e.g. files produced by other Parquet writers with a column named 
`type`.
   
   # What changes are included in this PR?
   
   Unraw the identifier (via `syn::ext::IdentExt::unraw`, already available 
through
   the existing `syn` dependency) wherever it is used as a column name, while 
keeping
   the raw identifier for field access in the generated code:
   
   - `parquet_derive/src/lib.rs`: the reader derive builds a parallel list of 
unrawed
     field-name strings for the `name_to_index` lookup and its error message.
   - `parquet_derive/src/parquet_field.rs`: `Field::parquet_type()` uses
     `self.ident.unraw().to_string()` for the schema column name.
   
   # Are these changes tested?
   
   Yes. Added a unit test (`test_parquet_type_with_raw_identifier`) and an
   integration round-trip test (`test_parquet_derive_raw_identifiers`) covering 
a
   struct with a raw-identifier field (`r#type`) alongside a normal field, 
asserting
   the schema columns are named `type`/`count`. I verified both tests fail 
without
   the fix (the writer emits a column named `r#type`) and pass with it.
   
   # Are there any user-facing changes?
   
   Structs with raw-identifier fields now read and write columns named without 
the
   `r#` prefix. This is a bug fix; there are no public API changes. Code that 
somehow
   relied on the previous `r#`-prefixed column names would change behavior, but 
such
   names could not be produced by any other Parquet writer.
   
   ---
   
   *AI disclosure (per CONTRIBUTING.md): this change was developed with the
   assistance of an AI coding tool. I reviewed every line, verified the fix 
against
   the failing/passing tests described above, and own the change.*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to