cbmixx opened a new issue, #10112:
URL: https://github.com/apache/arrow-rs/issues/10112

   ### Describe the bug
   
   `#[derive(ParquetRecordReader)]` and `#[derive(ParquetRecordWriter)]` cannot 
handle a Parquet column whose name is a Rust keyword (e.g. `type`). The only 
way to spell such a field in Rust is a raw identifier (`r#type`), but the 
derives stringify the identifier including the `r#` prefix:
   
   - **Reader**: the generated column-index lookup uses 
`name_to_index.get(stringify!(#field_names))`, and `stringify!(r#type)` yields 
`"r#type"`, so reading fails with `ParquetError::General("column name 'r#type' 
is not found in parquet file!")`.
   - **Writer**: `Field::parquet_type()` in 
`parquet_derive/src/parquet_field.rs` uses `self.ident.to_string()`, and 
`syn::Ident::to_string()` keeps the `r#` prefix, so the written file's schema 
gets a column literally named `r#type`.
   
   This makes it impossible to round-trip files produced by other Parquet 
writers, since a column named `type` cannot be referenced at all.
   
   ### To Reproduce
   
   ```rust
   use parquet::file::{reader::FileReader, 
serialized_reader::SerializedFileReader};
   use parquet::record::RecordReader;
   use parquet_derive::ParquetRecordReader;
   
   #[derive(ParquetRecordReader, Default, Debug)]
   struct ARecord {
       r#type: i32, // parquet column is named "type"
   }
   
   fn main() {
       // any parquet file with an INT32 column named "type"
       let file = std::fs::File::open("a_file.parquet").unwrap();
       let reader = SerializedFileReader::new(file).unwrap();
       let mut rg = reader.get_row_group(0).unwrap();
   
       let mut rows: Vec<ARecord> = Vec::new();
       rows.read_from_row_group(&mut *rg, 1).unwrap();
       // fails: ParquetError::General("column name 'r#type' is not found in 
parquet file!")
   }
   ```
   
   The writer side is similarly affected: deriving `ParquetRecordWriter` on the 
same struct produces a schema whose column is named `r#type` instead of `type` 
(visible via `schema()` or by inspecting the written file).
   
   ### Expected behavior
   
   A struct field declared with a raw identifier should map to the column name 
without the `r#` prefix — `r#type: i32` should read from and write to a column 
named `type`, matching how other derive ecosystems (e.g. serde) treat raw 
identifiers.
   
   ### Additional context
   
   Observed on `parquet_derive = "59.0.0"`; the relevant code is unchanged on 
`main`. The fix is to unraw the identifier (`syn::ext::IdentExt::unraw`, 
available via the existing `syn` dependency) wherever it is used as a column 
name — the reader's lookup in `parquet_derive/src/lib.rs` and the schema name 
in `Field::parquet_type()` — while keeping the raw identifier for field access 
in the generated code. I have a patch with tests and will submit a PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to