cbmixx opened a new issue, #10112:
URL: https://github.com/apache/arrow-rs/issues/10112
### Describe the bug
`#[derive(ParquetRecordReader)]` and `#[derive(ParquetRecordWriter)]` cannot
handle a Parquet column whose name is a Rust keyword (e.g. `type`). The only
way to spell such a field in Rust is a raw identifier (`r#type`), but the
derives stringify the identifier including the `r#` prefix:
- **Reader**: the generated column-index lookup uses
`name_to_index.get(stringify!(#field_names))`, and `stringify!(r#type)` yields
`"r#type"`, so reading fails with `ParquetError::General("column name 'r#type'
is not found in parquet file!")`.
- **Writer**: `Field::parquet_type()` in
`parquet_derive/src/parquet_field.rs` uses `self.ident.to_string()`, and
`syn::Ident::to_string()` keeps the `r#` prefix, so the written file's schema
gets a column literally named `r#type`.
This makes it impossible to round-trip files produced by other Parquet
writers, since a column named `type` cannot be referenced at all.
### To Reproduce
```rust
use parquet::file::{reader::FileReader,
serialized_reader::SerializedFileReader};
use parquet::record::RecordReader;
use parquet_derive::ParquetRecordReader;
#[derive(ParquetRecordReader, Default, Debug)]
struct ARecord {
r#type: i32, // parquet column is named "type"
}
fn main() {
// any parquet file with an INT32 column named "type"
let file = std::fs::File::open("a_file.parquet").unwrap();
let reader = SerializedFileReader::new(file).unwrap();
let mut rg = reader.get_row_group(0).unwrap();
let mut rows: Vec<ARecord> = Vec::new();
rows.read_from_row_group(&mut *rg, 1).unwrap();
// fails: ParquetError::General("column name 'r#type' is not found in
parquet file!")
}
```
The writer side is similarly affected: deriving `ParquetRecordWriter` on the
same struct produces a schema whose column is named `r#type` instead of `type`
(visible via `schema()` or by inspecting the written file).
### Expected behavior
A struct field declared with a raw identifier should map to the column name
without the `r#` prefix — `r#type: i32` should read from and write to a column
named `type`, matching how other derive ecosystems (e.g. serde) treat raw
identifiers.
### Additional context
Observed on `parquet_derive = "59.0.0"`; the relevant code is unchanged on
`main`. The fix is to unraw the identifier (`syn::ext::IdentExt::unraw`,
available via the existing `syn` dependency) wherever it is used as a column
name — the reader's lookup in `parquet_derive/src/lib.rs` and the schema name
in `Field::parquet_type()` — while keeping the raw identifier for field access
in the generated code. I have a patch with tests and will submit a PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]