double-free opened a new issue, #5716: URL: https://github.com/apache/arrow-rs/issues/5716
## Problem Description <!-- A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] (This section helps Arrow developers understand the context and *why* for this feature, in addition to the *what*) --> I'm working on parquet files written by `pyarrow` (embedded in `pandas`). I came across `parquet_derive` and it avoids boilerplates in my project. The problem is, it doesn't work on the parquet files that is written by `pandas` with default setup, it throws error information: ```text Parquet error: must specify definition levels ``` After digging into this, I found that the problem is the parquet file generated by `pyarrow` has def_level=1, i.e., every column, even without a null value, is OPTIONAL. <img width="677" alt="image" src="https://github.com/apache/arrow-rs/assets/27212391/b6b4cc96-8c53-4d41-9c66-4f802476dd7a"> However, the macro generate code that does not allow definition level, thus it fails to parsing columns with OPTIONAL value, even there is no actual NULL values: ```rust typed.read_records(num_records, None, None, &mut vals)?; ``` The API it calls is: https://docs.rs/parquet/latest/parquet/column/reader/struct.GenericColumnReader.html#method.read_records . ## My Solution The solution is straight-forward. I have fixed the problem locally, I'm willing to contribute a pull request, but I don't know if this solution is reasonable in the scope of the whole `arrow` project. Basically, I think we need to provide definition level in `read_record`: ```rust typed.read_records(num_records, None /*should use a Some(&mut Vec<i16>)*/, None, &mut vals)?; ``` In one word, with this solution, `parquet_derive` can now handle: 1. (already supported) parquet file with all columns REQUIRED 2. (new introduced) parquet file with OPTIONAL columns but are always guaranteed to be valid. ### Pros - This solution does not break current features - This solution makes parquet_derive more general in handling parquet files. It can pass the tests in `parquet_derive_tests`. I also add checks against the parsed records and valid records, to avoid abusing it for columns with NULLs. ### Cons - It will be slightly slower since it allocates an extra `Vec<i16>` for each column when invoking `read_from_row_group`. I don't think it is a big deal, though, compared to the inconvenience of not supporting OPTIONAL columns. Moreover, we can make use of the max_def_levels (for REQUIRED column, it is 0) to skip creating the Vec. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
