double-free opened a new issue, #5716:
URL: https://github.com/apache/arrow-rs/issues/5716

   ## Problem Description
   <!--
   A clear and concise description of what the problem is. Ex. I'm always 
frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for 
this feature, in addition to  the *what*)
   -->
   I'm working on parquet files written by `pyarrow` (embedded in `pandas`). I 
came across `parquet_derive` and it avoids boilerplates in my project.
   The problem is, it doesn't work on the parquet files that is written by 
`pandas` with default setup, it throws error information:
   
   ```text
   Parquet error: must specify definition levels
   ```
   
   After digging into this, I found that the problem is the parquet file 
generated by `pyarrow` has def_level=1, i.e., every column, even without a null 
value, is OPTIONAL.
   
   <img width="677" alt="image" 
src="https://github.com/apache/arrow-rs/assets/27212391/b6b4cc96-8c53-4d41-9c66-4f802476dd7a";>
   
   However, the macro generate code that does not allow definition level, thus 
it fails to parsing columns with OPTIONAL value, even there is no actual NULL 
values:
   
   ```rust
   typed.read_records(num_records, None, None, &mut vals)?;
   ```
   
   The API it calls is: 
https://docs.rs/parquet/latest/parquet/column/reader/struct.GenericColumnReader.html#method.read_records
 .
   
   ## My Solution
   
   The solution is straight-forward. I have fixed the problem locally, I'm 
willing to contribute a pull request, but I don't know if this solution is 
reasonable in the scope of the whole `arrow` project.
   
   Basically, I think we need to provide definition level in `read_record`:
   
   ```rust
   typed.read_records(num_records, None /*should use a Some(&mut Vec<i16>)*/, 
None, &mut vals)?;
   ```
   
   In one word, with this solution, `parquet_derive` can now handle:
   1. (already supported) parquet file with all columns REQUIRED
   2. (new introduced) parquet file with OPTIONAL columns but are always 
guaranteed to be valid.
   
   ### Pros
   
   - This solution does not break current features
   - This solution makes parquet_derive more general in handling parquet files.
   
   It can pass the tests in `parquet_derive_tests`. I also add checks against 
the parsed records and valid records, to avoid abusing it for columns with 
NULLs.
   
   ### Cons
   
   - It will be slightly slower since it allocates an extra `Vec<i16>` for each 
column when invoking `read_from_row_group`.
   
   I don't think it is a big deal, though, compared to the inconvenience of not 
supporting OPTIONAL columns. Moreover, we can make use of the max_def_levels 
(for REQUIRED column, it is 0) to skip creating the Vec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to