[I] [parquet_derive] support OPTIONAL (def_level = 1) columns by default [arrow-rs]

via GitHub Sat, 04 May 2024 03:53:15 -0700


double-free opened a new issue, #5716:
URL: https://github.com/apache/arrow-rs/issues/5716

## Problem Description

I'm working on parquet files written by `pyarrow` (embedded in `pandas`). I
came across `parquet_derive` and it avoids boilerplates in my project.
The problem is, it doesn't work on the parquet files that is written by
`pandas` with default setup, it throws error information:

```text
Parquet error: must specify definition levels
```

After digging into this, I found that the problem is the parquet file
generated by `pyarrow` has def_level=1, i.e., every column, even without a null
value, is OPTIONAL.

However, the macro generate code that does not allow definition level, thus
it fails to parsing columns with OPTIONAL value, even there is no actual NULL
values:

```rust
typed.read_records(num_records, None, None, &mut vals)?;
```

The API it calls is:
https://docs.rs/parquet/latest/parquet/column/reader/struct.GenericColumnReader.html#method.read_records
.

## My Solution

The solution is straight-forward. I have fixed the problem locally, I'm
willing to contribute a pull request, but I don't know if this solution is
reasonable in the scope of the whole `arrow` project.

Basically, I think we need to provide definition level in `read_record`:

```rust
typed.read_records(num_records, None /*should use a Some(&mut Vec<i16>)*/,
None, &mut vals)?;
```

In one word, with this solution, `parquet_derive` can now handle:
1. (already supported) parquet file with all columns REQUIRED
2. (new introduced) parquet file with OPTIONAL columns but are always
guaranteed to be valid.

### Pros

- This solution does not break current features
- This solution makes parquet_derive more general in handling parquet files.

It can pass the tests in `parquet_derive_tests`. I also add checks against
the parsed records and valid records, to avoid abusing it for columns with
NULLs.

### Cons

- It will be slightly slower since it allocates an extra `Vec<i16>` for each
column when invoking `read_from_row_group`.

I don't think it is a big deal, though, compared to the inconvenience of not
supporting OPTIONAL columns. Moreover, we can make use of the max_def_levels
(for REQUIRED column, it is 0) to skip creating the Vec.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [parquet_derive] support OPTIONAL (def_level = 1) columns by default [arrow-rs]

Reply via email to