Jim Pivarski created ARROW-14770:
------------------------------------
Summary: Direct (individualized) access to definition levels,
repetition levels, and numeric data of a column
Key: ARROW-14770
URL: https://issues.apache.org/jira/browse/ARROW-14770
Project: Apache Arrow
Issue Type: New Feature
Components: C++, Parquet, Python
Reporter: Jim Pivarski
It would be useful to have more low-level access to the three components of a
Parquet column in Python: the definition levels, the repetition levels, and the
numeric data, {_}individually{_}.
The particular use-case we have in Awkward Array is that users will sometimes
lazily read an array of lists of structs without reading any of the fields of
those structs. To build the data structure, we need the lengths of the lists
independently of the columns (which users can then use in functions like
{{{}ak.num{}}}; the number of structs without their field values is useful
information).
What we're doing right now is reading a column, converting it to Arrow
({{{}pa.Array{}}}), and getting the list lengths from that Arrow array. We have
been using the schema to try to pick the smallest column (booleans are best!),
but that's because we really just want the definition and repetition levels
without the numeric data.
I've heard that the Parquet metadata includes offsets to select just the
definition levels, just the repetition levels, or just the numeric data
(pre-decompression?). Exposing those in Python as {{pa.Buffer}} objects would
be ideal.
Beyond our use case, such a feature could also help with wide structs in lists:
all of the non-nullable fields of the struct would share the same definition
and repetition levels, so they don't need to be re-read. For that use-case, the
ability to pick out definition, repetition, and numeric data separately would
still be useful, but the purpose would be to read the numeric data without the
structural integers (opposite of ours).
The desired interface would be like {{{}ParquetFile.read_row_group{}}}, but
would return one, two, or three {{pa.Buffer}} objects depending on three
boolean arguments, {{{}definition{}}}, {{{}repetition{}}}, and {{{}numeric{}}}.
The {{pa.Buffer}} would be unpacked, with all run-length encodings and
fixed-width encodings converted into integers of at least one byte each. It may
make more sense for the output to be {{{}np.ndarray{}}}, to carry {{dtype}}
information if that can depend on the maximum level (though levels larger than
255 are likely rare!). This information must be available at some level in
Arrow's C++ code; the request is to expose it to Python.
I've labeled this minor because it is for optimizations, but it would be really
nice to have!
--
This message was sent by Atlassian Jira
(v8.20.1#820001)