[jira] [Commented] (ARROW-13151) [Python] Unable to read single child field of struct column from Parquet

Jim Pivarski (Jira) Fri, 25 Jun 2021 07:53:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-13151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17369517#comment-17369517
 ]


Jim Pivarski commented on ARROW-13151:
--------------------------------------

I hope reading a single field of a struct column is supported! It's an 
important use-case for us.

In particle physics, our data consist of many collision events, each with a 
variable-length number of particles, and each particle is a struct with many 
fields. Often, there's even deeper structure than that, but this is the basic 
structure. These structs are very wide, with as many as a hundred fields, 
because the same dataset is used by 3000 authors, all doing different analyses 
on the same input dataset. Most individual data analysts don't access more than 
10% of these struct fields.

Therefore, it's important to be able to read the data lazily (in interactive 
analysis) or at least selectively (in high-throughput applications). Reading 
and decompressing data are often bottlenecks, so restricting data-loading to 
just the data we use is by itself a 10× improvement. We have a custom file 
format (ROOT) that is designed to provide exactly this selective reading, but 
we've been looking at Parquet as a more cross-language and non-domain-specific 
alternative.

The bug that Angus reported arose in a framework that provides lazy-reading, 
Awkward Array's 
[ak.from_parquet|https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_parquet.html]
 function, which uses pyarrow.parquet.ParquetFile to read the data and convert 
it to Arrow, then converts the Arrow into Awkward Arrays (which are highly 
interchangeable with Arrow Arrays; conversion in both directions is usually 
zero-copy). [This whole 
feature|https://github.com/scikit-hep/awkward-1.0/blob/1ecfc3e29aaf1b79cd7e0e8fa1598452f3827c64/src/awkward/operations/convert.py#L3122-L3959]
 was designed around the idea that you can read individual struct fields, just 
as you can read individual columns. Just today, I found out that's not true, 
even in our basic case that does not trigger errors like Angus's:

>>> pq.write_table(pa.Table.from_pydict(\{"events": [{"muons": [{"pt": 10.5, 
>>> "eta": -1.5, "phi": 0.1}]}]}), "/tmp/testy.parquet")

>>> pq.ParquetFile("/tmp/testy.parquet").read(["events.muons.list.item.pt"])   
>>> # reads all three
pyarrow.Table
events: struct<muons: list<item: struct<eta: double, phi: double, pt: double>>>
  child 0, muons: list<item: struct<eta: double, phi: double, pt: double>>
    child 0, item: struct<eta: double, phi: double, pt: double>
      child 0, eta: double
      child 1, phi: double
      child 2, pt: double
>>> pq.ParquetFile("/tmp/testy.parquet").read(["events.muons.list.item.eta"])   
>>> # reads all three
pyarrow.Table
events: struct<muons: list<item: struct<eta: double, phi: double, pt: double>>>
  child 0, muons: list<item: struct<eta: double, phi: double, pt: double>>
    child 0, item: struct<eta: double, phi: double, pt: double>
      child 0, eta: double
      child 1, phi: double
      child 2, pt: double
>>> pq.ParquetFile("/tmp/testy.parquet").read(["events.muons.list.item.phi"])   
>>> # reads all three
pyarrow.Table
events: struct<muons: list<item: struct<eta: double, phi: double, pt: double>>>
  child 0, muons: list<item: struct<eta: double, phi: double, pt: double>>
    child 0, item: struct<eta: double, phi: double, pt: double>
      child 0, eta: double
      child 1, phi: double
      child 2, pt: double

I hadn't realized that our attempts to read only "muon pt" or only "muon eta" 
were, in fact, reading all muon fields. (In the real datasets, muons have 32 
fields, electrons have 47, taus have 37, jets have 30, photons have 27...)

We could try to rearrange data to something shallower:

{{>>> pq.write_table(pa.Table.from_pydict(\{"muons": [{"pt": 10.5, "eta": -1.5, 
"phi": 0.1}]}), "/tmp/testy.parquet")}}
{{>>> pq.ParquetFile("/tmp/testy.parquet").read(["muons.pt"])}}
{{pyarrow.Table}}
{{muons: struct<pt: double>}}
{{  child 0, pt: double}}
{{>>> pq.ParquetFile("/tmp/testy.parquet").read(["muons.eta"])}}
{{pyarrow.Table}}
{{muons: struct<eta: double>}}
{{  child 0, eta: double}}
{{>>> pq.ParquetFile("/tmp/testy.parquet").read(["muons.phi"])}}
{{pyarrow.Table}}
{{muons: struct<phi: double>}}
{{  child 0, phi: double}}

but that puts a hard-to-predict constraint on data structures. In the above, 
aren't we "reading a single column of a struct column?" (I probably saw this 
behavior and assumed that it would continue to deeper structures, which is how 
I never noticed that they sometimes read all struct fields.)

As a real-world case, here's a dataset that naturally has a structure that 
suffers from over-reading. It's not physics-related: it's a translation of the 
[Million Song Dataset|http://millionsongdataset.com/] into Parquet (side-note: 
it's losslessly 3× smaller than the original HDF5 files because of all the 
variable-length data): s3://pivarski-princeton/millionsongs/ . Lazily loading 
it has odd performance characteristics that I hadn't measured in detail until 
now:

In [1]: import awkward as ak

In [2]: songs = ak.from_parquet("/home/jpivarski/storage/data/million-song-datas
 ...: et/full/millionsongs/millionsongs-A-zstd.parquet", lazy=True)

In [3]: %time songs.analysis.segments.loudness_start
CPU times: user 19.1 ms, sys: 0 ns, total: 19.1 ms
Wall time: 18.8 ms
Out[3]: <Array [[-60, -22.7, -23, ... -38, -37.5]] type='39100 * var * float64'>

In [4]: %time songs.analysis.segments.loudness_max
CPU times: user 3.97 ms, sys: 14 µs, total: 3.98 ms
Wall time: 4.03 ms
Out[4]: <Array [[-20.6, -21.1, ... -35.6, -35.6]] type='39100 * var * float64'>

In [5]: %time songs.analysis.segments.pitches
CPU times: user 4.2 ms, sys: 0 ns, total: 4.2 ms
Wall time: 4.2 ms
Out[5]: <Array [[[0.294, 0.158, ... 0.437, 0.065]]] type='39100 * var * var * 
float64'>

In [6]: %time songs.analysis.segments.timbre
CPU times: user 4.05 ms, sys: 53 µs, total: 4.1 ms
Wall time: 4.11 ms
Out[6]: <Array [[[18, 71.3, 193, ... -34.8, 2.66]]] type='39100 * var * var * 
float64'>

The "pitches" and "timbre" fields are much larger than the "loudness_start" and 
"loudness_max" (note the deeper nesting), but the "loudness_start" takes 4× 
longer to load because that is triggering a read of everything in the whole 
"segments" struct. This depth of nestedness is necessary because "segments" 
have to be differentiated from "bars," "beats," "sections," "tatums," and the 
"analysis" level has to be differentiated from "metadata"—song names, artist 
names, tags—all of which are nested with different cardinalities. Restructuring 
this to have individually readable fields be separate top-level table columns 
would mean managing a lot of relationships using naming conventions, rather 
than the tree structure that more naturally fits the data.

So—the point of this long comment is that reading individual struct fields 
would be a very, very, very nice feature to have. Without it, we'll have to 
detect the unsupported cases and tell our users to rearrange their data to fit 
the supported ones through naming conventions.

> [Python] Unable to read single child field of struct column from Parquet
> ------------------------------------------------------------------------
>
>                 Key: ARROW-13151
>                 URL: https://issues.apache.org/jira/browse/ARROW-13151
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>            Reporter: Angus Hollands
>            Priority: Major
>
> Given the following table
> {code:java}
> data = {"root": [[{"addr": {"this": 3, "that": 3}}]]}
> table = pa.Table.from_pydict(data)
> {code}
> reading the nested column leads to an `pyarrow.lib.ArrowInvalid` error:
> {code}
> pq.write_table(table, "/tmp/table.parquet")
> file = pq.ParquetFile("/tmp/table.parquet")
> array = file.read(["root.list.item.addr.that"])
> {code}
> Traceback:
> {code}
> Traceback (most recent call last):
>   File "....", line 21, in <module>
>     array = file.read(["root.list.item.addr.that"])
>   File 
> "/home/angus/.mambaforge/envs/awkward/lib/python3.9/site-packages/pyarrow/parquet.py",
>  line 383, in read
>     return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1097, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: List child array invalid: Invalid: Struct child 
> array #0 does not match type field: struct<that: int64> vs struct<that: 
> int64, this: int64>
> {code}
> It's possible that I don't quite understand this properly - am I doing 
> something wrong?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13151) [Python] Unable to read single child field of struct column from Parquet

Reply via email to