Re: A couple of questions about pyarrow.parquet

Wes McKinney Fri, 17 May 2019 12:47:41 -0700

Please see also

https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=drivesdk


And prior mailing list discussion. I will comment in more detail on the
other items later

On Fri, May 17, 2019, 2:44 PM Micah Kornfield <[email protected]> wrote:

> I can't help on the first question.
>
> Regarding push-down predicates, there is an open JIRA [1] to do just that
>
> [1] https://issues.apache.org/jira/browse/PARQUET-473
> <
> https://issues.apache.org/jira/browse/PARQUET-473?jql=project%20in%20(PARQUET%2C%20ARROW)%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22pushdown%22
> >
>
> On Fri, May 17, 2019 at 11:48 AM Ted Gooch <[email protected]> wrote:
>
> > Hi,
> >
> > I've been doing some work trying to get the parquet read path going for
> the
> > python iceberg <https://github.com/apache/incubator-iceberg> library.  I
> > have two questions that I couldn't get figured out, and was hoping I
> could
> > get some guidance from the list here.
> >
> > First, I'd like to create a ParquetSchema->IcebergSchema converter, but
> it
> > appears that only limited information is available in the ColumnSchema
> > passed back to the python client[2]:
> >
> > <ParquetColumnSchema>
> >   name: key
> >   path: m.map.key
> >   max_definition_level: 2
> >   max_repetition_level: 1
> >   physical_type: BYTE_ARRAY
> >   logical_type: UTF8
> > <ParquetColumnSchema>
> >   name: key
> >   path: m.map.value.map.key
> >   max_definition_level: 4
> >   max_repetition_level: 2
> >   physical_type: BYTE_ARRAY
> >   logical_type: UTF8
> > <ParquetColumnSchema>
> >   name: value
> >   path: m.map.value.map.value
> >   max_definition_level: 5
> >   max_repetition_level: 2
> >   physical_type: BYTE_ARRAY
> >   logical_type: UTF8
> >
> >
> > where physical_type and logical_type are both strings[1].  The arrow
> schema
> > I can get from *to_arrow_schema *looks to be more expressive(although may
> > be I just don't understand the parquet format well enough):
> >
> > m: struct<map: list<map: struct<key: string, value: struct<map: list<map:
> > struct<key: string, value: string> not null>>> not null>>
> >   child 0, map: list<map: struct<key: string, value: struct<map:
> list<map:
> > struct<key: string, value: string> not null>>> not null>
> >       child 0, map: struct<key: string, value: struct<map: list<map:
> > struct<key: string, value: string> not null>>>
> >           child 0, key: string
> >           child 1, value: struct<map: list<map: struct<key: string,
> value:
> > string> not null>>
> >               child 0, map: list<map: struct<key: string, value: string>
> > not null>
> >                   child 0, map: struct<key: string, value: string>
> >                       child 0, key: string
> >                       child 1, value: string
> >
> >
> > It seems like I can infer the info from the name/path, but is there a
> more
> > direct way of getting the detailed parquet schema information?
> >
> > Second question, is there a way to push record level filtering into the
> > parquet reader, so that the parquet reader only reads in values that
> match
> > a given predicate expression? Predicate expressions would be simple
> > field-to-literal comparisons(>,>=,==,<=,<, !=, is null, is not null)
> > connected with logical operators(AND, OR, NOT).
> >
> > I've seen that after reading-in I can use the filtering language in
> > gandiva[3] to get filtered record-batches, but was looking for somewhere
> > lower in the stack if possible.
> >
> >
> >
> > [1]
> >
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L661-L667
> > [2] Spark/Hive Table DDL for this parquet file looks like:
> > CREATE TABLE `iceberg`.`nested_map` (
> > m map<string,map<string,string>>)
> > [3]
> >
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100
> >
>

Re: A couple of questions about pyarrow.parquet

Reply via email to