Re: A couple of questions about pyarrow.parquet

Micah Kornfield Fri, 17 May 2019 12:44:27 -0700

I can't help on the first question.

Regarding push-down predicates, there is an open JIRA [1] to do just that


[1] https://issues.apache.org/jira/browse/PARQUET-473
<https://issues.apache.org/jira/browse/PARQUET-473?jql=project%20in%20(PARQUET%2C%20ARROW)%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22pushdown%22>

On Fri, May 17, 2019 at 11:48 AM Ted Gooch <tedgo...@gmail.com> wrote:

> Hi,
>
> I've been doing some work trying to get the parquet read path going for the
> python iceberg <https://github.com/apache/incubator-iceberg> library.  I
> have two questions that I couldn't get figured out, and was hoping I could
> get some guidance from the list here.
>
> First, I'd like to create a ParquetSchema->IcebergSchema converter, but it
> appears that only limited information is available in the ColumnSchema
> passed back to the python client[2]:
>
> <ParquetColumnSchema>
>   name: key
>   path: m.map.key
>   max_definition_level: 2
>   max_repetition_level: 1
>   physical_type: BYTE_ARRAY
>   logical_type: UTF8
> <ParquetColumnSchema>
>   name: key
>   path: m.map.value.map.key
>   max_definition_level: 4
>   max_repetition_level: 2
>   physical_type: BYTE_ARRAY
>   logical_type: UTF8
> <ParquetColumnSchema>
>   name: value
>   path: m.map.value.map.value
>   max_definition_level: 5
>   max_repetition_level: 2
>   physical_type: BYTE_ARRAY
>   logical_type: UTF8
>
>
> where physical_type and logical_type are both strings[1].  The arrow schema
> I can get from *to_arrow_schema *looks to be more expressive(although may
> be I just don't understand the parquet format well enough):
>
> m: struct<map: list<map: struct<key: string, value: struct<map: list<map:
> struct<key: string, value: string> not null>>> not null>>
>   child 0, map: list<map: struct<key: string, value: struct<map: list<map:
> struct<key: string, value: string> not null>>> not null>
>       child 0, map: struct<key: string, value: struct<map: list<map:
> struct<key: string, value: string> not null>>>
>           child 0, key: string
>           child 1, value: struct<map: list<map: struct<key: string, value:
> string> not null>>
>               child 0, map: list<map: struct<key: string, value: string>
> not null>
>                   child 0, map: struct<key: string, value: string>
>                       child 0, key: string
>                       child 1, value: string
>
>
> It seems like I can infer the info from the name/path, but is there a more
> direct way of getting the detailed parquet schema information?
>
> Second question, is there a way to push record level filtering into the
> parquet reader, so that the parquet reader only reads in values that match
> a given predicate expression? Predicate expressions would be simple
> field-to-literal comparisons(>,>=,==,<=,<, !=, is null, is not null)
> connected with logical operators(AND, OR, NOT).
>
> I've seen that after reading-in I can use the filtering language in
> gandiva[3] to get filtered record-batches, but was looking for somewhere
> lower in the stack if possible.
>
>
>
> [1]
>
> https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L661-L667
> [2] Spark/Hive Table DDL for this parquet file looks like:
> CREATE TABLE `iceberg`.`nested_map` (
> m map<string,map<string,string>>)
> [3]
>
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100
>

Re: A couple of questions about pyarrow.parquet

Reply via email to