Please see also https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=drivesdk
And prior mailing list discussion. I will comment in more detail on the other items later On Fri, May 17, 2019, 2:44 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > I can't help on the first question. > > Regarding push-down predicates, there is an open JIRA [1] to do just that > > [1] https://issues.apache.org/jira/browse/PARQUET-473 > < > https://issues.apache.org/jira/browse/PARQUET-473?jql=project%20in%20(PARQUET%2C%20ARROW)%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22pushdown%22 > > > > On Fri, May 17, 2019 at 11:48 AM Ted Gooch <tedgo...@gmail.com> wrote: > > > Hi, > > > > I've been doing some work trying to get the parquet read path going for > the > > python iceberg <https://github.com/apache/incubator-iceberg> library. I > > have two questions that I couldn't get figured out, and was hoping I > could > > get some guidance from the list here. > > > > First, I'd like to create a ParquetSchema->IcebergSchema converter, but > it > > appears that only limited information is available in the ColumnSchema > > passed back to the python client[2]: > > > > <ParquetColumnSchema> > > name: key > > path: m.map.key > > max_definition_level: 2 > > max_repetition_level: 1 > > physical_type: BYTE_ARRAY > > logical_type: UTF8 > > <ParquetColumnSchema> > > name: key > > path: m.map.value.map.key > > max_definition_level: 4 > > max_repetition_level: 2 > > physical_type: BYTE_ARRAY > > logical_type: UTF8 > > <ParquetColumnSchema> > > name: value > > path: m.map.value.map.value > > max_definition_level: 5 > > max_repetition_level: 2 > > physical_type: BYTE_ARRAY > > logical_type: UTF8 > > > > > > where physical_type and logical_type are both strings[1]. The arrow > schema > > I can get from *to_arrow_schema *looks to be more expressive(although may > > be I just don't understand the parquet format well enough): > > > > m: struct<map: list<map: struct<key: string, value: struct<map: list<map: > > struct<key: string, value: string> not null>>> not null>> > > child 0, map: list<map: struct<key: string, value: struct<map: > list<map: > > struct<key: string, value: string> not null>>> not null> > > child 0, map: struct<key: string, value: struct<map: list<map: > > struct<key: string, value: string> not null>>> > > child 0, key: string > > child 1, value: struct<map: list<map: struct<key: string, > value: > > string> not null>> > > child 0, map: list<map: struct<key: string, value: string> > > not null> > > child 0, map: struct<key: string, value: string> > > child 0, key: string > > child 1, value: string > > > > > > It seems like I can infer the info from the name/path, but is there a > more > > direct way of getting the detailed parquet schema information? > > > > Second question, is there a way to push record level filtering into the > > parquet reader, so that the parquet reader only reads in values that > match > > a given predicate expression? Predicate expressions would be simple > > field-to-literal comparisons(>,>=,==,<=,<, !=, is null, is not null) > > connected with logical operators(AND, OR, NOT). > > > > I've seen that after reading-in I can use the filtering language in > > gandiva[3] to get filtered record-batches, but was looking for somewhere > > lower in the stack if possible. > > > > > > > > [1] > > > > > https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L661-L667 > > [2] Spark/Hive Table DDL for this parquet file looks like: > > CREATE TABLE `iceberg`.`nested_map` ( > > m map<string,map<string,string>>) > > [3] > > > > > https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100 > > >