I can't help on the first question. Regarding push-down predicates, there is an open JIRA [1] to do just that
[1] https://issues.apache.org/jira/browse/PARQUET-473 <https://issues.apache.org/jira/browse/PARQUET-473?jql=project%20in%20(PARQUET%2C%20ARROW)%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22pushdown%22> On Fri, May 17, 2019 at 11:48 AM Ted Gooch <tedgo...@gmail.com> wrote: > Hi, > > I've been doing some work trying to get the parquet read path going for the > python iceberg <https://github.com/apache/incubator-iceberg> library. I > have two questions that I couldn't get figured out, and was hoping I could > get some guidance from the list here. > > First, I'd like to create a ParquetSchema->IcebergSchema converter, but it > appears that only limited information is available in the ColumnSchema > passed back to the python client[2]: > > <ParquetColumnSchema> > name: key > path: m.map.key > max_definition_level: 2 > max_repetition_level: 1 > physical_type: BYTE_ARRAY > logical_type: UTF8 > <ParquetColumnSchema> > name: key > path: m.map.value.map.key > max_definition_level: 4 > max_repetition_level: 2 > physical_type: BYTE_ARRAY > logical_type: UTF8 > <ParquetColumnSchema> > name: value > path: m.map.value.map.value > max_definition_level: 5 > max_repetition_level: 2 > physical_type: BYTE_ARRAY > logical_type: UTF8 > > > where physical_type and logical_type are both strings[1]. The arrow schema > I can get from *to_arrow_schema *looks to be more expressive(although may > be I just don't understand the parquet format well enough): > > m: struct<map: list<map: struct<key: string, value: struct<map: list<map: > struct<key: string, value: string> not null>>> not null>> > child 0, map: list<map: struct<key: string, value: struct<map: list<map: > struct<key: string, value: string> not null>>> not null> > child 0, map: struct<key: string, value: struct<map: list<map: > struct<key: string, value: string> not null>>> > child 0, key: string > child 1, value: struct<map: list<map: struct<key: string, value: > string> not null>> > child 0, map: list<map: struct<key: string, value: string> > not null> > child 0, map: struct<key: string, value: string> > child 0, key: string > child 1, value: string > > > It seems like I can infer the info from the name/path, but is there a more > direct way of getting the detailed parquet schema information? > > Second question, is there a way to push record level filtering into the > parquet reader, so that the parquet reader only reads in values that match > a given predicate expression? Predicate expressions would be simple > field-to-literal comparisons(>,>=,==,<=,<, !=, is null, is not null) > connected with logical operators(AND, OR, NOT). > > I've seen that after reading-in I can use the filtering language in > gandiva[3] to get filtered record-batches, but was looking for somewhere > lower in the stack if possible. > > > > [1] > > https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L661-L667 > [2] Spark/Hive Table DDL for this parquet file looks like: > CREATE TABLE `iceberg`.`nested_map` ( > m map<string,map<string,string>>) > [3] > > https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100 >