Hi there,

there is an ongoing effort in Spark to support filter push down for nested
columns while reading Parquet. I was wondering if this work could be
extended to predicates on elements of arrays, keys/values of maps. After my
investigation, it seems impossible. I would appreciate your comments on my
thoughts below.

After exploring footers of some random Parquet files with arrays, I had an
idea to push down filters on elements of arrays by referring to those
elements via dots. A sample query is presented below:


spark.sql("SELECT arr[0] FROM t WHERE arr[0] = 0”).explain(true)


The footer in this case might look like:


file schema: spark_schema

--------------------------------------------------------------------------------

a:           REQUIRED INT32 R:0 D:0

arr:         OPTIONAL F:1

.list:       REPEATED F:1

..element:   REQUIRED INT32 R:1 D:2


row group 1: RC:2 TS:135 OFFSET:4

--------------------------------------------------------------------------------

a:            INT32 SNAPPY DO:0 FPO:4 SZ:55/53/0.96 VC:2
ENC:BIT_PACKED,PLAIN ST:[min: 1, max: 2, num_nulls: 0]

arr:

.list:

..element:    INT32 SNAPPY DO:0 FPO:59 SZ:86/82/0.95 VC:6
ENC:PLAIN_DICTIONARY,RLE ST:[min: 1, max: 2, num_nulls: 0]


Basically, the idea was to reference array elements in the query above as
“arr.list.element” to filter out row groups that are not relevant for
certain queries. Just as for nested fields.

However, the problem is that Parquet does not support filters on repeated
columns. If you write FilterApi.lt(intColumn("arr.list.element"),
0.asInstanceOf[Integer]), you will get "FilterPredicates do not currently
support repeated columns. Column arr.list.element is repeated".

Right now, I have these questions:

- Am I correct that we cannot rely that all arrays will be represented as
in the file above? There might be legacy files where repeated values can be
defined as `repeated int32 element`, for example.
- If we somehow solve the first point, how hard will it be to extend
Parquet to support filters on repeated values? Right now, it is not
supported. See SchemaCompatibilityValidator [1].

[1] -
https://github.com/apache/parquet-mr/blob/0569f5128d5e529b5114ba05db9b853625918b43/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/SchemaCompatibilityValidator.java#L176

Thanks in advance,
Anton

Reply via email to