[GitHub] [iceberg] anthonysgro opened a new issue, #7162: Missing vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

via GitHub Tue, 21 Mar 2023 11:31:02 -0700


anthonysgro opened a new issue, #7162:
URL: https://github.com/apache/iceberg/issues/7162

### Feature Request / Improvement

As it stands today, if you want to employ both Spark and Athena for your
iceberg tables in v1.1.0, you must disable the vectorized reader. The reason is
because Athena writes fields in a delta encoded manner, which is unsupported by
the vectorized reader.

If you have ever hit the following error, you have probably been impacted by
this issue:
`
java.lang.UnsupportedOperationException: Cannot support vectorized reads for
column [email] optional binary email (STRING) = 1 with encoding
DELTA_BYTE_ARRAY. Disable vectorized reads to read this table/file
at
org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator.initDataReader(VectorizedPageIterator.java:96)
`

Spark has implemented this support in 2022:
https://github.com/apache/spark/pull/35262
However, Iceberg uses its own vectorized reader.

Is it possible to implement support for these encodings? It would solve a
significant interoperability problem between Athena, Spark, and possibly other
query engines using them.

### Query engine

None

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] anthonysgro opened a new issue, #7162: Missing vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

Reply via email to