[ https://issues.apache.org/jira/browse/PARQUET-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jan Finis updated PARQUET-2238: ------------------------------- Description: The spec in parquet-format specifies that [DELTA_BYTE_ARRAY is only supported for the physical type BYTE_ARRAY|https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array--6]. Yet, [parquet-mr also uses it to encode FIXED_LEN_BYTE_ARRAY|https://github.com/apache/parquet-mr/blob/fd1326a8a56174320ea2f36d7c6c4e78105ab108/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L83]. So, I guess the spec should be updated to include FIXED_LEN_BYTE_ARRAY in the supported types of DELTA_BYTE_ARRAY encoding, or the code should be changed to no longer write this encoding for FIXED_LEN_BYTE_ARRAY. I guess changing the spec is more prudent, given that a) the encoding can make sense for FIXED_LEN_BYTE_ARRAY and b) there might already be countless files written with this encoding / type combination. was: The spec in parquet-format specifies that [DELTA_BYTE_ARRAY is only supported for the physical type BYTE_ARRAY|https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array--6]. Yet, [parquet-mr also uses it to encode FIXED_LEN_BYTE_ARRAY|https://github.com/apache/parquet-mr/blob/fd1326a8a56174320ea2f36d7c6c4e78105ab108/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L83]. So, I guess the spec should be updated to include FIXED_LEN_BYTE_ARRAY in the supported types of DELTA_BYTE_ARRAY encoding, or the code should be changed to no longer write this encoding for FIXED_LEN_BYTE_ARRAY. > Spec and parquet-mr disagree on DELTA_BYTE_ARRAY encoding > --------------------------------------------------------- > > Key: PARQUET-2238 > URL: https://issues.apache.org/jira/browse/PARQUET-2238 > Project: Parquet > Issue Type: Bug > Components: parquet-format, parquet-mr > Reporter: Jan Finis > Priority: Minor > > The spec in parquet-format specifies that [DELTA_BYTE_ARRAY is only supported > for the physical type > BYTE_ARRAY|https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array--6]. > Yet, [parquet-mr also uses it to encode > FIXED_LEN_BYTE_ARRAY|https://github.com/apache/parquet-mr/blob/fd1326a8a56174320ea2f36d7c6c4e78105ab108/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L83]. > So, I guess the spec should be updated to include FIXED_LEN_BYTE_ARRAY in the > supported types of DELTA_BYTE_ARRAY encoding, or the code should be changed > to no longer write this encoding for FIXED_LEN_BYTE_ARRAY. > I guess changing the spec is more prudent, given that > a) the encoding can make sense for FIXED_LEN_BYTE_ARRAY > and > b) there might already be countless files written with this encoding / type > combination. -- This message was sent by Atlassian Jira (v8.20.10#820010)