[ 
https://issues.apache.org/jira/browse/PARQUET-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Finis updated PARQUET-2238:
-------------------------------
    Description: 
The spec in parquet-format specifies that [DELTA_BYTE_ARRAY is only supported 
for the physical type 
BYTE_ARRAY|https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array--6].
 Yet, [parquet-mr also uses it to encode 
FIXED_LEN_BYTE_ARRAY|https://github.com/apache/parquet-mr/blob/fd1326a8a56174320ea2f36d7c6c4e78105ab108/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L83].

So, I guess the spec should be updated to include FIXED_LEN_BYTE_ARRAY in the 
supported types of DELTA_BYTE_ARRAY encoding, or the code should be changed to 
no longer write this encoding for FIXED_LEN_BYTE_ARRAY.

I guess changing the spec is more prudent, given that 
a) the encoding can make sense for FIXED_LEN_BYTE_ARRAY
and
b) there might already be countless files written with this encoding / type 
combination.

  was:
The spec in parquet-format specifies that [DELTA_BYTE_ARRAY is only supported 
for the physical type 
BYTE_ARRAY|https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array--6].
 Yet, [parquet-mr also uses it to encode 
FIXED_LEN_BYTE_ARRAY|https://github.com/apache/parquet-mr/blob/fd1326a8a56174320ea2f36d7c6c4e78105ab108/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L83].

So, I guess the spec should be updated to include FIXED_LEN_BYTE_ARRAY in the 
supported types of DELTA_BYTE_ARRAY encoding, or the code should be changed to 
no longer write this encoding for FIXED_LEN_BYTE_ARRAY.


> Spec and parquet-mr disagree on DELTA_BYTE_ARRAY encoding
> ---------------------------------------------------------
>
>                 Key: PARQUET-2238
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2238
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format, parquet-mr
>            Reporter: Jan Finis
>            Priority: Minor
>
> The spec in parquet-format specifies that [DELTA_BYTE_ARRAY is only supported 
> for the physical type 
> BYTE_ARRAY|https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array--6].
>  Yet, [parquet-mr also uses it to encode 
> FIXED_LEN_BYTE_ARRAY|https://github.com/apache/parquet-mr/blob/fd1326a8a56174320ea2f36d7c6c4e78105ab108/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L83].
> So, I guess the spec should be updated to include FIXED_LEN_BYTE_ARRAY in the 
> supported types of DELTA_BYTE_ARRAY encoding, or the code should be changed 
> to no longer write this encoding for FIXED_LEN_BYTE_ARRAY.
> I guess changing the spec is more prudent, given that 
> a) the encoding can make sense for FIXED_LEN_BYTE_ARRAY
> and
> b) there might already be countless files written with this encoding / type 
> combination.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to