[GitHub] [arrow] wgtmac commented on pull request #36955: GH-36882: [C++][Parquet] Default RLE for bool values in v2 pages

via GitHub Thu, 10 Aug 2023 19:38:05 -0700


wgtmac commented on PR #36955:
URL: https://github.com/apache/arrow/pull/36955#issuecomment-1674148021


   > My concern is mentioned in #36882
   > 
   > The code looks good, but maybe we would change it when we support more 
TYPES and default in 2.0?
   
   I have simply checked the parquet impl in the arrow-rs. It has two 
distinctions compared to parquet-cpp:
   - There isn't DataPageVersion in the arrow-rs. It depends on WriterVersion 
to decide the data page version. For example, PARQUET_1_0 uses data page V1 and 
PARQUET_2_0 uses V2. This seems to be more aligned with parquet-mr.
   - WriterVersion in the arrow-rs only has two values: PARQUET_1_0 and 
PARQUET_2_0. In the parquet-cpp we have more fine-grained versions, so we'd 
better not enable an encoding introduced by 2_8 (e.g. BYTE_STREAM_SPLIT) when 
writer version is PARQUET_2_6 or less. Unfortunately parquet-mr can enable 
BYTE_STREAM_SPLIT even when the writer version is V1.
   
   Now I have changed the code to use `Encoding::UNKNOWN` as the default. 
Before proceeding, I need to do more investigation on the relationship between 
encoding and version.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] wgtmac commented on pull request #36955: GH-36882: [C++][Parquet] Default RLE for bool values in v2 pages

Reply via email to