[jira] [Assigned] (PARQUET-1878) [C++] lz4 codec is not compatible with Hadoop Lz4Codec

Wes McKinney (Jira) Tue, 22 Sep 2020 12:07:53 -0700


     [ 
https://issues.apache.org/jira/browse/PARQUET-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wes McKinney reassigned PARQUET-1878:
-------------------------------------

    Assignee: Patrick Pai

> [C++] lz4 codec is not compatible with Hadoop Lz4Codec
> ------------------------------------------------------
>
>                 Key: PARQUET-1878
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1878
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp
>            Reporter: Steve M. Kim
>            Assignee: Patrick Pai
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: cpp-1.6.0
>
>          Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> As described in HADOOP-12990, the Hadoop {{Lz4Codec}} uses the lz4 block 
> format, and it prepends 8 extra bytes before the compressed data. I believe 
> that lz4 implementation in parquet-cpp also uses the lz4 block format, but it 
> does not prepend these 8 extra bytes.
>  
> Using Java parquet-mr, I wrote a Parquet file with lz4 compression:
> {code:java}
> $ parquet-tools meta 
> /tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f
> file:        
> file:/tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f
> creator:     parquet-mr version 1.10.1 (build 
> a89df8f9932b6ef6633d06069e50c9b7970bebd1)file schema:
> --------------------------------------------------------------------------------
> c1:          REQUIRED INT64 R:0 D:0
> c0:          REQUIRED BINARY R:0 D:0
> v0:          REQUIRED INT64 R:0 D:0row group 1: RC:5007 TS:28028 OFFSET:4
> --------------------------------------------------------------------------------
> c1:           INT64 LZ4 DO:0 FPO:4 SZ:24797/25694/1.04 VC:5007 
> ENC:DELTA_BINARY_PACKED ST:[min: 1566330126476659000, max: 
> 1571211622650188000, num_nulls: 0]
> c0:           BINARY LZ4 DO:0 FPO:24801 SZ:279/260/0.93 VC:5007 
> ENC:PLAIN,RLE_DICTIONARY ST:[min: 
> 0x7471732F62656566616C6F2F746F6D6163636F2D66782D6D6B74646174612D6C69766573747265616D,
>  max: 
> 0x7471732F62656566616C6F2F746F6D6163636F2D66782D6D6B74646174612D6C69766573747265616D,
>  num_nulls: 0]
> v0:           INT64 LZ4 DO:0 FPO:25080 SZ:1348/2074/1.54 VC:5007 
> ENC:PLAIN,RLE_DICTIONARY ST:[min: 0, max: 9, num_nulls: 0] {code}
> When I attempted to read this file with parquet-cpp, I got the following 
> error:
> {code:java}
> >>> import pyarrow.parquet as pq
> >>> pq.read_table('/tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1536, in read_table
>     return pf.read(columns=columns, use_threads=use_threads,
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1260, in read
>     table = piece.read(columns=columns, use_threads=use_threads,
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 707, in read
>     table = reader.read(**options)
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 336, in read
>     return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1130, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> OSError: IOError: Corrupt Lz4 compressed data. {code}
>  
> [https://github.com/apache/arrow/issues/3491] reported incompatibility in the 
> other direction, using Spark (which uses the Hadoop lz4 codec) to read a 
> parquet file that was written with parquet-cpp.
>  
> Given that the Hadoop lz4 codec has long been in use, and users have 
> accumulated Parquet files that were written with this implementation, I 
> propose changing parquet-cpp to match the Hadoop implementation.
>  
> See also:
>  * 
> https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16574328#comment-16574328
>  * 
> https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16585288#comment-16585288



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (PARQUET-1878) [C++] lz4 codec is not compatible with Hadoop Lz4Codec

Reply via email to