[
https://issues.apache.org/jira/browse/PARQUET-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279168#comment-17279168
]
mario luzi edited comment on PARQUET-1878 at 2/4/21, 9:25 PM:
--------------------------------------------------------------
Hello , we just tried latest apache-arrow version 3.0.0 and the write example
included in low level api example, but lz4 still seems not compatible with
Hadoop . we got this error :
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in
block -1 in file
[file:/home/leal/parquet/20210202130000-20210202130100-EOXDR_SRV_84-eoIUPS+1.lz4.parquet|file://home/leal/parquet/20210202130000-20210202130100-EOXDR_SRV_84-eoIUPS+1.lz4.parquet]
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:255)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
at org.apache.parquet.tools.command.HeadCommand.execute(HeadCommand.java:87)
at org.apache.parquet.tools.Main.main(Main.java:223)
Caused by: java.lang.RuntimeException: native lz4 library not available
at
org.apache.hadoop.io.compress.Lz4Codec.getDecompressorType(Lz4Codec.java:195)
at org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:178)
at
org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.<init>(CodecFactory.java:98)
at
org.apache.parquet.hadoop.CodecFactory.createDecompressor(CodecFactory.java:210)[leal@sulu
parquet]$ ./hadoop-3.2.2/bin/hadoop jar
apache-parquet-1.11.1/parquet-tools/target/parquet-tools-1.11.1.jar head
--debug parquet_2_0_example2.parquet
2021-02-04 21:24:36,354 INFO hadoop.InternalParquetRecordReader: RecordReader
initialized will read a total of 1500001 records.
2021-02-04 21:24:36,355 INFO hadoop.InternalParquetRecordReader: at row 0.
reading next block
2021-02-04 21:24:36,397 INFO compress.CodecPool: Got brand-new decompressor
[.lz4]
2021-02-04 21:24:36,410 INFO hadoop.InternalParquetRecordReader: block read in
memory in 55 ms. row count = 434436
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in
block -1 in file file:/home/leal/parquet/parquet_2_0_example2.parquet
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:255)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
at org.apache.parquet.tools.command.HeadCommand.execute(HeadCommand.java:87)
at org.apache.parquet.tools.Main.main(Main.java:223)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
Caused by: java.lang.IllegalArgumentException
at java.nio.Buffer.limit(Buffer.java:275)
at
org.apache.hadoop.io.compress.lz4.Lz4Decompressor.decompress(Lz4Decompressor.java:232)
at
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:88)
at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
any advice ? we need to write Lz4 files by C++ and read oover Hadoop jobs but
still stuck on this problem .
was (Author: mario.luzi):
Hello , we just tried latest apache-arrow version 3.0.0 and the write example
included in low level api example, but lz4 still seems not compatible with
Hadoop . we got this error :
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in
block -1 in file
file:/home/leal/parquet/20210202130000-20210202130100-EOXDR_SRV_84-eoIUPS+1.lz4.parquet
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:255)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
at org.apache.parquet.tools.command.HeadCommand.execute(HeadCommand.java:87)
at org.apache.parquet.tools.Main.main(Main.java:223)
Caused by: java.lang.RuntimeException: native lz4 library not available
at
org.apache.hadoop.io.compress.Lz4Codec.getDecompressorType(Lz4Codec.java:195)
at org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:178)
at
org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.<init>(CodecFactory.java:98)
at
org.apache.parquet.hadoop.CodecFactory.createDecompressor(CodecFactory.java:210)
any advice ? we need to write Lz4 files by C++ and read oover Hadoop jobs but
still stuck on this problem .
> [C++] lz4 codec is not compatible with Hadoop Lz4Codec
> ------------------------------------------------------
>
> Key: PARQUET-1878
> URL: https://issues.apache.org/jira/browse/PARQUET-1878
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-cpp
> Reporter: Steve M. Kim
> Assignee: Patrick Pai
> Priority: Major
> Labels: pull-request-available
> Fix For: cpp-1.6.0
>
> Time Spent: 10h 20m
> Remaining Estimate: 0h
>
> As described in HADOOP-12990, the Hadoop {{Lz4Codec}} uses the lz4 block
> format, and it prepends 8 extra bytes before the compressed data. I believe
> that lz4 implementation in parquet-cpp also uses the lz4 block format, but it
> does not prepend these 8 extra bytes.
>
> Using Java parquet-mr, I wrote a Parquet file with lz4 compression:
> {code:java}
> $ parquet-tools meta
> /tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f
> file:
> file:/tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f
> creator: parquet-mr version 1.10.1 (build
> a89df8f9932b6ef6633d06069e50c9b7970bebd1)file schema:
> --------------------------------------------------------------------------------
> c1: REQUIRED INT64 R:0 D:0
> c0: REQUIRED BINARY R:0 D:0
> v0: REQUIRED INT64 R:0 D:0row group 1: RC:5007 TS:28028 OFFSET:4
> --------------------------------------------------------------------------------
> c1: INT64 LZ4 DO:0 FPO:4 SZ:24797/25694/1.04 VC:5007
> ENC:DELTA_BINARY_PACKED ST:[min: 1566330126476659000, max:
> 1571211622650188000, num_nulls: 0]
> c0: BINARY LZ4 DO:0 FPO:24801 SZ:279/260/0.93 VC:5007
> ENC:PLAIN,RLE_DICTIONARY ST:[min:
> 0x7471732F62656566616C6F2F746F6D6163636F2D66782D6D6B74646174612D6C69766573747265616D,
> max:
> 0x7471732F62656566616C6F2F746F6D6163636F2D66782D6D6B74646174612D6C69766573747265616D,
> num_nulls: 0]
> v0: INT64 LZ4 DO:0 FPO:25080 SZ:1348/2074/1.54 VC:5007
> ENC:PLAIN,RLE_DICTIONARY ST:[min: 0, max: 9, num_nulls: 0] {code}
> When I attempted to read this file with parquet-cpp, I got the following
> error:
> {code:java}
> >>> import pyarrow.parquet as pq
> >>> pq.read_table('/tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
> line 1536, in read_table
> return pf.read(columns=columns, use_threads=use_threads,
> File
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
> line 1260, in read
> table = piece.read(columns=columns, use_threads=use_threads,
> File
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
> line 707, in read
> table = reader.read(**options)
> File
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
> line 336, in read
> return self.reader.read_all(column_indices=column_indices,
> File "pyarrow/_parquet.pyx", line 1130, in
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> OSError: IOError: Corrupt Lz4 compressed data. {code}
>
> [https://github.com/apache/arrow/issues/3491] reported incompatibility in the
> other direction, using Spark (which uses the Hadoop lz4 codec) to read a
> parquet file that was written with parquet-cpp.
>
> Given that the Hadoop lz4 codec has long been in use, and users have
> accumulated Parquet files that were written with this implementation, I
> propose changing parquet-cpp to match the Hadoop implementation.
>
> See also:
> *
> https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16574328#comment-16574328
> *
> https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16585288#comment-16585288
--
This message was sent by Atlassian Jira
(v8.3.4#803005)