marioloko opened a new issue, #2988:
URL: https://github.com/apache/arrow-rs/issues/2988
The algorithm used to read and write parquet files with CompressionCodec LZ4
is different in the C++ and Rust implementation. In C++ implementation it uses
the algorithm `LZ4Hadoop` while on Rust for the same CompressionCodec is using
the `LZ4Frame`.
When trying to read a parquet generated with C++ arrow library and
compression LZ4 I get a panic due to the following error:
```
thread 'arrow::arrow_reader::tests::test_read_lz4_hadoop_compressed'
panicked at 'called `Result::unwrap()` on an `Err` value: ParquetError("Parquet
error: underlying IO error: LZ4 error: ERROR_frameType_unknown")',
parquet/src/arrow/arrow_reader/mod.rs:2434:14
stack backtrace:
0: 0x108e96042 -
std::backtrace_rs::backtrace::libunwind::trace::hc51e76fb8889c4c7
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
1: 0x108e96042 -
std::backtrace_rs::backtrace::trace_unsynchronized::h6636ebb4dbdfddda
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0x108e96042 -
std::sys_common::backtrace::_print_fmt::h1fa8f79b68aa8a0a
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys_common/backtrace.rs:66:5
3: 0x108e96042 -
<std::sys_common::backtrace::_print::DisplayBacktrace as
core::fmt::Display>::fmt::hed16e04ffb615208
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys_common/backtrace.rs:45:22
4: 0x108eb596a - core::fmt::write::haeb35e341082f6bd
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/fmt/mod.rs:1209:17
5: 0x108e9278c - std::io::Write::write_fmt::h369182a997830c5c
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/io/mod.rs:1679:15
6: 0x108e9780b -
std::sys_common::backtrace::_print::h02fcfc3ff2a6bde7
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys_common/backtrace.rs:48:5
7: 0x108e9780b -
std::sys_common::backtrace::print::h5bd14db72e4fe5de
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys_common/backtrace.rs:35:9
8: 0x108e9780b -
std::panicking::default_hook::{{closure}}::h0dcad13e9fa12765
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:267:22
9: 0x108e974a6 - std::panicking::default_hook::h981e72b615f3b097
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:283:9
10: 0x108e97e60 -
std::panicking::rust_panic_with_hook::hc2eb58a19ca82803
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:669:13
11: 0x108e97d73 -
std::panicking::begin_panic_handler::{{closure}}::h54264dcd7b850967
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:560:13
12: 0x108e964d8 -
std::sys_common::backtrace::__rust_end_short_backtrace::hb6f4936878be10a3
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys_common/backtrace.rs:138:18
13: 0x108e97a3d - rust_begin_unwind
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:556:5
14: 0x108f1ca43 - core::panicking::panic_fmt::h2903bb0e76f10197
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/panicking.rs:142:14
15: 0x108f1cba5 - core::result::unwrap_failed::h18af68091210073e
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/result.rs:1791:5
16: 0x107cfbc89 -
core::result::Result<T,E>::unwrap::hf00cf6a95fafacda
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/result.rs:1113:23
17: 0x1077d9c49 -
parquet::arrow::arrow_reader::tests::test_read_lz4_hadoop_compressed::habb822e6e3580a4a
at
/Users/adriangc/arrow-rs/parquet/src/arrow/arrow_reader/mod.rs:2431:23
18: 0x10826f9c9 -
parquet::arrow::arrow_reader::tests::test_read_lz4_hadoop_compressed::{{closure}}::hcca0ca14496f9a08
at
/Users/adriangc/arrow-rs/parquet/src/arrow/arrow_reader/mod.rs:2426:5
19: 0x10815e7b8 -
core::ops::function::FnOnce::call_once::h725a8a1a5e94f92d
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/ops/function.rs:251:5
20: 0x108637382 -
core::ops::function::FnOnce::call_once::h7dbeda91e6aead6b
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/ops/function.rs:251:5
21: 0x108637382 -
test::__rust_begin_short_backtrace::h2d7dab6490f59483
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/test/src/lib.rs:576:18
22: 0x108607471 - test::run_test::{{closure}}::hdc386f68383c6886
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/test/src/lib.rs:567:30
23: 0x108607471 -
core::ops::function::FnOnce::call_once{{vtable.shim}}::h6e841d27c8e848a9
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/ops/function.rs:251:5
24: 0x1086360f5 - <alloc::boxed::Box<F,A> as
core::ops::function::FnOnce<Args>>::call_once::h6b3722b347c9df52
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/alloc/src/boxed.rs:1938:9
25: 0x1086360f5 - <core::panic::unwind_safe::AssertUnwindSafe<F> as
core::ops::function::FnOnce<()>>::call_once::hce401ebfe14ef605
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/panic/unwind_safe.rs:271:9
26: 0x1086360f5 - std::panicking::try::do_call::h43148f9663b6c691
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:464:40
27: 0x1086360f5 - std::panicking::try::h66bef5e978c7d8ca
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:428:19
28: 0x1086360f5 - std::panic::catch_unwind::h06d495074f8fe9c9
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panic.rs:137:14
29: 0x1086360f5 - test::run_test_in_process::h77027c7da4dcf222
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/test/src/lib.rs:599:27
30: 0x1086360f5 -
test::run_test::run_test_inner::{{closure}}::hd39abb8dc2feae55
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/test/src/lib.rs:493:39
31: 0x108601580 -
test::run_test::run_test_inner::{{closure}}::h1c34355cd155b093
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/test/src/lib.rs:520:37
32: 0x108601580 -
std::sys_common::backtrace::__rust_begin_short_backtrace::he0b6405111093efc
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys_common/backtrace.rs:122:18
33: 0x1086074ec -
std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}::h2d528071181592f0
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/thread/mod.rs:514:17
34: 0x1086074ec - <core::panic::unwind_safe::AssertUnwindSafe<F> as
core::ops::function::FnOnce<()>>::call_once::h5e0af44f5ef1962b
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/panic/unwind_safe.rs:271:9
35: 0x1086074ec - std::panicking::try::do_call::h95320ad6aee10be9
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:464:40
36: 0x1086074ec - std::panicking::try::hf5ca42fcabdfecee
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:428:19
37: 0x1086074ec - std::panic::catch_unwind::hc7cc0dc7e028b195
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panic.rs:137:14
38: 0x1086074ec -
std::thread::Builder::spawn_unchecked_::{{closure}}::h1cf9ab97df7fb558
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/thread/mod.rs:513:30
39: 0x1086074ec -
core::ops::function::FnOnce::call_once{{vtable.shim}}::h7f89265263d96ccd
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/ops/function.rs:251:5
40: 0x108e9c267 - <alloc::boxed::Box<F,A> as
core::ops::function::FnOnce<Args>>::call_once::h267a3b8c1ab382aa
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/alloc/src/boxed.rs:1938:9
41: 0x108e9c267 - <alloc::boxed::Box<F,A> as
core::ops::function::FnOnce<Args>>::call_once::h6a449eed197d47c4
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/alloc/src/boxed.rs:1938:9
42: 0x108e9c267 -
std::sys::unix::thread::Thread::new::thread_start::h5e4754f0c4a6d6ad
at
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys/unix/thread.rs:108:17
43: 0x7ff81337f4e1 - __pthread_start
```
<!--
A clear and concise description of what the bug is.
-->
**To Reproduce**
I uploaded to my arrow-rs fork, at the `lz4_hadoop_test` branch, the failing
test with produces the error above. I do not merge the test to this repository
because the test will fail.
To test it just clone my git fork and execute the test:
```
git clone [email protected]:marioloko/arrow-rs.git
cd arrow-rs
git checkout lz4_hadoop_test
cargo test arrow::arrow_reader::tests::test_read_lz4_hadoop_compressed
```
<!--
Steps to reproduce the behavior:
-->
**Expected behavior**
<!--
A clear and concise description of what you expected to happen.
-->
The expected behavior will be to be able to read the file and show the
contents inside. The test in the previous section should be able to succeed
reading the data.
**Additional context**
<!--
Add any other context about the problem here.
-->
After some digging in the problem, comparing the C++ and Rust libraries I
find that they use different algorithms for the same CompressionCode.
C++:
| CompressionCodec | Compression | Codec |
|---------------------|---------------|------------------------|
| LZ4 = 5 | LZ4_HADOOP | Lz4HadoopRawCodec (Hadoop +
Fallback to raw if fails) |
| LZ4_RAW = 7 | LZ4 | Lz4Codec (Raw) |
| | LZ4_FRAME. | Lz4FrameCodec
|
Rust:
| CompressionCodec | Compression | Codec |
|---------------------|---------------|------------------------|
| LZ4 = 5 | LZ4 | Lz4Codec (Frame).
|
| LZ4_RAW = 7 | LZ4_RAW | Lz4RawCodec (Raw) |
As we can observe, for LZ4 CompressionCodec it uses 2 different algorithms
in both libraries, for both compression and decompression. This makes both
libraries incompatible, so files generated with codec LZ4 in any of the
libraries cannot be read with the other library.
Moreover, changing the algorithm in this library from `Lz4Codec (Frame)` to
`Lz4HadoopRawCodec` makes files generated by older version of this library
incompatible with the new version, which is not desirable for people using this
library in production.
So I think there are two different solutions:
1. Break the backward compatibility with previous versions of this library
by changing the algorithm from `Lz4Codec (Frame)` to `Lz4HadoopRawCodec`.
Showing a panic error pointing to this thread. I am not very happy with this
solution because users of this library may experience problems after updating
and they are forced to regenerating parquet files with the new version.
2. Implementing a fallback mechanism, that is:
i. On `CodecCompression = LZ4` try to use `Lz4HadoopRawCodec`.
ii. On error try to use `Lz4Codec (Frame)`
iii. On error try to use `Lz4RawCodec` (This is because C++ library does
the fallback to this codec)
The problem of the 2 option is a bit of overhead due to try and fail
procedure. But it will be compatible with both C++ library and older versions
of this library.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]