marioloko opened a new issue, #2988:
URL: https://github.com/apache/arrow-rs/issues/2988

   The algorithm used to read and write parquet files with CompressionCodec LZ4 
is different in the C++ and Rust implementation. In C++ implementation it uses 
the algorithm `LZ4Hadoop` while on Rust for the same CompressionCodec is using 
the `LZ4Frame`.
   
   When trying to read a parquet generated with C++ arrow library and 
compression LZ4 I get a panic due to the following error:
   
   ```
   thread 'arrow::arrow_reader::tests::test_read_lz4_hadoop_compressed' 
panicked at 'called `Result::unwrap()` on an `Err` value: ParquetError("Parquet 
error: underlying IO error: LZ4 error: ERROR_frameType_unknown")', 
parquet/src/arrow/arrow_reader/mod.rs:2434:14
   stack backtrace:
      0:        0x108e96042 - 
std::backtrace_rs::backtrace::libunwind::trace::hc51e76fb8889c4c7
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
      1:        0x108e96042 - 
std::backtrace_rs::backtrace::trace_unsynchronized::h6636ebb4dbdfddda
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
      2:        0x108e96042 - 
std::sys_common::backtrace::_print_fmt::h1fa8f79b68aa8a0a
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys_common/backtrace.rs:66:5
      3:        0x108e96042 - 
<std::sys_common::backtrace::_print::DisplayBacktrace as 
core::fmt::Display>::fmt::hed16e04ffb615208
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys_common/backtrace.rs:45:22
      4:        0x108eb596a - core::fmt::write::haeb35e341082f6bd
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/fmt/mod.rs:1209:17
      5:        0x108e9278c - std::io::Write::write_fmt::h369182a997830c5c
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/io/mod.rs:1679:15
      6:        0x108e9780b - 
std::sys_common::backtrace::_print::h02fcfc3ff2a6bde7
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys_common/backtrace.rs:48:5
      7:        0x108e9780b - 
std::sys_common::backtrace::print::h5bd14db72e4fe5de
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys_common/backtrace.rs:35:9
      8:        0x108e9780b - 
std::panicking::default_hook::{{closure}}::h0dcad13e9fa12765
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:267:22
      9:        0x108e974a6 - std::panicking::default_hook::h981e72b615f3b097
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:283:9
     10:        0x108e97e60 - 
std::panicking::rust_panic_with_hook::hc2eb58a19ca82803
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:669:13
     11:        0x108e97d73 - 
std::panicking::begin_panic_handler::{{closure}}::h54264dcd7b850967
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:560:13
     12:        0x108e964d8 - 
std::sys_common::backtrace::__rust_end_short_backtrace::hb6f4936878be10a3
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys_common/backtrace.rs:138:18
     13:        0x108e97a3d - rust_begin_unwind
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:556:5
     14:        0x108f1ca43 - core::panicking::panic_fmt::h2903bb0e76f10197
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/panicking.rs:142:14
     15:        0x108f1cba5 - core::result::unwrap_failed::h18af68091210073e
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/result.rs:1791:5
     16:        0x107cfbc89 - 
core::result::Result<T,E>::unwrap::hf00cf6a95fafacda
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/result.rs:1113:23
     17:        0x1077d9c49 - 
parquet::arrow::arrow_reader::tests::test_read_lz4_hadoop_compressed::habb822e6e3580a4a
                                  at 
/Users/adriangc/arrow-rs/parquet/src/arrow/arrow_reader/mod.rs:2431:23
     18:        0x10826f9c9 - 
parquet::arrow::arrow_reader::tests::test_read_lz4_hadoop_compressed::{{closure}}::hcca0ca14496f9a08
                                  at 
/Users/adriangc/arrow-rs/parquet/src/arrow/arrow_reader/mod.rs:2426:5
     19:        0x10815e7b8 - 
core::ops::function::FnOnce::call_once::h725a8a1a5e94f92d
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/ops/function.rs:251:5
     20:        0x108637382 - 
core::ops::function::FnOnce::call_once::h7dbeda91e6aead6b
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/ops/function.rs:251:5
     21:        0x108637382 - 
test::__rust_begin_short_backtrace::h2d7dab6490f59483
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/test/src/lib.rs:576:18
     22:        0x108607471 - test::run_test::{{closure}}::hdc386f68383c6886
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/test/src/lib.rs:567:30
     23:        0x108607471 - 
core::ops::function::FnOnce::call_once{{vtable.shim}}::h6e841d27c8e848a9
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/ops/function.rs:251:5
     24:        0x1086360f5 - <alloc::boxed::Box<F,A> as 
core::ops::function::FnOnce<Args>>::call_once::h6b3722b347c9df52
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/alloc/src/boxed.rs:1938:9
     25:        0x1086360f5 - <core::panic::unwind_safe::AssertUnwindSafe<F> as 
core::ops::function::FnOnce<()>>::call_once::hce401ebfe14ef605
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/panic/unwind_safe.rs:271:9
     26:        0x1086360f5 - std::panicking::try::do_call::h43148f9663b6c691
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:464:40
     27:        0x1086360f5 - std::panicking::try::h66bef5e978c7d8ca
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:428:19
     28:        0x1086360f5 - std::panic::catch_unwind::h06d495074f8fe9c9
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panic.rs:137:14
     29:        0x1086360f5 - test::run_test_in_process::h77027c7da4dcf222
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/test/src/lib.rs:599:27
     30:        0x1086360f5 - 
test::run_test::run_test_inner::{{closure}}::hd39abb8dc2feae55
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/test/src/lib.rs:493:39
     31:        0x108601580 - 
test::run_test::run_test_inner::{{closure}}::h1c34355cd155b093
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/test/src/lib.rs:520:37
     32:        0x108601580 - 
std::sys_common::backtrace::__rust_begin_short_backtrace::he0b6405111093efc
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys_common/backtrace.rs:122:18
     33:        0x1086074ec - 
std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}::h2d528071181592f0
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/thread/mod.rs:514:17
     34:        0x1086074ec - <core::panic::unwind_safe::AssertUnwindSafe<F> as 
core::ops::function::FnOnce<()>>::call_once::h5e0af44f5ef1962b
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/panic/unwind_safe.rs:271:9
     35:        0x1086074ec - std::panicking::try::do_call::h95320ad6aee10be9
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:464:40
     36:        0x1086074ec - std::panicking::try::hf5ca42fcabdfecee
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panicking.rs:428:19
     37:        0x1086074ec - std::panic::catch_unwind::hc7cc0dc7e028b195
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/panic.rs:137:14
     38:        0x1086074ec - 
std::thread::Builder::spawn_unchecked_::{{closure}}::h1cf9ab97df7fb558
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/thread/mod.rs:513:30
     39:        0x1086074ec - 
core::ops::function::FnOnce::call_once{{vtable.shim}}::h7f89265263d96ccd
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/core/src/ops/function.rs:251:5
     40:        0x108e9c267 - <alloc::boxed::Box<F,A> as 
core::ops::function::FnOnce<Args>>::call_once::h267a3b8c1ab382aa
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/alloc/src/boxed.rs:1938:9
     41:        0x108e9c267 - <alloc::boxed::Box<F,A> as 
core::ops::function::FnOnce<Args>>::call_once::h6a449eed197d47c4
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/alloc/src/boxed.rs:1938:9
     42:        0x108e9c267 - 
std::sys::unix::thread::Thread::new::thread_start::h5e4754f0c4a6d6ad
                                  at 
/rustc/0ca356586fed56002b10920fd21ddf6fb12de797/library/std/src/sys/unix/thread.rs:108:17
     43:     0x7ff81337f4e1 - __pthread_start
   ```
   
   <!--
   A clear and concise description of what the bug is.
   -->
   
   **To Reproduce**
   
   I uploaded to my arrow-rs fork, at the `lz4_hadoop_test` branch, the failing 
test with produces the error above. I do not merge the test to this repository 
because the test will fail.
   
   To test it just clone my git fork and execute the test:
   ```
   git clone [email protected]:marioloko/arrow-rs.git
   cd arrow-rs
   git checkout lz4_hadoop_test
   cargo test arrow::arrow_reader::tests::test_read_lz4_hadoop_compressed
   ```
   <!--
   Steps to reproduce the behavior:
   -->
   
   **Expected behavior**
   <!--
   A clear and concise description of what you expected to happen.
   -->
   
   The expected behavior will be to be able to read the file and show the 
contents inside. The test in the previous section should be able to succeed 
reading the data.
   
   **Additional context**
   <!--
   Add any other context about the problem here.
   -->
   
   After some digging in the problem, comparing the C++ and Rust libraries I 
find that they use different algorithms for the same CompressionCode.
   
   C++:
   | CompressionCodec | Compression   | Codec                            |
   |---------------------|---------------|------------------------|
   | LZ4 = 5                     | LZ4_HADOOP | Lz4HadoopRawCodec (Hadoop + 
Fallback to raw if fails) |
   | LZ4_RAW = 7           | LZ4                  | Lz4Codec (Raw)           |
   |                                   | LZ4_FRAME.   | Lz4FrameCodec           
|
   
   Rust:
   | CompressionCodec | Compression   | Codec                            |
   |---------------------|---------------|------------------------|
   | LZ4 = 5                     | LZ4                  | Lz4Codec (Frame).     
  |
   | LZ4_RAW = 7           | LZ4_RAW        | Lz4RawCodec (Raw)    |
   
   As we can observe, for LZ4 CompressionCodec it uses 2 different algorithms 
in both libraries, for both compression and decompression. This makes both 
libraries incompatible, so files generated with codec LZ4 in any of the 
libraries cannot be read with the other library.
   
   Moreover, changing the algorithm in this library from `Lz4Codec (Frame)` to 
`Lz4HadoopRawCodec` makes files generated by older version of this library 
incompatible with the new version, which is not desirable for people using this 
library in production.
   
   So I think there are two different solutions:
   1. Break the backward compatibility with previous versions of this library 
by changing the algorithm from `Lz4Codec (Frame)` to `Lz4HadoopRawCodec`. 
Showing a panic error pointing to this thread. I am not very happy with this 
solution because users of this library may experience problems after updating 
and they are forced to regenerating parquet files with the new version.
   2. Implementing a fallback mechanism, that is:
     i. On `CodecCompression = LZ4` try to use `Lz4HadoopRawCodec`.
     ii. On error try to use `Lz4Codec (Frame)`
     iii. On error try to use `Lz4RawCodec` (This is because C++ library does 
the fallback to this codec)
     The problem of the 2 option is a bit of overhead due to try and fail 
procedure. But it will be compatible with both C++ library and older versions 
of this library.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to