[jira] [Resolved] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2019-02-17 Thread Dmitry Kalinkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Kalinkin resolved PARQUET-1438.
--
   Resolution: Fixed
Fix Version/s: 1.12.0

This got resolved after fixing the issue in arrow-cpp

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Fix For: 1.12.0
>
> Attachments: 32.parquet, 64.parquet, arrow_0.10.0_i686_test_fail.log, 
> arrow_0.11.0_i686_test_fail.log, parquet_1.5.0_i686_test_success.log
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644052#comment-16644052
 ] 

Dmitry Kalinkin commented on PARQUET-1438:
--

Opened ARROW-3477

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet, arrow_0.10.0_i686_test_fail.log, 
> arrow_0.11.0_i686_test_fail.log, parquet_1.5.0_i686_test_success.log
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643970#comment-16643970
 ] 

Dmitry Kalinkin commented on PARQUET-1438:
--

Perhaps this is an arrow issue then?

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet, arrow_0.10.0_i686_test_fail.log, 
> arrow_0.11.0_i686_test_fail.log, parquet_1.5.0_i686_test_success.log
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643963#comment-16643963
 ] 

Dmitry Kalinkin commented on PARQUET-1438:
--

Running test suite was a great suggestion!

I've tested arrow-cpp 0.10.0, parquet 1.5.0, arrow-cpp 0.11.0 and found that 
all tests pass on x86_64. As for tests on i686, *1* test fail on arrow-cpp 
0.10.0, *0* failures for parquet 1.5.0 (against arrow-cpp 0.10.0), arrow-cpp 
0.11.0 has *11* failing tests. I'm attaching log files to the ticket.

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet, arrow_0.10.0_i686_test_fail.log, 
> arrow_0.11.0_i686_test_fail.log, parquet_1.5.0_i686_test_success.log
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Kalinkin updated PARQUET-1438:
-
Attachment: arrow_0.10.0_i686_test_fail.log

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet, arrow_0.10.0_i686_test_fail.log, 
> arrow_0.11.0_i686_test_fail.log, parquet_1.5.0_i686_test_success.log
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Kalinkin updated PARQUET-1438:
-
Attachment: parquet_1.5.0_i686_test_success.log

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet, arrow_0.10.0_i686_test_fail.log, 
> arrow_0.11.0_i686_test_fail.log, parquet_1.5.0_i686_test_success.log
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Kalinkin updated PARQUET-1438:
-
Attachment: arrow_0.11.0_i686_test_fail.log

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet, arrow_0.10.0_i686_test_fail.log, 
> arrow_0.11.0_i686_test_fail.log, parquet_1.5.0_i686_test_success.log
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643608#comment-16643608
 ] 

Dmitry Kalinkin commented on PARQUET-1438:
--

Thank you for providing the diff. I looked and it doesn't seem very drastic to 
me as well.

I don't think there is a conflicting libraries problem. I do all of my builds 
in a sandbox and the writing of files does succeed with resulting files being 
grossly different for 0.11.0 on 32 bits.

Unfortunately all of 3545186d6, 3545186d6~ and 9b4cd9c03 do reproduce the bug.

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643462#comment-16643462
 ] 

Dmitry Kalinkin commented on PARQUET-1438:
--

Yes. The setup with arrow-cpp 0.10.0 and parquet-cpp 1.5.0 uses the tarball 
from 
https://github.com/apache/parquet-cpp/archive/apache-parquet-cpp-1.5.0.tar.gz

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643415#comment-16643415
 ] 

Dmitry Kalinkin commented on PARQUET-1438:
--

I now checked files that were produced with previous version of the parquet-cpp 
1.5.0 on 32 bit and they mostly match what I get on 64 bit arrow-cpp 0.11.0. I 
also tried to do a bisect on arrow-cpp repository, but could not find any good 
commit. They all either have a bug or don't build. I guess I could try to 
bisect paquet-cpp repository against arrow-cpp 0.10.0.

I was hoping someone with the knowledge of the format could take a look at 
files and see which part of the structure blows up. It seems like it is the 
schema that blows up. That means I need to look at thrift related stuff?

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643415#comment-16643415
 ] 

Dmitry Kalinkin edited comment on PARQUET-1438 at 10/9/18 1:20 PM:
---

I now checked files that were produced with previous version of the parquet-cpp 
1.5.0 on 32 bit and they mostly match what I get on 64 bit arrow-cpp 0.11.0. I 
also tried to do a bisect on arrow-cpp repository, but could not find any good 
commit. They all either have the bug or don't build. I guess, I could try to 
bisect paquet-cpp repository against arrow-cpp 0.10.0.

I was hoping someone with the knowledge of the format could take a look at 
files and see which part of the structure blows up. It seems like it is the 
schema that blows up. That means I need to look at thrift related stuff?


was (Author: veprbl):
I now checked files that were produced with previous version of the parquet-cpp 
1.5.0 on 32 bit and they mostly match what I get on 64 bit arrow-cpp 0.11.0. I 
also tried to do a bisect on arrow-cpp repository, but could not find any good 
commit. They all either have a bug or don't build. I guess I could try to 
bisect paquet-cpp repository against arrow-cpp 0.10.0.

I was hoping someone with the knowledge of the format could take a look at 
files and see which part of the structure blows up. It seems like it is the 
schema that blows up. That means I need to look at thrift related stuff?

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-08 Thread Dmitry Kalinkin (JIRA)
Dmitry Kalinkin created PARQUET-1438:


 Summary: [C++] corrupted files produced on 32-bit architecture 
(i686)
 Key: PARQUET-1438
 URL: https://issues.apache.org/jira/browse/PARQUET-1438
 Project: Parquet
  Issue Type: Bug
Reporter: Dmitry Kalinkin
 Attachments: 32.parquet, 64.parquet

I'm using C++ API to convert some data to parquet files. I've noticed a 
regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
arrow-cpp 0.11.0. The issue is that I can write parquet files without an error, 
but when I try to read those using pyarrow I get a segfault:

{noformat}
#0  0x7fffd17c7f0f in int 
arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
int, int, unsigned char const*, long) ()
   from 
/nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
#1  0x7fffd17c8025 in 
parquet::DictionaryDecoder 
>::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
   from 
/nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
#2  0x7fffd17bcf0f in 
parquet::internal::TypedRecordReader 
>::ReadRecordData(long) ()
   from 
/nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
#3  0x7fffd17bfbea in 
parquet::internal::TypedRecordReader 
>::ReadRecords(long) ()
   from 
/nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
#4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
std::shared_ptr*) ()
   from 
/nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
#5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
std::shared_ptr*) ()
   from 
/nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
#6  0x7fffd179a6e5 in 
parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector > const&, std::shared_ptr*) ()
   from 
/nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
#7  0x7fffd179aaad in 
parquet::arrow::FileReader::Impl::ReadTable(std::vector > const&, 
std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () from 
/nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
{noformat}

I have not been able to dig to the bottom of the issue, but it seems like the 
problem reproduces only when I run 32 bit binaries. After I learned that, I 
found that 32 bit and 64 bit codes produce very different different parquet 
files for the same data. The sizes of the structures are clearly different if I 
look at their hexdumps. I'm attaching those example files. Reading "32.parquet" 
(produced using i686 binaries) will cause a segfault on macOS and linux, 
"64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)