[jira] [Created] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
Michael Heuer created PARQUET-1441: -- Summary: SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter Key: PARQUET-1441 URL: https://issues.apache.org/jira/browse/PARQUET-1441 Project: Parquet Issue Type: Bug Components: parquet-avro Reporter: Michael Heuer The following unit test added to TestAvroSchemaConverter fails {code:java} @Test public void testConvertedSchemaToStringCantRedefineList() throws Exception { String parquet = "message spark_schema {\n" + " optional group annotation {\n" + "optional group transcriptEffects (LIST) {\n" + " repeated group list {\n" + "optional group element {\n" + " optional group effects (LIST) {\n" + "repeated group list {\n" + " optional binary element (UTF8);\n" + "}\n" + " }\n" + "}\n" + " }\n" + "}\n" + " }\n" + "}\n"; Configuration conf = new Configuration(false); AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf); Schema schema = avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet)); schema.toString(); } {code} while this one succeeds {code:java} @Test public void testConvertedSchemaToStringCantRedefineList() throws Exception { String parquet = "message spark_schema {\n" + " optional group annotation {\n" + "optional group transcriptEffects (LIST) {\n" + " repeated group list {\n" + "optional group element {\n" + " optional group effects (LIST) {\n" + "repeated group list {\n" + " optional binary element (UTF8);\n" + "}\n" + " }\n" + "}\n" + " }\n" + "}\n" + " }\n" + "}\n"; Configuration conf = new Configuration(false); conf.setBoolean("parquet.avro.add-list-element-records", false); AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf); Schema schema = avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet)); schema.toString(); } {code} I don't see a way to influence the code path in AvroIndexedRecordConverter to respect this configuration, resulting in the following stack trace downstream {noformat} Cause: org.apache.avro.SchemaParseException: Can't redefine: list at org.apache.avro.Schema$Names.put(Schema.java:1128) at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562) at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690) at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805) at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882) at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716) at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701) at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882) at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716) at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701) at org.apache.avro.Schema.toString(Schema.java:324) at org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68) at org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866) at org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.(AvroIndexedRecordConverter.java:333) at org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:172) at org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:94) at org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:168) at org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:94) at org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:66) at org.apache.parquet.avro.AvroCompatRecordMaterializer.(AvroCompatRecordMaterializer.java:34) at org.apache.parquet.avro.AvroReadSupport.newCompatMaterializer(AvroReadSupport.java:144) at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:136) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:204) at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182) at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) ... {noformat} See also downstream issues https://issues.apache.org/jira/browse/SPARK-25588 [https://github.com/bigdatagenomics/adam/issues/2058] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)
[ https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644052#comment-16644052 ] Dmitry Kalinkin commented on PARQUET-1438: -- Opened ARROW-3477 > [C++] corrupted files produced on 32-bit architecture (i686) > > > Key: PARQUET-1438 > URL: https://issues.apache.org/jira/browse/PARQUET-1438 > Project: Parquet > Issue Type: Bug >Reporter: Dmitry Kalinkin >Priority: Major > Attachments: 32.parquet, 64.parquet, arrow_0.10.0_i686_test_fail.log, > arrow_0.11.0_i686_test_fail.log, parquet_1.5.0_i686_test_success.log > > > I'm using C++ API to convert some data to parquet files. I've noticed a > regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to > arrow-cpp 0.11.0. The issue is that I can write parquet files without an > error, but when I try to read those using pyarrow I get a segfault: > {noformat} > #0 0x7fffd17c7f0f in int > arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, > int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #1 0x7fffd17c8025 in > parquet::DictionaryDecoder > >::DecodeSpaced(float*, int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #2 0x7fffd17bcf0f in > parquet::internal::TypedRecordReader > >::ReadRecordData(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #3 0x7fffd17bfbea in > parquet::internal::TypedRecordReader > >::ReadRecords(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #4 0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #5 0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #6 0x7fffd179a6e5 in > parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #7 0x7fffd179aaad in > parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, > std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > {noformat} > I have not been able to dig to the bottom of the issue, but it seems like the > problem reproduces only when I run 32 bit binaries. After I learned that, I > found that 32 bit and 64 bit codes produce very different different parquet > files for the same data. The sizes of the structures are clearly different if > I look at their hexdumps. I'm attaching those example files. Reading > "32.parquet" (produced using i686 binaries) will cause a segfault on macOS > and linux, "64.parquet" will read just fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1420) [C++] Thrift-generated symbols not exported in DLL
[ https://issues.apache.org/jira/browse/PARQUET-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644012#comment-16644012 ] Antoine Pitrou commented on PARQUET-1420: - (note you can see the work in progress here: https://github.com/apache/arrow/compare/master...pitrou:ARROW-3442-tests-linking-shared) > [C++] Thrift-generated symbols not exported in DLL > -- > > Key: PARQUET-1420 > URL: https://issues.apache.org/jira/browse/PARQUET-1420 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Antoine Pitrou >Priority: Major > Labels: windows > > Thirft-generated symbols don't have any {{PARQUET_EXPORT}}-like annotation, > so they are not reachable from the Parquet DLL. In turn this makes it > impossible to link Parquet unit tests with the Parquet DLL (instead of the > Parquet static lib). I'm not sure it can impact other applications. > Example linking error: > {code} > column_writer-test.cc.obj : error LNK2019: unresolved external symbol > "public: v > irtual unsigned int __cdecl parquet::format::Statistics::read(class > apache::thri > ft::protocol::TProtocol *)" > (?read@Statistics@format@parquet@@UEAAIPEAVTProtocol > @protocol@thrift@apache@@@Z) referenced in function "[thunk]:public: virtual > uns > igned int __cdecl parquet::format::Statistics::read`vtordisp{4294967292,0}' > (cla > ss apache::thrift::protocol::TProtocol *)" > (?read@Statistics@format@parquet@@$4P > PPM@A@EAAIPEAVTProtocol@protocol@thrift@apache@@@Z) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)
[ https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643970#comment-16643970 ] Dmitry Kalinkin commented on PARQUET-1438: -- Perhaps this is an arrow issue then? > [C++] corrupted files produced on 32-bit architecture (i686) > > > Key: PARQUET-1438 > URL: https://issues.apache.org/jira/browse/PARQUET-1438 > Project: Parquet > Issue Type: Bug >Reporter: Dmitry Kalinkin >Priority: Major > Attachments: 32.parquet, 64.parquet, arrow_0.10.0_i686_test_fail.log, > arrow_0.11.0_i686_test_fail.log, parquet_1.5.0_i686_test_success.log > > > I'm using C++ API to convert some data to parquet files. I've noticed a > regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to > arrow-cpp 0.11.0. The issue is that I can write parquet files without an > error, but when I try to read those using pyarrow I get a segfault: > {noformat} > #0 0x7fffd17c7f0f in int > arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, > int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #1 0x7fffd17c8025 in > parquet::DictionaryDecoder > >::DecodeSpaced(float*, int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #2 0x7fffd17bcf0f in > parquet::internal::TypedRecordReader > >::ReadRecordData(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #3 0x7fffd17bfbea in > parquet::internal::TypedRecordReader > >::ReadRecords(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #4 0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #5 0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #6 0x7fffd179a6e5 in > parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #7 0x7fffd179aaad in > parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, > std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > {noformat} > I have not been able to dig to the bottom of the issue, but it seems like the > problem reproduces only when I run 32 bit binaries. After I learned that, I > found that 32 bit and 64 bit codes produce very different different parquet > files for the same data. The sizes of the structures are clearly different if > I look at their hexdumps. I'm attaching those example files. Reading > "32.parquet" (produced using i686 binaries) will cause a segfault on macOS > and linux, "64.parquet" will read just fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)
[ https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643963#comment-16643963 ] Dmitry Kalinkin commented on PARQUET-1438: -- Running test suite was a great suggestion! I've tested arrow-cpp 0.10.0, parquet 1.5.0, arrow-cpp 0.11.0 and found that all tests pass on x86_64. As for tests on i686, *1* test fail on arrow-cpp 0.10.0, *0* failures for parquet 1.5.0 (against arrow-cpp 0.10.0), arrow-cpp 0.11.0 has *11* failing tests. I'm attaching log files to the ticket. > [C++] corrupted files produced on 32-bit architecture (i686) > > > Key: PARQUET-1438 > URL: https://issues.apache.org/jira/browse/PARQUET-1438 > Project: Parquet > Issue Type: Bug >Reporter: Dmitry Kalinkin >Priority: Major > Attachments: 32.parquet, 64.parquet, arrow_0.10.0_i686_test_fail.log, > arrow_0.11.0_i686_test_fail.log, parquet_1.5.0_i686_test_success.log > > > I'm using C++ API to convert some data to parquet files. I've noticed a > regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to > arrow-cpp 0.11.0. The issue is that I can write parquet files without an > error, but when I try to read those using pyarrow I get a segfault: > {noformat} > #0 0x7fffd17c7f0f in int > arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, > int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #1 0x7fffd17c8025 in > parquet::DictionaryDecoder > >::DecodeSpaced(float*, int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #2 0x7fffd17bcf0f in > parquet::internal::TypedRecordReader > >::ReadRecordData(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #3 0x7fffd17bfbea in > parquet::internal::TypedRecordReader > >::ReadRecords(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #4 0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #5 0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #6 0x7fffd179a6e5 in > parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #7 0x7fffd179aaad in > parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, > std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > {noformat} > I have not been able to dig to the bottom of the issue, but it seems like the > problem reproduces only when I run 32 bit binaries. After I learned that, I > found that 32 bit and 64 bit codes produce very different different parquet > files for the same data. The sizes of the structures are clearly different if > I look at their hexdumps. I'm attaching those example files. Reading > "32.parquet" (produced using i686 binaries) will cause a segfault on macOS > and linux, "64.parquet" will read just fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)
[ https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Kalinkin updated PARQUET-1438: - Attachment: arrow_0.10.0_i686_test_fail.log > [C++] corrupted files produced on 32-bit architecture (i686) > > > Key: PARQUET-1438 > URL: https://issues.apache.org/jira/browse/PARQUET-1438 > Project: Parquet > Issue Type: Bug >Reporter: Dmitry Kalinkin >Priority: Major > Attachments: 32.parquet, 64.parquet, arrow_0.10.0_i686_test_fail.log, > arrow_0.11.0_i686_test_fail.log, parquet_1.5.0_i686_test_success.log > > > I'm using C++ API to convert some data to parquet files. I've noticed a > regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to > arrow-cpp 0.11.0. The issue is that I can write parquet files without an > error, but when I try to read those using pyarrow I get a segfault: > {noformat} > #0 0x7fffd17c7f0f in int > arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, > int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #1 0x7fffd17c8025 in > parquet::DictionaryDecoder > >::DecodeSpaced(float*, int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #2 0x7fffd17bcf0f in > parquet::internal::TypedRecordReader > >::ReadRecordData(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #3 0x7fffd17bfbea in > parquet::internal::TypedRecordReader > >::ReadRecords(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #4 0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #5 0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #6 0x7fffd179a6e5 in > parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #7 0x7fffd179aaad in > parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, > std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > {noformat} > I have not been able to dig to the bottom of the issue, but it seems like the > problem reproduces only when I run 32 bit binaries. After I learned that, I > found that 32 bit and 64 bit codes produce very different different parquet > files for the same data. The sizes of the structures are clearly different if > I look at their hexdumps. I'm attaching those example files. Reading > "32.parquet" (produced using i686 binaries) will cause a segfault on macOS > and linux, "64.parquet" will read just fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)
[ https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Kalinkin updated PARQUET-1438: - Attachment: parquet_1.5.0_i686_test_success.log > [C++] corrupted files produced on 32-bit architecture (i686) > > > Key: PARQUET-1438 > URL: https://issues.apache.org/jira/browse/PARQUET-1438 > Project: Parquet > Issue Type: Bug >Reporter: Dmitry Kalinkin >Priority: Major > Attachments: 32.parquet, 64.parquet, arrow_0.10.0_i686_test_fail.log, > arrow_0.11.0_i686_test_fail.log, parquet_1.5.0_i686_test_success.log > > > I'm using C++ API to convert some data to parquet files. I've noticed a > regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to > arrow-cpp 0.11.0. The issue is that I can write parquet files without an > error, but when I try to read those using pyarrow I get a segfault: > {noformat} > #0 0x7fffd17c7f0f in int > arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, > int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #1 0x7fffd17c8025 in > parquet::DictionaryDecoder > >::DecodeSpaced(float*, int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #2 0x7fffd17bcf0f in > parquet::internal::TypedRecordReader > >::ReadRecordData(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #3 0x7fffd17bfbea in > parquet::internal::TypedRecordReader > >::ReadRecords(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #4 0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #5 0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #6 0x7fffd179a6e5 in > parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #7 0x7fffd179aaad in > parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, > std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > {noformat} > I have not been able to dig to the bottom of the issue, but it seems like the > problem reproduces only when I run 32 bit binaries. After I learned that, I > found that 32 bit and 64 bit codes produce very different different parquet > files for the same data. The sizes of the structures are clearly different if > I look at their hexdumps. I'm attaching those example files. Reading > "32.parquet" (produced using i686 binaries) will cause a segfault on macOS > and linux, "64.parquet" will read just fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)
[ https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Kalinkin updated PARQUET-1438: - Attachment: arrow_0.11.0_i686_test_fail.log > [C++] corrupted files produced on 32-bit architecture (i686) > > > Key: PARQUET-1438 > URL: https://issues.apache.org/jira/browse/PARQUET-1438 > Project: Parquet > Issue Type: Bug >Reporter: Dmitry Kalinkin >Priority: Major > Attachments: 32.parquet, 64.parquet, arrow_0.10.0_i686_test_fail.log, > arrow_0.11.0_i686_test_fail.log, parquet_1.5.0_i686_test_success.log > > > I'm using C++ API to convert some data to parquet files. I've noticed a > regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to > arrow-cpp 0.11.0. The issue is that I can write parquet files without an > error, but when I try to read those using pyarrow I get a segfault: > {noformat} > #0 0x7fffd17c7f0f in int > arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, > int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #1 0x7fffd17c8025 in > parquet::DictionaryDecoder > >::DecodeSpaced(float*, int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #2 0x7fffd17bcf0f in > parquet::internal::TypedRecordReader > >::ReadRecordData(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #3 0x7fffd17bfbea in > parquet::internal::TypedRecordReader > >::ReadRecords(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #4 0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #5 0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #6 0x7fffd179a6e5 in > parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #7 0x7fffd179aaad in > parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, > std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > {noformat} > I have not been able to dig to the bottom of the issue, but it seems like the > problem reproduces only when I run 32 bit binaries. After I learned that, I > found that 32 bit and 64 bit codes produce very different different parquet > files for the same data. The sizes of the structures are clearly different if > I look at their hexdumps. I'm attaching those example files. Reading > "32.parquet" (produced using i686 binaries) will cause a segfault on macOS > and linux, "64.parquet" will read just fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1420) [C++] Thrift-generated symbols not exported in DLL
[ https://issues.apache.org/jira/browse/PARQUET-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643878#comment-16643878 ] Antoine Pitrou commented on PARQUET-1420: - I've started looking into this. Two main pain points seem to stick out: * {{schema-test.cc}} invokes many Thrift-generated APIs, e.g. for creating schema elements * {{file-deserialize-test.cc}} seems to test implementation details (PageHeader and DataPageHeader serialization) > [C++] Thrift-generated symbols not exported in DLL > -- > > Key: PARQUET-1420 > URL: https://issues.apache.org/jira/browse/PARQUET-1420 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Antoine Pitrou >Priority: Major > Labels: windows > > Thirft-generated symbols don't have any {{PARQUET_EXPORT}}-like annotation, > so they are not reachable from the Parquet DLL. In turn this makes it > impossible to link Parquet unit tests with the Parquet DLL (instead of the > Parquet static lib). I'm not sure it can impact other applications. > Example linking error: > {code} > column_writer-test.cc.obj : error LNK2019: unresolved external symbol > "public: v > irtual unsigned int __cdecl parquet::format::Statistics::read(class > apache::thri > ft::protocol::TProtocol *)" > (?read@Statistics@format@parquet@@UEAAIPEAVTProtocol > @protocol@thrift@apache@@@Z) referenced in function "[thunk]:public: virtual > uns > igned int __cdecl parquet::format::Statistics::read`vtordisp{4294967292,0}' > (cla > ss apache::thrift::protocol::TProtocol *)" > (?read@Statistics@format@parquet@@$4P > PPM@A@EAAIPEAVTProtocol@protocol@thrift@apache@@@Z) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
parquet sync notes
Gabor (Cloudera): column index, benchmark, nested types (filter, indexes) Anna (Cloudera): process, feature branches, etiquette of waiting for someone? Blocked Zoltan (Cloudera): Feature branches? When to review them? Nandor (Cloudera) parquet file with multiple row groups, schema evolution Zoltan (Cloudera): column index Junjie (tencent): listening Gidon (IBM): encryption next steps Jim: bloom filter, Bit weaving Xinli (Uber): encryption Julien (WeWork): encryption Bloom filter: - PR for doc. Parquet-format feature branch. - - To be reviewed by: Deepak, Jim, Ryan. Encryption: - Another encryption effort exists, Julien to send intros: Xinli, Giddon, Zoltan - New requirements, updated doc, implement code changes. Process: - Feature branches: - - Julien to follow up with Ryan - Feature branches are considered like master: - - Every changed is reviewed individually through a PR - Every change has a jira - Only difference is that it’s ok to make incompatible changes - Squash merge vs merge commit: - - Merge commit keeps the history but clutters - 3 options: - - Merge commit - - Clutters history (not linear anymore) - But if each commit in the branch has a jira seems fine - Squash: - - Loses the detailed commits of the feature - Keeps history linear - Rebase feature branch - - Keeps history linear and keeps history - But need to address conflicts for each commit in branch - Commits in branch are now disconnected from the PR (modified after the facts). - When is it appropriate to wait: - - Balance: - - making sure we don’t make incompatible changes to the format and we have final features - Making it easier for people to contribute. - Anna to start a conversation around our etiquette - - How long is it appropriate to wait on feedback - How to know who’s the best committer to drive a PR to conclusion Filtering nested types support: - We should store stats for nested types Page Index benchmark: - Nice results, comparing random to sorted files: - - https://jmh.morethan.io/?gist=2388d962d6380f74a78ad0d97b4353a2/benchmarkWithOrWithoutColumnIndex.json - https://jmh.morethan.io/?gist=2388d962d6380f74a78ad0d97b4353a2/benchmarkPageSize.json - Need to compare page size affect on compression and file size Appending to a parquet file: - The type of a column chunk should be consistent with the schema in the footer.
[jira] [Resolved] (PARQUET-1354) [C++] Fix deprecated Arrow builder API usages
[ https://issues.apache.org/jira/browse/PARQUET-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved PARQUET-1354. --- Resolution: Fixed Yes this is fixed > [C++] Fix deprecated Arrow builder API usages > - > > Key: PARQUET-1354 > URL: https://issues.apache.org/jira/browse/PARQUET-1354 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Blocker > Fix For: cpp-1.5.0 > > > I see warnings like the following: > {code} > [64/65] Building CXX object > src/parquet/arrow/CMakeF...reader-writer-test.dir/arrow-reader-writer-test.cc.o > In file included from ../src/parquet/arrow/test-util.h:23:0, > from ../src/parquet/arrow/arrow-reader-writer-test.cc:37: > ../src/parquet/arrow/test-util.h: In function 'void > parquet::arrow::ExpectArrayT(void*, arrow::Array*) [with ArrowType = > arrow::BooleanType]': > ../src/parquet/arrow/test-util.h:467:82: warning: 'arrow::Status > arrow::BooleanBuilder::Append(const uint8_t*, int64_t, const uint8_t*)' is > deprecated (declared at > /opt/conda/envs/pyarrow-dev/include/arrow/builder.h:711): Use AppendValues > instead [-Wdeprecated-declarations] >EXPECT_OK(builder.Append(reinterpret_cast(expected), > result->length())); > > ^ > In file included from > /opt/conda/envs/pyarrow-dev/include/arrow/compute/context.h:24:0, > from > /opt/conda/envs/pyarrow-dev/include/arrow/compute/api.h:21, > from ../src/parquet/arrow/arrow-reader-writer-test.cc:26: > ../src/parquet/arrow/test-util.h: In instantiation of 'typename > std::enable_if parquet::arrow::DecimalWithPrecisionAndScale >::value, > arrow::Status>::type parquet::arrow::NullableArray(size_t, size_t, uint32_t, > std::shared_ptr*) [with ArrowType = > parquet::arrow::DecimalWithPrecisionAndScale<38>; int precision = 38; > typename std::enable_if parquet::arrow::DecimalWithPrecisionAndScale >::value, > arrow::Status>::type = arrow::Status; size_t = long unsigned int; uint32_t = > unsigned int]': > ../src/parquet/arrow/arrow-reader-writer-test.cc:845:3: required from 'void > parquet::arrow::TestParquetIO_SingleColumnTableOptionalChunkedWrite_Test::TestBody() > [with gtest_TypeParam_ = parquet::arrow::DecimalWithPrecisionAndScale<38>]' > /opt/conda/envs/pyarrow-dev/include/arrow/builder.h:1042:20: required from > here > ../src/parquet/arrow/test-util.h:331:73: warning: 'arrow::Status > arrow::FixedSizeBinaryBuilder::Append(const uint8_t*, int64_t, const > uint8_t*)' is deprecated (declared at > /opt/conda/envs/pyarrow-dev/include/arrow/builder.h:1017): Use AppendValues > instead [-Wdeprecated-declarations] >RETURN_NOT_OK(builder.Append(out_buf->data(), size, valid_bytes.data())); > ^ > ../src/parquet/arrow/test-util.h: In instantiation of 'typename > std::enable_if parquet::arrow::DecimalWithPrecisionAndScale >::value, > arrow::Status>::type parquet::arrow::NonNullArray(size_t, > std::shared_ptr*) [with ArrowType = > parquet::arrow::DecimalWithPrecisionAndScale<38>; int precision = 38; > typename std::enable_if parquet::arrow::DecimalWithPrecisionAndScale >::value, > arrow::Status>::type = arrow::Status; size_t = long unsigned int]': > ../src/parquet/arrow/arrow-reader-writer-test.cc:791:3: required from 'void > parquet::arrow::TestParquetIO_SingleColumnTableRequiredChunkedWriteArrowIO_Test::TestBody() > [with gtest_TypeParam_ = parquet::arrow::DecimalWithPrecisionAndScale<38>]' > /opt/conda/envs/pyarrow-dev/include/arrow/builder.h:1042:20: required from > here > ../src/parquet/arrow/test-util.h:170:53: warning: 'arrow::Status > arrow::FixedSizeBinaryBuilder::Append(const uint8_t*, int64_t, const > uint8_t*)' is deprecated (declared at > /opt/conda/envs/pyarrow-dev/include/arrow/builder.h:1017): Use AppendValues > instead [-Wdeprecated-declarations] >RETURN_NOT_OK(builder.Append(out_buf->data(), size)); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)
[ https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643608#comment-16643608 ] Dmitry Kalinkin commented on PARQUET-1438: -- Thank you for providing the diff. I looked and it doesn't seem very drastic to me as well. I don't think there is a conflicting libraries problem. I do all of my builds in a sandbox and the writing of files does succeed with resulting files being grossly different for 0.11.0 on 32 bits. Unfortunately all of 3545186d6, 3545186d6~ and 9b4cd9c03 do reproduce the bug. > [C++] corrupted files produced on 32-bit architecture (i686) > > > Key: PARQUET-1438 > URL: https://issues.apache.org/jira/browse/PARQUET-1438 > Project: Parquet > Issue Type: Bug >Reporter: Dmitry Kalinkin >Priority: Major > Attachments: 32.parquet, 64.parquet > > > I'm using C++ API to convert some data to parquet files. I've noticed a > regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to > arrow-cpp 0.11.0. The issue is that I can write parquet files without an > error, but when I try to read those using pyarrow I get a segfault: > {noformat} > #0 0x7fffd17c7f0f in int > arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, > int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #1 0x7fffd17c8025 in > parquet::DictionaryDecoder > >::DecodeSpaced(float*, int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #2 0x7fffd17bcf0f in > parquet::internal::TypedRecordReader > >::ReadRecordData(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #3 0x7fffd17bfbea in > parquet::internal::TypedRecordReader > >::ReadRecords(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #4 0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #5 0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #6 0x7fffd179a6e5 in > parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #7 0x7fffd179aaad in > parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, > std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > {noformat} > I have not been able to dig to the bottom of the issue, but it seems like the > problem reproduces only when I run 32 bit binaries. After I learned that, I > found that 32 bit and 64 bit codes produce very different different parquet > files for the same data. The sizes of the structures are clearly different if > I look at their hexdumps. I'm attaching those example files. Reading > "32.parquet" (produced using i686 binaries) will cause a segfault on macOS > and linux, "64.parquet" will read just fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)
[ https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643525#comment-16643525 ] Wes McKinney commented on PARQUET-1438: --- Here's the effective diff on the codebases https://gist.github.com/wesm/e8e43aba036db747fb9c021d590be938 Is it possible you have a conflicting libparquet.so lying around? The only thing that looks possibly concerning are some changes to the metadata introduced in PARQUET-1369. If you build with the commit 3545186d6 right before that, do you still get the issue? If you want to git bisect, the place to start is 9b4cd9c03 ARROW-3075 > [C++] corrupted files produced on 32-bit architecture (i686) > > > Key: PARQUET-1438 > URL: https://issues.apache.org/jira/browse/PARQUET-1438 > Project: Parquet > Issue Type: Bug >Reporter: Dmitry Kalinkin >Priority: Major > Attachments: 32.parquet, 64.parquet > > > I'm using C++ API to convert some data to parquet files. I've noticed a > regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to > arrow-cpp 0.11.0. The issue is that I can write parquet files without an > error, but when I try to read those using pyarrow I get a segfault: > {noformat} > #0 0x7fffd17c7f0f in int > arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, > int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #1 0x7fffd17c8025 in > parquet::DictionaryDecoder > >::DecodeSpaced(float*, int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #2 0x7fffd17bcf0f in > parquet::internal::TypedRecordReader > >::ReadRecordData(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #3 0x7fffd17bfbea in > parquet::internal::TypedRecordReader > >::ReadRecords(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #4 0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #5 0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #6 0x7fffd179a6e5 in > parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #7 0x7fffd179aaad in > parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, > std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > {noformat} > I have not been able to dig to the bottom of the issue, but it seems like the > problem reproduces only when I run 32 bit binaries. After I learned that, I > found that 32 bit and 64 bit codes produce very different different parquet > files for the same data. The sizes of the structures are clearly different if > I look at their hexdumps. I'm attaching those example files. Reading > "32.parquet" (produced using i686 binaries) will cause a segfault on macOS > and linux, "64.parquet" will read just fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1439) [C++] Parquet build fails when PARQUET_ARROW_LINKAGE is static
Deepak Majeti created PARQUET-1439: -- Summary: [C++] Parquet build fails when PARQUET_ARROW_LINKAGE is static Key: PARQUET-1439 URL: https://issues.apache.org/jira/browse/PARQUET-1439 Project: Parquet Issue Type: Improvement Components: parquet-cpp Reporter: Deepak Majeti Assignee: Deepak Majeti Fix For: cpp-1.6.0 The error is as follows {noformat} CMake Error at cmake_modules/BuildUtils.cmake:145 (add_dependencies): The dependency target "/usr/lib/x86_64-linux-gnu/libpthread.so" of target "parquet_objlib" does not exist. Call Stack (most recent call first): src/parquet/CMakeLists.txt:183 (ADD_ARROW_LIB{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)
[ https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643462#comment-16643462 ] Dmitry Kalinkin commented on PARQUET-1438: -- Yes. The setup with arrow-cpp 0.10.0 and parquet-cpp 1.5.0 uses the tarball from https://github.com/apache/parquet-cpp/archive/apache-parquet-cpp-1.5.0.tar.gz > [C++] corrupted files produced on 32-bit architecture (i686) > > > Key: PARQUET-1438 > URL: https://issues.apache.org/jira/browse/PARQUET-1438 > Project: Parquet > Issue Type: Bug >Reporter: Dmitry Kalinkin >Priority: Major > Attachments: 32.parquet, 64.parquet > > > I'm using C++ API to convert some data to parquet files. I've noticed a > regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to > arrow-cpp 0.11.0. The issue is that I can write parquet files without an > error, but when I try to read those using pyarrow I get a segfault: > {noformat} > #0 0x7fffd17c7f0f in int > arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, > int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #1 0x7fffd17c8025 in > parquet::DictionaryDecoder > >::DecodeSpaced(float*, int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #2 0x7fffd17bcf0f in > parquet::internal::TypedRecordReader > >::ReadRecordData(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #3 0x7fffd17bfbea in > parquet::internal::TypedRecordReader > >::ReadRecords(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #4 0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #5 0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #6 0x7fffd179a6e5 in > parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #7 0x7fffd179aaad in > parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, > std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > {noformat} > I have not been able to dig to the bottom of the issue, but it seems like the > problem reproduces only when I run 32 bit binaries. After I learned that, I > found that 32 bit and 64 bit codes produce very different different parquet > files for the same data. The sizes of the structures are clearly different if > I look at their hexdumps. I'm attaching those example files. Reading > "32.parquet" (produced using i686 binaries) will cause a segfault on macOS > and linux, "64.parquet" will read just fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)
[ https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643440#comment-16643440 ] Wes McKinney commented on PARQUET-1438: --- Are you using the _released_ version of 1.5.0 or some other version? There should be little discrepancy between the code in parquet-cpp 1.5.0 and what's in master now > [C++] corrupted files produced on 32-bit architecture (i686) > > > Key: PARQUET-1438 > URL: https://issues.apache.org/jira/browse/PARQUET-1438 > Project: Parquet > Issue Type: Bug >Reporter: Dmitry Kalinkin >Priority: Major > Attachments: 32.parquet, 64.parquet > > > I'm using C++ API to convert some data to parquet files. I've noticed a > regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to > arrow-cpp 0.11.0. The issue is that I can write parquet files without an > error, but when I try to read those using pyarrow I get a segfault: > {noformat} > #0 0x7fffd17c7f0f in int > arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, > int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #1 0x7fffd17c8025 in > parquet::DictionaryDecoder > >::DecodeSpaced(float*, int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #2 0x7fffd17bcf0f in > parquet::internal::TypedRecordReader > >::ReadRecordData(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #3 0x7fffd17bfbea in > parquet::internal::TypedRecordReader > >::ReadRecords(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #4 0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #5 0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #6 0x7fffd179a6e5 in > parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #7 0x7fffd179aaad in > parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, > std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > {noformat} > I have not been able to dig to the bottom of the issue, but it seems like the > problem reproduces only when I run 32 bit binaries. After I learned that, I > found that 32 bit and 64 bit codes produce very different different parquet > files for the same data. The sizes of the structures are clearly different if > I look at their hexdumps. I'm attaching those example files. Reading > "32.parquet" (produced using i686 binaries) will cause a segfault on macOS > and linux, "64.parquet" will read just fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)
[ https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643415#comment-16643415 ] Dmitry Kalinkin commented on PARQUET-1438: -- I now checked files that were produced with previous version of the parquet-cpp 1.5.0 on 32 bit and they mostly match what I get on 64 bit arrow-cpp 0.11.0. I also tried to do a bisect on arrow-cpp repository, but could not find any good commit. They all either have a bug or don't build. I guess I could try to bisect paquet-cpp repository against arrow-cpp 0.10.0. I was hoping someone with the knowledge of the format could take a look at files and see which part of the structure blows up. It seems like it is the schema that blows up. That means I need to look at thrift related stuff? > [C++] corrupted files produced on 32-bit architecture (i686) > > > Key: PARQUET-1438 > URL: https://issues.apache.org/jira/browse/PARQUET-1438 > Project: Parquet > Issue Type: Bug >Reporter: Dmitry Kalinkin >Priority: Major > Attachments: 32.parquet, 64.parquet > > > I'm using C++ API to convert some data to parquet files. I've noticed a > regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to > arrow-cpp 0.11.0. The issue is that I can write parquet files without an > error, but when I try to read those using pyarrow I get a segfault: > {noformat} > #0 0x7fffd17c7f0f in int > arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, > int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #1 0x7fffd17c8025 in > parquet::DictionaryDecoder > >::DecodeSpaced(float*, int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #2 0x7fffd17bcf0f in > parquet::internal::TypedRecordReader > >::ReadRecordData(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #3 0x7fffd17bfbea in > parquet::internal::TypedRecordReader > >::ReadRecords(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #4 0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #5 0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #6 0x7fffd179a6e5 in > parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #7 0x7fffd179aaad in > parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, > std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > {noformat} > I have not been able to dig to the bottom of the issue, but it seems like the > problem reproduces only when I run 32 bit binaries. After I learned that, I > found that 32 bit and 64 bit codes produce very different different parquet > files for the same data. The sizes of the structures are clearly different if > I look at their hexdumps. I'm attaching those example files. Reading > "32.parquet" (produced using i686 binaries) will cause a segfault on macOS > and linux, "64.parquet" will read just fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)
[ https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643415#comment-16643415 ] Dmitry Kalinkin edited comment on PARQUET-1438 at 10/9/18 1:20 PM: --- I now checked files that were produced with previous version of the parquet-cpp 1.5.0 on 32 bit and they mostly match what I get on 64 bit arrow-cpp 0.11.0. I also tried to do a bisect on arrow-cpp repository, but could not find any good commit. They all either have the bug or don't build. I guess, I could try to bisect paquet-cpp repository against arrow-cpp 0.10.0. I was hoping someone with the knowledge of the format could take a look at files and see which part of the structure blows up. It seems like it is the schema that blows up. That means I need to look at thrift related stuff? was (Author: veprbl): I now checked files that were produced with previous version of the parquet-cpp 1.5.0 on 32 bit and they mostly match what I get on 64 bit arrow-cpp 0.11.0. I also tried to do a bisect on arrow-cpp repository, but could not find any good commit. They all either have a bug or don't build. I guess I could try to bisect paquet-cpp repository against arrow-cpp 0.10.0. I was hoping someone with the knowledge of the format could take a look at files and see which part of the structure blows up. It seems like it is the schema that blows up. That means I need to look at thrift related stuff? > [C++] corrupted files produced on 32-bit architecture (i686) > > > Key: PARQUET-1438 > URL: https://issues.apache.org/jira/browse/PARQUET-1438 > Project: Parquet > Issue Type: Bug >Reporter: Dmitry Kalinkin >Priority: Major > Attachments: 32.parquet, 64.parquet > > > I'm using C++ API to convert some data to parquet files. I've noticed a > regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to > arrow-cpp 0.11.0. The issue is that I can write parquet files without an > error, but when I try to read those using pyarrow I get a segfault: > {noformat} > #0 0x7fffd17c7f0f in int > arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, > int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #1 0x7fffd17c8025 in > parquet::DictionaryDecoder > >::DecodeSpaced(float*, int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #2 0x7fffd17bcf0f in > parquet::internal::TypedRecordReader > >::ReadRecordData(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #3 0x7fffd17bfbea in > parquet::internal::TypedRecordReader > >::ReadRecords(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #4 0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #5 0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #6 0x7fffd179a6e5 in > parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #7 0x7fffd179aaad in > parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, > std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > {noformat} > I have not been able to dig to the bottom of the issue, but it seems like the > problem reproduces only when I run 32 bit binaries. After I learned that, I > found that 32 bit and 64 bit codes produce very different different parquet > files for the same data. The sizes of the structures are clearly different if > I look at their hexdumps. I'm attaching those example files. Reading > "32.parquet" (produced using i686 binaries) will cause a segfault on macOS > and linux, "64.parquet" will read just fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)
[ https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642990#comment-16642990 ] Wes McKinney commented on PARQUET-1438: --- Since we do not test or develop on 32-bit arch, I would guess that it's not very well supported in general. We would appreciate some help with this > [C++] corrupted files produced on 32-bit architecture (i686) > > > Key: PARQUET-1438 > URL: https://issues.apache.org/jira/browse/PARQUET-1438 > Project: Parquet > Issue Type: Bug >Reporter: Dmitry Kalinkin >Priority: Major > Attachments: 32.parquet, 64.parquet > > > I'm using C++ API to convert some data to parquet files. I've noticed a > regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to > arrow-cpp 0.11.0. The issue is that I can write parquet files without an > error, but when I try to read those using pyarrow I get a segfault: > {noformat} > #0 0x7fffd17c7f0f in int > arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, > int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #1 0x7fffd17c8025 in > parquet::DictionaryDecoder > >::DecodeSpaced(float*, int, int, unsigned char const*, long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #2 0x7fffd17bcf0f in > parquet::internal::TypedRecordReader > >::ReadRecordData(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #3 0x7fffd17bfbea in > parquet::internal::TypedRecordReader > >::ReadRecords(long) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #4 0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #5 0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, > std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #6 0x7fffd179a6e5 in > parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) () >from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > #7 0x7fffd179aaad in > parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, > std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () > from > /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11 > {noformat} > I have not been able to dig to the bottom of the issue, but it seems like the > problem reproduces only when I run 32 bit binaries. After I learned that, I > found that 32 bit and 64 bit codes produce very different different parquet > files for the same data. The sizes of the structures are clearly different if > I look at their hexdumps. I'm attaching those example files. Reading > "32.parquet" (produced using i686 binaries) will cause a segfault on macOS > and linux, "64.parquet" will read just fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)