date:20181009

[jira] [Created] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

2018-10-09 Thread Michael Heuer (JIRA)

Michael Heuer created PARQUET-1441:
--

 Summary: SchemaParseException: Can't redefine: list in 
AvroIndexedRecordConverter
 Key: PARQUET-1441
 URL: https://issues.apache.org/jira/browse/PARQUET-1441
 Project: Parquet
  Issue Type: Bug
  Components: parquet-avro
Reporter: Michael Heuer


The following unit test added to TestAvroSchemaConverter fails
{code:java}
@Test
public void testConvertedSchemaToStringCantRedefineList() throws Exception {
  String parquet = "message spark_schema {\n" +
  "  optional group annotation {\n" +
  "optional group transcriptEffects (LIST) {\n" +
  "  repeated group list {\n" +
  "optional group element {\n" +
  "  optional group effects (LIST) {\n" +
  "repeated group list {\n" +
  "  optional binary element (UTF8);\n" +
  "}\n" +
  "  }\n" +
  "}\n" +
  "  }\n" +
  "}\n" +
  "  }\n" +
  "}\n";

  Configuration conf = new Configuration(false);
  AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
  Schema schema = 
avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
  schema.toString();
}
{code}

while this one succeeds
{code:java}
@Test
public void testConvertedSchemaToStringCantRedefineList() throws Exception {
  String parquet = "message spark_schema {\n" +
  "  optional group annotation {\n" +
  "optional group transcriptEffects (LIST) {\n" +
  "  repeated group list {\n" +
  "optional group element {\n" +
  "  optional group effects (LIST) {\n" +
  "repeated group list {\n" +
  "  optional binary element (UTF8);\n" +
  "}\n" +
  "  }\n" +
  "}\n" +
  "  }\n" +
  "}\n" +
  "  }\n" +
  "}\n";
 
  Configuration conf = new Configuration(false);
  conf.setBoolean("parquet.avro.add-list-element-records", false);
  AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
  Schema schema = 
avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
  schema.toString();
}
{code}

I don't see a way to influence the code path in AvroIndexedRecordConverter to 
respect this configuration, resulting in the following stack trace downstream
{noformat}
  Cause: org.apache.avro.SchemaParseException: Can't redefine: list
  at org.apache.avro.Schema$Names.put(Schema.java:1128)
  at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
  at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
  at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
  at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
  at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
  at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
  at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
  at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
  at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
  at org.apache.avro.Schema.toString(Schema.java:324)
  at 
org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)
  at 
org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)
  at 
org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.(AvroIndexedRecordConverter.java:333)
  at 
org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:172)
  at 
org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:94)
  at 
org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:168)
  at 
org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:94)
  at 
org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:66)
  at 
org.apache.parquet.avro.AvroCompatRecordMaterializer.(AvroCompatRecordMaterializer.java:34)
  at 
org.apache.parquet.avro.AvroReadSupport.newCompatMaterializer(AvroReadSupport.java:144)
  at 
org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:136)
  at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:204)
  at 
org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
  at 
org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
...
{noformat}

See also downstream issues
https://issues.apache.org/jira/browse/SPARK-25588
[https://github.com/bigdatagenomics/adam/issues/2058]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644052#comment-16644052
 ] 

Dmitry Kalinkin commented on PARQUET-1438:
--

Opened ARROW-3477

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet, arrow_0.10.0_i686_test_fail.log, 
> arrow_0.11.0_i686_test_fail.log, parquet_1.5.0_i686_test_success.log
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1420) [C++] Thrift-generated symbols not exported in DLL

2018-10-09 Thread Antoine Pitrou (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644012#comment-16644012
 ] 

Antoine Pitrou commented on PARQUET-1420:
-

(note you can see the work in progress here:
 
https://github.com/apache/arrow/compare/master...pitrou:ARROW-3442-tests-linking-shared)

> [C++] Thrift-generated symbols not exported in DLL
> --
>
> Key: PARQUET-1420
> URL: https://issues.apache.org/jira/browse/PARQUET-1420
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: windows
>
> Thirft-generated symbols don't have any {{PARQUET_EXPORT}}-like annotation, 
> so they are not reachable from the Parquet DLL. In turn this makes it 
> impossible to link Parquet unit tests with the Parquet DLL (instead of the 
> Parquet static lib). I'm not sure it can impact other applications.
> Example linking error:
> {code}
> column_writer-test.cc.obj : error LNK2019: unresolved external symbol 
> "public: v
> irtual unsigned int __cdecl parquet::format::Statistics::read(class 
> apache::thri
> ft::protocol::TProtocol *)" 
> (?read@Statistics@format@parquet@@UEAAIPEAVTProtocol
> @protocol@thrift@apache@@@Z) referenced in function "[thunk]:public: virtual 
> uns
> igned int __cdecl parquet::format::Statistics::read`vtordisp{4294967292,0}' 
> (cla
> ss apache::thrift::protocol::TProtocol *)" 
> (?read@Statistics@format@parquet@@$4P
> PPM@A@EAAIPEAVTProtocol@protocol@thrift@apache@@@Z)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643970#comment-16643970
 ] 

Dmitry Kalinkin commented on PARQUET-1438:
--

Perhaps this is an arrow issue then?

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet, arrow_0.10.0_i686_test_fail.log, 
> arrow_0.11.0_i686_test_fail.log, parquet_1.5.0_i686_test_success.log
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643963#comment-16643963
 ] 

Dmitry Kalinkin commented on PARQUET-1438:
--

Running test suite was a great suggestion!

I've tested arrow-cpp 0.10.0, parquet 1.5.0, arrow-cpp 0.11.0 and found that 
all tests pass on x86_64. As for tests on i686, *1* test fail on arrow-cpp 
0.10.0, *0* failures for parquet 1.5.0 (against arrow-cpp 0.10.0), arrow-cpp 
0.11.0 has *11* failing tests. I'm attaching log files to the ticket.

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet, arrow_0.10.0_i686_test_fail.log, 
> arrow_0.11.0_i686_test_fail.log, parquet_1.5.0_i686_test_success.log
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Kalinkin updated PARQUET-1438:
-
Attachment: arrow_0.10.0_i686_test_fail.log

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet, arrow_0.10.0_i686_test_fail.log, 
> arrow_0.11.0_i686_test_fail.log, parquet_1.5.0_i686_test_success.log
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Kalinkin updated PARQUET-1438:
-
Attachment: parquet_1.5.0_i686_test_success.log

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet, arrow_0.10.0_i686_test_fail.log, 
> arrow_0.11.0_i686_test_fail.log, parquet_1.5.0_i686_test_success.log
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Kalinkin updated PARQUET-1438:
-
Attachment: arrow_0.11.0_i686_test_fail.log

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet, arrow_0.10.0_i686_test_fail.log, 
> arrow_0.11.0_i686_test_fail.log, parquet_1.5.0_i686_test_success.log
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1420) [C++] Thrift-generated symbols not exported in DLL

2018-10-09 Thread Antoine Pitrou (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643878#comment-16643878
 ] 

Antoine Pitrou commented on PARQUET-1420:
-

I've started looking into this. Two main pain points seem to stick out:

* {{schema-test.cc}} invokes many Thrift-generated APIs, e.g. for creating 
schema elements
* {{file-deserialize-test.cc}} seems to test implementation details (PageHeader 
and DataPageHeader serialization)

> [C++] Thrift-generated symbols not exported in DLL
> --
>
> Key: PARQUET-1420
> URL: https://issues.apache.org/jira/browse/PARQUET-1420
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: windows
>
> Thirft-generated symbols don't have any {{PARQUET_EXPORT}}-like annotation, 
> so they are not reachable from the Parquet DLL. In turn this makes it 
> impossible to link Parquet unit tests with the Parquet DLL (instead of the 
> Parquet static lib). I'm not sure it can impact other applications.
> Example linking error:
> {code}
> column_writer-test.cc.obj : error LNK2019: unresolved external symbol 
> "public: v
> irtual unsigned int __cdecl parquet::format::Statistics::read(class 
> apache::thri
> ft::protocol::TProtocol *)" 
> (?read@Statistics@format@parquet@@UEAAIPEAVTProtocol
> @protocol@thrift@apache@@@Z) referenced in function "[thunk]:public: virtual 
> uns
> igned int __cdecl parquet::format::Statistics::read`vtordisp{4294967292,0}' 
> (cla
> ss apache::thrift::protocol::TProtocol *)" 
> (?read@Statistics@format@parquet@@$4P
> PPM@A@EAAIPEAVTProtocol@protocol@thrift@apache@@@Z)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

parquet sync notes

2018-10-09 Thread Julien Le Dem

Gabor (Cloudera): column index, benchmark, nested types (filter, indexes)
Anna (Cloudera): process, feature branches, etiquette of waiting for
someone? Blocked
Zoltan (Cloudera): Feature branches? When to review them?
Nandor (Cloudera)
 parquet file with multiple row groups, schema evolution
Zoltan (Cloudera): column index
Junjie (tencent): listening
Gidon (IBM): encryption next steps
Jim: bloom filter, Bit weaving
Xinli (Uber): encryption
Julien (WeWork): encryption

Bloom filter:

   -  PR for doc. Parquet-format feature branch.
   -
  - To be reviewed by: Deepak, Jim, Ryan.


Encryption:

   - Another encryption effort exists, Julien to send intros: Xinli,
   Giddon, Zoltan
   - New requirements, updated doc, implement code changes.


Process:

   - Feature branches:
   -
  - Julien to follow up with Ryan
  - Feature branches are considered like master:
  -
 - Every changed is reviewed individually through a PR
 - Every change has a jira
 - Only difference is that it’s ok to make incompatible changes
 - Squash merge vs merge commit:
  -
 - Merge commit keeps the history but clutters
 - 3 options:
  -
 - Merge commit
 -
- Clutters history (not linear anymore)
- But if each commit in the branch has a jira seems fine
- Squash:
 -
- Loses the detailed commits of the feature
- Keeps history linear
- Rebase feature branch
 -
- Keeps history linear and keeps history
- But need to address conflicts for each commit in branch
- Commits in branch are now disconnected from the PR (modified
after the facts).
-  When is it appropriate to wait:
   -
  - Balance:
  -
 - making sure we don’t make incompatible changes to the format and
 we have final features
 - Making it easier for people to contribute.
 - Anna to start a conversation around our etiquette
  -
 - How long is it appropriate to wait on feedback
 - How to know who’s the best committer to drive a PR to conclusion


Filtering nested types support:

   -  We should store stats for nested types


Page Index benchmark:

   - Nice results, comparing random to sorted files:
   -
  -
  
https://jmh.morethan.io/?gist=2388d962d6380f74a78ad0d97b4353a2/benchmarkWithOrWithoutColumnIndex.json
  -
  
https://jmh.morethan.io/?gist=2388d962d6380f74a78ad0d97b4353a2/benchmarkPageSize.json
  - Need to compare page size affect on compression and file size


Appending to a parquet file:

   -  The type of a column chunk should be consistent with the schema in
   the footer.

[jira] [Resolved] (PARQUET-1354) [C++] Fix deprecated Arrow builder API usages

2018-10-09 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1354.
---
Resolution: Fixed

Yes this is fixed

> [C++] Fix deprecated Arrow builder API usages
> -
>
> Key: PARQUET-1354
> URL: https://issues.apache.org/jira/browse/PARQUET-1354
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
> Fix For: cpp-1.5.0
>
>
> I see warnings like the following:
> {code}
> [64/65] Building CXX object 
> src/parquet/arrow/CMakeF...reader-writer-test.dir/arrow-reader-writer-test.cc.o
> In file included from ../src/parquet/arrow/test-util.h:23:0,
>  from ../src/parquet/arrow/arrow-reader-writer-test.cc:37:
> ../src/parquet/arrow/test-util.h: In function 'void 
> parquet::arrow::ExpectArrayT(void*, arrow::Array*) [with ArrowType = 
> arrow::BooleanType]':
> ../src/parquet/arrow/test-util.h:467:82: warning: 'arrow::Status 
> arrow::BooleanBuilder::Append(const uint8_t*, int64_t, const uint8_t*)' is 
> deprecated (declared at 
> /opt/conda/envs/pyarrow-dev/include/arrow/builder.h:711): Use AppendValues 
> instead [-Wdeprecated-declarations]
>EXPECT_OK(builder.Append(reinterpret_cast(expected), 
> result->length()));
>   
> ^
> In file included from 
> /opt/conda/envs/pyarrow-dev/include/arrow/compute/context.h:24:0,
>  from 
> /opt/conda/envs/pyarrow-dev/include/arrow/compute/api.h:21,
>  from ../src/parquet/arrow/arrow-reader-writer-test.cc:26:
> ../src/parquet/arrow/test-util.h: In instantiation of 'typename 
> std::enable_if parquet::arrow::DecimalWithPrecisionAndScale >::value, 
> arrow::Status>::type parquet::arrow::NullableArray(size_t, size_t, uint32_t, 
> std::shared_ptr*) [with ArrowType = 
> parquet::arrow::DecimalWithPrecisionAndScale<38>; int precision = 38; 
> typename std::enable_if parquet::arrow::DecimalWithPrecisionAndScale >::value, 
> arrow::Status>::type = arrow::Status; size_t = long unsigned int; uint32_t = 
> unsigned int]':
> ../src/parquet/arrow/arrow-reader-writer-test.cc:845:3:   required from 'void 
> parquet::arrow::TestParquetIO_SingleColumnTableOptionalChunkedWrite_Test::TestBody()
>  [with gtest_TypeParam_ = parquet::arrow::DecimalWithPrecisionAndScale<38>]'
> /opt/conda/envs/pyarrow-dev/include/arrow/builder.h:1042:20:   required from 
> here
> ../src/parquet/arrow/test-util.h:331:73: warning: 'arrow::Status 
> arrow::FixedSizeBinaryBuilder::Append(const uint8_t*, int64_t, const 
> uint8_t*)' is deprecated (declared at 
> /opt/conda/envs/pyarrow-dev/include/arrow/builder.h:1017): Use AppendValues 
> instead [-Wdeprecated-declarations]
>RETURN_NOT_OK(builder.Append(out_buf->data(), size, valid_bytes.data()));
>  ^
> ../src/parquet/arrow/test-util.h: In instantiation of 'typename 
> std::enable_if parquet::arrow::DecimalWithPrecisionAndScale >::value, 
> arrow::Status>::type parquet::arrow::NonNullArray(size_t, 
> std::shared_ptr*) [with ArrowType = 
> parquet::arrow::DecimalWithPrecisionAndScale<38>; int precision = 38; 
> typename std::enable_if parquet::arrow::DecimalWithPrecisionAndScale >::value, 
> arrow::Status>::type = arrow::Status; size_t = long unsigned int]':
> ../src/parquet/arrow/arrow-reader-writer-test.cc:791:3:   required from 'void 
> parquet::arrow::TestParquetIO_SingleColumnTableRequiredChunkedWriteArrowIO_Test::TestBody()
>  [with gtest_TypeParam_ = parquet::arrow::DecimalWithPrecisionAndScale<38>]'
> /opt/conda/envs/pyarrow-dev/include/arrow/builder.h:1042:20:   required from 
> here
> ../src/parquet/arrow/test-util.h:170:53: warning: 'arrow::Status 
> arrow::FixedSizeBinaryBuilder::Append(const uint8_t*, int64_t, const 
> uint8_t*)' is deprecated (declared at 
> /opt/conda/envs/pyarrow-dev/include/arrow/builder.h:1017): Use AppendValues 
> instead [-Wdeprecated-declarations]
>RETURN_NOT_OK(builder.Append(out_buf->data(), size));
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643608#comment-16643608
 ] 

Dmitry Kalinkin commented on PARQUET-1438:
--

Thank you for providing the diff. I looked and it doesn't seem very drastic to 
me as well.

I don't think there is a conflicting libraries problem. I do all of my builds 
in a sandbox and the writing of files does succeed with resulting files being 
grossly different for 0.11.0 on 32 bits.

Unfortunately all of 3545186d6, 3545186d6~ and 9b4cd9c03 do reproduce the bug.

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643525#comment-16643525
 ] 

Wes McKinney commented on PARQUET-1438:
---

Here's the effective diff on the codebases

https://gist.github.com/wesm/e8e43aba036db747fb9c021d590be938

Is it possible you have a conflicting libparquet.so lying around? 

The only thing that looks possibly concerning are some changes to the metadata 
introduced in PARQUET-1369. If you build with the commit 3545186d6 right before 
that, do you still get the issue? If you want to git bisect, the place to start 
is 9b4cd9c03 ARROW-3075

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (PARQUET-1439) [C++] Parquet build fails when PARQUET_ARROW_LINKAGE is static

2018-10-09 Thread Deepak Majeti (JIRA)

Deepak Majeti created PARQUET-1439:
--

 Summary: [C++] Parquet build fails when PARQUET_ARROW_LINKAGE is 
static
 Key: PARQUET-1439
 URL: https://issues.apache.org/jira/browse/PARQUET-1439
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Deepak Majeti
Assignee: Deepak Majeti
 Fix For: cpp-1.6.0


The error is as follows
{noformat}
CMake Error at cmake_modules/BuildUtils.cmake:145 (add_dependencies):
  The dependency target "/usr/lib/x86_64-linux-gnu/libpthread.so" of target
  "parquet_objlib" does not exist.
Call Stack (most recent call first):
  src/parquet/CMakeLists.txt:183 (ADD_ARROW_LIB{noformat}
 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643462#comment-16643462
 ] 

Dmitry Kalinkin commented on PARQUET-1438:
--

Yes. The setup with arrow-cpp 0.10.0 and parquet-cpp 1.5.0 uses the tarball 
from 
https://github.com/apache/parquet-cpp/archive/apache-parquet-cpp-1.5.0.tar.gz

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643440#comment-16643440
 ] 

Wes McKinney commented on PARQUET-1438:
---

Are you using the _released_ version of 1.5.0 or some other version? There 
should be little discrepancy between the code in parquet-cpp 1.5.0 and what's 
in master now

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643415#comment-16643415
 ] 

Dmitry Kalinkin commented on PARQUET-1438:
--

I now checked files that were produced with previous version of the parquet-cpp 
1.5.0 on 32 bit and they mostly match what I get on 64 bit arrow-cpp 0.11.0. I 
also tried to do a bisect on arrow-cpp repository, but could not find any good 
commit. They all either have a bug or don't build. I guess I could try to 
bisect paquet-cpp repository against arrow-cpp 0.10.0.

I was hoping someone with the knowledge of the format could take a look at 
files and see which part of the structure blows up. It seems like it is the 
schema that blows up. That means I need to look at thrift related stuff?

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Dmitry Kalinkin (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643415#comment-16643415
 ] 

Dmitry Kalinkin edited comment on PARQUET-1438 at 10/9/18 1:20 PM:
---

I now checked files that were produced with previous version of the parquet-cpp 
1.5.0 on 32 bit and they mostly match what I get on 64 bit arrow-cpp 0.11.0. I 
also tried to do a bisect on arrow-cpp repository, but could not find any good 
commit. They all either have the bug or don't build. I guess, I could try to 
bisect paquet-cpp repository against arrow-cpp 0.10.0.

I was hoping someone with the knowledge of the format could take a look at 
files and see which part of the structure blows up. It seems like it is the 
schema that blows up. That means I need to look at thrift related stuff?


was (Author: veprbl):
I now checked files that were produced with previous version of the parquet-cpp 
1.5.0 on 32 bit and they mostly match what I get on 64 bit arrow-cpp 0.11.0. I 
also tried to do a bisect on arrow-cpp repository, but could not find any good 
commit. They all either have a bug or don't build. I guess I could try to 
bisect paquet-cpp repository against arrow-cpp 0.10.0.

I was hoping someone with the knowledge of the format could take a look at 
files and see which part of the structure blows up. It seems like it is the 
schema that blows up. That means I need to look at thrift related stuff?

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

2018-10-09 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642990#comment-16642990
 ] 

Wes McKinney commented on PARQUET-1438:
---

Since we do not test or develop on 32-bit arch, I would guess that it's not 
very well supported in general. We would appreciate some help with this

> [C++] corrupted files produced on 32-bit architecture (i686)
> 
>
> Key: PARQUET-1438
> URL: https://issues.apache.org/jira/browse/PARQUET-1438
> Project: Parquet
>  Issue Type: Bug
>Reporter: Dmitry Kalinkin
>Priority: Major
> Attachments: 32.parquet, 64.parquet
>
>
> I'm using C++ API to convert some data to parquet files. I've noticed a 
> regression when upgrading from arrow-cpp 0.10.0 + parquet-cpp 1.5.0 to 
> arrow-cpp 0.11.0. The issue is that I can write parquet files without an 
> error, but when I try to read those using pyarrow I get a segfault:
> {noformat}
> #0  0x7fffd17c7f0f in int 
> arrow::util::RleDecoder::GetBatchWithDictSpaced(float const*, float*, 
> int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #1  0x7fffd17c8025 in 
> parquet::DictionaryDecoder 
> >::DecodeSpaced(float*, int, int, unsigned char const*, long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #2  0x7fffd17bcf0f in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #3  0x7fffd17bfbea in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords(long) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #4  0x7fffd179d2f7 in parquet::arrow::PrimitiveImpl::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #5  0x7fffd1797162 in parquet::arrow::ColumnReader::NextBatch(long, 
> std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #6  0x7fffd179a6e5 in 
> parquet::arrow::FileReader::Impl::ReadSchemaField(int, std::vector std::allocator > const&, std::shared_ptr*) ()
>from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> #7  0x7fffd179aaad in 
> parquet::arrow::FileReader::Impl::ReadTable(std::vector std::allocator > const&, 
> std::shared_ptr*)::{lambda(int)#1}::operator()(int) const () 
> from 
> /nix/store/k6sy2ncjnkn5wnb2dq9m5f0qh446kjhg-arrow-cpp-0.11.0/lib/libparquet.so.11
> {noformat}
> I have not been able to dig to the bottom of the issue, but it seems like the 
> problem reproduces only when I run 32 bit binaries. After I learned that, I 
> found that 32 bit and 64 bit codes produce very different different parquet 
> files for the same data. The sizes of the structures are clearly different if 
> I look at their hexdumps. I'm attaching those example files. Reading 
> "32.parquet" (produced using i686 binaries) will cause a segfault on macOS 
> and linux, "64.parquet" will read just fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

[jira] [Commented] (PARQUET-1420) [C++] Thrift-generated symbols not exported in DLL

[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

[jira] [Updated] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

[jira] [Updated] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

[jira] [Updated] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

[jira] [Commented] (PARQUET-1420) [C++] Thrift-generated symbols not exported in DLL

parquet sync notes

[jira] [Resolved] (PARQUET-1354) [C++] Fix deprecated Arrow builder API usages

[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

[jira] [Created] (PARQUET-1439) [C++] Parquet build fails when PARQUET_ARROW_LINKAGE is static

[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

[jira] [Comment Edited] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

[jira] [Commented] (PARQUET-1438) [C++] corrupted files produced on 32-bit architecture (i686)

19 matches

Site Navigation

Mail list logo

Footer information