Re: Interpretation of PageHeader uncompressed_page_size
Thanks Gabor, that is very helpful to know. Best wishes, Hatem On Wed, Mar 25, 2020 at 2:15 PM Gabor Szadovszky wrote: > Hi Hatem, > > I agree that the levels shall be included as per the specification. I > checked the implementation in parquet-mr as well and it also includes the > levels in both uncompressed and compressed values. > > Cheers, > Gabor > > On Wed, Mar 25, 2020 at 1:02 PM Hatem Helal wrote: > > > I've recently done some work on adding support for DataPageV2 to the cpp > > code base [1]. A question came up if the uncompressed_page_size includes > > the levels which are not compressed in the V2 format anyway. > > > > My understanding of the thrift specification [2] is that the levels are > > included in this size. Can someone help confirm whether this > > interpretation is correct? > > > > Thanks, > > > > Hatem > > > > [1] https://github.com/apache/arrow/pull/6481 > > [2] > > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L623 > > >
Interpretation of PageHeader uncompressed_page_size
I've recently done some work on adding support for DataPageV2 to the cpp code base [1]. A question came up if the uncompressed_page_size includes the levels which are not compressed in the V2 format anyway. My understanding of the thrift specification [2] is that the levels are included in this size. Can someone help confirm whether this interpretation is correct? Thanks, Hatem [1] https://github.com/apache/arrow/pull/6481 [2] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L623
[jira] [Assigned] (PARQUET-458) [C++] Implement support for DataPageV2
[ https://issues.apache.org/jira/browse/PARQUET-458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hatem Helal reassigned PARQUET-458: --- Assignee: Hatem Helal > [C++] Implement support for DataPageV2 > -- > > Key: PARQUET-458 > URL: https://issues.apache.org/jira/browse/PARQUET-458 > Project: Parquet > Issue Type: New Feature > Components: parquet-cpp >Reporter: Wes McKinney > Assignee: Hatem Helal >Priority: Minor > Labels: pull-request-available > Fix For: cpp-1.6.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1639) [C++] Remove regex dependency for parsing ApplicationVersion
Hatem Helal created PARQUET-1639: Summary: [C++] Remove regex dependency for parsing ApplicationVersion Key: PARQUET-1639 URL: https://issues.apache.org/jira/browse/PARQUET-1639 Project: Parquet Issue Type: Improvement Components: parquet-cpp Reporter: Hatem Helal This is a follow up task to ARROW-6096. As [~fsaintjacques] points out, the parsing can be done in a single pass without using the regex library. See discussion: https://github.com/apache/arrow/pull/4985#issuecomment-517393619 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (PARQUET-1623) [C++] Invalid memory access with a magic number of records
[ https://issues.apache.org/jira/browse/PARQUET-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hatem Helal resolved PARQUET-1623. -- Resolution: Fixed Issue resolved by pull request 4857 [https://github.com/apache/arrow/pull/4857] > [C++] Invalid memory access with a magic number of records > -- > > Key: PARQUET-1623 > URL: https://issues.apache.org/jira/browse/PARQUET-1623 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Reporter: Hatem Helal > Assignee: Hatem Helal >Priority: Minor > Labels: pull-request-available > Fix For: cpp-1.6.0 > > Time Spent: 50m > Remaining Estimate: 0h > > I've observed a crash due to an invalid memory access when trying to read a > parquet file that I created with a single column of double-precision values > that occupies a fixed amount of memory. After some experimentation I found > that the following unittest added to {{arrow-reader-writer-test.cc}} will > fail when run in an ASAN build. > {code:java} > TEST(TestArrowReadWrite, MultiDataPageMagicNumber) { > const int num_rows = 262144; // 2^18 > std::shared_ptr table; > ASSERT_NO_FATAL_FAILURE(MakeDoubleTable(1, num_rows, 1, )); > std::shared_ptr result; > ASSERT_NO_FATAL_FAILURE( > DoSimpleRoundtrip(table, false, table->num_rows(), {}, )); > ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*table, *result)); > }{code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (PARQUET-1623) [C++] Invalid memory access with a magic number of records
[ https://issues.apache.org/jira/browse/PARQUET-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883150#comment-16883150 ] Hatem Helal commented on PARQUET-1623: -- Yes, will post one soon. Working on a unittest that doesn't need the full machinery of the parquet-arrow-test > [C++] Invalid memory access with a magic number of records > -- > > Key: PARQUET-1623 > URL: https://issues.apache.org/jira/browse/PARQUET-1623 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Reporter: Hatem Helal > Assignee: Hatem Helal >Priority: Minor > Fix For: cpp-1.6.0 > > > I've observed a crash due to an invalid memory access when trying to read a > parquet file that I created with a single column of double-precision values > that occupies a fixed amount of memory. After some experimentation I found > that the following unittest added to {{arrow-reader-writer-test.cc}} will > fail when run in an ASAN build. > {code:java} > TEST(TestArrowReadWrite, MultiDataPageMagicNumber) { > const int num_rows = 262144; // 2^18 > std::shared_ptr table; > ASSERT_NO_FATAL_FAILURE(MakeDoubleTable(1, num_rows, 1, )); > std::shared_ptr result; > ASSERT_NO_FATAL_FAILURE( > DoSimpleRoundtrip(table, false, table->num_rows(), {}, )); > ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*table, *result)); > }{code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (PARQUET-1623) [C++] Invalid memory access with a magic number of records
[ https://issues.apache.org/jira/browse/PARQUET-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883126#comment-16883126 ] Hatem Helal commented on PARQUET-1623: -- I think I might understand what is happening here: when there is exactly a power of two number of rows we end up not having any padding in the bit-packed validity vector. After some experimenting, I found that this problem is only present when a column is serialized as multiple data pages. The default page size is specified here: [https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L71] I think the problem lies in how the {{BitmapWriter}} is initialized here: [https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_reader.h#L188] I think the length of {{valid_bits_writer}} should be initialized to the current number of definition levels that the reader is trying to read from the current page. > [C++] Invalid memory access with a magic number of records > -- > > Key: PARQUET-1623 > URL: https://issues.apache.org/jira/browse/PARQUET-1623 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Reporter: Hatem Helal > Assignee: Hatem Helal >Priority: Minor > Fix For: cpp-1.6.0 > > > I've observed a crash due to an invalid memory access when trying to read a > parquet file that I created with a single column of double-precision values > that occupies a fixed amount of memory. After some experimentation I found > that the following unittest added to {{arrow-reader-writer-test.cc}} will > fail when run in an ASAN build. > {code:java} > TEST(TestArrowReadWrite, MultiDataPageMagicNumber) { > const int num_rows = 262144; // 2^18 > std::shared_ptr table; > ASSERT_NO_FATAL_FAILURE(MakeDoubleTable(1, num_rows, 1, )); > std::shared_ptr result; > ASSERT_NO_FATAL_FAILURE( > DoSimpleRoundtrip(table, false, table->num_rows(), {}, )); > ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*table, *result)); > }{code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (PARQUET-1623) [C++] Invalid memory access with a magic number of records
[ https://issues.apache.org/jira/browse/PARQUET-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883112#comment-16883112 ] Hatem Helal commented on PARQUET-1623: -- Here is the ASAN stack for the test: [https://gist.github.com/hatemhelal/ca0f6ef21f7aee0ff71afe18fbd52f92] > [C++] Invalid memory access with a magic number of records > -- > > Key: PARQUET-1623 > URL: https://issues.apache.org/jira/browse/PARQUET-1623 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Reporter: Hatem Helal > Assignee: Hatem Helal >Priority: Minor > > I've observed a crash due to an invalid memory access when trying to read a > parquet file that I created with a single column of double-precision values > that occupies a fixed amount of memory. After some experimentation I found > that the following unittest added to {{arrow-reader-writer-test.cc}} will > fail when run in an ASAN build. > {code:java} > TEST(TestArrowReadWrite, MultiDataPageMagicNumber) { > const int num_rows = 262144; // 2^18 > std::shared_ptr table; > ASSERT_NO_FATAL_FAILURE(MakeDoubleTable(1, num_rows, 1, )); > std::shared_ptr result; > ASSERT_NO_FATAL_FAILURE( > DoSimpleRoundtrip(table, false, table->num_rows(), {}, )); > ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*table, *result)); > }{code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (PARQUET-1623) [C++] Invalid memory access with a magic number of records
Hatem Helal created PARQUET-1623: Summary: [C++] Invalid memory access with a magic number of records Key: PARQUET-1623 URL: https://issues.apache.org/jira/browse/PARQUET-1623 Project: Parquet Issue Type: Bug Components: parquet-cpp Reporter: Hatem Helal Assignee: Hatem Helal I've observed a crash due to an invalid memory access when trying to read a parquet file that I created with a single column of double-precision values that occupies a fixed amount of memory. After some experimentation I found that the following unittest added to {{arrow-reader-writer-test.cc}} will fail when run in an ASAN build. {code:java} TEST(TestArrowReadWrite, MultiDataPageMagicNumber) { const int num_rows = 262144; // 2^18 std::shared_ptr table; ASSERT_NO_FATAL_FAILURE(MakeDoubleTable(1, num_rows, 1, )); std::shared_ptr result; ASSERT_NO_FATAL_FAILURE( DoSimpleRoundtrip(table, false, table->num_rows(), {}, )); ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*table, *result)); }{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (PARQUET-1169) [C++] Segment fault when using NextBatch of parquet::arrow::ColumnReader in parquet-cpp
[ https://issues.apache.org/jira/browse/PARQUET-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876821#comment-16876821 ] Hatem Helal commented on PARQUET-1169: -- [~frankfang], could you try this again using arrow master? I think this might have been resolved by ARROW-5608. > [C++] Segment fault when using NextBatch of parquet::arrow::ColumnReader in > parquet-cpp > --- > > Key: PARQUET-1169 > URL: https://issues.apache.org/jira/browse/PARQUET-1169 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Jian Fang >Priority: Major > Fix For: cpp-1.5.0 > > Attachments: test.parquet > > > When I running the below code, I consistently get segment fault, not sure > whether this is a bug or I did something wrong. Anyone here could help me > take a look? > {code:c++} > #include > #include > #include "arrow/array.h" > #include "arrow/io/file.h" > #include "arrow/test-util.h" > #include "parquet/arrow/reader.h" > using arrow::Array; > using arrow::default_memory_pool; > using arrow::io::FileMode; > using arrow::io::MemoryMappedFile; > using parquet::arrow::ColumnReader; > using parquet::arrow::FileReader; > using parquet::arrow::OpenFile; > int main(int argc, char** argv) { > if (argc > 1) { > std::string file_name = argv[1]; > std::shared_ptr file; > ABORT_NOT_OK(MemoryMappedFile::Open(file_name, FileMode::READ, )); > std::unique_ptr file_reader; > ABORT_NOT_OK(OpenFile(file, default_memory_pool(), _reader)); > std::unique_ptr column_reader; > ABORT_NOT_OK(file_reader->GetColumn(0, _reader)); > std::shared_ptr array1; > ABORT_NOT_OK(column_reader->NextBatch(1, )); > std::cout << "length " << array1->length() << std::endl; > std::shared_ptr array2; > // segment fault > ABORT_NOT_OK(column_reader->NextBatch(1, )); > std::cout << "length " << array2->length() << std::endl; > } > return 0; > } > {code} > Command to compile this program: > {code} > g++ test.c -I/usr/local/include/arrow -I/usr/local/include/parquet > --std=c++11 -lparquet -larrow -lgtest -o parquet_test > {code} > Command to run the program > {code} > ./parquet_test test.parquet > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1565) [C++] SEGV in FromParquetSchema with corrupt file from PARQUET-1481
[ https://issues.apache.org/jira/browse/PARQUET-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821152#comment-16821152 ] Hatem Helal commented on PARQUET-1565: -- This is a somewhat esoteric problem but the fix seems to be to extend the switch case here [this switch case|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/schema.cc#L174] to handle the corrupted thrift metadata. > [C++] SEGV in FromParquetSchema with corrupt file from PARQUET-1481 > --- > > Key: PARQUET-1565 > URL: https://issues.apache.org/jira/browse/PARQUET-1565 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Affects Versions: cpp-1.6.0 > Reporter: Hatem Helal >Assignee: Hatem Helal >Priority: Minor > > Calling {{parquet::arrow::FromParquetSchema}} when reading the corrupt file > attached to PARQUET-1481 results in a SEGV. I'm not sure when this was > introduced but I didn't observe this problem with our app that uses > parquet-cpp v1.4.0. Our team caught this while integrating Arrow 0.12.1 into > MATLAB. > To reproduce this, add the following lines to > [parquet-reader.cc|https://github.com/apache/arrow/blob/master/cpp/tools/parquet/parquet-reader.cc#L66], > build, and try to read the corrupt file attached to PARQUET-1481. > {code:java} > const auto parquet_schema = reader->metadata()->schema(); > std::shared_ptr<::arrow::Schema> arrow_schema; > PARQUET_THROW_NOT_OK(parquet::arrow::FromParquetSchema(parquet_schema, > _schema));{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1565) [C++] SEGV in FromParquetSchema with corrupt file from PARQUET-1481
Hatem Helal created PARQUET-1565: Summary: [C++] SEGV in FromParquetSchema with corrupt file from PARQUET-1481 Key: PARQUET-1565 URL: https://issues.apache.org/jira/browse/PARQUET-1565 Project: Parquet Issue Type: Bug Components: parquet-cpp Affects Versions: cpp-1.6.0 Reporter: Hatem Helal Assignee: Hatem Helal Calling {{parquet::arrow::FromParquetSchema}} when reading the corrupt file attached to PARQUET-1481 results in a SEGV. I'm not sure when this was introduced but I didn't observe this problem with our app that uses parquet-cpp v1.4.0. Our team caught this while integrating Arrow 0.12.1 into MATLAB. To reproduce this, add the following lines to [parquet-reader.cc|https://github.com/apache/arrow/blob/master/cpp/tools/parquet/parquet-reader.cc#L66], build, and try to read the corrupt file attached to PARQUET-1481. {code:java} const auto parquet_schema = reader->metadata()->schema(); std::shared_ptr<::arrow::Schema> arrow_schema; PARQUET_THROW_NOT_OK(parquet::arrow::FromParquetSchema(parquet_schema, _schema));{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1540) [C++] Set shared library version for linux and mac builds
[ https://issues.apache.org/jira/browse/PARQUET-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785475#comment-16785475 ] Hatem Helal commented on PARQUET-1540: -- This is a duplicate of ARROW-3185 > [C++] Set shared library version for linux and mac builds > - > > Key: PARQUET-1540 > URL: https://issues.apache.org/jira/browse/PARQUET-1540 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp > Reporter: Hatem Helal > Assignee: Hatem Helal >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > It looks like this was previously implemented when parquet-cpp was managed as > a separate repo (PARQUET-935). It would be good to add this back now that > parquet-cpp was incorporated into the arrow project. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1540) [C++] Set shared library version for linux and mac builds
[ https://issues.apache.org/jira/browse/PARQUET-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hatem Helal resolved PARQUET-1540. -- Resolution: Duplicate > [C++] Set shared library version for linux and mac builds > - > > Key: PARQUET-1540 > URL: https://issues.apache.org/jira/browse/PARQUET-1540 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp > Reporter: Hatem Helal > Assignee: Hatem Helal >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > It looks like this was previously implemented when parquet-cpp was managed as > a separate repo (PARQUET-935). It would be good to add this back now that > parquet-cpp was incorporated into the arrow project. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1540) [C++] Set shared library version for linux and mac builds
[ https://issues.apache.org/jira/browse/PARQUET-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783267#comment-16783267 ] Hatem Helal commented on PARQUET-1540: -- This was discussed on the [mailing list|https://lists.apache.org/thread.html/420bd7b5b4a4bad62bf7d874c998c99204e1633a7d0cf47c00541c61@%3Cdev.arrow.apache.org%3E] and it makes sense for the SO versions to match up for arrow and parquet until an independent parquet C++ release is prepared. > [C++] Set shared library version for linux and mac builds > - > > Key: PARQUET-1540 > URL: https://issues.apache.org/jira/browse/PARQUET-1540 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp > Reporter: Hatem Helal > Assignee: Hatem Helal >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > It looks like this was previously implemented when parquet-cpp was managed as > a separate repo (PARQUET-935). It would be good to add this back now that > parquet-cpp was incorporated into the arrow project. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1540) [C++] Set shared library version for linux and mac builds
Hatem Helal created PARQUET-1540: Summary: [C++] Set shared library version for linux and mac builds Key: PARQUET-1540 URL: https://issues.apache.org/jira/browse/PARQUET-1540 Project: Parquet Issue Type: Improvement Components: parquet-cpp Reporter: Hatem Helal Assignee: Hatem Helal It looks like this was previously implemented when parquet-cpp was managed as a separate repo (PARQUET-935). It would be good to add this back now that parquet-cpp was incorporated into the arrow project. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1482) [C++] Unable to read data from parquet file generated with parquetjs
[ https://issues.apache.org/jira/browse/PARQUET-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733248#comment-16733248 ] Hatem Helal commented on PARQUET-1482: -- [~wesmckinn], my colleague [~rdmello] is working on a fix for this. Could you help us out by adding him as a contributor on this project? Thanks! > [C++] Unable to read data from parquet file generated with parquetjs > > > Key: PARQUET-1482 > URL: https://issues.apache.org/jira/browse/PARQUET-1482 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Reporter: Hatem Helal > Assignee: Hatem Helal >Priority: Major > Attachments: feeds1kMicros.parquet > > > See attached file, when I debug: > {{% ./parquet-reader feed1kMicros.parquet}} > I see that the {{scanner->HasNext()}} always returns false. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1482) [C++] Unable to read data from parquet file generated with parquetjs
[ https://issues.apache.org/jira/browse/PARQUET-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726884#comment-16726884 ] Hatem Helal commented on PARQUET-1482: -- I think this is a problem in parquet-cpp since I've confirmed that parquet-tools can read this file. > [C++] Unable to read data from parquet file generated with parquetjs > > > Key: PARQUET-1482 > URL: https://issues.apache.org/jira/browse/PARQUET-1482 > Project: Parquet > Issue Type: Bug > Reporter: Hatem Helal > Assignee: Hatem Helal >Priority: Major > Attachments: feeds1kMicros.parquet > > > See attached file, when I debug: > {{% ./parquet-reader feed1kMicros.parquet}} > I see that the {{scanner->HasNext()}} always returns false. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1481) [C++] SEGV when reading corrupt parquet file
[ https://issues.apache.org/jira/browse/PARQUET-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726840#comment-16726840 ] Hatem Helal commented on PARQUET-1481: -- Great, thanks for that [~wesmckinn]! > [C++] SEGV when reading corrupt parquet file > > > Key: PARQUET-1481 > URL: https://issues.apache.org/jira/browse/PARQUET-1481 > Project: Parquet > Issue Type: Bug > Reporter: Hatem Helal >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Attachments: corrupt.parquet > > Time Spent: 20m > Remaining Estimate: 0h > > >>> import pyarrow.parquet as pq > >>> pq.read_table('corrupt.parquet') > fish: 'python' terminated by signal SIGSEGV (Address boundary error) > > Stack report from macOS: > > 0 libsystem_kernel.dylib 0x7fff51164cee __psynch_cvwait + 10 > 1 libsystem_pthread.dylib 0x7fff512a1662 _pthread_cond_wait + 732 > 2 libc++.1.dylib 0x7fff4f04acb0 > std::__1::condition_variable::wait(std::__1::unique_lock&) + > 18 > 3 libc++.1.dylib 0x7fff4f04b728 > std::__1::__assoc_sub_state::__sub_wait(std::__1::unique_lock&) > + 46 > 4 libparquet.11.dylib 0x000115512d00 > std::__1::__assoc_state::move() + 48 > 5 libparquet.11.dylib 0x0001154faa15 > parquet::arrow::FileReader::Impl::ReadTable(std::__1::vector std::__1::allocator > const&, std::__1::shared_ptr*) + 1093 > 6 libparquet.11.dylib 0x0001154fb6fe > parquet::arrow::FileReader::Impl::ReadTable(std::__1::shared_ptr*) > + 350 > 7 libparquet.11.dylib 0x0001154fce47 > parquet::arrow::FileReader::ReadTable(std::__1::shared_ptr*) + > 23 > 8 _parquet.so 0x00011598d97b > __pyx_pw_7pyarrow_8_parquet_13ParquetReader_9read_all(_object*, _object*, > _object*) + 1035 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1481) [C++] SEGV when reading corrupt parquet file
[ https://issues.apache.org/jira/browse/PARQUET-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726795#comment-16726795 ] Hatem Helal commented on PARQUET-1481: -- Sure, a colleague used a text editor to make a random change in the file that was originally written using parquet-cpp. I'm looking at making this throw an exception / not-ok status code. Does that sound reasonable? > [C++] SEGV when reading corrupt parquet file > > > Key: PARQUET-1481 > URL: https://issues.apache.org/jira/browse/PARQUET-1481 > Project: Parquet > Issue Type: Bug > Reporter: Hatem Helal > Assignee: Hatem Helal >Priority: Major > Attachments: corrupt.parquet > > > >>> import pyarrow.parquet as pq > >>> pq.read_table('corrupt.parquet') > fish: 'python' terminated by signal SIGSEGV (Address boundary error) > > Stack report from macOS: > > 0 libsystem_kernel.dylib 0x7fff51164cee __psynch_cvwait + 10 > 1 libsystem_pthread.dylib 0x7fff512a1662 _pthread_cond_wait + 732 > 2 libc++.1.dylib 0x7fff4f04acb0 > std::__1::condition_variable::wait(std::__1::unique_lock&) + > 18 > 3 libc++.1.dylib 0x7fff4f04b728 > std::__1::__assoc_sub_state::__sub_wait(std::__1::unique_lock&) > + 46 > 4 libparquet.11.dylib 0x000115512d00 > std::__1::__assoc_state::move() + 48 > 5 libparquet.11.dylib 0x0001154faa15 > parquet::arrow::FileReader::Impl::ReadTable(std::__1::vector std::__1::allocator > const&, std::__1::shared_ptr*) + 1093 > 6 libparquet.11.dylib 0x0001154fb6fe > parquet::arrow::FileReader::Impl::ReadTable(std::__1::shared_ptr*) > + 350 > 7 libparquet.11.dylib 0x0001154fce47 > parquet::arrow::FileReader::ReadTable(std::__1::shared_ptr*) + > 23 > 8 _parquet.so 0x00011598d97b > __pyx_pw_7pyarrow_8_parquet_13ParquetReader_9read_all(_object*, _object*, > _object*) + 1035 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1481) [C++] SEGV when reading corrupt parquet file
[ https://issues.apache.org/jira/browse/PARQUET-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726757#comment-16726757 ] Hatem Helal commented on PARQUET-1481: -- Managed to reproduce this using a simple test using latest apache arrow. Slightly nicer stack trace: {{F1220 13:29:51.966117 2315707200 record_reader.cc:854] Check failed: false}} {{*** Check failure stack trace: ***}} {{ @ 0x1083c217a google::LogMessage::Fail()}} {{ @ 0x1083c01de google::LogMessage::SendToLog()}} {{ @ 0x1083c0e1f google::LogMessage::Flush()}} {{ @ 0x1083c0c59 google::LogMessage::~LogMessage()}} {{ @ 0x1083c0f15 google::LogMessage::~LogMessage()}} {{ @ 0x10825d45c arrow::util::ArrowLog::~ArrowLog()}} {{ @ 0x10825d4a5 arrow::util::ArrowLog::~ArrowLog()}} {{ @ 0x107d5d936 parquet::internal::RecordReader::Make()}} {{ @ 0x107cf8abd parquet::arrow::PrimitiveImpl::PrimitiveImpl()}} {{ @ 0x107c69acd parquet::arrow::PrimitiveImpl::PrimitiveImpl()}} {{ @ 0x107c68ba8 parquet::arrow::FileReader::Impl::GetColumn()}} {{ @ 0x107c6b790 parquet::arrow::FileReader::Impl::GetReaderForNode()}} {{ @ 0x107c6cb3d parquet::arrow::FileReader::Impl::ReadSchemaField()}} {{ @ 0x107c79d60 parquet::arrow::FileReader::Impl::ReadTable()::$_1::operator()()}} {{ @ 0x107c764ef parquet::arrow::FileReader::Impl::ReadTable()}} {{ @ 0x107c7a9f5 parquet::arrow::FileReader::Impl::ReadTable()}} {{ @ 0x107c7f5f7 parquet::arrow::FileReader::ReadTable()}} {{ @ 0x107c6176c main}} > [C++] SEGV when reading corrupt parquet file > > > Key: PARQUET-1481 > URL: https://issues.apache.org/jira/browse/PARQUET-1481 > Project: Parquet > Issue Type: Bug > Reporter: Hatem Helal > Assignee: Hatem Helal >Priority: Major > Attachments: corrupt.parquet > > > >>> import pyarrow.parquet as pq > >>> pq.read_table('corrupt.parquet') > fish: 'python' terminated by signal SIGSEGV (Address boundary error) > > Stack report from macOS: > > 0 libsystem_kernel.dylib 0x7fff51164cee __psynch_cvwait + 10 > 1 libsystem_pthread.dylib 0x7fff512a1662 _pthread_cond_wait + 732 > 2 libc++.1.dylib 0x7fff4f04acb0 > std::__1::condition_variable::wait(std::__1::unique_lock&) + > 18 > 3 libc++.1.dylib 0x7fff4f04b728 > std::__1::__assoc_sub_state::__sub_wait(std::__1::unique_lock&) > + 46 > 4 libparquet.11.dylib 0x000115512d00 > std::__1::__assoc_state::move() + 48 > 5 libparquet.11.dylib 0x0001154faa15 > parquet::arrow::FileReader::Impl::ReadTable(std::__1::vector std::__1::allocator > const&, std::__1::shared_ptr*) + 1093 > 6 libparquet.11.dylib 0x0001154fb6fe > parquet::arrow::FileReader::Impl::ReadTable(std::__1::shared_ptr*) > + 350 > 7 libparquet.11.dylib 0x0001154fce47 > parquet::arrow::FileReader::ReadTable(std::__1::shared_ptr*) + > 23 > 8 _parquet.so 0x00011598d97b > __pyx_pw_7pyarrow_8_parquet_13ParquetReader_9read_all(_object*, _object*, > _object*) + 1035 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1481) [C++] SEGV when reading corrupt parquet file
[ https://issues.apache.org/jira/browse/PARQUET-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hatem Helal updated PARQUET-1481: - Attachment: corrupt.parquet > [C++] SEGV when reading corrupt parquet file > > > Key: PARQUET-1481 > URL: https://issues.apache.org/jira/browse/PARQUET-1481 > Project: Parquet > Issue Type: Bug > Reporter: Hatem Helal > Assignee: Hatem Helal >Priority: Major > Attachments: corrupt.parquet > > > >>> import pyarrow.parquet as pq > >>> pq.read_table('corrupt.parquet') > fish: 'python' terminated by signal SIGSEGV (Address boundary error) > > Stack report from macOS: > > 0 libsystem_kernel.dylib 0x7fff51164cee __psynch_cvwait + 10 > 1 libsystem_pthread.dylib 0x7fff512a1662 _pthread_cond_wait + 732 > 2 libc++.1.dylib 0x7fff4f04acb0 > std::__1::condition_variable::wait(std::__1::unique_lock&) + > 18 > 3 libc++.1.dylib 0x7fff4f04b728 > std::__1::__assoc_sub_state::__sub_wait(std::__1::unique_lock&) > + 46 > 4 libparquet.11.dylib 0x000115512d00 > std::__1::__assoc_state::move() + 48 > 5 libparquet.11.dylib 0x0001154faa15 > parquet::arrow::FileReader::Impl::ReadTable(std::__1::vector std::__1::allocator > const&, std::__1::shared_ptr*) + 1093 > 6 libparquet.11.dylib 0x0001154fb6fe > parquet::arrow::FileReader::Impl::ReadTable(std::__1::shared_ptr*) > + 350 > 7 libparquet.11.dylib 0x0001154fce47 > parquet::arrow::FileReader::ReadTable(std::__1::shared_ptr*) + > 23 > 8 _parquet.so 0x00011598d97b > __pyx_pw_7pyarrow_8_parquet_13ParquetReader_9read_all(_object*, _object*, > _object*) + 1035 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1481) [C++] SEGV when reading corrupt parquet file
Hatem Helal created PARQUET-1481: Summary: [C++] SEGV when reading corrupt parquet file Key: PARQUET-1481 URL: https://issues.apache.org/jira/browse/PARQUET-1481 Project: Parquet Issue Type: Bug Reporter: Hatem Helal Assignee: Hatem Helal >>> import pyarrow.parquet as pq >>> pq.read_table('corrupt.parquet') fish: 'python' terminated by signal SIGSEGV (Address boundary error) Stack report from macOS: 0 libsystem_kernel.dylib 0x7fff51164cee __psynch_cvwait + 10 1 libsystem_pthread.dylib 0x7fff512a1662 _pthread_cond_wait + 732 2 libc++.1.dylib 0x7fff4f04acb0 std::__1::condition_variable::wait(std::__1::unique_lock&) + 18 3 libc++.1.dylib 0x7fff4f04b728 std::__1::__assoc_sub_state::__sub_wait(std::__1::unique_lock&) + 46 4 libparquet.11.dylib 0x000115512d00 std::__1::__assoc_state::move() + 48 5 libparquet.11.dylib 0x0001154faa15 parquet::arrow::FileReader::Impl::ReadTable(std::__1::vector > const&, std::__1::shared_ptr*) + 1093 6 libparquet.11.dylib 0x0001154fb6fe parquet::arrow::FileReader::Impl::ReadTable(std::__1::shared_ptr*) + 350 7 libparquet.11.dylib 0x0001154fce47 parquet::arrow::FileReader::ReadTable(std::__1::shared_ptr*) + 23 8 _parquet.so 0x00011598d97b __pyx_pw_7pyarrow_8_parquet_13ParquetReader_9read_all(_object*, _object*, _object*) + 1035 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: parquet-arrow estimate file size
I think if I've understood the problem correctly, you could use the parquet::arrow::FileWriter https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.h#L128 The basic pattern is to use an object to manage the FileWriter lifetime, call the WriteTable method for each row group, and close it when you are done. My understanding is that each call to WriteTable will append a new row group which should allow you to incrementally write an out-of-memory dataset. I realize now that I haven't tested this myself so it would be good to double-check this with someone more experienced with the parquet-cpp APIs. On 12/11/18, 12:54 AM, "Jiayuan Chen" wrote: Thanks for the suggestion, will do. Since such high-level API is not yet implemented in the parquet-cpp project, I have to turn back to use the API newly introduced in the low-level API, that calculates the Parquet file size when adding data into the column writers. I have another question on that part: Is there any sample code & advice that I can follow to be able to stream the Parquet file on a per rowgroup basis? In order words, to restrict memory usage but still create big enough Parquet file, I would like to create relatively small rowgroup in memory using InMemoryOutputStream(), and dump the buffer contents to my external stream, after completing each row group, until a big file with several rowgroups is finished. However, my attempt to manipulate the underline arrow::Buffer have failed, that the pages starting from the second rowgroup are unreadable. Thanks! On Mon, Dec 10, 2018 at 3:53 PM Wes McKinney wrote: > hi Jiayuan, > > To your question > > > Would this be in the roadmap? > > I doubt there would be any objections to adding this feature to the > Arrow writer API -- please feel free to open a JIRA issue to describe > how the API might work in C++. Note there is no formal roadmap in this > project. > > - Wes > On Mon, Dec 10, 2018 at 5:31 PM Jiayuan Chen wrote: > > > > Thanks for the Python solution. However, is there a solution in C++ that > I > > can create such Parquet file with only in-memory buffer, using > parquet-cpp > > library? > > > > On Mon, Dec 10, 2018 at 3:23 PM Lee, David > wrote: > > > > > Resending.. Somehow I lost some line feeds in the previous reply.. > > > > > > import os > > > import pyarrow.parquet as pq > > > import glob as glob > > > > > > max_target_size = 134217728 > > > target_size = max_target_size * .95 > > > # Directory where parquet files are saved > > > working_directory = '/tmp/test' > > > files_dict = dict() > > > files = glob.glob(os.path.join(working_directory, "*.parquet")) > > > files.sort() > > > for file in files: > > > files_dict[file] = os.path.getsize(file) > > > print("Merging parquet files") > > > temp_file = os.path.join(working_directory, "temp.parquet") > > > file_no = 0 > > > for file in files: > > > if file in files_dict: > > > file_no = file_no + 1 > > > file_name = os.path.join(working_directory, > str(file_no).zfill(4) > > > + ".parquet") > > > print("Saving to parquet file " + file_name) > > > # Just rename file if the file size is in target range > > > if files_dict[file] > target_size: > > > del files_dict[file] > > > os.rename(file, file_name) > > > continue > > > merge_list = list() > > > file_size = 0 > > > # Find files to merge together which add up to less than 128 > megs > > > for k, v in files_dict.items(): > > > if file_size + v <= max_target_size: > > > print("Adding file " + k + " to merge list") > > > merge_list.append(k) > > > file_size = file_size + v > > > # Just rename file if there is only one file to merge > > > if len(merge_list) == 1: > > > del files_dict[merge_list[0]] > > > os.rename(merge_list[0], file_name) > > > continue > > > # Merge smaller files into one large file. Read row groups from > > > each file and add them to the new file. > > > schema = pq.read_schema(file) > > > print("Saving to new parquet file") > > > writer = pq.ParquetWriter(temp_file, schema=schema, > > > use_dictionary=True, compression='snappy') > > > for merge in merge_list: > > > parquet_file = pq.ParquetFile(merge) > > > print("Writing " + merge + " to new parquet file") > > > for i in range(parquet_file.num_row_groups): > > >
Re: Regarding Apache Parquet Project
Hi Arjit, I'm new around here too but interested to hear what the others on this list have to say. For C++ development, I've recommend reading through the examples: https://github.com/apache/arrow/tree/master/cpp/examples/parquet and the command-line tools: https://github.com/apache/arrow/tree/master/cpp/tools/parquet Both were helpful for getting up to speed on the main APIs. I use an IDE (Xcode but doesn't matter which) to debug and step through the code and try to understand the internal dependencies. The setup for Xcode was a bit manual but let me know if there is interest and I can investigate automation so that I can share it with others. Hope this helps, Hatem On 12/11/18, 5:39 AM, "Arjit Yadav" wrote: Hi all, I am new to this project. While I have used parquet in the past, I want to know how it works internally and look up relevant documentation and code inorder to start contributing to the project. Please let me know any available resources in this regard. Regards, Arjit Yadav
[jira] [Created] (PARQUET-1473) [C++] Add helper function that converts ParquetVersion to human-friendly string
Hatem Helal created PARQUET-1473: Summary: [C++] Add helper function that converts ParquetVersion to human-friendly string Key: PARQUET-1473 URL: https://issues.apache.org/jira/browse/PARQUET-1473 Project: Parquet Issue Type: Improvement Reporter: Hatem Helal Assignee: Hatem Helal I noticed this while working on ARROW-3564: the parquet-reader utility prints a line like: *Version: 0* which corresponds to the PARQUET_1_0 enum value. I couldn't find anything that obviously did this in the parquet-cpp code base. It would be good if there were a function that could map from the ParquetVersion enum to a human-friendly string. e.g: Version: 1.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1458) parquet::CompressionToString not recognizing brotli compression
[ https://issues.apache.org/jira/browse/PARQUET-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hatem Helal updated PARQUET-1458: - Labels: (was: pull) > parquet::CompressionToString not recognizing brotli compression > --- > > Key: PARQUET-1458 > URL: https://issues.apache.org/jira/browse/PARQUET-1458 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Reporter: Hatem Helal >Priority: Trivial > > It looks like we just need to add a case to handle the brotli codec > [here|[https://github.com/apache/arrow/blob/9b4cd9c03ed9365f8e235f296caa166ea692c98f/cpp/src/parquet/types.cc#L122]] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1458) parquet::CompressionToString not recognizing brotli compression
[ https://issues.apache.org/jira/browse/PARQUET-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hatem Helal updated PARQUET-1458: - Labels: pull (was: ) > parquet::CompressionToString not recognizing brotli compression > --- > > Key: PARQUET-1458 > URL: https://issues.apache.org/jira/browse/PARQUET-1458 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Reporter: Hatem Helal >Priority: Trivial > > It looks like we just need to add a case to handle the brotli codec > [here|[https://github.com/apache/arrow/blob/9b4cd9c03ed9365f8e235f296caa166ea692c98f/cpp/src/parquet/types.cc#L122]] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1458) parquet::CompressionToString not recognizing brotli compression
[ https://issues.apache.org/jira/browse/PARQUET-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hatem Helal updated PARQUET-1458: - Priority: Trivial (was: Major) > parquet::CompressionToString not recognizing brotli compression > --- > > Key: PARQUET-1458 > URL: https://issues.apache.org/jira/browse/PARQUET-1458 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Reporter: Hatem Helal >Priority: Trivial > > It looks like we just need to add a case to handle the brotli codec > [here|[https://github.com/apache/arrow/blob/9b4cd9c03ed9365f8e235f296caa166ea692c98f/cpp/src/parquet/types.cc#L122]] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1458) parquet::CompressionToString not recognizing brotli compression
[ https://issues.apache.org/jira/browse/PARQUET-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hatem Helal updated PARQUET-1458: - Description: It looks like we just need to add a case to handle the brotli codec [here|[https://github.com/apache/arrow/blob/9b4cd9c03ed9365f8e235f296caa166ea692c98f/cpp/src/parquet/types.cc#L122]] (was: It looks like we just need to add a case to handle the brotli codec [here|]https://github.com/apache/arrow/blob/9b4cd9c03ed9365f8e235f296caa166ea692c98f/cpp/src/parquet/types.cc#L122]) > parquet::CompressionToString not recognizing brotli compression > --- > > Key: PARQUET-1458 > URL: https://issues.apache.org/jira/browse/PARQUET-1458 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Reporter: Hatem Helal >Priority: Major > > It looks like we just need to add a case to handle the brotli codec > [here|[https://github.com/apache/arrow/blob/9b4cd9c03ed9365f8e235f296caa166ea692c98f/cpp/src/parquet/types.cc#L122]] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1458) parquet::CompressionToString not recognizing brotli compression
[ https://issues.apache.org/jira/browse/PARQUET-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686773#comment-16686773 ] Hatem Helal commented on PARQUET-1458: -- Looking into fixing this. > parquet::CompressionToString not recognizing brotli compression > --- > > Key: PARQUET-1458 > URL: https://issues.apache.org/jira/browse/PARQUET-1458 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Reporter: Hatem Helal >Priority: Major > > It looks like we just need to add a case to handle the brotli codec > [here|]https://github.com/apache/arrow/blob/9b4cd9c03ed9365f8e235f296caa166ea692c98f/cpp/src/parquet/types.cc#L122] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1458) parquet::CompressionToString not recognizing brotli compression
Hatem Helal created PARQUET-1458: Summary: parquet::CompressionToString not recognizing brotli compression Key: PARQUET-1458 URL: https://issues.apache.org/jira/browse/PARQUET-1458 Project: Parquet Issue Type: Bug Components: parquet-cpp Reporter: Hatem Helal It looks like we just need to add a case to handle the brotli codec [here|]https://github.com/apache/arrow/blob/9b4cd9c03ed9365f8e235f296caa166ea692c98f/cpp/src/parquet/types.cc#L122] -- This message was sent by Atlassian JIRA (v7.6.3#76005)