from:"Hatem Helal"

Re: Interpretation of PageHeader uncompressed_page_size

2020-03-26 Thread Hatem Helal

Thanks Gabor, that is very helpful to know.

Best wishes,

Hatem

On Wed, Mar 25, 2020 at 2:15 PM Gabor Szadovszky
 wrote:

> Hi Hatem,
>
> I agree that the levels shall be included as per the specification. I
> checked the implementation in parquet-mr as well and it also includes the
> levels in both uncompressed and compressed values.
>
> Cheers,
> Gabor
>
> On Wed, Mar 25, 2020 at 1:02 PM Hatem Helal  wrote:
>
> > I've recently done some work on adding support for DataPageV2 to the cpp
> > code base [1].  A question came up if the uncompressed_page_size includes
> > the levels which are not compressed in the V2 format anyway.
> >
> > My understanding of the thrift specification [2] is that the levels are
> > included in this size.  Can someone help confirm whether this
> > interpretation is correct?
> >
> > Thanks,
> >
> > Hatem
> >
> > [1] https://github.com/apache/arrow/pull/6481
> > [2]
> >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L623
> >
>

Interpretation of PageHeader uncompressed_page_size

2020-03-25 Thread Hatem Helal

I've recently done some work on adding support for DataPageV2 to the cpp
code base [1].  A question came up if the uncompressed_page_size includes
the levels which are not compressed in the V2 format anyway.

My understanding of the thrift specification [2] is that the levels are
included in this size.  Can someone help confirm whether this
interpretation is correct?

Thanks,

Hatem

[1] https://github.com/apache/arrow/pull/6481
[2]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L623

[jira] [Assigned] (PARQUET-458) [C++] Implement support for DataPageV2

2020-02-24 Thread Hatem Helal (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hatem Helal reassigned PARQUET-458:
---

Assignee: Hatem Helal

> [C++] Implement support for DataPageV2
> --
>
> Key: PARQUET-458
> URL: https://issues.apache.org/jira/browse/PARQUET-458
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>    Assignee: Hatem Helal
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (PARQUET-1639) [C++] Remove regex dependency for parsing ApplicationVersion

2019-08-16 Thread Hatem Helal (JIRA)

Hatem Helal created PARQUET-1639:


 Summary: [C++] Remove regex dependency for parsing 
ApplicationVersion
 Key: PARQUET-1639
 URL: https://issues.apache.org/jira/browse/PARQUET-1639
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Hatem Helal


This is a follow up task to ARROW-6096.  As [~fsaintjacques] points out, the 
parsing can be done in a single pass without using the regex library.  See 
discussion:

https://github.com/apache/arrow/pull/4985#issuecomment-517393619



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Resolved] (PARQUET-1623) [C++] Invalid memory access with a magic number of records

2019-07-12 Thread Hatem Helal (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hatem Helal resolved PARQUET-1623.
--
Resolution: Fixed

Issue resolved by pull request 4857

[https://github.com/apache/arrow/pull/4857]

> [C++] Invalid memory access with a magic number of records
> --
>
> Key: PARQUET-1623
> URL: https://issues.apache.org/jira/browse/PARQUET-1623
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>    Reporter: Hatem Helal
>    Assignee: Hatem Helal
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> I've observed a crash due to an invalid memory access when trying to read a 
> parquet file that I created with a single column of double-precision values 
> that occupies a fixed amount of memory.  After some experimentation I found 
> that the following unittest added to {{arrow-reader-writer-test.cc}} will 
> fail when run in an ASAN build.
> {code:java}
> TEST(TestArrowReadWrite, MultiDataPageMagicNumber) {
>   const int num_rows = 262144;  // 2^18
>   std::shared_ptr table;
>   ASSERT_NO_FATAL_FAILURE(MakeDoubleTable(1, num_rows, 1, ));
>   std::shared_ptr result;
>   ASSERT_NO_FATAL_FAILURE(
>       DoSimpleRoundtrip(table, false, table->num_rows(), {}, ));
>   ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*table, *result));
> }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (PARQUET-1623) [C++] Invalid memory access with a magic number of records

2019-07-11 Thread Hatem Helal (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883150#comment-16883150
 ] 

Hatem Helal commented on PARQUET-1623:
--

Yes, will post one soon.  Working on a unittest that doesn't need the full 
machinery of the parquet-arrow-test

> [C++] Invalid memory access with a magic number of records
> --
>
> Key: PARQUET-1623
> URL: https://issues.apache.org/jira/browse/PARQUET-1623
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>    Reporter: Hatem Helal
>    Assignee: Hatem Helal
>Priority: Minor
> Fix For: cpp-1.6.0
>
>
> I've observed a crash due to an invalid memory access when trying to read a 
> parquet file that I created with a single column of double-precision values 
> that occupies a fixed amount of memory.  After some experimentation I found 
> that the following unittest added to {{arrow-reader-writer-test.cc}} will 
> fail when run in an ASAN build.
> {code:java}
> TEST(TestArrowReadWrite, MultiDataPageMagicNumber) {
>   const int num_rows = 262144;  // 2^18
>   std::shared_ptr table;
>   ASSERT_NO_FATAL_FAILURE(MakeDoubleTable(1, num_rows, 1, ));
>   std::shared_ptr result;
>   ASSERT_NO_FATAL_FAILURE(
>       DoSimpleRoundtrip(table, false, table->num_rows(), {}, ));
>   ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*table, *result));
> }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (PARQUET-1623) [C++] Invalid memory access with a magic number of records

2019-07-11 Thread Hatem Helal (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883126#comment-16883126
 ] 

Hatem Helal commented on PARQUET-1623:
--

I think I might understand what is happening here: when there is exactly a 
power of two number of rows we end up not having any padding in the bit-packed 
validity vector.  After some experimenting, I found that this problem is only 
present when a column is serialized as multiple data pages.  The default page 
size is specified here:

[https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L71]

I think the problem lies in how the {{BitmapWriter}} is initialized here:

[https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_reader.h#L188]

I think the length of {{valid_bits_writer}} should be initialized to the 
current number of definition levels that the reader is trying to read from the 
current page.

> [C++] Invalid memory access with a magic number of records
> --
>
> Key: PARQUET-1623
> URL: https://issues.apache.org/jira/browse/PARQUET-1623
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>    Reporter: Hatem Helal
>    Assignee: Hatem Helal
>Priority: Minor
> Fix For: cpp-1.6.0
>
>
> I've observed a crash due to an invalid memory access when trying to read a 
> parquet file that I created with a single column of double-precision values 
> that occupies a fixed amount of memory.  After some experimentation I found 
> that the following unittest added to {{arrow-reader-writer-test.cc}} will 
> fail when run in an ASAN build.
> {code:java}
> TEST(TestArrowReadWrite, MultiDataPageMagicNumber) {
>   const int num_rows = 262144;  // 2^18
>   std::shared_ptr table;
>   ASSERT_NO_FATAL_FAILURE(MakeDoubleTable(1, num_rows, 1, ));
>   std::shared_ptr result;
>   ASSERT_NO_FATAL_FAILURE(
>       DoSimpleRoundtrip(table, false, table->num_rows(), {}, ));
>   ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*table, *result));
> }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (PARQUET-1623) [C++] Invalid memory access with a magic number of records

2019-07-11 Thread Hatem Helal (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883112#comment-16883112
 ] 

Hatem Helal commented on PARQUET-1623:
--

Here is the ASAN stack for the test: 

[https://gist.github.com/hatemhelal/ca0f6ef21f7aee0ff71afe18fbd52f92]

> [C++] Invalid memory access with a magic number of records
> --
>
> Key: PARQUET-1623
> URL: https://issues.apache.org/jira/browse/PARQUET-1623
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>    Reporter: Hatem Helal
>    Assignee: Hatem Helal
>Priority: Minor
>
> I've observed a crash due to an invalid memory access when trying to read a 
> parquet file that I created with a single column of double-precision values 
> that occupies a fixed amount of memory.  After some experimentation I found 
> that the following unittest added to {{arrow-reader-writer-test.cc}} will 
> fail when run in an ASAN build.
> {code:java}
> TEST(TestArrowReadWrite, MultiDataPageMagicNumber) {
>   const int num_rows = 262144;  // 2^18
>   std::shared_ptr table;
>   ASSERT_NO_FATAL_FAILURE(MakeDoubleTable(1, num_rows, 1, ));
>   std::shared_ptr result;
>   ASSERT_NO_FATAL_FAILURE(
>       DoSimpleRoundtrip(table, false, table->num_rows(), {}, ));
>   ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*table, *result));
> }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (PARQUET-1623) [C++] Invalid memory access with a magic number of records

2019-07-11 Thread Hatem Helal (JIRA)

Hatem Helal created PARQUET-1623:


 Summary: [C++] Invalid memory access with a magic number of records
 Key: PARQUET-1623
 URL: https://issues.apache.org/jira/browse/PARQUET-1623
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: Hatem Helal
Assignee: Hatem Helal


I've observed a crash due to an invalid memory access when trying to read a 
parquet file that I created with a single column of double-precision values 
that occupies a fixed amount of memory.  After some experimentation I found 
that the following unittest added to {{arrow-reader-writer-test.cc}} will fail 
when run in an ASAN build.
{code:java}
TEST(TestArrowReadWrite, MultiDataPageMagicNumber) {
  const int num_rows = 262144;  // 2^18

  std::shared_ptr table;
  ASSERT_NO_FATAL_FAILURE(MakeDoubleTable(1, num_rows, 1, ));

  std::shared_ptr result;
  ASSERT_NO_FATAL_FAILURE(
      DoSimpleRoundtrip(table, false, table->num_rows(), {}, ));

  ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*table, *result));
}{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (PARQUET-1169) [C++] Segment fault when using NextBatch of parquet::arrow::ColumnReader in parquet-cpp

2019-07-02 Thread Hatem Helal (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876821#comment-16876821
 ] 

Hatem Helal commented on PARQUET-1169:
--

[~frankfang], could you try this again using arrow master?  I think this might 
have been resolved by ARROW-5608.

> [C++] Segment fault when using NextBatch of parquet::arrow::ColumnReader in 
> parquet-cpp
> ---
>
> Key: PARQUET-1169
> URL: https://issues.apache.org/jira/browse/PARQUET-1169
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Jian Fang
>Priority: Major
> Fix For: cpp-1.5.0
>
> Attachments: test.parquet
>
>
> When I running the below code, I consistently get segment fault, not sure 
> whether this is a bug or I did something wrong. Anyone here could help me 
> take a look?
> {code:c++}
> #include 
> #include 
> #include "arrow/array.h"
> #include "arrow/io/file.h"
> #include "arrow/test-util.h"
> #include "parquet/arrow/reader.h"
> using arrow::Array;
> using arrow::default_memory_pool;
> using arrow::io::FileMode;
> using arrow::io::MemoryMappedFile;
> using parquet::arrow::ColumnReader;
> using parquet::arrow::FileReader;
> using parquet::arrow::OpenFile;
> int main(int argc, char** argv) {
>   if (argc > 1) {
> std::string file_name = argv[1];
> std::shared_ptr file;
> ABORT_NOT_OK(MemoryMappedFile::Open(file_name, FileMode::READ, ));
> std::unique_ptr file_reader;
> ABORT_NOT_OK(OpenFile(file, default_memory_pool(), _reader));
> std::unique_ptr column_reader;
> ABORT_NOT_OK(file_reader->GetColumn(0, _reader));
> std::shared_ptr array1;
> ABORT_NOT_OK(column_reader->NextBatch(1, ));
> std::cout << "length " << array1->length() << std::endl;
> std::shared_ptr array2;
> // segment fault
> ABORT_NOT_OK(column_reader->NextBatch(1, ));
> std::cout << "length " << array2->length() << std::endl;
>   }
>   return 0;
> }
> {code}
> Command to compile this program:
> {code}
> g++ test.c -I/usr/local/include/arrow -I/usr/local/include/parquet 
> --std=c++11 -lparquet -larrow -lgtest -o parquet_test
> {code}
> Command to run the program
> {code}
> ./parquet_test test.parquet
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1565) [C++] SEGV in FromParquetSchema with corrupt file from PARQUET-1481

2019-04-18 Thread Hatem Helal (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821152#comment-16821152
 ] 

Hatem Helal commented on PARQUET-1565:
--

This is a somewhat esoteric problem but the fix seems to be to extend the 
switch case here [this switch 
case|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/schema.cc#L174]
 to handle the corrupted thrift metadata.

> [C++] SEGV in FromParquetSchema with corrupt file from PARQUET-1481
> ---
>
> Key: PARQUET-1565
> URL: https://issues.apache.org/jira/browse/PARQUET-1565
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.6.0
>    Reporter: Hatem Helal
>Assignee: Hatem Helal
>Priority: Minor
>
> Calling {{parquet::arrow::FromParquetSchema}} when reading the corrupt file 
> attached to PARQUET-1481 results in a SEGV.  I'm not sure when this was 
> introduced but I didn't observe this problem with our app that uses 
> parquet-cpp v1.4.0.  Our team caught this while integrating Arrow 0.12.1 into 
> MATLAB. 
> To reproduce this, add the following lines to 
> [parquet-reader.cc|https://github.com/apache/arrow/blob/master/cpp/tools/parquet/parquet-reader.cc#L66],
>  build, and try to read the corrupt file attached to PARQUET-1481.
> {code:java}
>     const auto parquet_schema = reader->metadata()->schema();
>     std::shared_ptr<::arrow::Schema> arrow_schema;
>     PARQUET_THROW_NOT_OK(parquet::arrow::FromParquetSchema(parquet_schema, 
> _schema));{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (PARQUET-1565) [C++] SEGV in FromParquetSchema with corrupt file from PARQUET-1481

2019-04-18 Thread Hatem Helal (JIRA)

Hatem Helal created PARQUET-1565:


 Summary: [C++] SEGV in FromParquetSchema with corrupt file from 
PARQUET-1481
 Key: PARQUET-1565
 URL: https://issues.apache.org/jira/browse/PARQUET-1565
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Affects Versions: cpp-1.6.0
Reporter: Hatem Helal
Assignee: Hatem Helal


Calling {{parquet::arrow::FromParquetSchema}} when reading the corrupt file 
attached to PARQUET-1481 results in a SEGV.  I'm not sure when this was 
introduced but I didn't observe this problem with our app that uses parquet-cpp 
v1.4.0.  Our team caught this while integrating Arrow 0.12.1 into MATLAB. 

To reproduce this, add the following lines to 
[parquet-reader.cc|https://github.com/apache/arrow/blob/master/cpp/tools/parquet/parquet-reader.cc#L66],
 build, and try to read the corrupt file attached to PARQUET-1481.
{code:java}
    const auto parquet_schema = reader->metadata()->schema();
    std::shared_ptr<::arrow::Schema> arrow_schema;
    PARQUET_THROW_NOT_OK(parquet::arrow::FromParquetSchema(parquet_schema, 
_schema));{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1540) [C++] Set shared library version for linux and mac builds

2019-03-06 Thread Hatem Helal (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785475#comment-16785475
 ] 

Hatem Helal commented on PARQUET-1540:
--

This is a duplicate of ARROW-3185

> [C++] Set shared library version for linux and mac builds
> -
>
> Key: PARQUET-1540
> URL: https://issues.apache.org/jira/browse/PARQUET-1540
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>    Reporter: Hatem Helal
>    Assignee: Hatem Helal
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It looks like this was previously implemented when parquet-cpp was managed as 
> a separate repo (PARQUET-935).  It would be good to add this back now that 
> parquet-cpp was incorporated into the arrow project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (PARQUET-1540) [C++] Set shared library version for linux and mac builds

2019-03-06 Thread Hatem Helal (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hatem Helal resolved PARQUET-1540.
--
Resolution: Duplicate

> [C++] Set shared library version for linux and mac builds
> -
>
> Key: PARQUET-1540
> URL: https://issues.apache.org/jira/browse/PARQUET-1540
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>    Reporter: Hatem Helal
>    Assignee: Hatem Helal
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It looks like this was previously implemented when parquet-cpp was managed as 
> a separate repo (PARQUET-935).  It would be good to add this back now that 
> parquet-cpp was incorporated into the arrow project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1540) [C++] Set shared library version for linux and mac builds

2019-03-04 Thread Hatem Helal (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783267#comment-16783267
 ] 

Hatem Helal commented on PARQUET-1540:
--

This was discussed on the [mailing 
list|https://lists.apache.org/thread.html/420bd7b5b4a4bad62bf7d874c998c99204e1633a7d0cf47c00541c61@%3Cdev.arrow.apache.org%3E]
 and it makes sense for the SO versions to match up for arrow and parquet until 
an independent parquet C++ release is prepared.

> [C++] Set shared library version for linux and mac builds
> -
>
> Key: PARQUET-1540
> URL: https://issues.apache.org/jira/browse/PARQUET-1540
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>    Reporter: Hatem Helal
>    Assignee: Hatem Helal
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It looks like this was previously implemented when parquet-cpp was managed as 
> a separate repo (PARQUET-935).  It would be good to add this back now that 
> parquet-cpp was incorporated into the arrow project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (PARQUET-1540) [C++] Set shared library version for linux and mac builds

2019-02-25 Thread Hatem Helal (JIRA)

Hatem Helal created PARQUET-1540:


 Summary: [C++] Set shared library version for linux and mac builds
 Key: PARQUET-1540
 URL: https://issues.apache.org/jira/browse/PARQUET-1540
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Hatem Helal
Assignee: Hatem Helal


It looks like this was previously implemented when parquet-cpp was managed as a 
separate repo (PARQUET-935).  It would be good to add this back now that 
parquet-cpp was incorporated into the arrow project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1482) [C++] Unable to read data from parquet file generated with parquetjs

2019-01-03 Thread Hatem Helal (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733248#comment-16733248
 ] 

Hatem Helal commented on PARQUET-1482:
--

[~wesmckinn], my colleague [~rdmello] is working on a fix for this.  Could you 
help us out by adding him as a contributor on this project?  Thanks!

> [C++] Unable to read data from parquet file generated with parquetjs
> 
>
> Key: PARQUET-1482
> URL: https://issues.apache.org/jira/browse/PARQUET-1482
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>    Reporter: Hatem Helal
>    Assignee: Hatem Helal
>Priority: Major
> Attachments: feeds1kMicros.parquet
>
>
> See attached file, when I debug:
> {{% ./parquet-reader feed1kMicros.parquet}}
> I see that the {{scanner->HasNext()}} always returns false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1482) [C++] Unable to read data from parquet file generated with parquetjs

2018-12-21 Thread Hatem Helal (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726884#comment-16726884
 ] 

Hatem Helal commented on PARQUET-1482:
--

I think this is a problem in parquet-cpp since I've confirmed that 
parquet-tools can read this file.

> [C++] Unable to read data from parquet file generated with parquetjs
> 
>
> Key: PARQUET-1482
> URL: https://issues.apache.org/jira/browse/PARQUET-1482
> Project: Parquet
>  Issue Type: Bug
>        Reporter: Hatem Helal
>    Assignee: Hatem Helal
>Priority: Major
> Attachments: feeds1kMicros.parquet
>
>
> See attached file, when I debug:
> {{% ./parquet-reader feed1kMicros.parquet}}
> I see that the {{scanner->HasNext()}} always returns false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1481) [C++] SEGV when reading corrupt parquet file

2018-12-21 Thread Hatem Helal (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726840#comment-16726840
 ] 

Hatem Helal commented on PARQUET-1481:
--

Great, thanks for that [~wesmckinn]!

> [C++] SEGV when reading corrupt parquet file
> 
>
> Key: PARQUET-1481
> URL: https://issues.apache.org/jira/browse/PARQUET-1481
> Project: Parquet
>  Issue Type: Bug
>        Reporter: Hatem Helal
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Attachments: corrupt.parquet
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> >>> import pyarrow.parquet as pq
> >>> pq.read_table('corrupt.parquet')
> fish: 'python' terminated by signal SIGSEGV (Address boundary error)
>  
> Stack report from macOS:
>  
> 0 libsystem_kernel.dylib 0x7fff51164cee __psynch_cvwait + 10
> 1 libsystem_pthread.dylib 0x7fff512a1662 _pthread_cond_wait + 732
> 2 libc++.1.dylib 0x7fff4f04acb0 
> std::__1::condition_variable::wait(std::__1::unique_lock&) + 
> 18
> 3 libc++.1.dylib 0x7fff4f04b728 
> std::__1::__assoc_sub_state::__sub_wait(std::__1::unique_lock&)
>  + 46
> 4 libparquet.11.dylib 0x000115512d00 
> std::__1::__assoc_state::move() + 48
> 5 libparquet.11.dylib 0x0001154faa15 
> parquet::arrow::FileReader::Impl::ReadTable(std::__1::vector std::__1::allocator > const&, std::__1::shared_ptr*) + 1093
> 6 libparquet.11.dylib 0x0001154fb6fe 
> parquet::arrow::FileReader::Impl::ReadTable(std::__1::shared_ptr*)
>  + 350
> 7 libparquet.11.dylib 0x0001154fce47 
> parquet::arrow::FileReader::ReadTable(std::__1::shared_ptr*) + 
> 23
> 8 _parquet.so 0x00011598d97b 
> __pyx_pw_7pyarrow_8_parquet_13ParquetReader_9read_all(_object*, _object*, 
> _object*) + 1035



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1481) [C++] SEGV when reading corrupt parquet file

2018-12-21 Thread Hatem Helal (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726795#comment-16726795
 ] 

Hatem Helal commented on PARQUET-1481:
--

Sure, a colleague used a text editor to make a random change in the file that 
was originally written using parquet-cpp.  I'm looking at making this throw an 
exception / not-ok status code.  Does that sound reasonable?

> [C++] SEGV when reading corrupt parquet file
> 
>
> Key: PARQUET-1481
> URL: https://issues.apache.org/jira/browse/PARQUET-1481
> Project: Parquet
>  Issue Type: Bug
>        Reporter: Hatem Helal
>    Assignee: Hatem Helal
>Priority: Major
> Attachments: corrupt.parquet
>
>
> >>> import pyarrow.parquet as pq
> >>> pq.read_table('corrupt.parquet')
> fish: 'python' terminated by signal SIGSEGV (Address boundary error)
>  
> Stack report from macOS:
>  
> 0 libsystem_kernel.dylib 0x7fff51164cee __psynch_cvwait + 10
> 1 libsystem_pthread.dylib 0x7fff512a1662 _pthread_cond_wait + 732
> 2 libc++.1.dylib 0x7fff4f04acb0 
> std::__1::condition_variable::wait(std::__1::unique_lock&) + 
> 18
> 3 libc++.1.dylib 0x7fff4f04b728 
> std::__1::__assoc_sub_state::__sub_wait(std::__1::unique_lock&)
>  + 46
> 4 libparquet.11.dylib 0x000115512d00 
> std::__1::__assoc_state::move() + 48
> 5 libparquet.11.dylib 0x0001154faa15 
> parquet::arrow::FileReader::Impl::ReadTable(std::__1::vector std::__1::allocator > const&, std::__1::shared_ptr*) + 1093
> 6 libparquet.11.dylib 0x0001154fb6fe 
> parquet::arrow::FileReader::Impl::ReadTable(std::__1::shared_ptr*)
>  + 350
> 7 libparquet.11.dylib 0x0001154fce47 
> parquet::arrow::FileReader::ReadTable(std::__1::shared_ptr*) + 
> 23
> 8 _parquet.so 0x00011598d97b 
> __pyx_pw_7pyarrow_8_parquet_13ParquetReader_9read_all(_object*, _object*, 
> _object*) + 1035



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1481) [C++] SEGV when reading corrupt parquet file

2018-12-21 Thread Hatem Helal (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726757#comment-16726757
 ] 

Hatem Helal commented on PARQUET-1481:
--

Managed to reproduce this using a simple test using latest apache arrow.  
Slightly nicer stack trace:

 

{{F1220 13:29:51.966117 2315707200 record_reader.cc:854] Check failed: false}}
{{*** Check failure stack trace: ***}}
{{ @ 0x1083c217a google::LogMessage::Fail()}}
{{ @ 0x1083c01de google::LogMessage::SendToLog()}}
{{ @ 0x1083c0e1f google::LogMessage::Flush()}}
{{ @ 0x1083c0c59 google::LogMessage::~LogMessage()}}
{{ @ 0x1083c0f15 google::LogMessage::~LogMessage()}}
{{ @ 0x10825d45c arrow::util::ArrowLog::~ArrowLog()}}
{{ @ 0x10825d4a5 arrow::util::ArrowLog::~ArrowLog()}}
{{ @ 0x107d5d936 parquet::internal::RecordReader::Make()}}
{{ @ 0x107cf8abd parquet::arrow::PrimitiveImpl::PrimitiveImpl()}}
{{ @ 0x107c69acd parquet::arrow::PrimitiveImpl::PrimitiveImpl()}}
{{ @ 0x107c68ba8 parquet::arrow::FileReader::Impl::GetColumn()}}
{{ @ 0x107c6b790 parquet::arrow::FileReader::Impl::GetReaderForNode()}}
{{ @ 0x107c6cb3d parquet::arrow::FileReader::Impl::ReadSchemaField()}}
{{ @ 0x107c79d60 
parquet::arrow::FileReader::Impl::ReadTable()::$_1::operator()()}}
{{ @ 0x107c764ef parquet::arrow::FileReader::Impl::ReadTable()}}
{{ @ 0x107c7a9f5 parquet::arrow::FileReader::Impl::ReadTable()}}
{{ @ 0x107c7f5f7 parquet::arrow::FileReader::ReadTable()}}
{{ @ 0x107c6176c main}}

> [C++] SEGV when reading corrupt parquet file
> 
>
> Key: PARQUET-1481
> URL: https://issues.apache.org/jira/browse/PARQUET-1481
> Project: Parquet
>  Issue Type: Bug
>        Reporter: Hatem Helal
>    Assignee: Hatem Helal
>Priority: Major
> Attachments: corrupt.parquet
>
>
> >>> import pyarrow.parquet as pq
> >>> pq.read_table('corrupt.parquet')
> fish: 'python' terminated by signal SIGSEGV (Address boundary error)
>  
> Stack report from macOS:
>  
> 0 libsystem_kernel.dylib 0x7fff51164cee __psynch_cvwait + 10
> 1 libsystem_pthread.dylib 0x7fff512a1662 _pthread_cond_wait + 732
> 2 libc++.1.dylib 0x7fff4f04acb0 
> std::__1::condition_variable::wait(std::__1::unique_lock&) + 
> 18
> 3 libc++.1.dylib 0x7fff4f04b728 
> std::__1::__assoc_sub_state::__sub_wait(std::__1::unique_lock&)
>  + 46
> 4 libparquet.11.dylib 0x000115512d00 
> std::__1::__assoc_state::move() + 48
> 5 libparquet.11.dylib 0x0001154faa15 
> parquet::arrow::FileReader::Impl::ReadTable(std::__1::vector std::__1::allocator > const&, std::__1::shared_ptr*) + 1093
> 6 libparquet.11.dylib 0x0001154fb6fe 
> parquet::arrow::FileReader::Impl::ReadTable(std::__1::shared_ptr*)
>  + 350
> 7 libparquet.11.dylib 0x0001154fce47 
> parquet::arrow::FileReader::ReadTable(std::__1::shared_ptr*) + 
> 23
> 8 _parquet.so 0x00011598d97b 
> __pyx_pw_7pyarrow_8_parquet_13ParquetReader_9read_all(_object*, _object*, 
> _object*) + 1035



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1481) [C++] SEGV when reading corrupt parquet file

2018-12-21 Thread Hatem Helal (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hatem Helal updated PARQUET-1481:
-
Attachment: corrupt.parquet

> [C++] SEGV when reading corrupt parquet file
> 
>
> Key: PARQUET-1481
> URL: https://issues.apache.org/jira/browse/PARQUET-1481
> Project: Parquet
>  Issue Type: Bug
>        Reporter: Hatem Helal
>    Assignee: Hatem Helal
>Priority: Major
> Attachments: corrupt.parquet
>
>
> >>> import pyarrow.parquet as pq
> >>> pq.read_table('corrupt.parquet')
> fish: 'python' terminated by signal SIGSEGV (Address boundary error)
>  
> Stack report from macOS:
>  
> 0 libsystem_kernel.dylib 0x7fff51164cee __psynch_cvwait + 10
> 1 libsystem_pthread.dylib 0x7fff512a1662 _pthread_cond_wait + 732
> 2 libc++.1.dylib 0x7fff4f04acb0 
> std::__1::condition_variable::wait(std::__1::unique_lock&) + 
> 18
> 3 libc++.1.dylib 0x7fff4f04b728 
> std::__1::__assoc_sub_state::__sub_wait(std::__1::unique_lock&)
>  + 46
> 4 libparquet.11.dylib 0x000115512d00 
> std::__1::__assoc_state::move() + 48
> 5 libparquet.11.dylib 0x0001154faa15 
> parquet::arrow::FileReader::Impl::ReadTable(std::__1::vector std::__1::allocator > const&, std::__1::shared_ptr*) + 1093
> 6 libparquet.11.dylib 0x0001154fb6fe 
> parquet::arrow::FileReader::Impl::ReadTable(std::__1::shared_ptr*)
>  + 350
> 7 libparquet.11.dylib 0x0001154fce47 
> parquet::arrow::FileReader::ReadTable(std::__1::shared_ptr*) + 
> 23
> 8 _parquet.so 0x00011598d97b 
> __pyx_pw_7pyarrow_8_parquet_13ParquetReader_9read_all(_object*, _object*, 
> _object*) + 1035



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (PARQUET-1481) [C++] SEGV when reading corrupt parquet file

2018-12-21 Thread Hatem Helal (JIRA)

Hatem Helal created PARQUET-1481:


 Summary: [C++] SEGV when reading corrupt parquet file
 Key: PARQUET-1481
 URL: https://issues.apache.org/jira/browse/PARQUET-1481
 Project: Parquet
  Issue Type: Bug
Reporter: Hatem Helal
Assignee: Hatem Helal


>>> import pyarrow.parquet as pq
>>> pq.read_table('corrupt.parquet')
fish: 'python' terminated by signal SIGSEGV (Address boundary error)

 

Stack report from macOS:

 

0 libsystem_kernel.dylib 0x7fff51164cee __psynch_cvwait + 10
1 libsystem_pthread.dylib 0x7fff512a1662 _pthread_cond_wait + 732
2 libc++.1.dylib 0x7fff4f04acb0 
std::__1::condition_variable::wait(std::__1::unique_lock&) + 18
3 libc++.1.dylib 0x7fff4f04b728 
std::__1::__assoc_sub_state::__sub_wait(std::__1::unique_lock&)
 + 46
4 libparquet.11.dylib 0x000115512d00 
std::__1::__assoc_state::move() + 48
5 libparquet.11.dylib 0x0001154faa15 
parquet::arrow::FileReader::Impl::ReadTable(std::__1::vector > const&, std::__1::shared_ptr*) + 1093
6 libparquet.11.dylib 0x0001154fb6fe 
parquet::arrow::FileReader::Impl::ReadTable(std::__1::shared_ptr*)
 + 350
7 libparquet.11.dylib 0x0001154fce47 
parquet::arrow::FileReader::ReadTable(std::__1::shared_ptr*) + 23
8 _parquet.so 0x00011598d97b 
__pyx_pw_7pyarrow_8_parquet_13ParquetReader_9read_all(_object*, _object*, 
_object*) + 1035



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: parquet-arrow estimate file size

2018-12-11 Thread Hatem Helal

I think if I've understood the problem correctly, you could use the 
parquet::arrow::FileWriter

https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.h#L128

The basic pattern is to use an object to manage the FileWriter lifetime, call 
the WriteTable method for each row group, and close it when you are done.  My 
understanding is that each call to WriteTable will append a new row group which 
should allow you to incrementally write an out-of-memory dataset.  I realize 
now that I haven't tested this myself so it would be good to double-check this 
with someone more experienced with the parquet-cpp APIs.

On 12/11/18, 12:54 AM, "Jiayuan Chen"  wrote:

Thanks for the suggestion, will do.

Since such high-level API is not yet implemented in the parquet-cpp
project, I have to turn back to use the API newly introduced in the
low-level API, that calculates the Parquet file size when adding data into
the column writers. I have another question on that part:

Is there any sample code & advice that I can follow to be able to stream
the Parquet file on a per rowgroup basis? In order words, to restrict
memory usage but still create big enough Parquet file, I would like to
create relatively small rowgroup in memory using InMemoryOutputStream(),
and dump the buffer contents to my external stream, after completing each
row group, until a big file with several rowgroups is finished. However, my
attempt to manipulate the underline arrow::Buffer have failed, that the
pages starting from the second rowgroup are unreadable.

Thanks!

On Mon, Dec 10, 2018 at 3:53 PM Wes McKinney  wrote:

> hi Jiayuan,
>
> To your question
>
> > Would this be in the roadmap?
>
> I doubt there would be any objections to adding this feature to the
> Arrow writer API -- please feel free to open a JIRA issue to describe
> how the API might work in C++. Note there is no formal roadmap in this
> project.
>
> - Wes
> On Mon, Dec 10, 2018 at 5:31 PM Jiayuan Chen  wrote:
> >
> > Thanks for the Python solution. However, is there a solution in C++ that
> I
> > can create such Parquet file with only in-memory buffer, using
> parquet-cpp
> > library?
> >
> > On Mon, Dec 10, 2018 at 3:23 PM Lee, David 
> wrote:
> >
> > > Resending.. Somehow I lost some line feeds in the previous reply..
> > >
> > > import os
> > > import pyarrow.parquet as pq
> > > import glob as glob
> > >
> > > max_target_size = 134217728
> > > target_size = max_target_size * .95
> > > # Directory where parquet files are saved
> > > working_directory = '/tmp/test'
> > > files_dict = dict()
> > > files = glob.glob(os.path.join(working_directory, "*.parquet"))
> > > files.sort()
> > > for file in files:
> > > files_dict[file] = os.path.getsize(file)
> > > print("Merging parquet files")
> > > temp_file = os.path.join(working_directory, "temp.parquet")
> > > file_no = 0
> > > for file in files:
> > > if file in files_dict:
> > > file_no = file_no + 1
> > > file_name = os.path.join(working_directory,
> str(file_no).zfill(4)
> > > + ".parquet")
> > > print("Saving to parquet file " + file_name)
> > > # Just rename file if the file size is in target range
> > > if files_dict[file] > target_size:
> > > del files_dict[file]
> > > os.rename(file, file_name)
> > > continue
> > > merge_list = list()
> > > file_size = 0
> > > # Find files to merge together which add up to less than 128
> megs
> > > for k, v in files_dict.items():
> > > if file_size + v <= max_target_size:
> > > print("Adding file " + k + " to merge list")
> > > merge_list.append(k)
> > > file_size = file_size + v
> > > # Just rename file if there is only one file to merge
> > > if len(merge_list) == 1:
> > > del files_dict[merge_list[0]]
> > > os.rename(merge_list[0], file_name)
> > > continue
> > > # Merge smaller files into one large file. Read row groups 
from
> > > each file and add them to the new file.
> > > schema = pq.read_schema(file)
> > > print("Saving to new parquet file")
> > > writer = pq.ParquetWriter(temp_file, schema=schema,
> > > use_dictionary=True, compression='snappy')
> > > for merge in merge_list:
> > > parquet_file = pq.ParquetFile(merge)
> > > print("Writing " + merge + " to new parquet file")
> > > for i in range(parquet_file.num_row_groups):
> > >

Re: Regarding Apache Parquet Project

2018-12-11 Thread Hatem Helal

Hi Arjit,

I'm new around here too but interested to hear what the others on this list 
have to say.  For C++ development, I've recommend reading through the examples:

https://github.com/apache/arrow/tree/master/cpp/examples/parquet

and the command-line tools:

https://github.com/apache/arrow/tree/master/cpp/tools/parquet

Both were helpful for getting up to speed on the main APIs.  I use an IDE 
(Xcode but doesn't matter which) to debug and step through the code and try to 
understand the internal dependencies.  The setup for Xcode was a bit manual but 
let me know if there is interest and I can investigate automation so that I can 
share it with others.

Hope this helps,

Hatem

On 12/11/18, 5:39 AM, "Arjit Yadav"  wrote:

Hi all,

I am new to this project. While I have used parquet in the past, I want to
know how it works internally and look up relevant documentation and code
inorder to start contributing to the project.

Please let me know any available resources in this regard.

Regards,
Arjit Yadav

[jira] [Created] (PARQUET-1473) [C++] Add helper function that converts ParquetVersion to human-friendly string

2018-12-10 Thread Hatem Helal (JIRA)

Hatem Helal created PARQUET-1473:


 Summary: [C++] Add helper function that converts ParquetVersion to 
human-friendly string
 Key: PARQUET-1473
 URL: https://issues.apache.org/jira/browse/PARQUET-1473
 Project: Parquet
  Issue Type: Improvement
Reporter: Hatem Helal
Assignee: Hatem Helal


I noticed this while working on ARROW-3564: the parquet-reader utility prints a 
line like:

*Version: 0*

which corresponds to the PARQUET_1_0 enum value.  I couldn't find anything that 
obviously did this in the parquet-cpp code base.  It would be good if there 
were a function that could map from the ParquetVersion enum to a human-friendly 
string.  e.g:

 

Version: 1.0

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1458) parquet::CompressionToString not recognizing brotli compression

2018-11-14 Thread Hatem Helal (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hatem Helal updated PARQUET-1458:
-
Labels:   (was: pull)

> parquet::CompressionToString not recognizing brotli compression
> ---
>
> Key: PARQUET-1458
> URL: https://issues.apache.org/jira/browse/PARQUET-1458
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>    Reporter: Hatem Helal
>Priority: Trivial
>
> It looks like we just need to add a case to handle the brotli codec 
> [here|[https://github.com/apache/arrow/blob/9b4cd9c03ed9365f8e235f296caa166ea692c98f/cpp/src/parquet/types.cc#L122]]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1458) parquet::CompressionToString not recognizing brotli compression

2018-11-14 Thread Hatem Helal (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hatem Helal updated PARQUET-1458:
-
Labels: pull  (was: )

> parquet::CompressionToString not recognizing brotli compression
> ---
>
> Key: PARQUET-1458
> URL: https://issues.apache.org/jira/browse/PARQUET-1458
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>    Reporter: Hatem Helal
>Priority: Trivial
>
> It looks like we just need to add a case to handle the brotli codec 
> [here|[https://github.com/apache/arrow/blob/9b4cd9c03ed9365f8e235f296caa166ea692c98f/cpp/src/parquet/types.cc#L122]]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1458) parquet::CompressionToString not recognizing brotli compression

2018-11-14 Thread Hatem Helal (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hatem Helal updated PARQUET-1458:
-
Priority: Trivial  (was: Major)

> parquet::CompressionToString not recognizing brotli compression
> ---
>
> Key: PARQUET-1458
> URL: https://issues.apache.org/jira/browse/PARQUET-1458
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>    Reporter: Hatem Helal
>Priority: Trivial
>
> It looks like we just need to add a case to handle the brotli codec 
> [here|[https://github.com/apache/arrow/blob/9b4cd9c03ed9365f8e235f296caa166ea692c98f/cpp/src/parquet/types.cc#L122]]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1458) parquet::CompressionToString not recognizing brotli compression

2018-11-14 Thread Hatem Helal (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hatem Helal updated PARQUET-1458:
-
Description: It looks like we just need to add a case to handle the brotli 
codec 
[here|[https://github.com/apache/arrow/blob/9b4cd9c03ed9365f8e235f296caa166ea692c98f/cpp/src/parquet/types.cc#L122]]
  (was: It looks like we just need to add a case to handle the brotli codec 
[here|]https://github.com/apache/arrow/blob/9b4cd9c03ed9365f8e235f296caa166ea692c98f/cpp/src/parquet/types.cc#L122])

> parquet::CompressionToString not recognizing brotli compression
> ---
>
> Key: PARQUET-1458
> URL: https://issues.apache.org/jira/browse/PARQUET-1458
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>    Reporter: Hatem Helal
>Priority: Major
>
> It looks like we just need to add a case to handle the brotli codec 
> [here|[https://github.com/apache/arrow/blob/9b4cd9c03ed9365f8e235f296caa166ea692c98f/cpp/src/parquet/types.cc#L122]]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1458) parquet::CompressionToString not recognizing brotli compression

2018-11-14 Thread Hatem Helal (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686773#comment-16686773
 ] 

Hatem Helal commented on PARQUET-1458:
--

Looking into fixing this.

> parquet::CompressionToString not recognizing brotli compression
> ---
>
> Key: PARQUET-1458
> URL: https://issues.apache.org/jira/browse/PARQUET-1458
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>    Reporter: Hatem Helal
>Priority: Major
>
> It looks like we just need to add a case to handle the brotli codec 
> [here|]https://github.com/apache/arrow/blob/9b4cd9c03ed9365f8e235f296caa166ea692c98f/cpp/src/parquet/types.cc#L122]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (PARQUET-1458) parquet::CompressionToString not recognizing brotli compression

2018-11-14 Thread Hatem Helal (JIRA)

Hatem Helal created PARQUET-1458:


 Summary: parquet::CompressionToString not recognizing brotli 
compression
 Key: PARQUET-1458
 URL: https://issues.apache.org/jira/browse/PARQUET-1458
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: Hatem Helal


It looks like we just need to add a case to handle the brotli codec 
[here|]https://github.com/apache/arrow/blob/9b4cd9c03ed9365f8e235f296caa166ea692c98f/cpp/src/parquet/types.cc#L122]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Interpretation of PageHeader uncompressed_page_size

Interpretation of PageHeader uncompressed_page_size

[jira] [Assigned] (PARQUET-458) [C++] Implement support for DataPageV2

[jira] [Created] (PARQUET-1639) [C++] Remove regex dependency for parsing ApplicationVersion

[jira] [Resolved] (PARQUET-1623) [C++] Invalid memory access with a magic number of records

[jira] [Commented] (PARQUET-1623) [C++] Invalid memory access with a magic number of records

[jira] [Commented] (PARQUET-1623) [C++] Invalid memory access with a magic number of records

[jira] [Commented] (PARQUET-1623) [C++] Invalid memory access with a magic number of records

[jira] [Created] (PARQUET-1623) [C++] Invalid memory access with a magic number of records

[jira] [Commented] (PARQUET-1169) [C++] Segment fault when using NextBatch of parquet::arrow::ColumnReader in parquet-cpp

[jira] [Commented] (PARQUET-1565) [C++] SEGV in FromParquetSchema with corrupt file from PARQUET-1481

[jira] [Created] (PARQUET-1565) [C++] SEGV in FromParquetSchema with corrupt file from PARQUET-1481

[jira] [Commented] (PARQUET-1540) [C++] Set shared library version for linux and mac builds

[jira] [Resolved] (PARQUET-1540) [C++] Set shared library version for linux and mac builds

[jira] [Commented] (PARQUET-1540) [C++] Set shared library version for linux and mac builds

[jira] [Created] (PARQUET-1540) [C++] Set shared library version for linux and mac builds

[jira] [Commented] (PARQUET-1482) [C++] Unable to read data from parquet file generated with parquetjs

[jira] [Commented] (PARQUET-1482) [C++] Unable to read data from parquet file generated with parquetjs

[jira] [Commented] (PARQUET-1481) [C++] SEGV when reading corrupt parquet file

[jira] [Commented] (PARQUET-1481) [C++] SEGV when reading corrupt parquet file

[jira] [Commented] (PARQUET-1481) [C++] SEGV when reading corrupt parquet file

[jira] [Updated] (PARQUET-1481) [C++] SEGV when reading corrupt parquet file

[jira] [Created] (PARQUET-1481) [C++] SEGV when reading corrupt parquet file

Re: parquet-arrow estimate file size

Re: Regarding Apache Parquet Project

[jira] [Created] (PARQUET-1473) [C++] Add helper function that converts ParquetVersion to human-friendly string

[jira] [Updated] (PARQUET-1458) parquet::CompressionToString not recognizing brotli compression

[jira] [Updated] (PARQUET-1458) parquet::CompressionToString not recognizing brotli compression

[jira] [Updated] (PARQUET-1458) parquet::CompressionToString not recognizing brotli compression

[jira] [Updated] (PARQUET-1458) parquet::CompressionToString not recognizing brotli compression

[jira] [Commented] (PARQUET-1458) parquet::CompressionToString not recognizing brotli compression

[jira] [Created] (PARQUET-1458) parquet::CompressionToString not recognizing brotli compression

32 matches

Site Navigation

Mail list logo

Footer information