[jira] [Updated] (PARQUET-2110) Fix Typos in LogicalTypes.md

2022-01-19 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-2110:
--
Fix Version/s: format-2.10.0

> Fix Typos in LogicalTypes.md
> 
>
> Key: PARQUET-2110
> URL: https://issues.apache.org/jira/browse/PARQUET-2110
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: jincongho
>Assignee: jincongho
>Priority: Trivial
> Fix For: format-2.10.0
>
>
> interpertations -> interpretations
> regadless -> regardless
> unambigously -> unambiguously



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-2110) Fix Typos in LogicalTypes.md

2022-01-19 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-2110.
---
Resolution: Fixed

Resolved in PR https://github.com/apache/parquet-format/pull/181

> Fix Typos in LogicalTypes.md
> 
>
> Key: PARQUET-2110
> URL: https://issues.apache.org/jira/browse/PARQUET-2110
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: jincongho
>Priority: Trivial
>
> interpertations -> interpretations
> regadless -> regardless
> unambigously -> unambiguously



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-1345) [C++] It is possible to overflow a TMemoryBuffer when serializing the file metadata

2020-10-01 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17205494#comment-17205494
 ] 

Wes McKinney commented on PARQUET-1345:
---

Can you make a repro? Seems like something we should see if we can fix

> [C++] It is possible to overflow a TMemoryBuffer when serializing the file 
> metadata
> ---
>
> Key: PARQUET-1345
> URL: https://issues.apache.org/jira/browse/PARQUET-1345
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> I'm not sure if this is fixable, but see issue reported to Arrow:
> https://github.com/apache/arrow/issues/2077



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1878) [C++] lz4 codec is not compatible with Hadoop Lz4Codec

2020-09-22 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1878:
-

Assignee: Patrick Pai

> [C++] lz4 codec is not compatible with Hadoop Lz4Codec
> --
>
> Key: PARQUET-1878
> URL: https://issues.apache.org/jira/browse/PARQUET-1878
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Steve M. Kim
>Assignee: Patrick Pai
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> As described in HADOOP-12990, the Hadoop {{Lz4Codec}} uses the lz4 block 
> format, and it prepends 8 extra bytes before the compressed data. I believe 
> that lz4 implementation in parquet-cpp also uses the lz4 block format, but it 
> does not prepend these 8 extra bytes.
>  
> Using Java parquet-mr, I wrote a Parquet file with lz4 compression:
> {code:java}
> $ parquet-tools meta 
> /tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f
> file:
> file:/tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f
> creator: parquet-mr version 1.10.1 (build 
> a89df8f9932b6ef6633d06069e50c9b7970bebd1)file schema:
> 
> c1:  REQUIRED INT64 R:0 D:0
> c0:  REQUIRED BINARY R:0 D:0
> v0:  REQUIRED INT64 R:0 D:0row group 1: RC:5007 TS:28028 OFFSET:4
> 
> c1:   INT64 LZ4 DO:0 FPO:4 SZ:24797/25694/1.04 VC:5007 
> ENC:DELTA_BINARY_PACKED ST:[min: 1566330126476659000, max: 
> 1571211622650188000, num_nulls: 0]
> c0:   BINARY LZ4 DO:0 FPO:24801 SZ:279/260/0.93 VC:5007 
> ENC:PLAIN,RLE_DICTIONARY ST:[min: 
> 0x7471732F62656566616C6F2F746F6D6163636F2D66782D6D6B74646174612D6C69766573747265616D,
>  max: 
> 0x7471732F62656566616C6F2F746F6D6163636F2D66782D6D6B74646174612D6C69766573747265616D,
>  num_nulls: 0]
> v0:   INT64 LZ4 DO:0 FPO:25080 SZ:1348/2074/1.54 VC:5007 
> ENC:PLAIN,RLE_DICTIONARY ST:[min: 0, max: 9, num_nulls: 0] {code}
> When I attempted to read this file with parquet-cpp, I got the following 
> error:
> {code:java}
> >>> import pyarrow.parquet as pq
> >>> pq.read_table('/tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1536, in read_table
> return pf.read(columns=columns, use_threads=use_threads,
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1260, in read
> table = piece.read(columns=columns, use_threads=use_threads,
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 707, in read
> table = reader.read(**options)
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 336, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1130, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> OSError: IOError: Corrupt Lz4 compressed data. {code}
>  
> [https://github.com/apache/arrow/issues/3491] reported incompatibility in the 
> other direction, using Spark (which uses the Hadoop lz4 codec) to read a 
> parquet file that was written with parquet-cpp.
>  
> Given that the Hadoop lz4 codec has long been in use, and users have 
> accumulated Parquet files that were written with this implementation, I 
> propose changing parquet-cpp to match the Hadoop implementation.
>  
> See also:
>  * 
> https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16574328#comment-16574328
>  * 
> https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16585288#comment-16585288



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1878) [C++] lz4 codec is not compatible with Hadoop Lz4Codec

2020-09-22 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1878.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 7789
[https://github.com/apache/arrow/pull/7789]

> [C++] lz4 codec is not compatible with Hadoop Lz4Codec
> --
>
> Key: PARQUET-1878
> URL: https://issues.apache.org/jira/browse/PARQUET-1878
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Steve M. Kim
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> As described in HADOOP-12990, the Hadoop {{Lz4Codec}} uses the lz4 block 
> format, and it prepends 8 extra bytes before the compressed data. I believe 
> that lz4 implementation in parquet-cpp also uses the lz4 block format, but it 
> does not prepend these 8 extra bytes.
>  
> Using Java parquet-mr, I wrote a Parquet file with lz4 compression:
> {code:java}
> $ parquet-tools meta 
> /tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f
> file:
> file:/tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f
> creator: parquet-mr version 1.10.1 (build 
> a89df8f9932b6ef6633d06069e50c9b7970bebd1)file schema:
> 
> c1:  REQUIRED INT64 R:0 D:0
> c0:  REQUIRED BINARY R:0 D:0
> v0:  REQUIRED INT64 R:0 D:0row group 1: RC:5007 TS:28028 OFFSET:4
> 
> c1:   INT64 LZ4 DO:0 FPO:4 SZ:24797/25694/1.04 VC:5007 
> ENC:DELTA_BINARY_PACKED ST:[min: 1566330126476659000, max: 
> 1571211622650188000, num_nulls: 0]
> c0:   BINARY LZ4 DO:0 FPO:24801 SZ:279/260/0.93 VC:5007 
> ENC:PLAIN,RLE_DICTIONARY ST:[min: 
> 0x7471732F62656566616C6F2F746F6D6163636F2D66782D6D6B74646174612D6C69766573747265616D,
>  max: 
> 0x7471732F62656566616C6F2F746F6D6163636F2D66782D6D6B74646174612D6C69766573747265616D,
>  num_nulls: 0]
> v0:   INT64 LZ4 DO:0 FPO:25080 SZ:1348/2074/1.54 VC:5007 
> ENC:PLAIN,RLE_DICTIONARY ST:[min: 0, max: 9, num_nulls: 0] {code}
> When I attempted to read this file with parquet-cpp, I got the following 
> error:
> {code:java}
> >>> import pyarrow.parquet as pq
> >>> pq.read_table('/tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1536, in read_table
> return pf.read(columns=columns, use_threads=use_threads,
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1260, in read
> table = piece.read(columns=columns, use_threads=use_threads,
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 707, in read
> table = reader.read(**options)
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 336, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1130, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> OSError: IOError: Corrupt Lz4 compressed data. {code}
>  
> [https://github.com/apache/arrow/issues/3491] reported incompatibility in the 
> other direction, using Spark (which uses the Hadoop lz4 codec) to read a 
> parquet file that was written with parquet-cpp.
>  
> Given that the Hadoop lz4 codec has long been in use, and users have 
> accumulated Parquet files that were written with this implementation, I 
> propose changing parquet-cpp to match the Hadoop implementation.
>  
> See also:
>  * 
> https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16574328#comment-16574328
>  * 
> https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16585288#comment-16585288



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1904) [C++] Export file_offset in RowGroupMetaData

2020-08-27 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186124#comment-17186124
 ] 

Wes McKinney commented on PARQUET-1904:
---

Done. I also made you an administrator so you can do this in the future

> [C++] Export file_offset in RowGroupMetaData
> 
>
> Key: PARQUET-1904
> URL: https://issues.apache.org/jira/browse/PARQUET-1904
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Simon Bertron
>Assignee: Simon Bertron
>Priority: Trivial
>  Labels: parquet, pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> In the C++ row group metadata object, the offset of the row group in the file 
> is stored, but not exposed to users. RowGroupMetaDataImpl has a field 
> file_offset and a method file_offset() that exposes it. But RowGroupMetaData 
> does not have a file_offset() method. This seems odd, most other fields in 
> RowGroupMetaDataImpl are exposed by RowGroupMetaData.
>  
> This issue is similar to ARROW-3590, but that issue seems pretty stale and is 
> requesting a python feature. I think this issue is more focused and detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1904) [C++] Export file_offset in RowGroupMetaData

2020-08-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1904:
--
Fix Version/s: cpp-1.6.0

> [C++] Export file_offset in RowGroupMetaData
> 
>
> Key: PARQUET-1904
> URL: https://issues.apache.org/jira/browse/PARQUET-1904
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Simon Bertron
>Assignee: Simon Bertron
>Priority: Trivial
>  Labels: parquet, pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> In the C++ row group metadata object, the offset of the row group in the file 
> is stored, but not exposed to users. RowGroupMetaDataImpl has a field 
> file_offset and a method file_offset() that exposes it. But RowGroupMetaData 
> does not have a file_offset() method. This seems odd, most other fields in 
> RowGroupMetaDataImpl are exposed by RowGroupMetaData.
>  
> This issue is similar to ARROW-3590, but that issue seems pretty stale and is 
> requesting a python feature. I think this issue is more focused and detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1904) [C++] Export file_offset in RowGroupMetaData

2020-08-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1904:
-

Assignee: Simon Bertron

> [C++] Export file_offset in RowGroupMetaData
> 
>
> Key: PARQUET-1904
> URL: https://issues.apache.org/jira/browse/PARQUET-1904
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Simon Bertron
>Assignee: Simon Bertron
>Priority: Trivial
>  Labels: parquet, pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> In the C++ row group metadata object, the offset of the row group in the file 
> is stored, but not exposed to users. RowGroupMetaDataImpl has a field 
> file_offset and a method file_offset() that exposes it. But RowGroupMetaData 
> does not have a file_offset() method. This seems odd, most other fields in 
> RowGroupMetaDataImpl are exposed by RowGroupMetaData.
>  
> This issue is similar to ARROW-3590, but that issue seems pretty stale and is 
> requesting a python feature. I think this issue is more focused and detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1845) [C++] Int96 memory images in test cases assume only little-endian

2020-08-03 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1845.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 6981
[https://github.com/apache/arrow/pull/6981]

> [C++] Int96 memory images in test cases assume only little-endian
> -
>
> Key: PARQUET-1845
> URL: https://issues.apache.org/jira/browse/PARQUET-1845
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Int96 is used as a pair of uint_64 and uint_32. Both elements can be handled 
> using a native endian for effectiveness.
> Int96 memory images in parquet-internal-tests assume only little-endian.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1845) [C++] Int96 memory images in test cases assume only little-endian

2020-08-03 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1845:
-

Assignee: Kazuaki Ishizaki

> [C++] Int96 memory images in test cases assume only little-endian
> -
>
> Key: PARQUET-1845
> URL: https://issues.apache.org/jira/browse/PARQUET-1845
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Int96 is used as a pair of uint_64 and uint_32. Both elements can be handled 
> using a native endian for effectiveness.
> Int96 memory images in parquet-internal-tests assume only little-endian.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1882) [C++] Writing an all-null column and then reading it with buffered_stream aborts the process

2020-07-12 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1882.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 7718
[https://github.com/apache/arrow/pull/7718]

> [C++] Writing an all-null column and then reading it with buffered_stream 
> aborts the process
> 
>
> Key: PARQUET-1882
> URL: https://issues.apache.org/jira/browse/PARQUET-1882
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Windows 10 64-bit, MSVC
>Reporter: Eric Gorelik
>Assignee: Micah Kornfield
>Priority: Critical
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When writing a column unbuffered that contains only nulls, a 0-byte 
> dictionary page gets written. When then reading the resulting file with 
> buffered_stream enabled, the column reader gets the length of the page (which 
> is 0), and then tries to read that many bytes from the underlying input 
> stream.
> parquet/column_reader.cc, SerializedPageReader::NextPage
>  
> {code:java}
> int compressed_len = current_page_header_.compressed_page_size;
> int uncompressed_len = current_page_header_.uncompressed_page_size;
> // Read the compressed data page.
> std::shared_ptr page_buffer;
> PARQUET_THROW_NOT_OK(stream_->Read(compressed_len, _buffer));{code}
>  
> BufferedInputStream::Read, however, has an assertion that the bytes to read 
> is strictly positive, so the assertion fails and aborts the process.
> arrow/io/buffered.cc, BufferedInputStream::Impl
>  
> {code:java}
> Status Read(int64_t nbytes, int64_t* bytes_read, void* out) {
>   ARROW_CHECK_GT(nbytes, 0);
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1882) [C++] Writing an all-null column and then reading it with buffered_stream aborts the process

2020-07-12 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1882:
--
Summary: [C++] Writing an all-null column and then reading it with 
buffered_stream aborts the process  (was: Writing an all-null column and then 
reading it with buffered_stream aborts the process)

> [C++] Writing an all-null column and then reading it with buffered_stream 
> aborts the process
> 
>
> Key: PARQUET-1882
> URL: https://issues.apache.org/jira/browse/PARQUET-1882
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Windows 10 64-bit, MSVC
>Reporter: Eric Gorelik
>Assignee: Micah Kornfield
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When writing a column unbuffered that contains only nulls, a 0-byte 
> dictionary page gets written. When then reading the resulting file with 
> buffered_stream enabled, the column reader gets the length of the page (which 
> is 0), and then tries to read that many bytes from the underlying input 
> stream.
> parquet/column_reader.cc, SerializedPageReader::NextPage
>  
> {code:java}
> int compressed_len = current_page_header_.compressed_page_size;
> int uncompressed_len = current_page_header_.uncompressed_page_size;
> // Read the compressed data page.
> std::shared_ptr page_buffer;
> PARQUET_THROW_NOT_OK(stream_->Read(compressed_len, _buffer));{code}
>  
> BufferedInputStream::Read, however, has an assertion that the bytes to read 
> is strictly positive, so the assertion fails and aborts the process.
> arrow/io/buffered.cc, BufferedInputStream::Impl
>  
> {code:java}
> Status Read(int64_t nbytes, int64_t* bytes_read, void* out) {
>   ARROW_CHECK_GT(nbytes, 0);
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1839) [C++] values_read not updated in ReadBatchSpaced

2020-07-12 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1839.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 7717
[https://github.com/apache/arrow/pull/7717]

> [C++] values_read not updated in ReadBatchSpaced 
> -
>
> Key: PARQUET-1839
> URL: https://issues.apache.org/jira/browse/PARQUET-1839
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Nileema Shingte
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> values_read is not updated in some cases in the 
> `TypedColumnReaderImpl::ReadBatchSpaced` API
> we probably need to add 
> {code:java}
> *values_read = total_values;{code}
> After 
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_reader.cc#L906]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1839) [C++] values_read not updated in ReadBatchSpaced

2020-07-12 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1839:
--
Summary: [C++] values_read not updated in ReadBatchSpaced   (was: 
values_read not updated in ReadBatchSpaced )

> [C++] values_read not updated in ReadBatchSpaced 
> -
>
> Key: PARQUET-1839
> URL: https://issues.apache.org/jira/browse/PARQUET-1839
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Nileema Shingte
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> values_read is not updated in some cases in the 
> `TypedColumnReaderImpl::ReadBatchSpaced` API
> we probably need to add 
> {code:java}
> *values_read = total_values;{code}
> After 
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_reader.cc#L906]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1882) Writing an all-null column and then reading it with buffered_stream aborts the process

2020-07-09 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154724#comment-17154724
 ] 

Wes McKinney commented on PARQUET-1882:
---

Can you provide a reproducible code example?

> Writing an all-null column and then reading it with buffered_stream aborts 
> the process
> --
>
> Key: PARQUET-1882
> URL: https://issues.apache.org/jira/browse/PARQUET-1882
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Windows 10 64-bit, MSVC
>Reporter: Eric Gorelik
>Priority: Critical
>
> When writing a column unbuffered that contains only nulls, a 0-byte 
> dictionary page gets written. When then reading the resulting file with 
> buffered_stream enabled, the column reader gets the length of the page (which 
> is 0), and then tries to read that many bytes from the underlying input 
> stream.
> parquet/column_reader.cc, SerializedPageReader::NextPage
>  
> {code:java}
> int compressed_len = current_page_header_.compressed_page_size;
> int uncompressed_len = current_page_header_.uncompressed_page_size;
> // Read the compressed data page.
> std::shared_ptr page_buffer;
> PARQUET_THROW_NOT_OK(stream_->Read(compressed_len, _buffer));{code}
>  
> BufferedInputStream::Read, however, has an assertion that the bytes to read 
> is strictly positive, so the assertion fails and aborts the process.
> arrow/io/buffered.cc, BufferedInputStream::Impl
>  
> {code:java}
> Status Read(int64_t nbytes, int64_t* bytes_read, void* out) {
>   ARROW_CHECK_GT(nbytes, 0);
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1878) [C++] lz4 codec is not compatible with Hadoop Lz4Codec

2020-06-18 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139954#comment-17139954
 ] 

Wes McKinney commented on PARQUET-1878:
---

[~chairmank] can you also send an e-mail to dev@parquet.apache.org about this? 
We've been going around in circles on this LZ4 stuff and I think it's time that 
we fix this up once and for all across the implementations

cc [~apitrou] [~fsaintjacques] [~uwe] 

> [C++] lz4 codec is not compatible with Hadoop Lz4Codec
> --
>
> Key: PARQUET-1878
> URL: https://issues.apache.org/jira/browse/PARQUET-1878
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Steve M. Kim
>Priority: Major
>
> As described in HADOOP-12990, the Hadoop {{Lz4Codec}} uses the lz4 block 
> format, and it prepends 8 extra bytes before the compressed data. I believe 
> that lz4 implementation in parquet-cpp also uses the lz4 block format, but it 
> does not prepend these 8 extra bytes.
>  
> Using Java parquet-mr, I wrote a Parquet file with lz4 compression:
> {code:java}
> $ parquet-tools meta 
> /tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f
> file:
> file:/tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f
> creator: parquet-mr version 1.10.1 (build 
> a89df8f9932b6ef6633d06069e50c9b7970bebd1)file schema:
> 
> c1:  REQUIRED INT64 R:0 D:0
> c0:  REQUIRED BINARY R:0 D:0
> v0:  REQUIRED INT64 R:0 D:0row group 1: RC:5007 TS:28028 OFFSET:4
> 
> c1:   INT64 LZ4 DO:0 FPO:4 SZ:24797/25694/1.04 VC:5007 
> ENC:DELTA_BINARY_PACKED ST:[min: 1566330126476659000, max: 
> 1571211622650188000, num_nulls: 0]
> c0:   BINARY LZ4 DO:0 FPO:24801 SZ:279/260/0.93 VC:5007 
> ENC:PLAIN,RLE_DICTIONARY ST:[min: 
> 0x7471732F62656566616C6F2F746F6D6163636F2D66782D6D6B74646174612D6C69766573747265616D,
>  max: 
> 0x7471732F62656566616C6F2F746F6D6163636F2D66782D6D6B74646174612D6C69766573747265616D,
>  num_nulls: 0]
> v0:   INT64 LZ4 DO:0 FPO:25080 SZ:1348/2074/1.54 VC:5007 
> ENC:PLAIN,RLE_DICTIONARY ST:[min: 0, max: 9, num_nulls: 0] {code}
> When I attempted to read this file with parquet-cpp, I got the following 
> error:
> {code:java}
> >>> import pyarrow.parquet as pq
> >>> pq.read_table('/tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1536, in read_table
> return pf.read(columns=columns, use_threads=use_threads,
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1260, in read
> table = piece.read(columns=columns, use_threads=use_threads,
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 707, in read
> table = reader.read(**options)
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 336, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1130, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> OSError: IOError: Corrupt Lz4 compressed data. {code}
>  
> [https://github.com/apache/arrow/issues/3491] reported incompatibility in the 
> other direction, using Spark (which uses the Hadoop lz4 codec) to read a 
> parquet file that was written with parquet-cpp.
>  
> Given that the Hadoop lz4 codec has long been in use, and users have 
> accumulated Parquet files that were written with this implementation, I 
> propose changing parquet-cpp to match the Hadoop implementation.
>  
> See also:
>  * 
> https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16574328#comment-16574328
>  * 
> https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16585288#comment-16585288



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1241) [C++] Use LZ4 frame format

2020-06-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1241:
--
Fix Version/s: cpp-1.6.0

> [C++] Use LZ4 frame format
> --
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1877) [C++] Reconcile container size with string size for memory issues

2020-06-17 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1877.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 7465
[https://github.com/apache/arrow/pull/7465]

> [C++] Reconcile container size with string size for memory issues
> -
>
> Key: PARQUET-1877
> URL: https://issues.apache.org/jira/browse/PARQUET-1877
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Right now the size can cause allocations an order of magnitude larger then 
> string size limits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1859) [C++] Require error message when using ParquetException::EofException

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1859:
-

Assignee: (was: Wes McKinney)

> [C++] Require error message when using ParquetException::EofException
> -
>
> Key: PARQUET-1859
> URL: https://issues.apache.org/jira/browse/PARQUET-1859
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> "Unexpected end of stream" (the defaults) gives no clue where the failure 
> occurred



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1385:
-

Assignee: (was: Wes McKinney)

> [C++] bloom_filter-test is very slow under valgrind
> ---
>
> Key: PARQUET-1385
> URL: https://issues.apache.org/jira/browse/PARQUET-1385
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>
> This test takes ~5 minutes to run under valgrind in Travis CI
> {code}
> 1: [==] Running 6 tests from 6 test cases.
> 1: [--] Global test environment set-up.
> 1: [--] 1 test from Murmur3Test
> 1: [ RUN  ] Murmur3Test.TestBloomFilter
> 1: [   OK ] Murmur3Test.TestBloomFilter (19 ms)
> 1: [--] 1 test from Murmur3Test (34 ms total)
> 1: 
> 1: [--] 1 test from ConstructorTest
> 1: [ RUN  ] ConstructorTest.TestBloomFilter
> 1: [   OK ] ConstructorTest.TestBloomFilter (101 ms)
> 1: [--] 1 test from ConstructorTest (101 ms total)
> 1: 
> 1: [--] 1 test from BasicTest
> 1: [ RUN  ] BasicTest.TestBloomFilter
> 1: [   OK ] BasicTest.TestBloomFilter (49 ms)
> 1: [--] 1 test from BasicTest (49 ms total)
> 1: 
> 1: [--] 1 test from FPPTest
> 1: [ RUN  ] FPPTest.TestBloomFilter
> 1: [   OK ] FPPTest.TestBloomFilter (308731 ms)
> 1: [--] 1 test from FPPTest (308741 ms total)
> 1: 
> 1: [--] 1 test from CompatibilityTest
> 1: [ RUN  ] CompatibilityTest.TestBloomFilter
> 1: [   OK ] CompatibilityTest.TestBloomFilter (62 ms)
> 1: [--] 1 test from CompatibilityTest (62 ms total)
> 1: 
> 1: [--] 1 test from OptimalValueTest
> 1: [ RUN  ] OptimalValueTest.TestBloomFilter
> 1: [   OK ] OptimalValueTest.TestBloomFilter (27 ms)
> 1: [--] 1 test from OptimalValueTest (27 ms total)
> 1: 
> 1: [--] Global test environment tear-down
> 1: [==] 6 tests from 6 test cases ran. (309081 ms total)
> 1: [  PASSED  ] 6 tests.
> {code}
> Either we should change the FPPTest parameters to be faster, or we should not 
> run that test when using valrind



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1352) [CPP] Trying to write an arrow table with structs to a parquet file

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1352:
-

Assignee: (was: Wes McKinney)

> [CPP] Trying to write an arrow table with structs to a parquet file
> ---
>
> Key: PARQUET-1352
> URL: https://issues.apache.org/jira/browse/PARQUET-1352
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Dragan Markovic
>Priority: Major
>
> Relevant issue:[https://github.com/apache/arrow/issues/2287]
>  
> I'm creating a struct with the following schema in arrow: 
> https://pastebin.com/Cc8nreBP
>  
> When I try to convert that table to a .parquet file, the file gets created 
> with a valid schema (the one I posted above) and then throws this exception: 
> "lemented: Level generation for Struct not supported yet".
>  
> Here's the code: [https://ideone.com/DJkKUF]
>  
> Is there any way to write arrow table of structs to a .parquet file in cpp? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1385:
-

Assignee: (was: Wes McKinney)

> [C++] bloom_filter-test is very slow under valgrind
> ---
>
> Key: PARQUET-1385
> URL: https://issues.apache.org/jira/browse/PARQUET-1385
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>
> This test takes ~5 minutes to run under valgrind in Travis CI
> {code}
> 1: [==] Running 6 tests from 6 test cases.
> 1: [--] Global test environment set-up.
> 1: [--] 1 test from Murmur3Test
> 1: [ RUN  ] Murmur3Test.TestBloomFilter
> 1: [   OK ] Murmur3Test.TestBloomFilter (19 ms)
> 1: [--] 1 test from Murmur3Test (34 ms total)
> 1: 
> 1: [--] 1 test from ConstructorTest
> 1: [ RUN  ] ConstructorTest.TestBloomFilter
> 1: [   OK ] ConstructorTest.TestBloomFilter (101 ms)
> 1: [--] 1 test from ConstructorTest (101 ms total)
> 1: 
> 1: [--] 1 test from BasicTest
> 1: [ RUN  ] BasicTest.TestBloomFilter
> 1: [   OK ] BasicTest.TestBloomFilter (49 ms)
> 1: [--] 1 test from BasicTest (49 ms total)
> 1: 
> 1: [--] 1 test from FPPTest
> 1: [ RUN  ] FPPTest.TestBloomFilter
> 1: [   OK ] FPPTest.TestBloomFilter (308731 ms)
> 1: [--] 1 test from FPPTest (308741 ms total)
> 1: 
> 1: [--] 1 test from CompatibilityTest
> 1: [ RUN  ] CompatibilityTest.TestBloomFilter
> 1: [   OK ] CompatibilityTest.TestBloomFilter (62 ms)
> 1: [--] 1 test from CompatibilityTest (62 ms total)
> 1: 
> 1: [--] 1 test from OptimalValueTest
> 1: [ RUN  ] OptimalValueTest.TestBloomFilter
> 1: [   OK ] OptimalValueTest.TestBloomFilter (27 ms)
> 1: [--] 1 test from OptimalValueTest (27 ms total)
> 1: 
> 1: [--] Global test environment tear-down
> 1: [==] 6 tests from 6 test cases ran. (309081 ms total)
> 1: [  PASSED  ] 6 tests.
> {code}
> Either we should change the FPPTest parameters to be faster, or we should not 
> run that test when using valrind



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-838) [CPP] Unable to read files written by parquet-cpp from parquet-tools

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-838:


Assignee: (was: Wes McKinney)

> [CPP] Unable to read files written by parquet-cpp from parquet-tools
> 
>
> Key: PARQUET-838
> URL: https://issues.apache.org/jira/browse/PARQUET-838
> Project: Parquet
>  Issue Type: Bug
>Reporter: Deepak Majeti
>Priority: Major
> Attachments: parquet_cpp_example.parquet
>
>
> I could not read files written by parquet-cpp from parquet-tools and Hive.
> Setting field ids in the schema metadata seems to be the problem. We should 
> make setting the field_id optional.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1385:
-

Assignee: Wes McKinney

> [C++] bloom_filter-test is very slow under valgrind
> ---
>
> Key: PARQUET-1385
> URL: https://issues.apache.org/jira/browse/PARQUET-1385
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>
> This test takes ~5 minutes to run under valgrind in Travis CI
> {code}
> 1: [==] Running 6 tests from 6 test cases.
> 1: [--] Global test environment set-up.
> 1: [--] 1 test from Murmur3Test
> 1: [ RUN  ] Murmur3Test.TestBloomFilter
> 1: [   OK ] Murmur3Test.TestBloomFilter (19 ms)
> 1: [--] 1 test from Murmur3Test (34 ms total)
> 1: 
> 1: [--] 1 test from ConstructorTest
> 1: [ RUN  ] ConstructorTest.TestBloomFilter
> 1: [   OK ] ConstructorTest.TestBloomFilter (101 ms)
> 1: [--] 1 test from ConstructorTest (101 ms total)
> 1: 
> 1: [--] 1 test from BasicTest
> 1: [ RUN  ] BasicTest.TestBloomFilter
> 1: [   OK ] BasicTest.TestBloomFilter (49 ms)
> 1: [--] 1 test from BasicTest (49 ms total)
> 1: 
> 1: [--] 1 test from FPPTest
> 1: [ RUN  ] FPPTest.TestBloomFilter
> 1: [   OK ] FPPTest.TestBloomFilter (308731 ms)
> 1: [--] 1 test from FPPTest (308741 ms total)
> 1: 
> 1: [--] 1 test from CompatibilityTest
> 1: [ RUN  ] CompatibilityTest.TestBloomFilter
> 1: [   OK ] CompatibilityTest.TestBloomFilter (62 ms)
> 1: [--] 1 test from CompatibilityTest (62 ms total)
> 1: 
> 1: [--] 1 test from OptimalValueTest
> 1: [ RUN  ] OptimalValueTest.TestBloomFilter
> 1: [   OK ] OptimalValueTest.TestBloomFilter (27 ms)
> 1: [--] 1 test from OptimalValueTest (27 ms total)
> 1: 
> 1: [--] Global test environment tear-down
> 1: [==] 6 tests from 6 test cases ran. (309081 ms total)
> 1: [  PASSED  ] 6 tests.
> {code}
> Either we should change the FPPTest parameters to be faster, or we should not 
> run that test when using valrind



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-443) Schema resolution: map encoding

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-443:


Assignee: (was: Wes McKinney)

> Schema resolution: map encoding
> ---
>
> Key: PARQUET-443
> URL: https://issues.apache.org/jira/browse/PARQUET-443
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> Related: PARQUET-441 and PARQUET-442



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-441) Schema resolution: one, two, and three-level array encoding

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-441:


Assignee: (was: Wes McKinney)

> Schema resolution: one, two, and three-level array encoding
> ---
>
> Key: PARQUET-441
> URL: https://issues.apache.org/jira/browse/PARQUET-441
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> While the Parquet spec recommends the "three-level" array encoding, two other 
> styles are possible in the wild, see for example:
> https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exec/hdfs-parquet-scanner.cc#L1986



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1869) [C++] Large decimal values don't roundtrip correctly

2020-06-02 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123699#comment-17123699
 ] 

Wes McKinney commented on PARQUET-1869:
---

I'm pretty sure this is a problem with conversion from Arrow format to the 
Parquet fixed-size-binary storage representation, so might move this issue to 
the ARROW issue tracker. Either way we should definitely try to fix this before 
the next major Arrow release

> [C++] Large decimal values don't roundtrip correctly
> 
>
> Key: PARQUET-1869
> URL: https://issues.apache.org/jira/browse/PARQUET-1869
> Project: Parquet
>  Issue Type: Test
>  Components: parquet-cpp
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Reproducer with python:
> {code}
> import decimal
> import pyarrow as pa
> import pyarrow.parquet as pq
> arr = pa.array([decimal.Decimal('9223372036854775808'), 
> decimal.Decimal('1.111')])
> print(arr)
> pq.write_table(pa.table({'a': arr}), "test_decimal.parquet") 
> result = pq.read_table("test_decimal.parquet")
> print(result.column('a'))
> {code}
> gives
> {code}
> # before writing
> 
> [
>   9223372036854775808.000,
>   1.111
> ]
> # after reading
> 
> [
>   [
>     -221360928884514619.392,
>     1.111
>   ]
> ]
> {code}
> I tried reading the file with a different parquet implementation (fastparquet 
> python package), and that gives the same values on read, so the issue might 
> possibly rather be on the write side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1855) [C++] Improve documentation on MetaData ownership

2020-05-24 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1855:
-

Assignee: Francois Saint-Jacques

> [C++] Improve documentation on MetaData ownership
> -
>
> Key: PARQUET-1855
> URL: https://issues.apache.org/jira/browse/PARQUET-1855
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I had to look at the implementation to understand what are the lifetime 
> relationship for the following objects:
> * FileMetaData
> * RowGroupMetaData
> * ColumnChunkMetaData
> From what I gather, a reference to the top-level FileMetaData must be hold 
> for any of the children objects (RowGroupMetaData and ColumnChunkMetaData) 
> lifetime. It is unclear if the original buffer from which the metadata was 
> deserialized must be hold for the lifetime of the FIleMetaData object, I 
> suspect it does not need to be kept.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1855) [C++] Improve documentation on MetaData ownership

2020-05-24 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1855.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 7244
[https://github.com/apache/arrow/pull/7244]

> [C++] Improve documentation on MetaData ownership
> -
>
> Key: PARQUET-1855
> URL: https://issues.apache.org/jira/browse/PARQUET-1855
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I had to look at the implementation to understand what are the lifetime 
> relationship for the following objects:
> * FileMetaData
> * RowGroupMetaData
> * ColumnChunkMetaData
> From what I gather, a reference to the top-level FileMetaData must be hold 
> for any of the children objects (RowGroupMetaData and ColumnChunkMetaData) 
> lifetime. It is unclear if the original buffer from which the metadata was 
> deserialized must be hold for the lifetime of the FIleMetaData object, I 
> suspect it does not need to be kept.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1861) [Documentation][C++] Explain ReaderProperters.buffer_stream*

2020-05-21 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1861.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 7221
[https://github.com/apache/arrow/pull/7221]

> [Documentation][C++] Explain ReaderProperters.buffer_stream*
> 
>
> Key: PARQUET-1861
> URL: https://issues.apache.org/jira/browse/PARQUET-1861
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1865) [C++] Failure from C++17 feature used in parquet/encoding_benchmark.cc

2020-05-20 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1865.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 7237
[https://github.com/apache/arrow/pull/7237]

> [C++] Failure from C++17 feature used in parquet/encoding_benchmark.cc
> --
>
> Key: PARQUET-1865
> URL: https://issues.apache.org/jira/browse/PARQUET-1865
> Project: Parquet
>  Issue Type: Bug
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> {code}
> ir/encoding_benchmark.cc.o -c ../src/parquet/encoding_benchmark.cc
> ../src/parquet/encoding_benchmark.cc:242:53: error: static_assert with no 
> message is a C++17 extension [-Werror,-Wc++17-extensions]
>   static_assert(sizeof(CType) == sizeof(*raw_values));
> ^
> , ""
> ../src/parquet/encoding_benchmark.cc:286:53: error: static_assert with no 
> message is a C++17 extension [-Werror,-Wc++17-extensions]
>   static_assert(sizeof(CType) == sizeof(*raw_values));
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1865) [C++] Failure from C++17 feature used in parquet/encoding_benchmark.cc

2020-05-20 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1865:
-

Assignee: Wes McKinney

> [C++] Failure from C++17 feature used in parquet/encoding_benchmark.cc
> --
>
> Key: PARQUET-1865
> URL: https://issues.apache.org/jira/browse/PARQUET-1865
> Project: Parquet
>  Issue Type: Bug
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>
> {code}
> ir/encoding_benchmark.cc.o -c ../src/parquet/encoding_benchmark.cc
> ../src/parquet/encoding_benchmark.cc:242:53: error: static_assert with no 
> message is a C++17 extension [-Werror,-Wc++17-extensions]
>   static_assert(sizeof(CType) == sizeof(*raw_values));
> ^
> , ""
> ../src/parquet/encoding_benchmark.cc:286:53: error: static_assert with no 
> message is a C++17 extension [-Werror,-Wc++17-extensions]
>   static_assert(sizeof(CType) == sizeof(*raw_values));
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1865) [C++] Failure from C++17 feature used in parquet/encoding_benchmark.cc

2020-05-20 Thread Wes McKinney (Jira)
Wes McKinney created PARQUET-1865:
-

 Summary: [C++] Failure from C++17 feature used in 
parquet/encoding_benchmark.cc
 Key: PARQUET-1865
 URL: https://issues.apache.org/jira/browse/PARQUET-1865
 Project: Parquet
  Issue Type: Bug
Reporter: Wes McKinney


{code}
ir/encoding_benchmark.cc.o -c ../src/parquet/encoding_benchmark.cc
../src/parquet/encoding_benchmark.cc:242:53: error: static_assert with no 
message is a C++17 extension [-Werror,-Wc++17-extensions]
  static_assert(sizeof(CType) == sizeof(*raw_values));
^
, ""
../src/parquet/encoding_benchmark.cc:286:53: error: static_assert with no 
message is a C++17 extension [-Werror,-Wc++17-extensions]
  static_assert(sizeof(CType) == sizeof(*raw_values));
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1861) [Documentation][C++] Explain ReaderProperters.buffer_stream*

2020-05-08 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1861:
--
Summary: [Documentation][C++] Explain ReaderProperters.buffer_stream*  
(was: [Documentation] Explain ReaderProperters.buffer_stream*)

> [Documentation][C++] Explain ReaderProperters.buffer_stream*
> 
>
> Key: PARQUET-1861
> URL: https://issues.apache.org/jira/browse/PARQUET-1861
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1857) [C++][Parquet] ParquetFileReader unable to read files with more than 32767 row groups

2020-05-06 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1857.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 7108
[https://github.com/apache/arrow/pull/7108]

> [C++][Parquet] ParquetFileReader unable to read files with more than 32767 
> row groups
> -
>
> Key: PARQUET-1857
> URL: https://issues.apache.org/jira/browse/PARQUET-1857
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Novice
>Assignee: Wes McKinney
>Priority: Critical
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
> Attachments: test.parquet.tgz, test_2.parquet.tgz
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> I am using Rust to write Parquet file and read from Python.
> When write_batch with 1 batch size, reading the Parquet file from Python 
> gives the error below:
> ```
> >>> pd.read_parquet("some.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 125, in read
>  path, columns=columns, **kwargs
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1537, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1262, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 707, in read
>  table = reader.read(**options)
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 337, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1130, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
>  OSError: Unexpected end of stream
> ```
> Also, when using batch size 1 and then read from Python, there is error too: 
> ```
> >>> pd.read_parquet("some.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 125, in read
>  path, columns=columns, **kwargs
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1537, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1262, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 707, in read
>  table = reader.read(**options)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 337, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1130, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
>  OSError: The file only has 0 columns, requested metadata for column: 6
> ```
> Using batch size 1000 is fine.
> Note that my data has 450047 rows. Schema:
> ```
> message schema
> { REQUIRED INT32 a; REQUIRED INT32 b; REQUIRED INT32 c; REQUIRED INT64 d; 
> REQUIRED INT32 e; REQUIRED BYTE_ARRAY f (UTF8); REQUIRED BOOLEAN g; }
> ```
>  
> EDIT: as I add more rows (estimated 80 millions), using batch size 1000 does 
> not work too:
> ```
> >>> df = pd.read_parquet("data/ping_pong.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 125, in read
>  path, columns=columns, **kwargs
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1537, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1262, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 707, in read
>  table = reader.read(**options)
>  File 
> 

[jira] [Commented] (PARQUET-1858) [Python] [Rust] Parquet read file fails with batch size 1_000_000 and 41 row groups

2020-05-06 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100758#comment-17100758
 ] 

Wes McKinney commented on PARQUET-1858:
---

Yes it looks like the file written by Rust is malformed. That two independent 
implementations fail is good evidence of that. 

> [Python] [Rust] Parquet read file fails with batch size 1_000_000 and 41 row 
> groups
> ---
>
> Key: PARQUET-1858
> URL: https://issues.apache.org/jira/browse/PARQUET-1858
> Project: Parquet
>  Issue Type: Bug
>Reporter: Novice
>Priority: Major
> Attachments: test_2.parquet.tgz
>
>
> Here is the error I got:
> Pyarrow:
> ```
> >>> df = pd.read_parquet("test.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 125, in read
>  path, columns=columns, **kwargs
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1281, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1137, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 605, in read
>  table = reader.read(**options)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 253, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1136, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
>  OSError: Unexpected end of stream
> ```
> fastparquet:
> ```
>  >>> df = pd.read_parquet("test.parquet", engine="fastparquet")
>  
> /home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/encoding.py:222:
>  NumbaDeprecationWarning: The 'numba.jitclass' decorator has moved to 
> 'numba.experimental.jitclass' to better reflect the experimental nature of 
> the functionality. Please update your imports to accommodate this change and 
> see 
> [http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#change-of-jitclass-location]
>  for the time frame.
>  Numpy8 = numba.jitclass(spec8)(NumpyIO)
>  
> /home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/encoding.py:224:
>  NumbaDeprecationWarning: The 'numba.jitclass' decorator has moved to 
> 'numba.experimental.jitclass' to better reflect the experimental nature of 
> the functionality. Please update your imports to accommodate this change and 
> see 
> [http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#change-of-jitclass-location]
>  for the time frame.
>  Numpy32 = numba.jitclass(spec32)(NumpyIO)
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 201, in read
>  return parquet_file.to_pandas(columns=columns, **kwargs)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/api.py", 
> line 399, in to_pandas
>  index=index, assign=parts)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/api.py", 
> line 228, in read_row_group
>  scheme=self.file_scheme)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", 
> line 354, in read_row_group
>  cats, selfmade, assign=assign)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", 
> line 331, in read_row_group_arrays
>  catdef=out.get(name+'-catdef', None))
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", 
> line 245, in read_col
>  skip_nulls, selfmade=selfmade)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", 
> line 99, in read_data_page
>  raw_bytes = _read_page(f, header, metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", 
> line 31, in _read_page
>  page_header.uncompressed_page_size)
>  AssertionError: found 120016208 raw bytes (expected None)
> ```
> The corresponding Rust code is:
> ```
> use parquet::{
>  column::writer::ColumnWriter::BoolColumnWriter,
>  column::writer::ColumnWriter::Int32ColumnWriter,
>  [file::]
> { properties::WriterProperties, writer::
> {FileWriter, SerializedFileWriter}
> ,
>  },
>  schema::parser::parse_message_type,
>  };
>  use std::\{fs, rc::Rc};
> fn main() {
>  let schema = "

[jira] [Assigned] (PARQUET-1859) [C++] Require error message when using ParquetException::EofException

2020-05-05 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1859:
-

Assignee: Wes McKinney

> [C++] Require error message when using ParquetException::EofException
> -
>
> Key: PARQUET-1859
> URL: https://issues.apache.org/jira/browse/PARQUET-1859
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> "Unexpected end of stream" (the defaults) gives no clue where the failure 
> occurred



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1858) [Python] [Rust] Parquet read file fails with batch size 1_000_000 and 41 row groups

2020-05-05 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100260#comment-17100260
 ] 

Wes McKinney commented on PARQUET-1858:
---

The PLAIN encoding for the boolean type is possibly malformed. I opened 
PARQUET-1859 about providing better error messages, but here is what the 
failure is

{code}
$ python test.py 
Traceback (most recent call last):
  File "test.py", line 7, in 
pq.read_table(path)
  File "/home/wesm/code/arrow/python/pyarrow/parquet.py", line 1539, in 
read_table
use_pandas_metadata=use_pandas_metadata)
  File "/home/wesm/code/arrow/python/pyarrow/parquet.py", line 1264, in read
use_pandas_metadata=use_pandas_metadata)
  File "/home/wesm/code/arrow/python/pyarrow/parquet.py", line 707, in read
table = reader.read(**options)
  File "/home/wesm/code/arrow/python/pyarrow/parquet.py", line 337, in read
use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 1130, in 
pyarrow._parquet.ParquetReader.read_all
check_status(self.reader.get()
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
raise IOError(message)
OSError: Unexpected end of stream: Failed to decode 100 bits for boolean 
PLAIN encoding only decoded 2048
In ../src/parquet/arrow/reader.cc, line 844, code: final_status
{code}

Can this file be read by the Java library?

> [Python] [Rust] Parquet read file fails with batch size 1_000_000 and 41 row 
> groups
> ---
>
> Key: PARQUET-1858
> URL: https://issues.apache.org/jira/browse/PARQUET-1858
> Project: Parquet
>  Issue Type: Bug
>Reporter: Novice
>Priority: Major
> Attachments: test_2.parquet.tgz
>
>
> Here is the error I got:
> Pyarrow:
> ```
> >>> df = pd.read_parquet("test.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 125, in read
>  path, columns=columns, **kwargs
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1281, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1137, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 605, in read
>  table = reader.read(**options)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 253, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1136, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
>  OSError: Unexpected end of stream
> ```
> fastparquet:
> ```
>  >>> df = pd.read_parquet("test.parquet", engine="fastparquet")
>  
> /home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/encoding.py:222:
>  NumbaDeprecationWarning: The 'numba.jitclass' decorator has moved to 
> 'numba.experimental.jitclass' to better reflect the experimental nature of 
> the functionality. Please update your imports to accommodate this change and 
> see 
> [http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#change-of-jitclass-location]
>  for the time frame.
>  Numpy8 = numba.jitclass(spec8)(NumpyIO)
>  
> /home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/encoding.py:224:
>  NumbaDeprecationWarning: The 'numba.jitclass' decorator has moved to 
> 'numba.experimental.jitclass' to better reflect the experimental nature of 
> the functionality. Please update your imports to accommodate this change and 
> see 
> [http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#change-of-jitclass-location]
>  for the time frame.
>  Numpy32 = numba.jitclass(spec32)(NumpyIO)
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 201, in read
>  return parquet_file.to_pandas(columns=columns, **kwargs)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/api.py", 
> line 399, in to_pandas
>  index=index, assign=parts)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/api.py", 
> line 228, in read_row_group
>  scheme=self.file_scheme)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", 
> line 354, in read_row_group
>  cats, selfmade, assign=assign)
>  File 

[jira] [Created] (PARQUET-1859) [C++] Require error message when using ParquetException::EofException

2020-05-05 Thread Wes McKinney (Jira)
Wes McKinney created PARQUET-1859:
-

 Summary: [C++] Require error message when using 
ParquetException::EofException
 Key: PARQUET-1859
 URL: https://issues.apache.org/jira/browse/PARQUET-1859
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Wes McKinney
 Fix For: cpp-1.6.0


"Unexpected end of stream" (the defaults) gives no clue where the failure 
occurred



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1857) [C++][Parquet] ParquetFileReader unable to read files with more than 32767 row groups

2020-05-05 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100084#comment-17100084
 ] 

Wes McKinney commented on PARQUET-1857:
---

I put up a PR for the first problem you reported. If there are failures with < 
32768 row groups, then can you open a new JIRA and post the file since that 
will have to be investigated separately?

> [C++][Parquet] ParquetFileReader unable to read files with more than 32767 
> row groups
> -
>
> Key: PARQUET-1857
> URL: https://issues.apache.org/jira/browse/PARQUET-1857
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Novice
>Assignee: Wes McKinney
>Priority: Critical
>  Labels: pull-request-available
> Attachments: test.parquet.tgz, test_2.parquet.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I am using Rust to write Parquet file and read from Python.
> When write_batch with 1 batch size, reading the Parquet file from Python 
> gives the error below:
> ```
> >>> pd.read_parquet("some.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 125, in read
>  path, columns=columns, **kwargs
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1537, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1262, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 707, in read
>  table = reader.read(**options)
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 337, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1130, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
>  OSError: Unexpected end of stream
> ```
> Also, when using batch size 1 and then read from Python, there is error too: 
> ```
> >>> pd.read_parquet("some.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 125, in read
>  path, columns=columns, **kwargs
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1537, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1262, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 707, in read
>  table = reader.read(**options)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 337, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1130, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
>  OSError: The file only has 0 columns, requested metadata for column: 6
> ```
> Using batch size 1000 is fine.
> Note that my data has 450047 rows. Schema:
> ```
> message schema
> { REQUIRED INT32 a; REQUIRED INT32 b; REQUIRED INT32 c; REQUIRED INT64 d; 
> REQUIRED INT32 e; REQUIRED BYTE_ARRAY f (UTF8); REQUIRED BOOLEAN g; }
> ```
>  
> EDIT: as I add more rows (estimated 80 millions), using batch size 1000 does 
> not work too:
> ```
> >>> df = pd.read_parquet("data/ping_pong.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 125, in read
>  path, columns=columns, **kwargs
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1537, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1262, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 707, in read
>  table = reader.read(**options)
>  

[jira] [Moved] (PARQUET-1857) [C++][Parquet] ParquetFileReader unable to read files with more than 32767 row groups

2020-05-05 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney moved ARROW-8677 to PARQUET-1857:
--

  Component/s: (was: Rust)
   (was: Python)
   parquet-cpp
  Key: PARQUET-1857  (was: ARROW-8677)
Affects Version/s: (was: 0.17.0)
 Workflow: patch-available, re-open possible  (was: jira)
  Environment: (was: Linux debian
)
  Project: Parquet  (was: Apache Arrow)

> [C++][Parquet] ParquetFileReader unable to read files with more than 32767 
> row groups
> -
>
> Key: PARQUET-1857
> URL: https://issues.apache.org/jira/browse/PARQUET-1857
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Novice
>Assignee: Wes McKinney
>Priority: Critical
> Attachments: test.parquet.tgz
>
>
> I am using Rust to write Parquet file and read from Python.
> When write_batch with 1 batch size, reading the Parquet file from Python 
> gives the error below:
> ```
> >>> pd.read_parquet("some.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 125, in read
>  path, columns=columns, **kwargs
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1537, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1262, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 707, in read
>  table = reader.read(**options)
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 337, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1130, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
>  OSError: Unexpected end of stream
> ```
> Also, when using batch size 1 and then read from Python, there is error too: 
> ```
> >>> pd.read_parquet("some.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 125, in read
>  path, columns=columns, **kwargs
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1537, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1262, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 707, in read
>  table = reader.read(**options)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 337, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1130, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
>  OSError: The file only has 0 columns, requested metadata for column: 6
> ```
> Using batch size 1000 is fine.
> Note that my data has 450047 rows. Schema:
> ```
> message schema
> { REQUIRED INT32 a; REQUIRED INT32 b; REQUIRED INT32 c; REQUIRED INT64 d; 
> REQUIRED INT32 e; REQUIRED BYTE_ARRAY f (UTF8); REQUIRED BOOLEAN g; }
> ```
>  
> EDIT: as I add more rows (estimated 80 millions), using batch size 1000 does 
> not work too:
> ```
> >>> df = pd.read_parquet("data/ping_pong.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 125, in read
>  path, columns=columns, **kwargs
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1537, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1262, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 707, 

[jira] [Created] (PARQUET-1856) [C++] Test suite assumes that Snappy support is built

2020-05-04 Thread Wes McKinney (Jira)
Wes McKinney created PARQUET-1856:
-

 Summary: [C++] Test suite assumes that Snappy support is built
 Key: PARQUET-1856
 URL: https://issues.apache.org/jira/browse/PARQUET-1856
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: Wes McKinney
 Fix For: cpp-1.6.0


The test suite fails if {{-DARROW_WITH_SNAPPY=OFF}}

{code}
[--] 1 test from TestStatisticsSortOrder/0, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)1>
[ RUN  ] TestStatisticsSortOrder/0.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1820) [C++] Use a column filter hint to inform read prefetching in Arrow reads

2020-05-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1820.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 6744
[https://github.com/apache/arrow/pull/6744]

> [C++] Use a column filter hint to inform read prefetching in Arrow reads
> 
>
> Key: PARQUET-1820
> URL: https://issues.apache.org/jira/browse/PARQUET-1820
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> As a follow up to PARQUET-1698 and ARROW-7995, we should use the I/O 
> coalescing facility (where available and enabled), in combination with a 
> column filter hint, to compute and prefetch the exact byte ranges we will be 
> reading (using the metadata). This should further improve performance on 
> remote object stores like Amazon S3. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1820) [C++] Use a column filter hint to inform read prefetching in Arrow reads

2020-05-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1820:
-

Assignee: David Li

> [C++] Use a column filter hint to inform read prefetching in Arrow reads
> 
>
> Key: PARQUET-1820
> URL: https://issues.apache.org/jira/browse/PARQUET-1820
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> As a follow up to PARQUET-1698 and ARROW-7995, we should use the I/O 
> coalescing facility (where available and enabled), in combination with a 
> column filter hint, to compute and prefetch the exact byte ranges we will be 
> reading (using the metadata). This should further improve performance on 
> remote object stores like Amazon S3. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1820) [C++] Use a column filter hint to inform read prefetching in Arrow reads

2020-05-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1820:
--
Summary: [C++] Use a column filter hint to inform read prefetching in Arrow 
reads  (was: [C++] Use a column filter hint to inform read prefetching)

> [C++] Use a column filter hint to inform read prefetching in Arrow reads
> 
>
> Key: PARQUET-1820
> URL: https://issues.apache.org/jira/browse/PARQUET-1820
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> As a follow up to PARQUET-1698 and ARROW-7995, we should use the I/O 
> coalescing facility (where available and enabled), in combination with a 
> column filter hint, to compute and prefetch the exact byte ranges we will be 
> reading (using the metadata). This should further improve performance on 
> remote object stores like Amazon S3. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1404) [C++] Add index pages to the format to support efficient page skipping to parquet-cpp

2020-04-23 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090617#comment-17090617
 ] 

Wes McKinney commented on PARQUET-1404:
---

Do you want to keep the discussion in one place, i.e. on the mailing list?

> [C++] Add index pages to the format to support efficient page skipping to 
> parquet-cpp
> -
>
> Key: PARQUET-1404
> URL: https://issues.apache.org/jira/browse/PARQUET-1404
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Renato Javier Marroquín Mogrovejo
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Once PARQUET-922 is completed we can port such implementation to parquet-cpp 
> as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1327) [C++] Bloom filter read/write implementation

2020-04-23 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1327:
--
Summary: [C++] Bloom filter read/write implementation  (was: [C++]Bloom 
filter read/write implementation)

> [C++] Bloom filter read/write implementation
> 
>
> Key: PARQUET-1327
> URL: https://issues.apache.org/jira/browse/PARQUET-1327
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Junjie Chen
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1828) [C++] Add a SSE2 path for the ByteStreamSplit encoder implementation

2020-04-19 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1828:
--
Summary: [C++] Add a SSE2 path for the ByteStreamSplit encoder 
implementation  (was: Add a SSE2 path for the ByteStreamSplit encoder 
implementation)

> [C++] Add a SSE2 path for the ByteStreamSplit encoder implementation
> 
>
> Key: PARQUET-1828
> URL: https://issues.apache.org/jira/browse/PARQUET-1828
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> The encode path for the byte stream split encoding can have better 
> performance if SSE2 intrinsics are used.
> The decode path already uses sse2 intrinsics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1846) [C++] Remove deprecated IO classes and related functions

2020-04-19 Thread Wes McKinney (Jira)
Wes McKinney created PARQUET-1846:
-

 Summary: [C++] Remove deprecated IO classes and related functions
 Key: PARQUET-1846
 URL: https://issues.apache.org/jira/browse/PARQUET-1846
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Wes McKinney
 Fix For: cpp-1.6.0


These were added almost a year ago, so there has been ample time for users to 
migrate



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1835) [C++] Fix crashes on invalid input (OSS-Fuzz)

2020-04-06 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1835.
---
Resolution: Fixed

Issue resolved by pull request 6848
[https://github.com/apache/arrow/pull/6848]

> [C++] Fix crashes on invalid input (OSS-Fuzz)
> -
>
> Key: PARQUET-1835
> URL: https://issues.apache.org/jira/browse/PARQUET-1835
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Fix more issues found by OSS-Fuzz.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1834) Add Apache 2.0 license to README.md files in parquet-testing

2020-04-06 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1834:
--
Fix Version/s: cpp-1.6.0

> Add Apache 2.0 license to README.md files in parquet-testing
> 
>
> Key: PARQUET-1834
> URL: https://issues.apache.org/jira/browse/PARQUET-1834
> Project: Parquet
>  Issue Type: Task
>Reporter: Maya Anderson
>Assignee: Maya Anderson
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> parquet-testing files can be used for interop tests in parquet-mr. 
> However, if it is added as a submodule, then the 3 README.md files fail the 
> license check and hence fail build of parquet-mr.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1834) Add Apache 2.0 license to README.md files in parquet-testing

2020-04-06 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1834.
---
Resolution: Fixed

Resolved by PR 
https://github.com/apache/parquet-testing/commit/bcd9ebcf9204a346df47204fe21b85c8d0498816

> Add Apache 2.0 license to README.md files in parquet-testing
> 
>
> Key: PARQUET-1834
> URL: https://issues.apache.org/jira/browse/PARQUET-1834
> Project: Parquet
>  Issue Type: Task
>Reporter: Maya Anderson
>Assignee: Maya Anderson
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> parquet-testing files can be used for interop tests in parquet-mr. 
> However, if it is added as a submodule, then the 3 README.md files fail the 
> license check and hence fail build of parquet-mr.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1829) [C++] Fix crashes on invalid input (OSS-Fuzz)

2020-03-26 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1829.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 6728
[https://github.com/apache/arrow/pull/6728]

> [C++] Fix crashes on invalid input (OSS-Fuzz)
> -
>
> Key: PARQUET-1829
> URL: https://issues.apache.org/jira/browse/PARQUET-1829
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> There are remaining issues open in OSS-Fuzz. We should fix most of them 
> (except some out-of-memory conditions which may not easily be fixable).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-458) [C++] Implement support for DataPageV2

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-458.
--
Resolution: Fixed

Issue resolved by pull request 6481
[https://github.com/apache/arrow/pull/6481]

> [C++] Implement support for DataPageV2
> --
>
> Key: PARQUET-458
> URL: https://issues.apache.org/jira/browse/PARQUET-458
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Hatem Helal
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1786) [C++] Use simd to improve BYTE_STREAM_SPLIT decoding performance

2020-03-24 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066085#comment-17066085
 ] 

Wes McKinney commented on PARQUET-1786:
---

Please leave resolved issues in "Resolved" state otherwise they will not show 
up in changelogs

> [C++] Use simd to improve BYTE_STREAM_SPLIT decoding performance
> 
>
> Key: PARQUET-1786
> URL: https://issues.apache.org/jira/browse/PARQUET-1786
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> BYTE_STREAM_SPLIT essentially does a scatter/gather operation in the 
> encode/decoder paths. Unfortunately, it is not as fast as memcpy when the 
> data is cached. That can be improved through using simd intrinsics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1823) [C++] Invalid RowGroup returned when reading with parquet::arrow::FileReader->RowGroup(i)->Column(j)

2020-03-20 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1823.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 6674
[https://github.com/apache/arrow/pull/6674]

> [C++] Invalid RowGroup returned when reading with 
> parquet::arrow::FileReader->RowGroup(i)->Column(j)
> 
>
> Key: PARQUET-1823
> URL: https://issues.apache.org/jira/browse/PARQUET-1823
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Critical
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Originally reported as ARROW-8138



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1819) [C++] Fix crashes on corrupt IPC input (OSS-Fuzz)

2020-03-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1819.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 6659
[https://github.com/apache/arrow/pull/6659]

> [C++] Fix crashes on corrupt IPC input (OSS-Fuzz)
> -
>
> Key: PARQUET-1819
> URL: https://issues.apache.org/jira/browse/PARQUET-1819
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1814) [C++] TestInt96ParquetIO failure on Windows

2020-03-13 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1814:
--
Fix Version/s: cpp-1.6.0

> [C++] TestInt96ParquetIO failure on Windows
> ---
>
> Key: PARQUET-1814
> URL: https://issues.apache.org/jira/browse/PARQUET-1814
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> {code}
> [ RUN  ] TestInt96ParquetIO.ReadIntoTimestamp
> C:/t/arrow/cpp/src/arrow/testing/gtest_util.cc(77): error: Failed
> @@ -0, +0 @@
> -1970-01-01 00:00:00.145738543
> +1970-01-02 11:35:00.145738543
> C:/t/arrow/cpp/src/parquet/arrow/arrow_reader_writer_test.cc(1034): error: 
> Expected: this->ReadAndCheckSingleColumnFile(*values) doesn't generate new 
> fatal failures in the current thread.
>   Actual: it does.
> [  FAILED  ] TestInt96ParquetIO.ReadIntoTimestamp (47 ms)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1813) [C++] Remove logging statement in unit test

2020-03-12 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1813:
--
Summary: [C++] Remove logging statement in unit test  (was: [C++] Weird 
error output in tests)

> [C++] Remove logging statement in unit test
> ---
>
> Key: PARQUET-1813
> URL: https://issues.apache.org/jira/browse/PARQUET-1813
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Wes McKinney
>Priority: Major
>
> It doesn't appear to fail the test, but I still get this weird output on 
> Windows:
> {code}
> [ RUN  ] TestConvertArrowSchema.ParquetMaps
> C:/t/arrow/cpp/src/parquet/arrow/arrow_schema_test.cc:989: my_map: 
> map not null
> [   OK ] TestConvertArrowSchema.ParquetMaps (0 ms)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1813) [C++] Weird error output in tests

2020-03-12 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1813:
-

Assignee: Wes McKinney

> [C++] Weird error output in tests
> -
>
> Key: PARQUET-1813
> URL: https://issues.apache.org/jira/browse/PARQUET-1813
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Wes McKinney
>Priority: Major
>
> It doesn't appear to fail the test, but I still get this weird output on 
> Windows:
> {code}
> [ RUN  ] TestConvertArrowSchema.ParquetMaps
> C:/t/arrow/cpp/src/parquet/arrow/arrow_schema_test.cc:989: my_map: 
> map not null
> [   OK ] TestConvertArrowSchema.ParquetMaps (0 ms)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1813) [C++] Weird error output in tests

2020-03-12 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058286#comment-17058286
 ] 

Wes McKinney commented on PARQUET-1813:
---

I missed the debug output in my code review 
https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/arrow_schema_test.cc#L989.
 Will fix

> [C++] Weird error output in tests
> -
>
> Key: PARQUET-1813
> URL: https://issues.apache.org/jira/browse/PARQUET-1813
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Priority: Major
>
> It doesn't appear to fail the test, but I still get this weird output on 
> Windows:
> {code}
> [ RUN  ] TestConvertArrowSchema.ParquetMaps
> C:/t/arrow/cpp/src/parquet/arrow/arrow_schema_test.cc:989: my_map: 
> map not null
> [   OK ] TestConvertArrowSchema.ParquetMaps (0 ms)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1663) [C++] Provide API to check the presence of complex data types

2020-03-10 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1663.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 5490
[https://github.com/apache/arrow/pull/5490]

> [C++] Provide API to check the presence of complex data types
> -
>
> Key: PARQUET-1663
> URL: https://issues.apache.org/jira/browse/PARQUET-1663
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Zherui Cao
>Assignee: Zherui Cao
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> we need functions like
> hasMapType()
> hasArrayType()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1300) [C++] Parquet modular encryption

2020-03-06 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053455#comment-17053455
 ] 

Wes McKinney commented on PARQUET-1300:
---

Anyone interested in looking at packaging issues for encryption? I don't think 
it's being shipped in Arrow packages yet

> [C++] Parquet modular encryption
> 
>
> Key: PARQUET-1300
> URL: https://issues.apache.org/jira/browse/PARQUET-1300
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Gidon Gershinsky
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: pull-request-available
> Attachments: column_reader.cc, column_writer.cc, file_reader.cc, 
> file_writer.cc, thrift.h
>
>  Time Spent: 34h
>  Remaining Estimate: 0h
>
> CPP version of a mechanism for modular encryption and decryption of Parquet 
> files. Allows to keep the data fully encrypted in the storage, while enabling 
> a client to extract a required subset (footer, column(s), pages) and to 
> authenticate / decrypt the extracted data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1810) [C++] Fix undefined behaviour on invalid enum values (OSS-Fuzz)

2020-03-05 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1810.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 6537
[https://github.com/apache/arrow/pull/6537]

> [C++] Fix undefined behaviour on invalid enum values (OSS-Fuzz)
> ---
>
> Key: PARQUET-1810
> URL: https://issues.apache.org/jira/browse/PARQUET-1810
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1780) [C++] Set ColumnMetadata.encoding_stats field

2020-03-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1780.
---
Resolution: Fixed

Issue resolved by pull request 6370
[https://github.com/apache/arrow/pull/6370]

> [C++] Set ColumnMetadata.encoding_stats field
> -
>
> Key: PARQUET-1780
> URL: https://issues.apache.org/jira/browse/PARQUET-1780
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Gamage Omega Ishendra
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> This metadata field is not set in the C++ library. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1780) [C++] Set ColumnMetadata.encoding_stats field

2020-03-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1780:
-

Assignee: Gamage Omega Ishendra

> [C++] Set ColumnMetadata.encoding_stats field
> -
>
> Key: PARQUET-1780
> URL: https://issues.apache.org/jira/browse/PARQUET-1780
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Gamage Omega Ishendra
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> This metadata field is not set in the C++ library. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1797) [C++] Fix fuzzing errors

2020-02-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1797:
--
Summary: [C++] Fix fuzzing errors  (was: Fix fuzzing errors)

> [C++] Fix fuzzing errors
> 
>
> Key: PARQUET-1797
> URL: https://issues.apache.org/jira/browse/PARQUET-1797
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1798) [C++] Review logic around automatic assignment of field_id's

2020-02-14 Thread Wes McKinney (Jira)
Wes McKinney created PARQUET-1798:
-

 Summary: [C++] Review logic around automatic assignment of 
field_id's
 Key: PARQUET-1798
 URL: https://issues.apache.org/jira/browse/PARQUET-1798
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Wes McKinney
 Fix For: cpp-1.6.0


At schema deserialization (from Thrift) time, we are assigning a default 
field_id to the Schema node based on a depth-first ordering of notes. This 
means that a round trip (load, then write) will cause field_id's to be written 
that weren't there before. I'm not sure this is the desired behavior.

We should examine this in more detail and possible change it. See also 
discussion in ARROW-7080 https://github.com/apache/arrow/pull/6408



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1788) [C++] ColumnWriter has undefined behavior when writing arrow chunks

2020-02-10 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1788.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 6378
[https://github.com/apache/arrow/pull/6378]

> [C++] ColumnWriter has undefined behavior when writing arrow chunks
> ---
>
> Key: PARQUET-1788
> URL: https://issues.apache.org/jira/browse/PARQUET-1788
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We blindly add offset to dep_level and rep_level inside chunking callbacks 
> when these are nullptrs (I believe this occurs if the schema is flat) we 
> still apply the offset which triggers UBSan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1716) [C++] Add support for BYTE_STREAM_SPLIT encoding

2020-02-04 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1716.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 6005
[https://github.com/apache/arrow/pull/6005]

> [C++] Add support for BYTE_STREAM_SPLIT encoding
> 
>
> Key: PARQUET-1716
> URL: https://issues.apache.org/jira/browse/PARQUET-1716
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>   Original Estimate: 72h
>  Time Spent: 14h
>  Remaining Estimate: 58h
>
> *From the Parquet issue ( https://issues.apache.org/jira/browse/PARQUET-1622 
> ):*
> Apache Parquet does not have any encodings suitable for FP data and the 
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream 
> splitting". Such could be "byte stream splitting" which creates K streams of 
> length N where K is the number of bytes in the data type (4 for floats, 8 for 
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the 
> original data and for some cases there is a performance improvement in 
> compression and decompression speed.
> You can read a more detailed report here:
>  [https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> *Apache Arrow can benefit from the reduced requirements for storing FP 
> parquet column data and improvements in decompression speed.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1716) [C++] Add support for BYTE_STREAM_SPLIT encoding

2020-02-04 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1716:
-

Assignee: Martin Radev

> [C++] Add support for BYTE_STREAM_SPLIT encoding
> 
>
> Key: PARQUET-1716
> URL: https://issues.apache.org/jira/browse/PARQUET-1716
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 72h
>  Time Spent: 13h 50m
>  Remaining Estimate: 58h 10m
>
> *From the Parquet issue ( https://issues.apache.org/jira/browse/PARQUET-1622 
> ):*
> Apache Parquet does not have any encodings suitable for FP data and the 
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream 
> splitting". Such could be "byte stream splitting" which creates K streams of 
> length N where K is the number of bytes in the data type (4 for floats, 8 for 
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the 
> original data and for some cases there is a performance improvement in 
> compression and decompression speed.
> You can read a more detailed report here:
>  [https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> *Apache Arrow can benefit from the reduced requirements for storing FP 
> parquet column data and improvements in decompression speed.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

2020-02-04 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030199#comment-17030199
 ] 

Wes McKinney commented on PARQUET-1783:
---

I suppose it's good at least that the min/max are not "incorrect" when used for 
predicate pushdown, but yes this should be fixed. 

> [C++] Parquet statistics wrong for dictionary type
> --
>
> Key: PARQUET-1783
> URL: https://issues.apache.org/jira/browse/PARQUET-1783
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.6.0
>Reporter: Florian Jetter
>Priority: Major
>
> h3. Observed behaviour
> Statistics for categorical data are equivalent for all row groups and refer 
> to the entire {{CategoricalDtype}} instead of the data included in the row 
> group.
> h3. Expected behaviour
> The row group statistics should only include data which is part of the actual 
> row group, not the entire {{CategoricalDtype}}
> h3. Minimal example
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])})
> table = pa.Table.from_pandas(test_df)
> pq.write_table(
> table,
> "test_parquet",
> chunk_size=1,
> )
> test_parquet = pq.ParquetFile("test_parquet")
> test_parquet.metadata.row_group(0).column(0).statistics
> {code}
> {code:java}
> Out[1]:
> 
>   has_min_max: True
>   min: 1
>   max: 42
>   null_count: 0
>   distinct_count: 0
>   num_values: 1
>   physical_type: BYTE_ARRAY
>   logical_type: String
>   converted_type (legacy): UTF8
> {code}
> Expected would be
> {{min:1}} {{max:1}} instead of {{max: 42}} for the first row group
>  
> Tested with 
>  pandas==1.0.0
>  pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / 
> essentially 0.16.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

2020-02-04 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030200#comment-17030200
 ] 

Wes McKinney commented on PARQUET-1783:
---

Do we need to create a corresponding Arrow issue so this does not pass out of 
mind?

> [C++] Parquet statistics wrong for dictionary type
> --
>
> Key: PARQUET-1783
> URL: https://issues.apache.org/jira/browse/PARQUET-1783
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.6.0
>Reporter: Florian Jetter
>Priority: Major
>
> h3. Observed behaviour
> Statistics for categorical data are equivalent for all row groups and refer 
> to the entire {{CategoricalDtype}} instead of the data included in the row 
> group.
> h3. Expected behaviour
> The row group statistics should only include data which is part of the actual 
> row group, not the entire {{CategoricalDtype}}
> h3. Minimal example
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])})
> table = pa.Table.from_pandas(test_df)
> pq.write_table(
> table,
> "test_parquet",
> chunk_size=1,
> )
> test_parquet = pq.ParquetFile("test_parquet")
> test_parquet.metadata.row_group(0).column(0).statistics
> {code}
> {code:java}
> Out[1]:
> 
>   has_min_max: True
>   min: 1
>   max: 42
>   null_count: 0
>   distinct_count: 0
>   num_values: 1
>   physical_type: BYTE_ARRAY
>   logical_type: String
>   converted_type (legacy): UTF8
> {code}
> Expected would be
> {{min:1}} {{max:1}} instead of {{max: 42}} for the first row group
>  
> Tested with 
>  pandas==1.0.0
>  pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / 
> essentially 0.16.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1780) [C++] Set ColumnMetadata.encoding_stats field

2020-01-28 Thread Wes McKinney (Jira)
Wes McKinney created PARQUET-1780:
-

 Summary: [C++] Set ColumnMetadata.encoding_stats field
 Key: PARQUET-1780
 URL: https://issues.apache.org/jira/browse/PARQUET-1780
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: Wes McKinney
 Fix For: cpp-1.6.0


This metadata field is not set in the C++ library. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1747) [C++] Access to ColumnChunkMetaData fails when encryption is on

2020-01-22 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1747.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 6150
[https://github.com/apache/arrow/pull/6150]

> [C++] Access to ColumnChunkMetaData fails when encryption is on
> ---
>
> Key: PARQUET-1747
> URL: https://issues.apache.org/jira/browse/PARQUET-1747
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Gal Lushi
>Assignee: Gal Lushi
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> When encryption is on, can't access  ColumnChunkMetaData  from the 
> RowGroupMetaData.
> For example, this code won't work with encryption on.
> {code:c++}
> reader->metadata()
>  ->RowGroup(0)
>  ->ColumnChunk(0)
>  ->num_values();
> {code}
>  
>  One implication is that the Parquet Arrow API doesn't work with encryption 
> on.
> Tests for the Parquet Arrow API (with encryption) are soon to follow in a 
> separate PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1772) [C++] ParquetFileWriter: Data overwritten when output stream opened in append mode

2020-01-22 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1772:
--
Component/s: parquet-cpp

> [C++] ParquetFileWriter: Data overwritten when output stream opened in append 
> mode
> --
>
> Key: PARQUET-1772
> URL: https://issues.apache.org/jira/browse/PARQUET-1772
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Gawain BOLTON
>Assignee: Gawain BOLTON
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> An arrow::io::FileOutputStream can be opened in append mode.
> However, when the output stream is used by the ParquetFileWriter the data 
> already present in the file is overwritten instead of being appended.
> From what I can see, Parquet does not have currently the functionality to 
> append data.  As such I suggest detecting when an append is attempted to give 
> an error rather than overwrite existing data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1766) [C++] parquet NaN/null double statistics can result in endless loop

2020-01-21 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1766.
---
Resolution: Fixed

Issue resolved by pull request 6167
[https://github.com/apache/arrow/pull/6167]

> [C++] parquet NaN/null double statistics can result in endless loop
> ---
>
> Key: PARQUET-1766
> URL: https://issues.apache.org/jira/browse/PARQUET-1766
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Pierre Belzile
>Assignee: Francois Saint-Jacques
>Priority: Critical
>  Labels: parquet, pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> There is a bug in the doubles column statistics computation when writing to 
> parquet an array with only NaNs and nulls. It loops endlessly if the last 
> cell of a write group is a Null. The line in error is 
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L633]
>  which checks for NaN but not for Null. Code then falls through and loops 
> endlessly and causes the program to appear frozen.
> This code snippet repeats:
> {noformat}
> TEST(parquet, nans) {
>   /* Create a small parquet structure */
>   std::vector> fields;
>   fields.push_back(::arrow::field("doubles", ::arrow::float64()));
>   std::shared_ptr<::arrow::Schema> schema = 
> ::arrow::schema(std::move(fields));  
> std::unique_ptr<::arrow::RecordBatchBuilder> builder;
>   ::arrow::RecordBatchBuilder::Make(schema, ::arrow::default_memory_pool(), 
> );
>   
> builder->GetFieldAs<::arrow::DoubleBuilder>(0)->Append(std::numeric_limits::quiet_NaN());
>   builder->GetFieldAs<::arrow::DoubleBuilder>(0)->AppendNull();  
> std::shared_ptr<::arrow::RecordBatch> batch;
>   builder->Flush();
>   arrow::PrettyPrint(*batch, 0, ::cout);  std::shared_ptr 
> table;
>   arrow::Table::FromRecordBatches({batch}, );  /* Attempt to write */
>   std::shared_ptr<::arrow::io::FileOutputStream> os;
>   arrow::io::FileOutputStream::Open("/tmp/test.parquet", );
>   parquet::WriterProperties::Builder writer_props_bld;
>   // writer_props_bld.disable_statistics("doubles");
>   std::shared_ptr writer_props = 
> writer_props_bld.build();
>   std::shared_ptr arrow_props =
>   parquet::ArrowWriterProperties::Builder().store_schema()->build();
>   std::unique_ptr writer;
>   parquet::arrow::FileWriter::Open(
>   *table->schema(), arrow::default_memory_pool(), os,
>   writer_props, arrow_props, );
>   writer->WriteTable(*table, 1024);
> }{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1769) [C++] Update to parquet-format 2.8.0

2020-01-15 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1769.
---
Resolution: Fixed

Issue resolved by pull request 6200
[https://github.com/apache/arrow/pull/6200]

> [C++] Update to parquet-format 2.8.0
> 
>
> Key: PARQUET-1769
> URL: https://issues.apache.org/jira/browse/PARQUET-1769
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1769) [C++] Update to parquet-format 2.8.0

2020-01-14 Thread Wes McKinney (Jira)
Wes McKinney created PARQUET-1769:
-

 Summary: [C++] Update to parquet-format 2.8.0
 Key: PARQUET-1769
 URL: https://issues.apache.org/jira/browse/PARQUET-1769
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: cpp-1.6.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1701) [C++] Stream API: Add support for optional fields

2020-01-14 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1701.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 5928
[https://github.com/apache/arrow/pull/5928]

> [C++] Stream API: Add support for optional fields
> -
>
> Key: PARQUET-1701
> URL: https://issues.apache.org/jira/browse/PARQUET-1701
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Gawain BOLTON
>Assignee: Gawain BOLTON
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> The parquet::StreamReader and parquet::StreamWriter classes currently only 
> support required fields.
> Support must be added to this API in order for it to be usable when the 
> schema has optional  fields.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1701) [C++] Stream API: Add support for optional fields

2020-01-14 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1701:
--
Component/s: parquet-cpp

> [C++] Stream API: Add support for optional fields
> -
>
> Key: PARQUET-1701
> URL: https://issues.apache.org/jira/browse/PARQUET-1701
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Gawain BOLTON
>Assignee: Gawain BOLTON
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> The parquet::StreamReader and parquet::StreamWriter classes currently only 
> support required fields.
> Support must be added to this API in order for it to be usable when the 
> schema has optional  fields.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1698) [C++] Add reader option to pre-buffer entire serialized row group into memory

2020-01-13 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014781#comment-17014781
 ] 

Wes McKinney commented on PARQUET-1698:
---

Currently in the C++ library, IO calls are issued separately for each column. 
To give a concrete example of the current problem, let's consider an example 
like:

* File with 100 columns
* Read columns 1 through 10 and 91 through 100

We have the option of readahead buffering but the readahead only happens at the 
single-column level. So we need to have some strategy to enable the 20 column 
reads to be "coalesced" into 2-ish reads. 

> [C++] Add reader option to pre-buffer entire serialized row group into memory
> -
>
> Key: PARQUET-1698
> URL: https://issues.apache.org/jira/browse/PARQUET-1698
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Zherui Cao
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In some scenarios (example: reading datasets from Amazon S3), reading columns 
> independently and allowing unbridled {{Read}} calls to the underlying file 
> handle can yield suboptimal performance. In such cases, it may be preferable 
> to first read the entire serialized row group into memory then deserialize 
> the constituent columns from this
> Note that such an option would not be appropriate as a default behavior for 
> all file handle types since low-selectivity reads (example: reading only 3 
> columns out of a file with 100 columns)  will be suboptimal in some cases. I 
> think it would be better for "high latency" file systems to opt into this 
> option
> cc [~fsaintjacques] [~bkietz] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1698) [C++] Add reader option to pre-buffer entire serialized row group into memory

2020-01-13 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014744#comment-17014744
 ] 

Wes McKinney commented on PARQUET-1698:
---

I think the pre-buffering should probably be implemented at the RowGroupReader 
level. Something like:

{code}
rg_reader->PreBufferColumns(column_indices);
{code}

what do you think? Then we can provide this prebuffering as an option at the 
Arrow read and Datasets level. Another option would be to set the prebuffer 
column indices in {{ReaderProperties}} (tomay-to, tomah-to, I guess). 

cc [~npr] [~fsaintjacques] [~bkietz]

> [C++] Add reader option to pre-buffer entire serialized row group into memory
> -
>
> Key: PARQUET-1698
> URL: https://issues.apache.org/jira/browse/PARQUET-1698
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Zherui Cao
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In some scenarios (example: reading datasets from Amazon S3), reading columns 
> independently and allowing unbridled {{Read}} calls to the underlying file 
> handle can yield suboptimal performance. In such cases, it may be preferable 
> to first read the entire serialized row group into memory then deserialize 
> the constituent columns from this
> Note that such an option would not be appropriate as a default behavior for 
> all file handle types since low-selectivity reads (example: reading only 3 
> columns out of a file with 100 columns)  will be suboptimal in some cases. I 
> think it would be better for "high latency" file systems to opt into this 
> option
> cc [~fsaintjacques] [~bkietz] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Moved] (PARQUET-1766) [C++] parquet NaN/null double statistics can result in endless loop

2020-01-13 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney moved ARROW-7376 to PARQUET-1766:
--

  Component/s: (was: C++)
   parquet-cpp
Fix Version/s: (was: 0.16.0)
   cpp-1.6.0
  Key: PARQUET-1766  (was: ARROW-7376)
Affects Version/s: (was: 0.15.1)
 Workflow: patch-available, re-open possible  (was: jira)
  Project: Parquet  (was: Apache Arrow)

> [C++] parquet NaN/null double statistics can result in endless loop
> ---
>
> Key: PARQUET-1766
> URL: https://issues.apache.org/jira/browse/PARQUET-1766
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Pierre Belzile
>Assignee: Francois Saint-Jacques
>Priority: Critical
>  Labels: parquet, pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> There is a bug in the doubles column statistics computation when writing to 
> parquet an array with only NaNs and nulls. It loops endlessly if the last 
> cell of a write group is a Null. The line in error is 
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L633]
>  which checks for NaN but not for Null. Code then falls through and loops 
> endlessly and causes the program to appear frozen.
> This code snippet repeats:
> {noformat}
> TEST(parquet, nans) {
>   /* Create a small parquet structure */
>   std::vector> fields;
>   fields.push_back(::arrow::field("doubles", ::arrow::float64()));
>   std::shared_ptr<::arrow::Schema> schema = 
> ::arrow::schema(std::move(fields));  
> std::unique_ptr<::arrow::RecordBatchBuilder> builder;
>   ::arrow::RecordBatchBuilder::Make(schema, ::arrow::default_memory_pool(), 
> );
>   
> builder->GetFieldAs<::arrow::DoubleBuilder>(0)->Append(std::numeric_limits::quiet_NaN());
>   builder->GetFieldAs<::arrow::DoubleBuilder>(0)->AppendNull();  
> std::shared_ptr<::arrow::RecordBatch> batch;
>   builder->Flush();
>   arrow::PrettyPrint(*batch, 0, ::cout);  std::shared_ptr 
> table;
>   arrow::Table::FromRecordBatches({batch}, );  /* Attempt to write */
>   std::shared_ptr<::arrow::io::FileOutputStream> os;
>   arrow::io::FileOutputStream::Open("/tmp/test.parquet", );
>   parquet::WriterProperties::Builder writer_props_bld;
>   // writer_props_bld.disable_statistics("doubles");
>   std::shared_ptr writer_props = 
> writer_props_bld.build();
>   std::shared_ptr arrow_props =
>   parquet::ArrowWriterProperties::Builder().store_schema()->build();
>   std::unique_ptr writer;
>   parquet::arrow::FileWriter::Open(
>   *table->schema(), arrow::default_memory_pool(), os,
>   writer_props, arrow_props, );
>   writer->WriteTable(*table, 1024);
> }{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1698) [C++] Add reader option to pre-buffer entire serialized row group into memory

2020-01-13 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014637#comment-17014637
 ] 

Wes McKinney commented on PARQUET-1698:
---

[~lidavidm] I missed the part about "wide datasets".

I wonder if we can implement something general and not-S3-specific that does 
read coalescing generically, basically _partial_ row group pre-buffering. This 
would require a declaration of intent up front about which columns you plan to 
read. Thoughts?

> [C++] Add reader option to pre-buffer entire serialized row group into memory
> -
>
> Key: PARQUET-1698
> URL: https://issues.apache.org/jira/browse/PARQUET-1698
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Zherui Cao
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In some scenarios (example: reading datasets from Amazon S3), reading columns 
> independently and allowing unbridled {{Read}} calls to the underlying file 
> handle can yield suboptimal performance. In such cases, it may be preferable 
> to first read the entire serialized row group into memory then deserialize 
> the constituent columns from this
> Note that such an option would not be appropriate as a default behavior for 
> all file handle types since low-selectivity reads (example: reading only 3 
> columns out of a file with 100 columns)  will be suboptimal in some cases. I 
> think it would be better for "high latency" file systems to opt into this 
> option
> cc [~fsaintjacques] [~bkietz] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1698) [C++] Add reader option to pre-buffer entire serialized row group into memory

2020-01-12 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013982#comment-17013982
 ] 

Wes McKinney commented on PARQUET-1698:
---

[~lidavidm] I'm quite interested to compare the rather complex optimization you 
have described with the very simple solution of pulling down the whole 
serialized row group in a single read from S3 prior so there is effectively 
only a single IO call per row group. AFAIK this is the most common Parquet 
optimization when it comes to high latency file systems like S3

> [C++] Add reader option to pre-buffer entire serialized row group into memory
> -
>
> Key: PARQUET-1698
> URL: https://issues.apache.org/jira/browse/PARQUET-1698
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Zherui Cao
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In some scenarios (example: reading datasets from Amazon S3), reading columns 
> independently and allowing unbridled {{Read}} calls to the underlying file 
> handle can yield suboptimal performance. In such cases, it may be preferable 
> to first read the entire serialized row group into memory then deserialize 
> the constituent columns from this
> Note that such an option would not be appropriate as a default behavior for 
> all file handle types since low-selectivity reads (example: reading only 3 
> columns out of a file with 100 columns)  will be suboptimal in some cases. I 
> think it would be better for "high latency" file systems to opt into this 
> option
> cc [~fsaintjacques] [~bkietz] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1720) [C++] Parquet JSONPrint not showing version correctly

2019-12-19 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000192#comment-17000192
 ] 

Wes McKinney commented on PARQUET-1720:
---

Assuming this is a C++ issue. Can you provide detail?

> [C++] Parquet JSONPrint not showing version correctly
> -
>
> Key: PARQUET-1720
> URL: https://issues.apache.org/jira/browse/PARQUET-1720
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Zherui Cao
>Assignee: Zherui Cao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1720) [C++] Parquet JSONPrint not showing version correctly

2019-12-19 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1720:
--
Summary: [C++] Parquet JSONPrint not showing version correctly  (was: 
Parquet JSONPrint not showing version correctly)

> [C++] Parquet JSONPrint not showing version correctly
> -
>
> Key: PARQUET-1720
> URL: https://issues.apache.org/jira/browse/PARQUET-1720
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Zherui Cao
>Assignee: Zherui Cao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1720) Parquet JSONPrint not showing version correctly

2019-12-19 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1720:
--
Component/s: parquet-cpp

> Parquet JSONPrint not showing version correctly
> ---
>
> Key: PARQUET-1720
> URL: https://issues.apache.org/jira/browse/PARQUET-1720
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Zherui Cao
>Assignee: Zherui Cao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1715) [C++] Add the Parquet code samples to CI + Refactor Parquet Encryption Samples

2019-12-12 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995175#comment-16995175
 ] 

Wes McKinney commented on PARQUET-1715:
---

Done

> [C++] Add the Parquet code samples to CI + Refactor Parquet Encryption Samples
> --
>
> Key: PARQUET-1715
> URL: https://issues.apache.org/jira/browse/PARQUET-1715
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gal Lushi
>Assignee: Gal Lushi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This is a refix for PARQUET-1712 , refactoring the Parquet Encryption code 
> samples as well + Added the code samples to the CI (the previous issue added 
> only the tools, not the samples).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1715) [C++] Add the Parquet code samples to CI + Refactor Parquet Encryption Samples

2019-12-12 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1715:
-

Assignee: Gal Lushi

> [C++] Add the Parquet code samples to CI + Refactor Parquet Encryption Samples
> --
>
> Key: PARQUET-1715
> URL: https://issues.apache.org/jira/browse/PARQUET-1715
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gal Lushi
>Assignee: Gal Lushi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This is a refix for PARQUET-1712 , refactoring the Parquet Encryption code 
> samples as well + Added the code samples to the CI (the previous issue added 
> only the tools, not the samples).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (PARQUET-1718) Store int16 as int16

2019-12-11 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed PARQUET-1718.
-

> Store int16 as int16
> 
>
> Key: PARQUET-1718
> URL: https://issues.apache.org/jira/browse/PARQUET-1718
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Viacheslav Shalamov
>Priority: Major
>
> When writing a POJO with `short` field, it ends up in parquet file as 32-bit 
> int because of:
> ??16-bit ints are not explicitly supported in the storage format since they 
> are covered by 32-bit ints with an efficient encoding.??
>  [https://github.com/apache/parquet-format#types]
> How about annotating it with logical type `IntType (bitWidth = 16, isSigned = 
> true)` ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1716) [C++] Add support for BYTE_STREAM_SPLIT encoding

2019-12-10 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1716:
--
Summary: [C++] Add support for BYTE_STREAM_SPLIT encoding  (was: 
[C++][Parquet] Add support for Parquet's BYTE_STREAM_SPLIT encoding)

> [C++] Add support for BYTE_STREAM_SPLIT encoding
> 
>
> Key: PARQUET-1716
> URL: https://issues.apache.org/jira/browse/PARQUET-1716
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Martin Radev
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 72h
>  Time Spent: 0.5h
>  Remaining Estimate: 71.5h
>
> *From the Parquet issue ( https://issues.apache.org/jira/browse/PARQUET-1622 
> ):*
> Apache Parquet does not have any encodings suitable for FP data and the 
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream 
> splitting". Such could be "byte stream splitting" which creates K streams of 
> length N where K is the number of bytes in the data type (4 for floats, 8 for 
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the 
> original data and for some cases there is a performance improvement in 
> compression and decompression speed.
> You can read a more detailed report here:
>  [https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> *Apache Arrow can benefit from the reduced requirements for storing FP 
> parquet column data and improvements in decompression speed.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Moved] (PARQUET-1716) [C++][Parquet] Add support for Parquet's BYTE_STREAM_SPLIT encoding

2019-12-10 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney moved ARROW-5913 to PARQUET-1716:
--

Component/s: (was: C++)
 parquet-cpp
Key: PARQUET-1716  (was: ARROW-5913)
   Workflow: patch-available, re-open possible  (was: jira)
 Issue Type: New Feature  (was: Wish)
Project: Parquet  (was: Apache Arrow)

> [C++][Parquet] Add support for Parquet's BYTE_STREAM_SPLIT encoding
> ---
>
> Key: PARQUET-1716
> URL: https://issues.apache.org/jira/browse/PARQUET-1716
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Martin Radev
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 72h
>  Time Spent: 0.5h
>  Remaining Estimate: 71.5h
>
> *From the Parquet issue ( https://issues.apache.org/jira/browse/PARQUET-1622 
> ):*
> Apache Parquet does not have any encodings suitable for FP data and the 
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream 
> splitting". Such could be "byte stream splitting" which creates K streams of 
> length N where K is the number of bytes in the data type (4 for floats, 8 for 
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the 
> original data and for some cases there is a performance improvement in 
> compression and decompression speed.
> You can read a more detailed report here:
>  [https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> *Apache Arrow can benefit from the reduced requirements for storing FP 
> parquet column data and improvements in decompression speed.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1712) [C++] Stop using deprecated APIs in examples

2019-12-10 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992361#comment-16992361
 ] 

Wes McKinney commented on PARQUET-1712:
---

Done

> [C++] Stop using deprecated APIs in examples
> 
>
> Key: PARQUET-1712
> URL: https://issues.apache.org/jira/browse/PARQUET-1712
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Kenta Murata
>Assignee: Kenta Murata
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some Status-returning APIs used in example files have been deprecated 
> recently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1712) [C++] Stop using deprecated APIs in examples

2019-12-10 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1712:
-

Assignee: Kenta Murata

> [C++] Stop using deprecated APIs in examples
> 
>
> Key: PARQUET-1712
> URL: https://issues.apache.org/jira/browse/PARQUET-1712
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Kenta Murata
>Assignee: Kenta Murata
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some Status-returning APIs used in example files have been deprecated 
> recently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (PARQUET-1713) [C++] Refactor Parquet Code Samples to use Result APIs

2019-12-09 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed PARQUET-1713.
-

> [C++] Refactor Parquet Code Samples to use Result APIs
> -
>
> Key: PARQUET-1713
> URL: https://issues.apache.org/jira/browse/PARQUET-1713
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gal Lushi
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, the Parquet code samples use the (now deprecated by ARROW-7235) 
> `Status`-returning functions.
> See [https://github.com/apache/arrow/pull/5994]
> this also closes ARROW-7352 which was opened is the wrong JIRA by mistake.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1709) [C++] Avoid unnecessary temporary std::shared_ptr copies

2019-12-09 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1709.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 5949
[https://github.com/apache/arrow/pull/5949]

> [C++] Avoid unnecessary temporary std::shared_ptr copies
> 
>
> Key: PARQUET-1709
> URL: https://issues.apache.org/jira/browse/PARQUET-1709
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Gawain BOLTON
>Assignee: Gawain BOLTON
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> There are several occurences of copying of std::shared_ptr objects which are 
> easily avoided.
> Copying of std::shared_ptr objects can be expensive due to the atomic 
> operations involved in incrementing/decrementing the reference counter.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1709) [C++] Avoid unnecessary temporary std::shared_ptr copies

2019-12-09 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1709:
--
Component/s: parquet-cpp

> [C++] Avoid unnecessary temporary std::shared_ptr copies
> 
>
> Key: PARQUET-1709
> URL: https://issues.apache.org/jira/browse/PARQUET-1709
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Gawain BOLTON
>Assignee: Gawain BOLTON
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> There are several occurences of copying of std::shared_ptr objects which are 
> easily avoided.
> Copying of std::shared_ptr objects can be expensive due to the atomic 
> operations involved in incrementing/decrementing the reference counter.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1702) [C++] Make BufferedRowGroupWriter compatible with parquet encryption

2019-12-04 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1702.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 5903
[https://github.com/apache/arrow/pull/5903]

> [C++] Make BufferedRowGroupWriter compatible with parquet encryption
> 
>
> Key: PARQUET-1702
> URL: https://issues.apache.org/jira/browse/PARQUET-1702
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.6.0
>Reporter: Or Ozeri
>Assignee: Or Ozeri
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The newly added parquet encryption feature currently works only with 
> SerializedRowGroupWriter.
> There are several issues preventing the use of BufferedRowGroupWriter with 
> encryption enabled:
> 1. Meta encryptor not passed on to ColumnChunkMetaDataBuilder::Finish. This 
> can trigger a null-pointer dereference (reported as segmentation fault).
> 2. UpdateEncryption not called on Close, resulting in an incorrect AAD string 
> when encrypting the column chunk metadata.
> 3. The column ordinal passed on to PageWriter::Open is always zero, resulting 
> in a wrong AAD string when encrypting the columns data (except for the first 
> column).
> 4. When decrypting a column chunk with no dictionary pages, PARQUET-1706 
> confuses the decryptor to think it is decrypting a dictionary page, which 
> again causes a wrong AAD string to be used when decrypting.
> We propose a patch (few dozen lines) to fix the above issues.
> We also extend the current parquet-encryption-test unit test, which tests 
> SerializedRowGroupWriter, to test also with BufferedRowGroupWriter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   10   >