[jira] [Updated] (PARQUET-2369) Clarify Support for Pages Compressed with Multiple GZIP Members

2023-11-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2369:

Fix Version/s: format-2.10.0

> Clarify Support for Pages Compressed with Multiple GZIP Members
> ---
>
> Key: PARQUET-2369
> URL: https://issues.apache.org/jira/browse/PARQUET-2369
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Raphael Taylor-Davies
>Priority: Major
> Fix For: format-2.10.0
>
>
> https://github.com/apache/parquet-testing/pull/41



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2369) Clarify Support for Pages Compressed with Multiple GZIP Members

2023-11-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2369:

Component/s: parquet-format

> Clarify Support for Pages Compressed with Multiple GZIP Members
> ---
>
> Key: PARQUET-2369
> URL: https://issues.apache.org/jira/browse/PARQUET-2369
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Raphael Taylor-Davies
>Priority: Major
>
> https://github.com/apache/parquet-testing/pull/41



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2369) Clarify Support for Pages Compressed with Multiple GZIP Members

2023-11-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2369:

Priority: Major  (was: Trivial)

> Clarify Support for Pages Compressed with Multiple GZIP Members
> ---
>
> Key: PARQUET-2369
> URL: https://issues.apache.org/jira/browse/PARQUET-2369
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Raphael Taylor-Davies
>Priority: Major
>
> https://github.com/apache/parquet-testing/pull/41



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-1646) [C++] Use arrow::Buffer for buffered dictionary indices in DictEncoder instead of std::vector

2023-11-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1646:

Fix Version/s: cpp-15.0.0
   (was: cpp-14.0.0)

> [C++] Use arrow::Buffer for buffered dictionary indices in DictEncoder 
> instead of std::vector
> -
>
> Key: PARQUET-1646
> URL: https://issues.apache.org/jira/browse/PARQUET-1646
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-15.0.0
>
>
> Follow up to ARROW-6411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2099) [C++] Statistics::num_values() is misleading

2023-11-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2099:

Fix Version/s: cpp-15.0.0
   (was: cpp-14.0.0)

> [C++] Statistics::num_values() is misleading 
> -
>
> Key: PARQUET-2099
> URL: https://issues.apache.org/jira/browse/PARQUET-2099
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Micah Kornfield
>Priority: Major
> Fix For: cpp-15.0.0
>
>
> num_values() in statistics seems to capture the number of encoded values.  
> This is misleading as everyplace else in parquet num_values() really 
> indicates all values (null and not-null, i.e. the number of levels).  
> We should likely remove this field, rename it or at the very least update the 
> documentation.
> CC [~zeroshade]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2321) allow customized buffer size when creating ArrowInputStream for a column PageReader

2023-11-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2321:

Fix Version/s: cpp-15.0.0
   (was: cpp-14.0.0)

> allow customized buffer size when creating ArrowInputStream for a column 
> PageReader
> ---
>
> Key: PARQUET-2321
> URL: https://issues.apache.org/jira/browse/PARQUET-2321
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Jinpeng Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-15.0.0
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> When buffered stream is enabled, all column chunks, regardless of their 
> actual sizes, are currently sharing the same buffer size which is stored in 
> the shared [read 
> properties]([https://github.com/apache/arrow/blob/main/cpp/src/parquet/file_reader.cc#L213).]
>   
> Given a limited memory budget, one may want to customize buffer size for 
> different column chunks based on their actual size, i.e., smaller chunks will 
> use consume less memory budget for its buffer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2238) Spec and parquet-mr disagree on DELTA_BYTE_ARRAY encoding

2023-09-26 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2238.
-
Resolution: Duplicate

> Spec and parquet-mr disagree on DELTA_BYTE_ARRAY encoding
> -
>
> Key: PARQUET-2238
> URL: https://issues.apache.org/jira/browse/PARQUET-2238
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format, parquet-mr
>Reporter: Jan Finis
>Priority: Minor
>
> The spec in parquet-format specifies that [DELTA_BYTE_ARRAY is only supported 
> for the physical type 
> BYTE_ARRAY|https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array--6].
>  Yet, [parquet-mr also uses it to encode 
> FIXED_LEN_BYTE_ARRAY|https://github.com/apache/parquet-mr/blob/fd1326a8a56174320ea2f36d7c6c4e78105ab108/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L83].
> So, I guess the spec should be updated to include FIXED_LEN_BYTE_ARRAY in the 
> supported types of DELTA_BYTE_ARRAY encoding, or the code should be changed 
> to no longer write this encoding for FIXED_LEN_BYTE_ARRAY.
> I guess changing the spec is more prudent, given that 
> a) the encoding can make sense for FIXED_LEN_BYTE_ARRAY
> and
> b) there might already be countless files written with this encoding / type 
> combination.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-1646) [C++] Use arrow::Buffer for buffered dictionary indices in DictEncoder instead of std::vector

2023-08-24 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1646:

Fix Version/s: cpp-14.0.0
   (was: cpp-13.0.0)

> [C++] Use arrow::Buffer for buffered dictionary indices in DictEncoder 
> instead of std::vector
> -
>
> Key: PARQUET-1646
> URL: https://issues.apache.org/jira/browse/PARQUET-1646
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-14.0.0
>
>
> Follow up to ARROW-6411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2321) allow customized buffer size when creating ArrowInputStream for a column PageReader

2023-08-24 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2321:

Fix Version/s: cpp-14.0.0
   (was: cpp-13.0.0)

> allow customized buffer size when creating ArrowInputStream for a column 
> PageReader
> ---
>
> Key: PARQUET-2321
> URL: https://issues.apache.org/jira/browse/PARQUET-2321
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Jinpeng Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-14.0.0
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> When buffered stream is enabled, all column chunks, regardless of their 
> actual sizes, are currently sharing the same buffer size which is stored in 
> the shared [read 
> properties]([https://github.com/apache/arrow/blob/main/cpp/src/parquet/file_reader.cc#L213).]
>   
> Given a limited memory budget, one may want to customize buffer size for 
> different column chunks based on their actual size, i.e., smaller chunks will 
> use consume less memory budget for its buffer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2099) [C++] Statistics::num_values() is misleading

2023-08-24 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2099:

Fix Version/s: cpp-14.0.0
   (was: cpp-13.0.0)

> [C++] Statistics::num_values() is misleading 
> -
>
> Key: PARQUET-2099
> URL: https://issues.apache.org/jira/browse/PARQUET-2099
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Micah Kornfield
>Priority: Major
> Fix For: cpp-14.0.0
>
>
> num_values() in statistics seems to capture the number of encoded values.  
> This is misleading as everyplace else in parquet num_values() really 
> indicates all values (null and not-null, i.e. the number of levels).  
> We should likely remove this field, rename it or at the very least update the 
> documentation.
> CC [~zeroshade]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2323) Use bit vector to store Prebuffered column chunk index

2023-07-28 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2323:

Fix Version/s: cpp-13.0.0
   (was: cpp-14.0.0)

> Use bit vector to store Prebuffered column chunk index
> --
>
> Key: PARQUET-2323
> URL: https://issues.apache.org/jira/browse/PARQUET-2323
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Jinpeng Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-13.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> In https://issues.apache.org/jira/browse/PARQUET-2316 we allow partial buffer 
> in parquet File Reader by storing prebuffered column chunk index in a hash 
> set, and make a copy of this hash set for each rowgroup reader
> In extreme conditions where numerous columns are prebuffered and multiple 
> rowgroup readers are created for the same row group , the hash set would 
> incur significant overhead. 
> Using bit vector would be a reasonsable mitigation, taking 4KB for 32K 
> columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-07-26 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747306#comment-17747306
 ] 

Antoine Pitrou commented on PARQUET-:
-

bq. Should we just keep the specs as is and let the implementations decide 
which encoding to use for boolean values?

Makes sense. But can you please open an issue for these discussions? This is 
unrelated to the issue I originally reported, and which is fixed.

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Xuwei Fu
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-07-25 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746834#comment-17746834
 ] 

Antoine Pitrou commented on PARQUET-:
-

There are other implementations arond, so I would be a bit uneasy about 
changing the spec like this.
Perhaps we should simply switch to v2 data pages by default in parquet-cpp and 
parquet-mr at some point?

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Xuwei Fu
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2323) Use bit vector to store Prebuffered column chunk index

2023-07-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2323:

Fix Version/s: cpp-14.0.0
   (was: cpp-13.0.0)

> Use bit vector to store Prebuffered column chunk index
> --
>
> Key: PARQUET-2323
> URL: https://issues.apache.org/jira/browse/PARQUET-2323
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Jinpeng Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-14.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> In https://issues.apache.org/jira/browse/PARQUET-2316 we allow partial buffer 
> in parquet File Reader by storing prebuffered column chunk index in a hash 
> set, and make a copy of this hash set for each rowgroup reader
> In extreme conditions where numerous columns are prebuffered and multiple 
> rowgroup readers are created for the same row group , the hash set would 
> incur significant overhead. 
> Using bit vector would be a reasonsable mitigation, taking 4KB for 32K 
> columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2323) Use bit vector to store Prebuffered column chunk index

2023-07-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2323.
-
Resolution: Fixed

Issue resolved by pull request 36649
https://github.com/apache/arrow/pull/36649

> Use bit vector to store Prebuffered column chunk index
> --
>
> Key: PARQUET-2323
> URL: https://issues.apache.org/jira/browse/PARQUET-2323
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Jinpeng Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-13.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> In https://issues.apache.org/jira/browse/PARQUET-2316 we allow partial buffer 
> in parquet File Reader by storing prebuffered column chunk index in a hash 
> set, and make a copy of this hash set for each rowgroup reader
> In extreme conditions where numerous columns are prebuffered and multiple 
> rowgroup readers are created for the same row group , the hash set would 
> incur significant overhead. 
> Using bit vector would be a reasonsable mitigation, taking 4KB for 32K 
> columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-06-15 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733129#comment-17733129
 ] 

Antoine Pitrou edited comment on PARQUET- at 6/15/23 3:32 PM:
--

Resolved in https://github.com/apache/parquet-format/pull/193


was (Author: pitrou):
Resolved in https://github.com/apache/parquet-format/pull/211

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Gang Wu
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-06-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-.
-
Resolution: Fixed

Resolved in https://github.com/apache/parquet-format/pull/211

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Gang Wu
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2310) [Doc] Add implementation status / matrix

2023-06-15 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733100#comment-17733100
 ] 

Antoine Pitrou commented on PARQUET-2310:
-

This was originally proposed in https://github.com/apache/arrow/pull/36027

> [Doc] Add implementation status / matrix
> 
>
> Key: PARQUET-2310
> URL: https://issues.apache.org/jira/browse/PARQUET-2310
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Priority: Major
>
> In Apache Arrow we have a documentation page listed the feature status for 
> various implementations of Arrow: https://arrow.apache.org/docs/status.html
> It could be nice to have a similar page for the main Parquet implementations 
> (at least Java, C++, Rust).
> The main downside is that it needs to be kept up to date.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2310) [Doc] Add implementation status / matrix

2023-06-15 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733099#comment-17733099
 ] 

Antoine Pitrou commented on PARQUET-2310:
-

cc [~wgtmac] [~gszadovszky] [~alippai]

> [Doc] Add implementation status / matrix
> 
>
> Key: PARQUET-2310
> URL: https://issues.apache.org/jira/browse/PARQUET-2310
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Priority: Major
>
> In Apache Arrow we have a documentation page listed the feature status for 
> various implementations of Arrow: https://arrow.apache.org/docs/status.html
> It could be nice to have a similar page for the main Parquet implementations 
> (at least Java, C++, Rust).
> The main downside is that it needs to be kept up to date.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2310) [Doc] Add implementation status / matrix

2023-06-15 Thread Antoine Pitrou (Jira)
Antoine Pitrou created PARQUET-2310:
---

 Summary: [Doc] Add implementation status / matrix
 Key: PARQUET-2310
 URL: https://issues.apache.org/jira/browse/PARQUET-2310
 Project: Parquet
  Issue Type: Task
  Components: parquet-format
Reporter: Antoine Pitrou


In Apache Arrow we have a documentation page listed the feature status for 
various implementations of Arrow: https://arrow.apache.org/docs/status.html

It could be nice to have a similar page for the main Parquet implementations 
(at least Java, C++, Rust).

The main downside is that it needs to be kept up to date.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-02-27 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693891#comment-17693891
 ] 

Antoine Pitrou commented on PARQUET-:
-

Yes, this is why I've filed this under parquet-format.

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-02-27 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693870#comment-17693870
 ] 

Antoine Pitrou commented on PARQUET-:
-

> I don't understand. Isn't length the part of encoding in spec?

What do you mean?

> And seems that DataPageV2 in parquet-mr is not in-prod?

What is that supposed to mean?

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2231) [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY

2023-01-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2231.
-
Resolution: Fixed

Closed by PR https://github.com/apache/parquet-format/pull/189

> [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY
> -
>
> Key: PARQUET-2231
> URL: https://issues.apache.org/jira/browse/PARQUET-2231
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec says that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY, but in 
> parquet-mr it has been allowed for FIXED_LEN_BYTE_ARRAY as well since 2015.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-152) Encoding issue with fixed length byte arrays

2023-01-16 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-152:
---
Component/s: parquet-mr

> Encoding issue with fixed length byte arrays
> 
>
> Key: PARQUET-152
> URL: https://issues.apache.org/jira/browse/PARQUET-152
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Nezih Yigitbasi
>Assignee: Sergio Peña
>Priority: Minor
> Fix For: 1.8.0
>
>
> While running some tests against the master branch I hit an encoding issue 
> that seemed like a bug to me.
> I noticed that when writing a fixed length byte array and the array's size is 
> > dictionaryPageSize (in my test it was 512), the encoding falls back to 
> DELTA_BYTE_ARRAY as seen below:
> {noformat}
> Dec 17, 2014 3:41:10 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: 
> written 12,125B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 1,710B 
> raw, 1,710B comp, 5 pages, encodings: [DELTA_BYTE_ARRAY]
> {noformat}
> But then read fails with the following exception:
> {noformat}
> Caused by: parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is 
> only supported for type BINARY
>   at parquet.column.Encoding$7.getValuesReader(Encoding.java:193)
>   at 
> parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:534)
>   at 
> parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:574)
>   at 
> parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:54)
>   at 
> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:518)
>   at 
> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:510)
>   at parquet.column.page.DataPageV2.accept(DataPageV2.java:123)
>   at 
> parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:510)
>   at 
> parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:502)
>   at 
> parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:604)
>   at 
> parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:348)
>   at 
> parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
>   at 
> parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
>   at 
> parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:267)
>   at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
>   at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
>   at 
> parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
>   at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
>   at 
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:129)
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198)
>   ... 16 more
> {noformat}
> When the array's size is < dictionaryPageSize, RLE_DICTIONARY encoding is 
> used and read works fine:
> {noformat}
> Dec 17, 2014 3:39:50 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: 
> written 50B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 3B raw, 3B 
> comp, 1 pages, encodings: [RLE_DICTIONARY, PLAIN], dic { 1 entries, 8B raw, 
> 1B comp}
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2231) [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY

2023-01-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677300#comment-17677300
 ] 

Antoine Pitrou commented on PARQUET-2231:
-

[~rok] [~shanhuang] [~muthunagappan] [~jinshang] FYI

> [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY
> -
>
> Key: PARQUET-2231
> URL: https://issues.apache.org/jira/browse/PARQUET-2231
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec says that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY, but in 
> parquet-mr it has been allowed for FIXED_LEN_BYTE_ARRAY as well since 2015.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2231) [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY

2023-01-16 Thread Antoine Pitrou (Jira)
Antoine Pitrou created PARQUET-2231:
---

 Summary: [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY
 Key: PARQUET-2231
 URL: https://issues.apache.org/jira/browse/PARQUET-2231
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: format-2.10.0


The spec says that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY, but in 
parquet-mr it has been allowed for FIXED_LEN_BYTE_ARRAY as well since 2015.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-152) Encoding issue with fixed length byte arrays

2023-01-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677297#comment-17677297
 ] 

Antoine Pitrou commented on PARQUET-152:


It would be nice if the encodings spec had been updated as well, because for 
now it mentions that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY columns, 
not FIXED_LEN_BYTE_ARRAY. See PARQUET-2231.

> Encoding issue with fixed length byte arrays
> 
>
> Key: PARQUET-152
> URL: https://issues.apache.org/jira/browse/PARQUET-152
> Project: Parquet
>  Issue Type: Bug
>Reporter: Nezih Yigitbasi
>Assignee: Sergio Peña
>Priority: Minor
> Fix For: 1.8.0
>
>
> While running some tests against the master branch I hit an encoding issue 
> that seemed like a bug to me.
> I noticed that when writing a fixed length byte array and the array's size is 
> > dictionaryPageSize (in my test it was 512), the encoding falls back to 
> DELTA_BYTE_ARRAY as seen below:
> {noformat}
> Dec 17, 2014 3:41:10 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: 
> written 12,125B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 1,710B 
> raw, 1,710B comp, 5 pages, encodings: [DELTA_BYTE_ARRAY]
> {noformat}
> But then read fails with the following exception:
> {noformat}
> Caused by: parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is 
> only supported for type BINARY
>   at parquet.column.Encoding$7.getValuesReader(Encoding.java:193)
>   at 
> parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:534)
>   at 
> parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:574)
>   at 
> parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:54)
>   at 
> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:518)
>   at 
> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:510)
>   at parquet.column.page.DataPageV2.accept(DataPageV2.java:123)
>   at 
> parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:510)
>   at 
> parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:502)
>   at 
> parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:604)
>   at 
> parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:348)
>   at 
> parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
>   at 
> parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
>   at 
> parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:267)
>   at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
>   at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
>   at 
> parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
>   at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
>   at 
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:129)
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198)
>   ... 16 more
> {noformat}
> When the array's size is < dictionaryPageSize, RLE_DICTIONARY encoding is 
> used and read works fine:
> {noformat}
> Dec 17, 2014 3:39:50 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: 
> written 50B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 3B raw, 3B 
> comp, 1 pages, encodings: [RLE_DICTIONARY, PLAIN], dic { 1 entries, 8B raw, 
> 1B comp}
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-01-04 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654524#comment-17654524
 ] 

Antoine Pitrou commented on PARQUET-:
-

cc [~julienledem] [~pnarang] [~rdblue] [~alexlevenson]

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-01-04 Thread Antoine Pitrou (Jira)
Antoine Pitrou created PARQUET-:
---

 Summary: [Format] RLE encoding spec incorrect for v2 data pages
 Key: PARQUET-
 URL: https://issues.apache.org/jira/browse/PARQUET-
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Reporter: Antoine Pitrou
 Fix For: format-2.10.0


The spec 
(https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
 has this:
{code}
rle-bit-packed-hybrid:  
length := length of the  in bytes stored as 4 bytes little endian 
(unsigned int32)
{code}

But the length is actually prepended only in v1 data pages, not in v2 data 
pages.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2218) [Format] Clarify CRC computation

2023-01-03 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2218.
-
Resolution: Fixed

Fixed by PR https://github.com/apache/parquet-format/pull/188

> [Format] Clarify CRC computation
> 
>
> Key: PARQUET-2218
> URL: https://issues.apache.org/jira/browse/PARQUET-2218
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
> Fix For: format-2.10.0
>
>
> The format spec on CRC checksumming felt ambiguous when trying to implement 
> it in Parquet C++, so we should make the wording clearer.
> (see discussion on 
> https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and 
> below)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2221) [Format] Encoding spec incorrect for dictionary fallback

2023-01-03 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654025#comment-17654025
 ] 

Antoine Pitrou commented on PARQUET-2221:
-

cc [~julienledem] [~pnarang] [~rdblue] [~alexlevenson]

> [Format] Encoding spec incorrect for dictionary fallback
> 
>
> Key: PARQUET-2221
> URL: https://issues.apache.org/jira/browse/PARQUET-2221
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec for DICTIONARY_ENCODING states that:
> bq. If the dictionary grows too big, whether in size or number of distinct 
> values, the encoding will fall back to the plain encoding. 
> https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8
> However, the parquet-mr implementation was deliberately changed to a 
> different fallback mechanism in 
> https://issues.apache.org/jira/browse/PARQUET-52
> I'm assuming the parquet-mr implementation is authoritative here. But then 
> the spec is incorrect and should be fixed to reflect expected behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-52) Improve the encoding fall back mechanism for Parquet 2.0

2023-01-03 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-52?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-52:
--
Description: 
https://github.com/apache/incubator-parquet-mr/pull/74

-> moved to https://github.com/apache/parquet-mr/pull/74

  was:https://github.com/apache/incubator-parquet-mr/pull/74


> Improve the encoding fall back mechanism for Parquet 2.0
> 
>
> Key: PARQUET-52
> URL: https://issues.apache.org/jira/browse/PARQUET-52
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>Priority: Major
> Fix For: 1.6.0
>
>
> https://github.com/apache/incubator-parquet-mr/pull/74
> -> moved to https://github.com/apache/parquet-mr/pull/74



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2221) [Format] Encoding spec incorrect for dictionary fallback

2023-01-03 Thread Antoine Pitrou (Jira)
Antoine Pitrou created PARQUET-2221:
---

 Summary: [Format] Encoding spec incorrect for dictionary fallback
 Key: PARQUET-2221
 URL: https://issues.apache.org/jira/browse/PARQUET-2221
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Reporter: Antoine Pitrou
 Fix For: format-2.10.0


The spec for DICTIONARY_ENCODING states that:

bq. If the dictionary grows too big, whether in size or number of distinct 
values, the encoding will fall back to the plain encoding. 

https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8

However, the parquet-mr implementation was deliberately changed to a different 
fallback mechanism in https://issues.apache.org/jira/browse/PARQUET-52

I'm assuming the parquet-mr implementation is authoritative here. But then the 
spec is incorrect and should be fixed to reflect expected behavior.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-796) Delta Encoding is not used when dictionary enabled

2023-01-03 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-796:
---
Priority: Major  (was: Critical)

> Delta Encoding is not used when dictionary enabled
> --
>
> Key: PARQUET-796
> URL: https://issues.apache.org/jira/browse/PARQUET-796
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Jakub Liska
>Priority: Major
>
> Current code doesn't enable using both Delta Encoding and Dictionary 
> Encoding. If I instantiate ParquetWriter like this : 
> {code}
> val writer = new ParquetWriter[Group](outFile, new GroupWriteSupport, codec, 
> blockSize, pageSize, dictPageSize, enableDictionary = true, true, 
> ParquetProperties.WriterVersion.PARQUET_2_0, configuration)
> {code}
> Then this piece of code : 
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultValuesWriterFactory.java#L78-L86
> Causes that DictionaryValuesWriter is used instead of the inferred 
> DeltaLongEncodingWriter. 
> The original issue is here : 
> https://github.com/apache/parquet-mr/pull/154#issuecomment-266489768



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2218) [Format] Clarify CRC computation

2022-12-13 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2218:

Description: 
The format spec on CRC checksumming felt ambiguous when trying to implement it 
in Parquet C++, so we should make the wording clearer.

(see discussion on 
https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and 
below)

  was:The format spec on CRC checksumming felt ambiguous when trying to 
implement it in Parquet C++, so we should make the wording clearer.


> [Format] Clarify CRC computation
> 
>
> Key: PARQUET-2218
> URL: https://issues.apache.org/jira/browse/PARQUET-2218
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
> Fix For: format-2.10.0
>
>
> The format spec on CRC checksumming felt ambiguous when trying to implement 
> it in Parquet C++, so we should make the wording clearer.
> (see discussion on 
> https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and 
> below)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2218) [Format] Clarify CRC computation

2022-12-13 Thread Antoine Pitrou (Jira)
Antoine Pitrou created PARQUET-2218:
---

 Summary: [Format] Clarify CRC computation
 Key: PARQUET-2218
 URL: https://issues.apache.org/jira/browse/PARQUET-2218
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: format-2.10.0


The format spec on CRC checksumming felt ambiguous when trying to implement it 
in Parquet C++, so we should make the wording clearer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1629) Page-level CRC checksum verification for DataPageV2

2022-12-13 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646612#comment-17646612
 ] 

Antoine Pitrou commented on PARQUET-1629:
-

[~mwish] for the record. Perhaps you would be interested in doing this, if you 
can do some Java.

> Page-level CRC checksum verification for DataPageV2
> ---
>
> Key: PARQUET-1629
> URL: https://issues.apache.org/jira/browse/PARQUET-1629
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Boudewijn Braams
>Priority: Major
>
> In https://jira.apache.org/jira/browse/PARQUET-1580 (Github PR: 
> https://github.com/apache/parquet-mr/pull/647) we implemented page level CRC 
> checksum verification for DataPageV1. As a follow up, we should add support 
> for DataPageV2 that follows the spec (see see 
> https://jira.apache.org/jira/browse/PARQUET-1539).
> What needs to be done:
> * Add writing out checksums for DataPageV2
> * Add checksum verification for DataPageV2
> * Create new test suite
> * Create new benchmarks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2204) TypedColumnReaderImpl::Skip should reuse scratch space

2022-12-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2204.
-
Fix Version/s: cpp-11.0.0
   Resolution: Fixed

Issue resolved by pull request 14509
https://github.com/apache/arrow/pull/14509

> TypedColumnReaderImpl::Skip should reuse scratch space
> --
>
> Key: PARQUET-2204
> URL: https://issues.apache.org/jira/browse/PARQUET-2204
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: fatemah
>Assignee: fatemah
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-11.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> TypedColumnReaderImpl::Skip allocates scratch space on every call. The 
> scratch space is used to read rep/def levels and values and throw them away. 
> The memory allocation slows down the skip based on microbenchmarks. The 
> scratch space can be allocated once and re-used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2204) TypedColumnReaderImpl::Skip should reuse scratch space

2022-12-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned PARQUET-2204:
---

Assignee: fatemah

> TypedColumnReaderImpl::Skip should reuse scratch space
> --
>
> Key: PARQUET-2204
> URL: https://issues.apache.org/jira/browse/PARQUET-2204
> Project: Parquet
>  Issue Type: Improvement
>Reporter: fatemah
>Assignee: fatemah
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> TypedColumnReaderImpl::Skip allocates scratch space on every call. The 
> scratch space is used to read rep/def levels and values and throw them away. 
> The memory allocation slows down the skip based on microbenchmarks. The 
> scratch space can be allocated once and re-used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2204) TypedColumnReaderImpl::Skip should reuse scratch space

2022-12-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2204:

Component/s: parquet-cpp

> TypedColumnReaderImpl::Skip should reuse scratch space
> --
>
> Key: PARQUET-2204
> URL: https://issues.apache.org/jira/browse/PARQUET-2204
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: fatemah
>Assignee: fatemah
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> TypedColumnReaderImpl::Skip allocates scratch space on every call. The 
> scratch space is used to read rep/def levels and values and throw them away. 
> The memory allocation slows down the skip based on microbenchmarks. The 
> scratch space can be allocated once and re-used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-12-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1222:

Fix Version/s: format-2.10.0

> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Assignee: Micah Kornfield
>Priority: Critical
> Fix For: format-2.10.0
>
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-12-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-1222.
-
Resolution: Fixed

> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Assignee: Micah Kornfield
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-12-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned PARQUET-1222:
---

Assignee: Micah Kornfield

> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Assignee: Micah Kornfield
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2215) Document how DELTA_BINARY_PACKED handles overflow for deltas

2022-11-23 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned PARQUET-2215:
---

Assignee: Antoine Pitrou

> Document how DELTA_BINARY_PACKED handles overflow for deltas
> 
>
> Key: PARQUET-2215
> URL: https://issues.apache.org/jira/browse/PARQUET-2215
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format
>Reporter: Rok Mihevc
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: docs
>
> [Current 
> docs|https://github.com/apache/parquet-format/blob/master/Encodings.md?plain=1#L160]
>  do not explicitly state how overflow is handled.
> [See 
> discussion|https://github.com/apache/arrow/pull/14191#discussion_r1028298973] 
> for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2206) Microbenchmark for ColumnReadaer ReadBatch and Skip

2022-11-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2206.
-
Fix Version/s: cpp-11.0.0
   Resolution: Fixed

Issue resolved by pull request 14523
[https://github.com/apache/arrow/pull/14523]

> Microbenchmark for ColumnReadaer ReadBatch and Skip
> ---
>
> Key: PARQUET-2206
> URL: https://issues.apache.org/jira/browse/PARQUET-2206
> Project: Parquet
>  Issue Type: Improvement
>Reporter: fatemah
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-11.0.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
>  Adding a micro benchmark for column reader ReadBatch and Skip. Later, I will 
> add benchmarks for RecordReader's ReadRecords and SkipRecords.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2206) Microbenchmark for ColumnReadaer ReadBatch and Skip

2022-11-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2206:

Component/s: parquet-cpp

> Microbenchmark for ColumnReadaer ReadBatch and Skip
> ---
>
> Key: PARQUET-2206
> URL: https://issues.apache.org/jira/browse/PARQUET-2206
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: fatemah
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-11.0.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
>  Adding a micro benchmark for column reader ReadBatch and Skip. Later, I will 
> add benchmarks for RecordReader's ReadRecords and SkipRecords.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2206) Microbenchmark for ColumnReadaer ReadBatch and Skip

2022-11-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned PARQUET-2206:
---

Assignee: fatemah

> Microbenchmark for ColumnReadaer ReadBatch and Skip
> ---
>
> Key: PARQUET-2206
> URL: https://issues.apache.org/jira/browse/PARQUET-2206
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: fatemah
>Assignee: fatemah
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-11.0.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
>  Adding a micro benchmark for column reader ReadBatch and Skip. Later, I will 
> add benchmarks for RecordReader's ReadRecords and SkipRecords.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2210) Skip pages based on header metadata using a callback

2022-11-09 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2210:

Component/s: parquet-cpp

> Skip pages based on header metadata using a callback
> 
>
> Key: PARQUET-2210
> URL: https://issues.apache.org/jira/browse/PARQUET-2210
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: fatemah
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Currently, we do not expose the page header metadata and they cannot be used 
> for skipping pages. I propose exposing the metadata through a callback that 
> would allow the caller to decide if they want to read or skip the page based 
> on the metadata. The signature of the callback would be the following: 
> std::function skip_page_callback)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2210) [C++] Skip pages based on header metadata using a callback

2022-11-09 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2210:

Summary: [C++] Skip pages based on header metadata using a callback  (was: 
Skip pages based on header metadata using a callback)

> [C++] Skip pages based on header metadata using a callback
> --
>
> Key: PARQUET-2210
> URL: https://issues.apache.org/jira/browse/PARQUET-2210
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: fatemah
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Currently, we do not expose the page header metadata and they cannot be used 
> for skipping pages. I propose exposing the metadata through a callback that 
> would allow the caller to decide if they want to read or skip the page based 
> on the metadata. The signature of the callback would be the following: 
> std::function skip_page_callback)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2211) [C++] Print ColumnMetaData.encoding_stats field

2022-11-06 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned PARQUET-2211:
---

Assignee: Gang Wu

> [C++] Print ColumnMetaData.encoding_stats field
> ---
>
> Key: PARQUET-2211
> URL: https://issues.apache.org/jira/browse/PARQUET-2211
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-11.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The ParquetFilePrinter of parquet-cpp prints column chunk encodings solely 
> from ColumnMetaData.encodings field. As ColumnMetaData.encoding_stats has 
> been introduced long ago and it is a better source of obtain encodings, the 
> printer should be aware of it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2211) [C++] Print ColumnMetaData.encoding_stats field

2022-11-06 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2211.
-
Fix Version/s: cpp-11.0.0
   Resolution: Fixed

Issue resolved by pull request 14556
[https://github.com/apache/arrow/pull/14556]

> [C++] Print ColumnMetaData.encoding_stats field
> ---
>
> Key: PARQUET-2211
> URL: https://issues.apache.org/jira/browse/PARQUET-2211
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Gang Wu
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-11.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The ParquetFilePrinter of parquet-cpp prints column chunk encodings solely 
> from ColumnMetaData.encodings field. As ColumnMetaData.encoding_stats has 
> been introduced long ago and it is a better source of obtain encodings, the 
> printer should be aware of it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2209) [C++] Optimize skip for the case that number of values to skip equals page size

2022-11-02 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned PARQUET-2209:
---

Assignee: fatemah

> [C++] Optimize skip for the case that number of values to skip equals page 
> size
> ---
>
> Key: PARQUET-2209
> URL: https://issues.apache.org/jira/browse/PARQUET-2209
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: fatemah
>Assignee: fatemah
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-11.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Optimize skip for the case that the number of values to skip equals page 
> size. Right now, we end up reading to the end of the page and throwing away 
> the rep/defs and values that we have read, which is unnecessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2209) [C++] Optimize skip for the case that number of values to skip equals page size

2022-11-02 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2209.
-
Fix Version/s: cpp-11.0.0
   Resolution: Fixed

Issue resolved by pull request 14545
[https://github.com/apache/arrow/pull/14545]

> [C++] Optimize skip for the case that number of values to skip equals page 
> size
> ---
>
> Key: PARQUET-2209
> URL: https://issues.apache.org/jira/browse/PARQUET-2209
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: fatemah
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-11.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Optimize skip for the case that the number of values to skip equals page 
> size. Right now, we end up reading to the end of the page and throwing away 
> the rep/defs and values that we have read, which is unnecessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2188) Add SkipRecords API to RecordReader

2022-10-31 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned PARQUET-2188:
---

Assignee: fatemah

> Add SkipRecords API to RecordReader
> ---
>
> Key: PARQUET-2188
> URL: https://issues.apache.org/jira/browse/PARQUET-2188
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: fatemah
>Assignee: fatemah
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-11.0.0
>
>  Time Spent: 16.5h
>  Remaining Estimate: 0h
>
> The RecordReader is missing an API to skip records. There is a Skip method in 
> the ColumnReader, but that skips based on the number of values/levels and not 
> records. For repeated fields, this SkipRecords API will detect the record 
> boundaries and correctly skip the right number of values for the requested 
> number of records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2188) Add SkipRecords API to RecordReader

2022-10-31 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2188.
-
Fix Version/s: cpp-11.0.0
   Resolution: Fixed

Issue resolved by pull request 14142
[https://github.com/apache/arrow/pull/14142]

> Add SkipRecords API to RecordReader
> ---
>
> Key: PARQUET-2188
> URL: https://issues.apache.org/jira/browse/PARQUET-2188
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: fatemah
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-11.0.0
>
>  Time Spent: 16.5h
>  Remaining Estimate: 0h
>
> The RecordReader is missing an API to skip records. There is a Skip method in 
> the ColumnReader, but that skips based on the number of values/levels and not 
> records. For repeated fields, this SkipRecords API will detect the record 
> boundaries and correctly skip the right number of values for the requested 
> number of records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2209) [C++] Optimize skip for the case that number of values to skip equals page size

2022-10-31 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2209:

Summary: [C++] Optimize skip for the case that number of values to skip 
equals page size  (was: Optimize skip for the case that number of values to 
skip equals page size)

> [C++] Optimize skip for the case that number of values to skip equals page 
> size
> ---
>
> Key: PARQUET-2209
> URL: https://issues.apache.org/jira/browse/PARQUET-2209
> Project: Parquet
>  Issue Type: Improvement
>Reporter: fatemah
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Optimize skip for the case that the number of values to skip equals page 
> size. Right now, we end up reading to the end of the page and throwing away 
> the rep/defs and values that we have read, which is unnecessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2209) [C++] Optimize skip for the case that number of values to skip equals page size

2022-10-31 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2209:

Component/s: parquet-cpp

> [C++] Optimize skip for the case that number of values to skip equals page 
> size
> ---
>
> Key: PARQUET-2209
> URL: https://issues.apache.org/jira/browse/PARQUET-2209
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: fatemah
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Optimize skip for the case that the number of values to skip equals page 
> size. Right now, we end up reading to the end of the page and throwing away 
> the rep/defs and values that we have read, which is unnecessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-1646) [C++] Use arrow::Buffer for buffered dictionary indices in DictEncoder instead of std::vector

2022-10-26 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1646:

Fix Version/s: cpp-11.0.0
   (was: cpp-10.0.0)

> [C++] Use arrow::Buffer for buffered dictionary indices in DictEncoder 
> instead of std::vector
> -
>
> Key: PARQUET-1646
> URL: https://issues.apache.org/jira/browse/PARQUET-1646
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-11.0.0
>
>
> Follow up to ARROW-6411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2099) [C++] Statistics::num_values() is misleading

2022-10-26 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2099:

Fix Version/s: cpp-11.0.0
   (was: cpp-10.0.0)

> [C++] Statistics::num_values() is misleading 
> -
>
> Key: PARQUET-2099
> URL: https://issues.apache.org/jira/browse/PARQUET-2099
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Micah Kornfield
>Priority: Major
> Fix For: cpp-11.0.0
>
>
> num_values() in statistics seems to capture the number of encoded values.  
> This is misleading as everyplace else in parquet num_values() really 
> indicates all values (null and not-null, i.e. the number of levels).  
> We should likely remove this field, rename it or at the very least update the 
> documentation.
> CC [~zeroshade]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2179) Add a test for skipping repeated fields

2022-10-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned PARQUET-2179:
---

Assignee: fatemah

> Add a test for skipping repeated fields
> ---
>
> Key: PARQUET-2179
> URL: https://issues.apache.org/jira/browse/PARQUET-2179
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: fatemah
>Assignee: fatemah
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-10.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> The existing test only tests non-repeated fields. Adding a test for repeated 
> fields to make it clear that it is skipping values and not records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2179) Add a test for skipping repeated fields

2022-10-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2179.
-
Fix Version/s: cpp-10.0.0
   Resolution: Fixed

Issue resolved by pull request 14366
[https://github.com/apache/arrow/pull/14366]

> Add a test for skipping repeated fields
> ---
>
> Key: PARQUET-2179
> URL: https://issues.apache.org/jira/browse/PARQUET-2179
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: fatemah
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-10.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> The existing test only tests non-repeated fields. Adding a test for repeated 
> fields to make it clear that it is skipping values and not records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-09-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17611445#comment-17611445
 ] 

Antoine Pitrou commented on PARQUET-1222:
-

I agree with [~gszadovszky] for elevating these rules at the specification 
level.

> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-09-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17611444#comment-17611444
 ] 

Antoine Pitrou commented on PARQUET-1222:
-

(side note: the ML is mostly a firehose of notifications nowadays, which 
doesn't make it easy to follow...)

> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2187) Add Parquet file containing a boolean column with RLE encoding to paquet

2022-09-29 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned PARQUET-2187:
---

Assignee: Nishanth

> Add Parquet file containing a boolean column with RLE encoding to paquet
> 
>
> Key: PARQUET-2187
> URL: https://issues.apache.org/jira/browse/PARQUET-2187
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-testing
>Reporter: Nishanth
>Assignee: Nishanth
>Priority: Minor
>  Labels: pull-request-available
>
> Precursor to https://issues.apache.org/jira/browse/ARROW-17450 . 
> Add a test file in parquet-testing containing a boolean column with RLE 
> encoding. 
> The test files will be used by parquet implementation to validate the 
> encoding can be read 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2187) Add Parquet file containing a boolean column with RLE encoding to paquet

2022-09-29 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2187.
-
Resolution: Fixed

> Add Parquet file containing a boolean column with RLE encoding to paquet
> 
>
> Key: PARQUET-2187
> URL: https://issues.apache.org/jira/browse/PARQUET-2187
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-testing
>Reporter: Nishanth
>Priority: Minor
>  Labels: pull-request-available
>
> Precursor to https://issues.apache.org/jira/browse/ARROW-17450 . 
> Add a test file in parquet-testing containing a boolean column with RLE 
> encoding. 
> The test files will be used by parquet implementation to validate the 
> encoding can be read 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2186) [Java] parquet-mr fails compiling

2022-09-12 Thread Antoine Pitrou (Jira)
Antoine Pitrou created PARQUET-2186:
---

 Summary: [Java] parquet-mr fails compiling
 Key: PARQUET-2186
 URL: https://issues.apache.org/jira/browse/PARQUET-2186
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.13.0
Reporter: Antoine Pitrou


This is on git master:
{code}
[INFO] 
[INFO] Reactor Summary for Apache Parquet MR 1.13.0-SNAPSHOT:
[INFO] 
[INFO] Apache Parquet MR .. FAILURE [  0.958 s]
[INFO] Apache Parquet Format Structures ... SKIPPED
[INFO] Apache Parquet Generator ... SKIPPED
[INFO] Apache Parquet Common .. SKIPPED
[INFO] Apache Parquet Encodings ... SKIPPED
[INFO] Apache Parquet Column .. SKIPPED
[INFO] Apache Parquet Arrow ... SKIPPED
[INFO] Apache Parquet Jackson . SKIPPED
[INFO] Apache Parquet Hadoop .. SKIPPED
[INFO] Apache Parquet Avro  SKIPPED
[INFO] Apache Parquet Benchmarks .. SKIPPED
[INFO] Apache Parquet Command-line  SKIPPED
[INFO] Apache Parquet Pig . SKIPPED
[INFO] Apache Parquet Pig Bundle .. SKIPPED
[INFO] Apache Parquet Protobuf  SKIPPED
[INFO] Apache Parquet Scala ... SKIPPED
[INFO] Apache Parquet Thrift .. SKIPPED
[INFO] Apache Parquet Hadoop Bundle ... SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time:  1.148 s
[INFO] Finished at: 2022-09-12T16:06:24+02:00
[INFO] 
[ERROR] Failed to execute goal org.apache.rat:apache-rat-plugin:0.13:check 
(default) on project parquet: Too many files with unapproved license: 1 See RAT 
report in: /home/antoine/parquet/mr/target/rat.txt -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
{code}

This is the "RAT" (sic) report:
{code}
*

Files with unapproved licenses:

  cli.sh

*

*
  Files with Apache License headers will be marked AL
  Binary files (which do not require any license headers) will be marked B
  Compressed archives will be marked A
  Notices, licenses etc. will be marked N
  N NOTICE
  AL.travis.yml
  ALCHANGES.md
  N LICENSE
  ALdev/ci-before_install.sh
  ALdev/prepare-release.sh
  ALdev/finalize-release
  ALdev/ci-before_install-master.sh
  ALdev/merge_parquet_pr.py
  ALdev/COMMITTERS.md
  ALdev/source-release.sh
  ALdev/README.md
  N src/license.txt
  AL.editorconfig
  ALchangelog.sh
  AL.github/workflows/test.yml
  B doc/dremel_paper/schema.png
  B doc/dremel_paper/dremel_example.png
  ALpom.xml
 !? cli.sh
  ALPoweredBy.md
  ALREADME.md
 
*
{code}

This is because I have a script file "cli.sh" at the base of the git checkout.

The "RAT" report shouldn't fail because of unrelated files that are not in the 
git repository...




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2182) Handle unknown logical types

2022-09-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2182:

Component/s: parquet-mr

> Handle unknown logical types
> 
>
> Key: PARQUET-2182
> URL: https://issues.apache.org/jira/browse/PARQUET-2182
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Priority: Major
>
> New logical types introduced in parquet-format shall be properly handled in 
> parquet-mr releases that are not aware of this new type. In this case we 
> shall read the data as if only the primitive type would be defined (without a 
> logical type) with one exception: We shall not use min/max based statistics 
> (including column indexes) since we don't know the proper ordering of that 
> type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-758) [Format] HALF precision FLOAT Logical type

2022-08-29 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-758:
---
Summary: [Format] HALF precision FLOAT Logical type  (was: HALF precision 
FLOAT Logical type)

> [Format] HALF precision FLOAT Logical type
> --
>
> Key: PARQUET-758
> URL: https://issues.apache.org/jira/browse/PARQUET-758
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-1158) [C++] Basic RowGroup filtering

2022-08-22 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1158:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Basic RowGroup filtering
> --
>
> Key: PARQUET-1158
> URL: https://issues.apache.org/jira/browse/PARQUET-1158
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>
> See 
> https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300
> We should be able to translate this into C++ enums and apply in the Arrow 
> read methods methods.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-1430) [C++] Add tests for C++ tools

2022-08-22 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1430:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Add tests for C++ tools
> -
>
> Key: PARQUET-1430
> URL: https://issues.apache.org/jira/browse/PARQUET-1430
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Priority: Major
>
> We currently do not have any tests for the tools.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-1199) [C++] Support writing (and test reading) boolean values with RLE encoding

2022-08-22 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1199:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Support writing (and test reading) boolean values with RLE encoding
> -
>
> Key: PARQUET-1199
> URL: https://issues.apache.org/jira/browse/PARQUET-1199
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> This is supported by the Parquet specification, we should ensure that we are 
> able to read such data



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-1515) [C++] Disable LZ4 codec

2022-08-22 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1515:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Disable LZ4 codec
> ---
>
> Key: PARQUET-1515
> URL: https://issues.apache.org/jira/browse/PARQUET-1515
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Priority: Major
>
> As discussed in https://issues.apache.org/jira/browse/PARQUET-1241, the 
> parquet-cpp's LZ4 codec is not compatible with Hadoop and parquet-mr. We must 
> disable the codec until we resolve the compatibility issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-1614) [C++] Reuse arrow::Buffer used as scratch space for decryption in Thrift deserialization hot path

2022-08-22 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1614:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Reuse arrow::Buffer used as scratch space for decryption in Thrift 
> deserialization hot path
> -
>
> Key: PARQUET-1614
> URL: https://issues.apache.org/jira/browse/PARQUET-1614
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> If it is possible to reuse memory on the decrypt-deserialize hot path that 
> will improve performance



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-1634) [C++] Factor out data/dictionary page writes to allow for page buffering

2022-08-22 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1634:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Factor out data/dictionary page writes to allow for page buffering 
> -
>
> Key: PARQUET-1634
> URL: https://issues.apache.org/jira/browse/PARQUET-1634
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> Logic that eagerly writes out data pages is hard-coded into the column writer 
> implementation
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L565
> For higher-latency file systems like Amazon S3, it makes more sense to buffer 
> pages in memory and write them in larger batches (and preferably 
> asynchronously). We should refactor this logic so we have the ability to 
> choose rather than have the behavior hard-coded



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-1646) [C++] Use arrow::Buffer for buffered dictionary indices in DictEncoder instead of std::vector

2022-08-22 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1646:

Fix Version/s: cpp-10.0.0
   (was: cpp-9.0.0)

> [C++] Use arrow::Buffer for buffered dictionary indices in DictEncoder 
> instead of std::vector
> -
>
> Key: PARQUET-1646
> URL: https://issues.apache.org/jira/browse/PARQUET-1646
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-10.0.0
>
>
> Follow up to ARROW-6411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-1653) [C++] Deprecated BIT_PACKED level decoding is probably incorrect

2022-08-22 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1653:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Deprecated BIT_PACKED level decoding is probably incorrect
> 
>
> Key: PARQUET-1653
> URL: https://issues.apache.org/jira/browse/PARQUET-1653
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> In working on PARQUET-1652, I noticed that our implementation of BIT_PACKED 
> almost certainly does not line up with apache/parquet-format. I'm going to 
> disable it in our tests until it can be validated



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-1657) [C++] Change Bloom filter implementation to use xxhash

2022-08-22 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1657:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Change Bloom filter implementation to use xxhash
> --
>
> Key: PARQUET-1657
> URL: https://issues.apache.org/jira/browse/PARQUET-1657
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> I also strongly recommend doing away with the virtual function calls if 
> possible. We have vendored xxhash in Apache Arrow so we should also remove 
> the murmur3 code while we are at it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-1814) [C++] TestInt96ParquetIO failure on Windows

2022-08-22 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1814:

Fix Version/s: (was: cpp-9.0.0)

> [C++] TestInt96ParquetIO failure on Windows
> ---
>
> Key: PARQUET-1814
> URL: https://issues.apache.org/jira/browse/PARQUET-1814
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Priority: Major
>
> {code}
> [ RUN  ] TestInt96ParquetIO.ReadIntoTimestamp
> C:/t/arrow/cpp/src/arrow/testing/gtest_util.cc(77): error: Failed
> @@ -0, +0 @@
> -1970-01-01 00:00:00.145738543
> +1970-01-02 11:35:00.145738543
> C:/t/arrow/cpp/src/parquet/arrow/arrow_reader_writer_test.cc(1034): error: 
> Expected: this->ReadAndCheckSingleColumnFile(*values) doesn't generate new 
> fatal failures in the current thread.
>   Actual: it does.
> [  FAILED  ] TestInt96ParquetIO.ReadIntoTimestamp (47 ms)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-1859) [C++] Require error message when using ParquetException::EofException

2022-08-22 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1859:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Require error message when using ParquetException::EofException
> -
>
> Key: PARQUET-1859
> URL: https://issues.apache.org/jira/browse/PARQUET-1859
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> "Unexpected end of stream" (the defaults) gives no clue where the failure 
> occurred



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2099) [C++] Statistics::num_values() is misleading

2022-08-22 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2099:

Fix Version/s: cpp-10.0.0
   (was: cpp-9.0.0)

> [C++] Statistics::num_values() is misleading 
> -
>
> Key: PARQUET-2099
> URL: https://issues.apache.org/jira/browse/PARQUET-2099
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Micah Kornfield
>Priority: Major
> Fix For: cpp-10.0.0
>
>
> num_values() in statistics seems to capture the number of encoded values.  
> This is misleading as everyplace else in parquet num_values() really 
> indicates all values (null and not-null, i.e. the number of levels).  
> We should likely remove this field, rename it or at the very least update the 
> documentation.
> CC [~zeroshade]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-1416) [C++] Deprecate parquet/api/* in favor of simpler public API "parquet/api.h"

2022-08-22 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1416:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Deprecate parquet/api/* in favor of simpler public API "parquet/api.h"
> 
>
> Key: PARQUET-1416
> URL: https://issues.apache.org/jira/browse/PARQUET-1416
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Minor
>
> The public API for this project is simple enough that I don't think we need 
> anything more complex than a single public API header file



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2124) Bad DCHECK For Intermixed Dictionary Encoding

2022-02-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2124.
-
Fix Version/s: cpp-8.0.0
   Resolution: Fixed

Issue resolved by pull request 12427
[https://github.com/apache/arrow/pull/12427]

> Bad DCHECK For Intermixed Dictionary Encoding
> -
>
> Key: PARQUET-2124
> URL: https://issues.apache.org/jira/browse/PARQUET-2124
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: William Butler
>Assignee: William Butler
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-8.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Parquet CPP has a DCHECK for a dictionary encoded page coming after a 
> non-dictionary encoded page. This is bad because the DCHECK can be triggered 
> by Parquet files that have a column that has a dictionary page, then a 
> non-dictionary encoded page, then a page of dictionary encoded 
> values(indices). Fuzzing found such a file. While this could be turned into 
> an exception, I don't see anything in the Parquet specification that 
> prohibits such an occurrence of pages.
> This situation has brought up on the mailing list 
> before([https://lists.apache.org/thread/3bzymmbxvmzj12km7cjz1150ndvy9bos)] 
> and it seems like this is valid but nobody is doing it.
> In the PR that added this 
> check([https://github.com/apache/parquet-cpp/pull/73)] it was noted that the 
> check is probably not needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-2123) Invalid memory access in ScanFileContents

2022-02-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2123.
-
Fix Version/s: cpp-8.0.0
   Resolution: Fixed

Issue resolved by pull request 12423
[https://github.com/apache/arrow/pull/12423]

> Invalid memory access in ScanFileContents
> -
>
> Key: PARQUET-2123
> URL: https://issues.apache.org/jira/browse/PARQUET-2123
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: William Butler
>Assignee: William Butler
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-8.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When a Parquet file has 0 columns, ScanFileContents will try to access the 
> 0th element of a size 0 vector.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-2119) Parquet CPP DeltaBitPackDecoder Check Failure

2022-02-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2119.
-
Fix Version/s: cpp-8.0.0
   Resolution: Fixed

Issue resolved by pull request 12365
[https://github.com/apache/arrow/pull/12365]

> Parquet CPP DeltaBitPackDecoder Check Failure
> -
>
> Key: PARQUET-2119
> URL: https://issues.apache.org/jira/browse/PARQUET-2119
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: William Butler
>Assignee: William Butler
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-8.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> DeltaBitPackDecoder uses num_values_ instead of total_value_count_ when 
> computing batch size.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-1614) [C++] Reuse arrow::Buffer used as scratch space for decryption in Thrift deserialization hot path

2022-02-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1614:

Fix Version/s: cpp-8.0.0
   (was: cpp-7.0.0)

> [C++] Reuse arrow::Buffer used as scratch space for decryption in Thrift 
> deserialization hot path
> -
>
> Key: PARQUET-1614
> URL: https://issues.apache.org/jira/browse/PARQUET-1614
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-8.0.0
>
>
> If it is possible to reuse memory on the decrypt-deserialize hot path that 
> will improve performance



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-1657) [C++] Change Bloom filter implementation to use xxhash

2022-02-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1657:

Fix Version/s: cpp-8.0.0
   (was: cpp-7.0.0)

> [C++] Change Bloom filter implementation to use xxhash
> --
>
> Key: PARQUET-1657
> URL: https://issues.apache.org/jira/browse/PARQUET-1657
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-8.0.0
>
>
> I also strongly recommend doing away with the virtual function calls if 
> possible. We have vendored xxhash in Apache Arrow so we should also remove 
> the murmur3 code while we are at it



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-1199) [C++] Support writing (and test reading) boolean values with RLE encoding

2022-02-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1199:

Fix Version/s: cpp-8.0.0
   (was: cpp-7.0.0)

> [C++] Support writing (and test reading) boolean values with RLE encoding
> -
>
> Key: PARQUET-1199
> URL: https://issues.apache.org/jira/browse/PARQUET-1199
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-8.0.0
>
>
> This is supported by the Parquet specification, we should ensure that we are 
> able to read such data



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-1814) [C++] TestInt96ParquetIO failure on Windows

2022-02-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1814:

Fix Version/s: cpp-8.0.0
   (was: cpp-7.0.0)

> [C++] TestInt96ParquetIO failure on Windows
> ---
>
> Key: PARQUET-1814
> URL: https://issues.apache.org/jira/browse/PARQUET-1814
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: cpp-8.0.0
>
>
> {code}
> [ RUN  ] TestInt96ParquetIO.ReadIntoTimestamp
> C:/t/arrow/cpp/src/arrow/testing/gtest_util.cc(77): error: Failed
> @@ -0, +0 @@
> -1970-01-01 00:00:00.145738543
> +1970-01-02 11:35:00.145738543
> C:/t/arrow/cpp/src/parquet/arrow/arrow_reader_writer_test.cc(1034): error: 
> Expected: this->ReadAndCheckSingleColumnFile(*values) doesn't generate new 
> fatal failures in the current thread.
>   Actual: it does.
> [  FAILED  ] TestInt96ParquetIO.ReadIntoTimestamp (47 ms)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-1430) [C++] Add tests for C++ tools

2022-02-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1430:

Fix Version/s: cpp-8.0.0
   (was: cpp-7.0.0)

> [C++] Add tests for C++ tools
> -
>
> Key: PARQUET-1430
> URL: https://issues.apache.org/jira/browse/PARQUET-1430
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
>Priority: Major
> Fix For: cpp-8.0.0
>
>
> We currently do not have any tests for the tools.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-1646) [C++] Use arrow::Buffer for buffered dictionary indices in DictEncoder instead of std::vector

2022-02-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1646:

Fix Version/s: cpp-8.0.0
   (was: cpp-7.0.0)

> [C++] Use arrow::Buffer for buffered dictionary indices in DictEncoder 
> instead of std::vector
> -
>
> Key: PARQUET-1646
> URL: https://issues.apache.org/jira/browse/PARQUET-1646
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-8.0.0
>
>
> Follow up to ARROW-6411



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-2099) [C++] Statistics::num_values() is misleading

2022-02-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2099:

Fix Version/s: cpp-8.0.0
   (was: cpp-7.0.0)

> [C++] Statistics::num_values() is misleading 
> -
>
> Key: PARQUET-2099
> URL: https://issues.apache.org/jira/browse/PARQUET-2099
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Micah Kornfield
>Priority: Major
> Fix For: cpp-8.0.0
>
>
> num_values() in statistics seems to capture the number of encoded values.  
> This is misleading as everyplace else in parquet num_values() really 
> indicates all values (null and not-null, i.e. the number of levels).  
> We should likely remove this field, rename it or at the very least update the 
> documentation.
> CC [~zeroshade]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-1634) [C++] Factor out data/dictionary page writes to allow for page buffering

2022-02-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1634:

Fix Version/s: cpp-8.0.0
   (was: cpp-7.0.0)

> [C++] Factor out data/dictionary page writes to allow for page buffering 
> -
>
> Key: PARQUET-1634
> URL: https://issues.apache.org/jira/browse/PARQUET-1634
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-8.0.0
>
>
> Logic that eagerly writes out data pages is hard-coded into the column writer 
> implementation
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L565
> For higher-latency file systems like Amazon S3, it makes more sense to buffer 
> pages in memory and write them in larger batches (and preferably 
> asynchronously). We should refactor this logic so we have the ability to 
> choose rather than have the behavior hard-coded



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-1653) [C++] Deprecated BIT_PACKED level decoding is probably incorrect

2022-02-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1653:

Fix Version/s: cpp-8.0.0
   (was: cpp-7.0.0)

> [C++] Deprecated BIT_PACKED level decoding is probably incorrect
> 
>
> Key: PARQUET-1653
> URL: https://issues.apache.org/jira/browse/PARQUET-1653
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-8.0.0
>
>
> In working on PARQUET-1652, I noticed that our implementation of BIT_PACKED 
> almost certainly does not line up with apache/parquet-format. I'm going to 
> disable it in our tests until it can be validated



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-1859) [C++] Require error message when using ParquetException::EofException

2022-02-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1859:

Fix Version/s: cpp-8.0.0
   (was: cpp-7.0.0)

> [C++] Require error message when using ParquetException::EofException
> -
>
> Key: PARQUET-1859
> URL: https://issues.apache.org/jira/browse/PARQUET-1859
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-8.0.0
>
>
> "Unexpected end of stream" (the defaults) gives no clue where the failure 
> occurred



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-2118) [C++] thift_internal.h assumes shared_ptr type in some cases

2022-02-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2118:

Fix Version/s: cpp-8.0.0
   (was: cpp-7.0.0)

> [C++] thift_internal.h assumes shared_ptr type in some cases
> 
>
> Key: PARQUET-2118
> URL: https://issues.apache.org/jira/browse/PARQUET-2118
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-8.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Thrift can still be built with boost shared_ptrs so we need to be pointer 
> agnostic.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-2118) [C++] thift_internal.h assumes shared_ptr type in some cases

2022-02-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2118.
-
Fix Version/s: cpp-7.0.0
   Resolution: Fixed

Issue resolved by pull request 12349
[https://github.com/apache/arrow/pull/12349]

> [C++] thift_internal.h assumes shared_ptr type in some cases
> 
>
> Key: PARQUET-2118
> URL: https://issues.apache.org/jira/browse/PARQUET-2118
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-7.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Thrift can still be built with boost shared_ptrs so we need to be pointer 
> agnostic.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-490) [C++] Incorporate DELTA_BINARY_PACKED value encoder into library and add unit tests

2022-01-31 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-490:
---
Fix Version/s: (was: cpp-6.0.0)

> [C++] Incorporate DELTA_BINARY_PACKED value encoder into library and add unit 
> tests
> ---
>
> Key: PARQUET-490
> URL: https://issues.apache.org/jira/browse/PARQUET-490
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> There is some code for this currently found in 
> {{examples/decode_benchmark.cc}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Reopened] (PARQUET-490) [C++] Incorporate DELTA_BINARY_PACKED value encoder into library and add unit tests

2022-01-31 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reopened PARQUET-490:

  Assignee: (was: Shan Huang)

> [C++] Incorporate DELTA_BINARY_PACKED value encoder into library and add unit 
> tests
> ---
>
> Key: PARQUET-490
> URL: https://issues.apache.org/jira/browse/PARQUET-490
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-6.0.0
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> There is some code for this currently found in 
> {{examples/decode_benchmark.cc}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-490) [C++] Incorporate DELTA_BINARY_PACKED value encoder into library and add unit tests

2022-01-31 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17484609#comment-17484609
 ] 

Antoine Pitrou commented on PARQUET-490:


[~Bkief] Hmm, sorry. I went a bit overboard when closing this issue, as title 
mentions encoding not decoding.

> [C++] Incorporate DELTA_BINARY_PACKED value encoder into library and add unit 
> tests
> ---
>
> Key: PARQUET-490
> URL: https://issues.apache.org/jira/browse/PARQUET-490
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Shan Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-6.0.0
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> There is some code for this currently found in 
> {{examples/decode_benchmark.cc}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-2115) Parquet Cpp Crash on Invalid Dictionary Bit Width

2022-01-31 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2115:

Fix Version/s: cpp-8.0.0
   (was: cpp-7.0.0)

> Parquet Cpp Crash on Invalid Dictionary Bit Width
> -
>
> Key: PARQUET-2115
> URL: https://issues.apache.org/jira/browse/PARQUET-2115
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: William Butler
>Assignee: William Butler
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-8.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


  1   2   3   4   >