[jira] [Commented] (PARQUET-2237) Improve performance when filters in RowGroupFilter can match exactly

2023-02-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691413#comment-17691413
 ] 

ASF GitHub Bot commented on PARQUET-2237:
-

yabola commented on PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#issuecomment-1437950981

   @gszadovszky @shangxinli If you have time, please also take a look, thanks~




> Improve performance when filters in RowGroupFilter can match exactly
> 
>
> Key: PARQUET-2237
> URL: https://issues.apache.org/jira/browse/PARQUET-2237
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Priority: Major
>
> If we can accurately judge by the minMax status, we don’t need to load the 
> dictionary from filesystem and compare one by one anymore.
> Similarly , Bloomfilter needs to load from filesystem, it may costs time and 
> memory. If we can exactly determine the existence/nonexistence of the value 
> from minMax or dictionary filters , then we can avoid using Bloomfilter to 
> Improve performance.
> For example,
>  # read data greater than {{x1}} in the block, if minMax in status is all 
> greater than {{{}x1{}}}, then we don't need to read dictionary and compare 
> one by one.
>  # If we already have page dictionaries and have compared one by one, we 
> don't need to read BloomFilter and compare.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] yabola commented on pull request #1023: PARQUET-2237 Improve performance when filters in RowGroupFilter can match exactly

2023-02-20 Thread via GitHub


yabola commented on PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#issuecomment-1437950981

   @gszadovszky @shangxinli If you have time, please also take a look, thanks~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written

2023-02-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691348#comment-17691348
 ] 

ASF GitHub Bot commented on PARQUET-2164:
-

wgtmac commented on code in PR #1032:
URL: https://github.com/apache/parquet-mr/pull/1032#discussion_r1112489810


##
parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java:
##
@@ -164,6 +164,12 @@ public CapacityByteArrayOutputStream(int initialSlabSize, 
int maxCapacityHint, B
   private void addSlab(int minimumSize) {
 int nextSlabSize;
 
+if (bytesUsed + minimumSize < 0) {

Review Comment:
   nit: replace it with `Math.addExact` to hand over overflow check. Then we 
get the chance to check the 2GB hard limit before it becoming negative





> CapacityByteArrayOutputStream overflow while writing causes negative row 
> group sizes to be written
> --
>
> Key: PARQUET-2164
> URL: https://issues.apache.org/jira/browse/PARQUET-2164
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Parth Chandra
>Priority: Major
> Fix For: 1.12.3
>
> Attachments: TestLargeDictionaryWriteParquet.java
>
>
> It is possible, while writing a parquet file, to cause 
> {{CapacityByteArrayOutputStream}} to overflow.
> This is an extreme case but it has been observed in a real world data set.
> The attached Spark program manages to reproduce the issue.
> Short summary of how this happens - 
> 1. After many small records possibly including nulls, the dictionary page 
> fills up and subsequent pages are written using plain encoding
> 2. The estimate of when to perform the page size check is based on the number 
> of values observed per page so far. Let's say this is about 100K
> 3. A sequence of very large records shows up. Let's say each of these record 
> is 200K. 
> 4. After 11K of these records the size of the page has gone up beyond 2GB.
> 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of 
> data but also it holds the size of the data in an int which overflows.
> There are a couple of things to fix here -
> 1. The check for page size should check both the number of values added as 
> well as the buffered size of the data
> 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data 
> size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly 
> that).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1032: PARQUET-2164: Check size of buffered data to prevent page data from overflowing

2023-02-20 Thread via GitHub


wgtmac commented on code in PR #1032:
URL: https://github.com/apache/parquet-mr/pull/1032#discussion_r1112489810


##
parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java:
##
@@ -164,6 +164,12 @@ public CapacityByteArrayOutputStream(int initialSlabSize, 
int maxCapacityHint, B
   private void addSlab(int minimumSize) {
 int nextSlabSize;
 
+if (bytesUsed + minimumSize < 0) {

Review Comment:
   nit: replace it with `Math.addExact` to hand over overflow check. Then we 
get the chance to check the 2GB hard limit before it becoming negative



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written

2023-02-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691347#comment-17691347
 ] 

ASF GitHub Bot commented on PARQUET-2164:
-

cxzl25 commented on PR #1032:
URL: https://github.com/apache/parquet-mr/pull/1032#issuecomment-1437794593

   Is it also similar to #1031 ?




> CapacityByteArrayOutputStream overflow while writing causes negative row 
> group sizes to be written
> --
>
> Key: PARQUET-2164
> URL: https://issues.apache.org/jira/browse/PARQUET-2164
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Parth Chandra
>Priority: Major
> Fix For: 1.12.3
>
> Attachments: TestLargeDictionaryWriteParquet.java
>
>
> It is possible, while writing a parquet file, to cause 
> {{CapacityByteArrayOutputStream}} to overflow.
> This is an extreme case but it has been observed in a real world data set.
> The attached Spark program manages to reproduce the issue.
> Short summary of how this happens - 
> 1. After many small records possibly including nulls, the dictionary page 
> fills up and subsequent pages are written using plain encoding
> 2. The estimate of when to perform the page size check is based on the number 
> of values observed per page so far. Let's say this is about 100K
> 3. A sequence of very large records shows up. Let's say each of these record 
> is 200K. 
> 4. After 11K of these records the size of the page has gone up beyond 2GB.
> 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of 
> data but also it holds the size of the data in an int which overflows.
> There are a couple of things to fix here -
> 1. The check for page size should check both the number of values added as 
> well as the buffered size of the data
> 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data 
> size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly 
> that).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] cxzl25 commented on pull request #1032: PARQUET-2164: Check size of buffered data to prevent page data from overflowing

2023-02-20 Thread via GitHub


cxzl25 commented on PR #1032:
URL: https://github.com/apache/parquet-mr/pull/1032#issuecomment-1437794593

   Is it also similar to #1031 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written

2023-02-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691281#comment-17691281
 ] 

ASF GitHub Bot commented on PARQUET-2164:
-

parthchandra opened a new pull request, #1032:
URL: https://github.com/apache/parquet-mr/pull/1032

   This PR addresses the following 
[PARQUET-2164](https://issues.apache.org/jira/browse/PARQUET-2164) 
   The configuration parameters 
   ```
   parquet.page.size.check.estimate=false
   parquet.page.size.row.check.min=
   parquet.page.size.row.check.max=
   ```
   address the reported problem. However the issue can still be hit because the 
default value of `parquet.page.size.check.estimate` is `true`.
   This PR simply adds a check to make sure that 
`CapacityByteArrayOutputStream` cannot overflow but will instead throw an 
exception.




> CapacityByteArrayOutputStream overflow while writing causes negative row 
> group sizes to be written
> --
>
> Key: PARQUET-2164
> URL: https://issues.apache.org/jira/browse/PARQUET-2164
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Parth Chandra
>Priority: Major
> Fix For: 1.12.3
>
> Attachments: TestLargeDictionaryWriteParquet.java
>
>
> It is possible, while writing a parquet file, to cause 
> {{CapacityByteArrayOutputStream}} to overflow.
> This is an extreme case but it has been observed in a real world data set.
> The attached Spark program manages to reproduce the issue.
> Short summary of how this happens - 
> 1. After many small records possibly including nulls, the dictionary page 
> fills up and subsequent pages are written using plain encoding
> 2. The estimate of when to perform the page size check is based on the number 
> of values observed per page so far. Let's say this is about 100K
> 3. A sequence of very large records shows up. Let's say each of these record 
> is 200K. 
> 4. After 11K of these records the size of the page has gone up beyond 2GB.
> 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of 
> data but also it holds the size of the data in an int which overflows.
> There are a couple of things to fix here -
> 1. The check for page size should check both the number of values added as 
> well as the buffered size of the data
> 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data 
> size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly 
> that).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] parthchandra opened a new pull request, #1032: PARQUET-2164: Check size of buffered data to prevent page data from overflowing

2023-02-20 Thread via GitHub


parthchandra opened a new pull request, #1032:
URL: https://github.com/apache/parquet-mr/pull/1032

   This PR addresses the following 
[PARQUET-2164](https://issues.apache.org/jira/browse/PARQUET-2164) 
   The configuration parameters 
   ```
   parquet.page.size.check.estimate=false
   parquet.page.size.row.check.min=
   parquet.page.size.row.check.max=
   ```
   address the reported problem. However the issue can still be hit because the 
default value of `parquet.page.size.check.estimate` is `true`.
   This PR simply adds a check to make sure that 
`CapacityByteArrayOutputStream` cannot overflow but will instead throw an 
exception.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691259#comment-17691259
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

shangxinli commented on PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1437355639

   Let me know if you have more feedbacks  @ggershinsky @gszadovszky 




> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli commented on pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file

2023-02-20 Thread via GitHub


shangxinli commented on PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1437355639

   Let me know if you have more feedbacks  @ggershinsky @gszadovszky 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Xuwei Fu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691257#comment-17691257
 ] 

Xuwei Fu commented on PARQUET-2249:
---

I guess maybe we can take a look at:
 # [https://github.com/apache/arrow-rs/issues/264]
 # https://issues.apache.org/jira/browse/PARQUET-1222

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Xuwei Fu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691256#comment-17691256
 ] 

Xuwei Fu commented on PARQUET-2249:
---

I guess NaN is not always larger than all values.
 # Postgres, DB2 and Oracle put NULL higher than any other values
 # MSSQL and MySQL going the other way.

SQL may have `null_first` for this case. Still I think your idea is ok.

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Gang Wu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691199#comment-17691199
 ] 

Gang Wu commented on PARQUET-2249:
--

As of today, there are many different parquet implementations (java, cpp, rust, 
etc). Directly making a PR to parquet-format repo is a good start to reach the 
maintainers of different implementations. However, the tricky thing is that we 
have to deal with legacy readers and files.

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Jan Finis (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691184#comment-17691184
 ] 

Jan Finis edited comment on PARQUET-2249 at 2/20/23 1:50 PM:
-

I would be willing to propose a fixing commit for this, but I'm not part of ASF 
and the whole process, yet, so I don't know exactly how to get that going. I 
could start a PR on the parquet-format github repo. Is that the right point to 
suggest changes to the spec/parquet.thrift?

Sidepoint: Note that NaN being larger than all other values is also:
 * The semantic that SQL has for NaN
 * What parquet-mr seems to be doing right now. At least, I have found parquet 
files that have NaN written as max_value in row group statistics.

However, treating NaN as something extra by maintaining NaN counts will allow 
incorporating any NaN semantics the query engine wishes to use.


was (Author: jfinis):
I would be willing to propose a fixing commit for this, but I'm not part of ASF 
and the whole process, yet, so I don't know exactly how to get that going. I 
could start a PR on the parquet-format github repo. Is that the right point to 
suggest changes to the spec/parquet.thrift?

Note that NaN being larger than all other values is also:
 * The semantic that SQL has for NaN
 * What parquet-mr seems to be doing right now. At least, I have found parquet 
files that have NaN written as max_value in row group statistics.

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> 

[jira] [Comment Edited] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Jan Finis (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691184#comment-17691184
 ] 

Jan Finis edited comment on PARQUET-2249 at 2/20/23 1:48 PM:
-

I would be willing to propose a fixing commit for this, but I'm not part of ASF 
and the whole process, yet, so I don't know exactly how to get that going. I 
could start a PR on the parquet-format github repo. Is that the right point to 
suggest changes to the spec/parquet.thrift?

Note that NaN being larger than all other values is also:
 * The semantic that SQL has for NaN
 * What parquet-mr seems to be doing right now. At least, I have found parquet 
files that have NaN written as max_value in row group statistics.


was (Author: jfinis):
I would be willing to suggest a fix for this, but I'm not part of ASF and the 
whole process, yet, so I don't know exactly how to get that going. I could 
start a PR on the parquet-format github repo. Is that the right point to 
suggest changes to the spec/parquet.thrift?

Note that NaN being larger than all other values is also:
* The semantic that SQL has for NaN
* What parquet-mr seems to be doing right now. At least, I have found parquet 
files that have NaN written as max_value in row group statistics.

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of 

[jira] [Comment Edited] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Jan Finis (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691184#comment-17691184
 ] 

Jan Finis edited comment on PARQUET-2249 at 2/20/23 1:48 PM:
-

I would be willing to suggest a fix for this, but I'm not part of ASF and the 
whole process, yet, so I don't know exactly how to get that going. I could 
start a PR on the parquet-format github repo. Is that the right point to 
suggest changes to the spec/parquet.thrift?

Note that NaN being larger than all other values is also:
* The semantic that SQL has for NaN
* What parquet-mr seems to be doing right now. At least, I have found parquet 
files that have NaN written as max_value in row group statistics.


was (Author: jfinis):
I would be willing to suggest a fix for this, but I'm not part of ASF and the 
whole process, yet, so I don't know exactly how to get that going. I could 
start a PR on the parquet-mr github repo. Is that the right point to suggest 
changes to the spec/parquet.thrift?

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Jan Finis (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691184#comment-17691184
 ] 

Jan Finis commented on PARQUET-2249:


I would be willing to suggest a fix for this, but I'm not part of ASF and the 
whole process, yet, so I don't know exactly how to get that going. I could 
start a PR on the parquet-mr github repo. Is that the right point to suggest 
changes to the spec/parquet.thrift?

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Xuwei Fu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691179#comment-17691179
 ] 

Xuwei Fu edited comment on PARQUET-2249 at 2/20/23 1:30 PM:


The problem seem to be that, float point is so widely used, but they are 
"partial order". Seems that iceberg provides NaN counts. And min-max is 
un-related to NaN.

In Sorting, iceberg forces that:

> Sorting floating-point numbers should produce the following behavior: 
> {{-NaN}} < {{-Infinity}} < {{-value}} < {{-0}} < {{0}} < {{value}} < 
> {{Infinity}} < {{{}NaN{}}}. This aligns with the implementation of Java 
> floating-point types comparisons.

I think (1) is bad, because NaN is never equal to NULL.

IEEE754 ([https://ieeexplore.ieee.org/document/8766229]) and C++ standard 
support some "totalOrder", but I think regard it as totalOrder is strange, so 
min-max as `byte[0]` is wierd for me. I think iceberg style looks great

If I'm wrong, please correct me


was (Author: JIRAUSER294084):
Seems that iceberg provides NaN counts. And min-max is un-related to NaN.

In Sorting, iceberg forces that:

> Sorting floating-point numbers should produce the following behavior: 
> {{-NaN}} < {{-Infinity}} < {{-value}} < {{-0}} < {{0}} < {{value}} < 
> {{Infinity}} < {{{}NaN{}}}. This aligns with the implementation of Java 
> floating-point types comparisons.

I think (1) is bad, because NaN is never equal to NULL.

IEEE754 (https://ieeexplore.ieee.org/document/8766229) and C++ standard support 
some "totalOrder", but I think regard it as totalOrder is strange, so min-max 
as `byte[0]` is wierd for me. I think iceberg style looks great

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the 

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Xuwei Fu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691179#comment-17691179
 ] 

Xuwei Fu commented on PARQUET-2249:
---

Seems that iceberg provides NaN counts. And min-max is un-related to NaN.

In Sorting, iceberg forces that:

> Sorting floating-point numbers should produce the following behavior: 
> {{-NaN}} < {{-Infinity}} < {{-value}} < {{-0}} < {{0}} < {{value}} < 
> {{Infinity}} < {{{}NaN{}}}. This aligns with the implementation of Java 
> floating-point types comparisons.

I think (1) is bad, because NaN is never equal to NULL.

IEEE754 (https://ieeexplore.ieee.org/document/8766229) and C++ standard support 
some "totalOrder", but I think regard it as totalOrder is strange, so min-max 
as `byte[0]` is wierd for me. I think iceberg style looks great

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2237) Improve performance when filters in RowGroupFilter can match exactly

2023-02-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691126#comment-17691126
 ] 

ASF GitHub Bot commented on PARQUET-2237:
-

yabola commented on code in PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#discussion_r1109441802


##
parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/PredicateEvaluation.java:
##
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.filter2.compat;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+import org.apache.parquet.filter2.predicate.FilterPredicate;
+import org.apache.parquet.filter2.predicate.Operators;
+
+/**
+ * Used in Filters to mark whether we should DROP the block if data matches 
the condition.
+ * If we cannot decide whether the block matches, it will be always safe to 
return BLOCK_MIGHT_MATCH.
+ * We use Boolean Object here to distinguish the value type, please do not 
modify it.
+ */
+public class PredicateEvaluation {
+  /* The block might match, but we cannot decide yet, will check in the other 
filters. */
+  public static final Boolean BLOCK_MIGHT_MATCH = new Boolean(false);
+  /* The block can match for sure. */
+  public static final Boolean BLOCK_MUST_MATCH = new Boolean(false);
+  /* The block can't match for sure */
+  public static final Boolean BLOCK_CANNOT_MATCH = new Boolean(true);
+
+  public static Boolean evaluateAnd(Operators.And and, 
FilterPredicate.Visitor predicate) {
+Boolean left = and.getLeft().accept(predicate);
+if (left == BLOCK_CANNOT_MATCH) {
+  // seems unintuitive to put an || not an && here but we can
+  // drop a chunk of records if we know that either the left or
+  // the right predicate agrees that no matter what we don't
+  // need this chunk.
+  return BLOCK_CANNOT_MATCH;
+}
+Boolean right = and.getRight().accept(predicate);
+if (right == BLOCK_CANNOT_MATCH) {
+  return BLOCK_CANNOT_MATCH;
+} else if (left == BLOCK_MUST_MATCH && right == BLOCK_MUST_MATCH) {

Review Comment:
   if left is `BLOCK_MUST_MATCH` , right is  `BLOCK_MIGHT_MATCH` , left & right 
should be `BLOCK_MIGHT_MATCH`.
   Because in the next filter may let right be `BLOCK_CANNOT_MATCH ` and we 
should drop it.
   
   And I add new 
[UT](https://github.com/apache/parquet-mr/pull/1023/files#diff-8915e6fa23018e02c2e79a3f6cc5078a8882f8031022dbdde217fe9bf1d908afR143)
   In  `StatisticsFilter` left might match (but can't match in 
DictionaryFilter),  right must match -> return might match in StatisticsFilter, 
 return can't match in DictionaryFilter
   
   





> Improve performance when filters in RowGroupFilter can match exactly
> 
>
> Key: PARQUET-2237
> URL: https://issues.apache.org/jira/browse/PARQUET-2237
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Priority: Major
>
> If we can accurately judge by the minMax status, we don’t need to load the 
> dictionary from filesystem and compare one by one anymore.
> Similarly , Bloomfilter needs to load from filesystem, it may costs time and 
> memory. If we can exactly determine the existence/nonexistence of the value 
> from minMax or dictionary filters , then we can avoid using Bloomfilter to 
> Improve performance.
> For example,
>  # read data greater than {{x1}} in the block, if minMax in status is all 
> greater than {{{}x1{}}}, then we don't need to read dictionary and compare 
> one by one.
>  # If we already have page dictionaries and have compared one by one, we 
> don't need to read BloomFilter and compare.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] yabola commented on a diff in pull request #1023: PARQUET-2237 Improve performance when filters in RowGroupFilter can match exactly

2023-02-20 Thread via GitHub


yabola commented on code in PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#discussion_r1109441802


##
parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/PredicateEvaluation.java:
##
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.filter2.compat;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+import org.apache.parquet.filter2.predicate.FilterPredicate;
+import org.apache.parquet.filter2.predicate.Operators;
+
+/**
+ * Used in Filters to mark whether we should DROP the block if data matches 
the condition.
+ * If we cannot decide whether the block matches, it will be always safe to 
return BLOCK_MIGHT_MATCH.
+ * We use Boolean Object here to distinguish the value type, please do not 
modify it.
+ */
+public class PredicateEvaluation {
+  /* The block might match, but we cannot decide yet, will check in the other 
filters. */
+  public static final Boolean BLOCK_MIGHT_MATCH = new Boolean(false);
+  /* The block can match for sure. */
+  public static final Boolean BLOCK_MUST_MATCH = new Boolean(false);
+  /* The block can't match for sure */
+  public static final Boolean BLOCK_CANNOT_MATCH = new Boolean(true);
+
+  public static Boolean evaluateAnd(Operators.And and, 
FilterPredicate.Visitor predicate) {
+Boolean left = and.getLeft().accept(predicate);
+if (left == BLOCK_CANNOT_MATCH) {
+  // seems unintuitive to put an || not an && here but we can
+  // drop a chunk of records if we know that either the left or
+  // the right predicate agrees that no matter what we don't
+  // need this chunk.
+  return BLOCK_CANNOT_MATCH;
+}
+Boolean right = and.getRight().accept(predicate);
+if (right == BLOCK_CANNOT_MATCH) {
+  return BLOCK_CANNOT_MATCH;
+} else if (left == BLOCK_MUST_MATCH && right == BLOCK_MUST_MATCH) {

Review Comment:
   if left is `BLOCK_MUST_MATCH` , right is  `BLOCK_MIGHT_MATCH` , left & right 
should be `BLOCK_MIGHT_MATCH`.
   Because in the next filter may let right be `BLOCK_CANNOT_MATCH ` and we 
should drop it.
   
   And I add new 
[UT](https://github.com/apache/parquet-mr/pull/1023/files#diff-8915e6fa23018e02c2e79a3f6cc5078a8882f8031022dbdde217fe9bf1d908afR143)
   In  `StatisticsFilter` left might match (but can't match in 
DictionaryFilter),  right must match -> return might match in StatisticsFilter, 
 return can't match in DictionaryFilter
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Comment Edited] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Jan Finis (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691108#comment-17691108
 ] 

Jan Finis edited comment on PARQUET-2249 at 2/20/23 9:55 AM:
-

[~wgtmac] True, not writing a column index in this case is also a solution. 
Note though that this is a pessimization for pages not containing NaN in the 
same column chunk. It would be a shame if a single NaN makes a whole column 
chunk non-indexable. It might be a good interim solution, but it's not too 
satisfying.

The whole topic of NaN handling in Parquet currently seems to be lacking and 
somewhat inconsistent, making columns with NaNs mostly unusable for scan 
pruning. Maybe there should be a redefinition of the semantics in a new 
version, so that columns with NaNs can be used for indexing as other columns. 
As mentioned, Iceberg has solved this problem by providing NaN counts.


was (Author: jfinis):
[~wgtmac] True, not writing a column index in this case is also a solution. 
Note though that this is a pessimization for pages not containing NaN in the 
same column chunk. It would be a shame if a single NaN makes a whole column 
chunk non-indexable. It might be a good interim solution, but it's not too 
satisfying.

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then 

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Jan Finis (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691108#comment-17691108
 ] 

Jan Finis commented on PARQUET-2249:


[~wgtmac] True, not writing a column index in this case is also a solution. 
Note though that this is a pessimization for pages not containing NaN in the 
same column chunk. It would be a shame if a single NaN makes a whole column 
chunk non-indexable. It might be a good interim solution, but it's not too 
satisfying.

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)