[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705797#comment-17705797
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

wgtmac commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1150020701


##
src/main/thrift/parquet.thrift:
##
@@ -190,6 +190,41 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ *
+ * Writers should populate all fields in this struct except for the exceptions 
listed per field.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+* The number of logical physical bytes stored for BYTE_ARRAY data values. 
Logical bytes refers to the number
+* of bytes needed if no special encoding is used. This is exclusive of the 
bytes needed
+* to store the length of each byte array. In other words, this field is 
equivelant to the the (size of 
+* PLAIN-ENCODING the byte array values) - (4 bytes * number of values 
written). To determine logical sizes 
+* of other other types readers can use schema information multiplied by 
the number of non-null values.
+* The number of non-null values can be inferred from the histograms below.
+*
+* For example if column chunk is dictionary encoded with a dictionary 
["a", "bc", "cde"] and a data page 
+* has indexes [0, 0, 1, 2].  This value is expected to be 7 (1 + 1 + 2 + 
3).
+*
+* This option should only be set for physical and logical types that would 
use BYTE_ARRAY when encoded with PLAIN encoding.
+*/
+   1: optional i64 logical_variable_width_stored_bytes;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_level+1) 
+ * where each element represens the number of time the repetition level 
was observed in the data.
+ *
+ * This value is optional if max_repetition_level is 0.
+ */
+   2: optional list repetition_level_histogram;
+   /**
+* Same as  repetition_level_histogram except for definition levels.
+*
+* This value is optional when max_definition_level is 0. 
+*/ 
+   3: optional list definition_level_histogram;

Review Comment:
   BTW, do we need to add an extra histogram for `pair` 
if both exist?





> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] wgtmac commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-27 Thread via GitHub


wgtmac commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1150020701


##
src/main/thrift/parquet.thrift:
##
@@ -190,6 +190,41 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ *
+ * Writers should populate all fields in this struct except for the exceptions 
listed per field.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+* The number of logical physical bytes stored for BYTE_ARRAY data values. 
Logical bytes refers to the number
+* of bytes needed if no special encoding is used. This is exclusive of the 
bytes needed
+* to store the length of each byte array. In other words, this field is 
equivelant to the the (size of 
+* PLAIN-ENCODING the byte array values) - (4 bytes * number of values 
written). To determine logical sizes 
+* of other other types readers can use schema information multiplied by 
the number of non-null values.
+* The number of non-null values can be inferred from the histograms below.
+*
+* For example if column chunk is dictionary encoded with a dictionary 
["a", "bc", "cde"] and a data page 
+* has indexes [0, 0, 1, 2].  This value is expected to be 7 (1 + 1 + 2 + 
3).
+*
+* This option should only be set for physical and logical types that would 
use BYTE_ARRAY when encoded with PLAIN encoding.
+*/
+   1: optional i64 logical_variable_width_stored_bytes;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_level+1) 
+ * where each element represens the number of time the repetition level 
was observed in the data.
+ *
+ * This value is optional if max_repetition_level is 0.
+ */
+   2: optional list repetition_level_histogram;
+   /**
+* Same as  repetition_level_histogram except for definition levels.
+*
+* This value is optional when max_definition_level is 0. 
+*/ 
+   3: optional list definition_level_histogram;

Review Comment:
   BTW, do we need to add an extra histogram for `pair` 
if both exist?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705742#comment-17705742
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

wgtmac commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1149936534


##
src/main/thrift/parquet.thrift:
##
@@ -190,6 +190,41 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ *
+ * Writers should populate all fields in this struct except for the exceptions 
listed per field.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+* The number of logical physical bytes stored for BYTE_ARRAY data values. 
Logical bytes refers to the number
+* of bytes needed if no special encoding is used. This is exclusive of the 
bytes needed
+* to store the length of each byte array. In other words, this field is 
equivelant to the the (size of 
+* PLAIN-ENCODING the byte array values) - (4 bytes * number of values 
written). To determine logical sizes 
+* of other other types readers can use schema information multiplied by 
the number of non-null values.
+* The number of non-null values can be inferred from the histograms below.
+*
+* For example if column chunk is dictionary encoded with a dictionary 
["a", "bc", "cde"] and a data page 
+* has indexes [0, 0, 1, 2].  This value is expected to be 7 (1 + 1 + 2 + 
3).
+*
+* This option should only be set for physical and logical types that would 
use BYTE_ARRAY when encoded with PLAIN encoding.
+*/
+   1: optional i64 logical_variable_width_stored_bytes;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_level+1) 
+ * where each element represens the number of time the repetition level 
was observed in the data.
+ *
+ * This value is optional if max_repetition_level is 0.
+ */
+   2: optional list repetition_level_histogram;
+   /**
+* Same as  repetition_level_histogram except for definition levels.

Review Comment:
   ```suggestion
   * Same as repetition_level_histogram except for definition levels.
   ```



##
src/main/thrift/parquet.thrift:
##
@@ -190,6 +190,41 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ *
+ * Writers should populate all fields in this struct except for the exceptions 
listed per field.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+* The number of logical physical bytes stored for BYTE_ARRAY data values. 
Logical bytes refers to the number
+* of bytes needed if no special encoding is used. This is exclusive of the 
bytes needed
+* to store the length of each byte array. In other words, this field is 
equivelant to the the (size of 
+* PLAIN-ENCODING the byte array values) - (4 bytes * number of values 
written). To determine logical sizes 
+* of other other types readers can use schema information multiplied by 
the number of non-null values.
+* The number of non-null values can be inferred from the histograms below.
+*
+* For example if column chunk is dictionary encoded with a dictionary 
["a", "bc", "cde"] and a data page 
+* has indexes [0, 0, 1, 2].  This value is expected to be 7 (1 + 1 + 2 + 
3).
+*
+* This option should only be set for physical and logical types that would 
use BYTE_ARRAY when encoded with PLAIN encoding.

Review Comment:
   It is a little bit confusing. Did you mean
   
   > This option should only be set for physical and logical types that would 
use BYTE_ARRAY when **__NOT__** encoded with PLAIN encoding.
   
   or
   
   > This option should only be set for physical and logical types that would 
use BYTE_ARRA **__EVEN__** when encoded with PLAIN encoding.



##
src/main/thrift/parquet.thrift:
##
@@ -190,6 +190,41 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ *
+ * Writers should populate all fields in this struct except for the exceptions 
listed per field.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+* The number of logical physical bytes stored for BYTE_ARRAY data values. 
Logical bytes refers to the number
+* of bytes needed if no special encoding is used. This is exclusive of the 
bytes needed
+* to store the length of each byte array. In other words, this field is 
equivelant to the the (size of 
+* PLAIN-ENCODING the byte array values) - (4 bytes * 

[GitHub] [parquet-format] wgtmac commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-27 Thread via GitHub


wgtmac commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1149936534


##
src/main/thrift/parquet.thrift:
##
@@ -190,6 +190,41 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ *
+ * Writers should populate all fields in this struct except for the exceptions 
listed per field.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+* The number of logical physical bytes stored for BYTE_ARRAY data values. 
Logical bytes refers to the number
+* of bytes needed if no special encoding is used. This is exclusive of the 
bytes needed
+* to store the length of each byte array. In other words, this field is 
equivelant to the the (size of 
+* PLAIN-ENCODING the byte array values) - (4 bytes * number of values 
written). To determine logical sizes 
+* of other other types readers can use schema information multiplied by 
the number of non-null values.
+* The number of non-null values can be inferred from the histograms below.
+*
+* For example if column chunk is dictionary encoded with a dictionary 
["a", "bc", "cde"] and a data page 
+* has indexes [0, 0, 1, 2].  This value is expected to be 7 (1 + 1 + 2 + 
3).
+*
+* This option should only be set for physical and logical types that would 
use BYTE_ARRAY when encoded with PLAIN encoding.
+*/
+   1: optional i64 logical_variable_width_stored_bytes;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_level+1) 
+ * where each element represens the number of time the repetition level 
was observed in the data.
+ *
+ * This value is optional if max_repetition_level is 0.
+ */
+   2: optional list repetition_level_histogram;
+   /**
+* Same as  repetition_level_histogram except for definition levels.

Review Comment:
   ```suggestion
   * Same as repetition_level_histogram except for definition levels.
   ```



##
src/main/thrift/parquet.thrift:
##
@@ -190,6 +190,41 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ *
+ * Writers should populate all fields in this struct except for the exceptions 
listed per field.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+* The number of logical physical bytes stored for BYTE_ARRAY data values. 
Logical bytes refers to the number
+* of bytes needed if no special encoding is used. This is exclusive of the 
bytes needed
+* to store the length of each byte array. In other words, this field is 
equivelant to the the (size of 
+* PLAIN-ENCODING the byte array values) - (4 bytes * number of values 
written). To determine logical sizes 
+* of other other types readers can use schema information multiplied by 
the number of non-null values.
+* The number of non-null values can be inferred from the histograms below.
+*
+* For example if column chunk is dictionary encoded with a dictionary 
["a", "bc", "cde"] and a data page 
+* has indexes [0, 0, 1, 2].  This value is expected to be 7 (1 + 1 + 2 + 
3).
+*
+* This option should only be set for physical and logical types that would 
use BYTE_ARRAY when encoded with PLAIN encoding.

Review Comment:
   It is a little bit confusing. Did you mean
   
   > This option should only be set for physical and logical types that would 
use BYTE_ARRAY when **__NOT__** encoded with PLAIN encoding.
   
   or
   
   > This option should only be set for physical and logical types that would 
use BYTE_ARRA **__EVEN__** when encoded with PLAIN encoding.



##
src/main/thrift/parquet.thrift:
##
@@ -190,6 +190,41 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ *
+ * Writers should populate all fields in this struct except for the exceptions 
listed per field.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+* The number of logical physical bytes stored for BYTE_ARRAY data values. 
Logical bytes refers to the number
+* of bytes needed if no special encoding is used. This is exclusive of the 
bytes needed
+* to store the length of each byte array. In other words, this field is 
equivelant to the the (size of 
+* PLAIN-ENCODING the byte array values) - (4 bytes * number of values 
written). To determine logical sizes 
+* of other other types readers can use schema information multiplied by 
the number of non-null values.
+* The number of non-null values can be inferred from the histograms below.
+  

[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts

2023-03-27 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705697#comment-17705697
 ] 

Dongjoon Hyun commented on PARQUET-2224:


As you know, for SPARK-42380, we verified that it's Maven and its plugin 
combination issues and we avoid it by pinning Maven version and plugin versions.
Thus, we don't think that's Apache Spark issue and will re-try when we can 
verify that Maven and its plugins work fine in the future.

> Publish SBOM artifacts
> --
>
> Key: PARQUET-2224
> URL: https://issues.apache.org/jira/browse/PARQUET-2224
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


RE: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-27 Thread wish maple
+1 For uncompressed size for the field. However, it's a bit-tricky here.
I've
implement a similar size-hint in our system, here are some problems I met:
1. Null variables. In Arrow Array, null-value should occupy some place, but
field-raw size cannot represent that value.
2. Size of FLBA/ByteArray. It's size should be variable-size-summary or
variable-size-summary + sizeof(ByteArray) * value-count
3. Some time Arrow data is not equal to Parquet data, like Decimal stored
as int32 or int64.
Hope that helps.

Best, Xuwei Fu

On 2023/03/24 16:26:51 Micah Kornfield wrote:
> Parquet metadata currently tracks uncompressed and compressed page/column
> sizes [1][2].  Uncompressed size here corresponds to encoded size which
can
> differ substantially from the plain encoding size due to RLE/Dictionary
> encoding.
>
> When doing query planning/execution it can be useful to understand the
> total raw size of bytes (e.g. whether to do a broad-cast join).
>
> Would people be open to adding an optional field that records the
estimated
> (or exact) size of the column if plain encoding had been used?
>
> Thanks,
> Micah
>
> [1]
>
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
> [2]
>
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637
>


RE: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-27 Thread wish maple
+1 For uncompressed size for the field. However, it's a bit-tricky here.
I've
implement a similar size-hint in our system, here are some problems I met:
1. Null variables. In Arrow Array, null-value should occupy some place, but
field-raw size cannot represent that value.
2. Size of FLBA/ByteArray. It's size should be variable-size-summary or
variable-size-summary + sizeof(ByteArray) * value-count
3. Some times Arrow data is not equal to Parquet data, like Decimal stored
as int32 or int64.
Hope that helps.


[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705655#comment-17705655
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1149697363


##
src/main/thrift/parquet.thrift:
##
@@ -190,6 +190,35 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+* The number of logic bytes needed to store present/non-null values.
+* Unless specified below, the computed size is the size it would take to 
plain-encode the underlying
+* physical type.
+* Special calculations:
+*  - Enum: plain-encoded BYTE_ARRAY size
+*  - Integers (same size used for signed and unsigned): int8 - 1 bytes, 
int16 - 2 
+*  - Decimal - Each value is assumed to take the minimal number of bytes 
necessary to encode
+*the precision of the decimal value.
+*  - Nested types (lists, nested groups and maps) - No additional size for 
these structures
+*are accounted for in this field, instead the histogram fields below 
can be
+*be used to estimate overhead to recreate these structures.
+*/
+   1: optional i64 logical_value_byte_storage;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_level+1) 
+ * where each element represens the number of time the repetition level 
was observed in the data.
+ */
+   2: optional list repetition_level_histogram;

Review Comment:
   Made a proposal for these questions.





> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705653#comment-17705653
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1149696676


##
src/main/thrift/parquet.thrift:
##
@@ -190,6 +190,35 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+* The number of logic bytes needed to store present/non-null values.
+* Unless specified below, the computed size is the size it would take to 
plain-encode the underlying
+* physical type.
+* Special calculations:
+*  - Enum: plain-encoded BYTE_ARRAY size
+*  - Integers (same size used for signed and unsigned): int8 - 1 bytes, 
int16 - 2 
+*  - Decimal - Each value is assumed to take the minimal number of bytes 
necessary to encode
+*the precision of the decimal value.
+*  - Nested types (lists, nested groups and maps) - No additional size for 
these structures
+*are accounted for in this field, instead the histogram fields below 
can be
+*be used to estimate overhead to recreate these structures.
+*/
+   1: optional i64 logical_value_byte_storage;

Review Comment:
   I've updated to relect variable width bytes.





> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-27 Thread via GitHub


emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1149697363


##
src/main/thrift/parquet.thrift:
##
@@ -190,6 +190,35 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+* The number of logic bytes needed to store present/non-null values.
+* Unless specified below, the computed size is the size it would take to 
plain-encode the underlying
+* physical type.
+* Special calculations:
+*  - Enum: plain-encoded BYTE_ARRAY size
+*  - Integers (same size used for signed and unsigned): int8 - 1 bytes, 
int16 - 2 
+*  - Decimal - Each value is assumed to take the minimal number of bytes 
necessary to encode
+*the precision of the decimal value.
+*  - Nested types (lists, nested groups and maps) - No additional size for 
these structures
+*are accounted for in this field, instead the histogram fields below 
can be
+*be used to estimate overhead to recreate these structures.
+*/
+   1: optional i64 logical_value_byte_storage;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_level+1) 
+ * where each element represens the number of time the repetition level 
was observed in the data.
+ */
+   2: optional list repetition_level_histogram;

Review Comment:
   Made a proposal for these questions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-27 Thread via GitHub


emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1149696676


##
src/main/thrift/parquet.thrift:
##
@@ -190,6 +190,35 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+* The number of logic bytes needed to store present/non-null values.
+* Unless specified below, the computed size is the size it would take to 
plain-encode the underlying
+* physical type.
+* Special calculations:
+*  - Enum: plain-encoded BYTE_ARRAY size
+*  - Integers (same size used for signed and unsigned): int8 - 1 bytes, 
int16 - 2 
+*  - Decimal - Each value is assumed to take the minimal number of bytes 
necessary to encode
+*the precision of the decimal value.
+*  - Nested types (lists, nested groups and maps) - No additional size for 
these structures
+*are accounted for in this field, instead the histogram fields below 
can be
+*be used to estimate overhead to recreate these structures.
+*/
+   1: optional i64 logical_value_byte_storage;

Review Comment:
   I've updated to relect variable width bytes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705650#comment-17705650
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

emkornfield commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1485717387

   > Seems this patch doesn't need to consider [backward-compatibility 
rules](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types)?
   
   Since we are storing rep/def levels directly I don't think so since those 
rules are applied on top of this data to make the correct inferences.




> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] emkornfield commented on pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-27 Thread via GitHub


emkornfield commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1485717387

   > Seems this patch doesn't need to consider [backward-compatibility 
rules](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types)?
   
   Since we are storing rep/def levels directly I don't think so since those 
rules are applied on top of this data to make the correct inferences.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705649#comment-17705649
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1149682895


##
src/main/thrift/parquet.thrift:
##
@@ -190,6 +190,35 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+* The number of logic bytes needed to store present/non-null values.
+* Unless specified below, the computed size is the size it would take to 
plain-encode the underlying
+* physical type.
+* Special calculations:
+*  - Enum: plain-encoded BYTE_ARRAY size
+*  - Integers (same size used for signed and unsigned): int8 - 1 bytes, 
int16 - 2 
+*  - Decimal - Each value is assumed to take the minimal number of bytes 
necessary to encode

Review Comment:
   I originally had this.  I think given the two different opinions expressed, 
I'm going to change this field to only record variable width bytes, and say all 
other calcutions can be performed by readers based on type and number of values





> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-27 Thread via GitHub


emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1149682895


##
src/main/thrift/parquet.thrift:
##
@@ -190,6 +190,35 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+* The number of logic bytes needed to store present/non-null values.
+* Unless specified below, the computed size is the size it would take to 
plain-encode the underlying
+* physical type.
+* Special calculations:
+*  - Enum: plain-encoded BYTE_ARRAY size
+*  - Integers (same size used for signed and unsigned): int8 - 1 bytes, 
int16 - 2 
+*  - Decimal - Each value is assumed to take the minimal number of bytes 
necessary to encode

Review Comment:
   I originally had this.  I think given the two different opinions expressed, 
I'm going to change this field to only record variable width bytes, and say all 
other calcutions can be performed by readers based on type and number of values



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts

2023-03-27 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705646#comment-17705646
 ] 

Steve Loughran commented on PARQUET-2224:
-

+SPARK-42380

> Publish SBOM artifacts
> --
>
> Key: PARQUET-2224
> URL: https://issues.apache.org/jira/browse/PARQUET-2224
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts

2023-03-27 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705645#comment-17705645
 ] 

Steve Loughran commented on PARQUET-2224:
-

HADOOP-18641. didnt' actually break the build, just printed stack traces and 
didn't do the manifests

> Publish SBOM artifacts
> --
>
> Key: PARQUET-2224
> URL: https://issues.apache.org/jira/browse/PARQUET-2224
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Release Apache Parquet 1.12.4 RC0

2023-03-27 Thread Dongjoon Hyun
+1

Thank you, Gang and Yuming.

Dongjoon.

On 2023/03/27 05:44:14 "Wang, Yuming" wrote:
> +1. Tested this release through Spark UT: 
> https://github.com/apache/spark/pull/40555.
> 
> 
> From: Gang Wu 
> Date: Sunday, March 26, 2023 at 22:42
> To: dev@parquet.apache.org 
> Subject: [VOTE] Release Apache Parquet 1.12.4 RC0
> External Email
> 
> Hi everyone,
> 
> I propose the following RC to be released as the official Apache Parquet
> 1.12.4 release.
> 
> The commit id is 22069e58494e7cb5d50e664c7ffa1cf1468404f8
> * This corresponds to the tag: apache-parquet-1.12.4-rc0
> *
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F22069e58494e7cb5d50e664c7ffa1cf1468404f8=05%7C01%7Cyumwang%40ebay.com%7Cc5216cd229664f939b6508db2e0855ed%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638154385567464296%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=%2Bny4R%2BgGQwIc3yMxsHfPh87YYTPhJ580UUoGV30WUQU%3D=0
> 
> The release tarball, signature, and checksums are here:
> * 
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.12.4-rc0=05%7C01%7Cyumwang%40ebay.com%7Cc5216cd229664f939b6508db2e0855ed%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638154385567464296%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=qW7uIIvyamqkT7FbkBWvwKD1VnfeRWnKLUBpcVHXvck%3D=0
> 
> You can find the KEYS file here:
> * 
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdownloads.apache.org%2Fparquet%2FKEYS=05%7C01%7Cyumwang%40ebay.com%7Cc5216cd229664f939b6508db2e0855ed%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638154385567464296%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=79Et30L9u4w4%2F%2B%2FTvPTpXEobOuvTV9XyVmapKC2qwoY%3D=0
> 
> Binary artifacts are staged in Nexus here:
> * 
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F=05%7C01%7Cyumwang%40ebay.com%7Cc5216cd229664f939b6508db2e0855ed%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638154385567464296%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=Z%2FhRa8zc5ZHhs15Epx7X%2BIUwQJI4MoyPMOgAIJemvHU%3D=0
> 
> This release includes important changes listed:
> * 
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fparquet-1.12.4%2FCHANGES.md=05%7C01%7Cyumwang%40ebay.com%7Cc5216cd229664f939b6508db2e0855ed%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638154385567464296%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=SXURCILyTz6SYb3iNPEnedkgjMk%2BA%2FLYHyS4TvT4bbM%3D=0
> 
> Please download, verify, and test.
> 
> Please vote in the next 72 hours.
> 
> [ ] +1 Release this as Apache Parquet 1.12.4
> [ ] +0
> [ ] -1 Do not release this because...
> 
> Best regards,
> Gang
> 


[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts

2023-03-27 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705436#comment-17705436
 ] 

Dongjoon Hyun commented on PARQUET-2224:


Never mind. I found it Apache Parquet 1.12.4 RC artifacts. It looks good to me. 
Thank you, [~wgtmac].

- 
https://repository.apache.org/content/groups/staging/org/apache/parquet/parquet-common/1.12.4/parquet-common-1.12.4-cyclonedx.json

> Publish SBOM artifacts
> --
>
> Key: PARQUET-2224
> URL: https://issues.apache.org/jira/browse/PARQUET-2224
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts

2023-03-27 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705433#comment-17705433
 ] 

Dongjoon Hyun commented on PARQUET-2224:


BTW, [~wgtmac]. What is the `Fixed Version` of this issue? I want to check the 
released artifacts.

> Publish SBOM artifacts
> --
>
> Key: PARQUET-2224
> URL: https://issues.apache.org/jira/browse/PARQUET-2224
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts

2023-03-27 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705430#comment-17705430
 ] 

Dongjoon Hyun commented on PARQUET-2224:


Apache Maven and its plugin eco-systems are also one of the open source 
projects which have many issues.

FYI, Apache ORC 1.7.8 and 1.8.3 also have no issue, [~ste...@apache.org]. 
- 
https://repo1.maven.org/maven2/org/apache/orc/orc-core/1.7.8/orc-core-1.7.8-cyclonedx.json
- 
https://repo1.maven.org/maven2/org/apache/orc/orc-core/1.8.3/orc-core-1.8.3-cyclonedx.json

> Publish SBOM artifacts
> --
>
> Key: PARQUET-2224
> URL: https://issues.apache.org/jira/browse/PARQUET-2224
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts

2023-03-27 Thread Gang Wu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705396#comment-17705396
 ] 

Gang Wu commented on PARQUET-2224:
--

[~ste...@apache.org] Do you have relevant JIRAs? We are releasing a new version 
of parquet and it sounds like a blocker?

> Publish SBOM artifacts
> --
>
> Key: PARQUET-2224
> URL: https://issues.apache.org/jira/browse/PARQUET-2224
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705389#comment-17705389
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1149338204


##
src/main/thrift/parquet.thrift:
##
@@ -190,6 +190,35 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+* The number of logic bytes needed to store present/non-null values.
+* Unless specified below, the computed size is the size it would take to 
plain-encode the underlying
+* physical type.
+* Special calculations:
+*  - Enum: plain-encoded BYTE_ARRAY size
+*  - Integers (same size used for signed and unsigned): int8 - 1 bytes, 
int16 - 2 
+*  - Decimal - Each value is assumed to take the minimal number of bytes 
necessary to encode

Review Comment:
   Seems that small Decimal can be encoded as FLBA or BYTE_ARRAY, but big 
decimal cannot be stored as i32. Should we force use the physical type or 
related with physical type?





> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-27 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1149338204


##
src/main/thrift/parquet.thrift:
##
@@ -190,6 +190,35 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+* The number of logic bytes needed to store present/non-null values.
+* Unless specified below, the computed size is the size it would take to 
plain-encode the underlying
+* physical type.
+* Special calculations:
+*  - Enum: plain-encoded BYTE_ARRAY size
+*  - Integers (same size used for signed and unsigned): int8 - 1 bytes, 
int16 - 2 
+*  - Decimal - Each value is assumed to take the minimal number of bytes 
necessary to encode

Review Comment:
   Seems that small Decimal can be encoded as FLBA or BYTE_ARRAY, but big 
decimal cannot be stored as i32. Should we force use the physical type or 
related with physical type?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705387#comment-17705387
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

mapleFU commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1485217695

   Seems this patch doesn't need to consider [backward-compatibility 
rules](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types
 )?




> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] mapleFU commented on pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-27 Thread via GitHub


mapleFU commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1485217695

   Seems this patch doesn't need to consider [backward-compatibility 
rules](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types
 )?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705384#comment-17705384
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1149327881


##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_leve) 
+ * where each element represens the count of the number of times that 
level occurs in the page/column chunk.
+ */
+   8: optional list repetition_level_histogram;

Review Comment:
   Seems It can help pushdown some filter on List/Map, and helping constructing 
the list. It's great, but I think maybe we need some samples? Because it's a 
bit hard to understand how to make full use of it. Like some rules in  ?





> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-27 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1149327881


##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_leve) 
+ * where each element represens the count of the number of times that 
level occurs in the page/column chunk.
+ */
+   8: optional list repetition_level_histogram;

Review Comment:
   Seems It can help pushdown some filter on List/Map, and helping constructing 
the list. It's great, but I think maybe we need some samples? Because it's a 
bit hard to understand how to make full use of it. Like some rules in  ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2262) Fix local build failure from maven-surefire-plugin due to missing surefire.argLine

2023-03-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705382#comment-17705382
 ] 

ASF GitHub Bot commented on PARQUET-2262:
-

wgtmac commented on PR #1045:
URL: https://github.com/apache/parquet-mr/pull/1045#issuecomment-1485205157

   @gszadovszky @ggershinsky @shangxinli Could you please take a look? I hit 
this issue while releasing the 1.12.4-rc0.




> Fix local build failure from maven-surefire-plugin due to missing 
> surefire.argLine
> --
>
> Key: PARQUET-2262
> URL: https://issues.apache.org/jira/browse/PARQUET-2262
> Project: Parquet
>  Issue Type: Test
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The issue can be reproduced simply by running *mvn test* on the laptop 
> locally.
> {quote}
> [INFO] 
> 
> [INFO] Reactor Summary for Apache Parquet MR 1.13.0-SNAPSHOT:
> [INFO]
> [INFO] Apache Parquet MR .. SUCCESS [ 1.056 s]
> [INFO] Apache Parquet Format Structures ... FAILURE [ 2.009 s]
> [INFO] Apache Parquet Generator ... SKIPPED
> [INFO] Apache Parquet Common .. SKIPPED
> [INFO] Apache Parquet Encodings ... SKIPPED
> [INFO] Apache Parquet Column .. SKIPPED
> [INFO] Apache Parquet Arrow ... SKIPPED
> [INFO] Apache Parquet Jackson . SKIPPED
> [INFO] Apache Parquet Hadoop .. SKIPPED
> [INFO] Apache Parquet Avro  SKIPPED
> [INFO] Apache Parquet Benchmarks .. SKIPPED
> [INFO] Apache Parquet Command-line  SKIPPED
> [INFO] Apache Parquet Pig . SKIPPED
> [INFO] Apache Parquet Pig Bundle .. SKIPPED
> [INFO] Apache Parquet Protobuf  SKIPPED
> [INFO] Apache Parquet Scala ... SKIPPED
> [INFO] Apache Parquet Thrift .. SKIPPED
> [INFO] Apache Parquet Hadoop Bundle ... SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 3.162 s
> [INFO] Finished at: 2023-03-27T09:52:19+08:00
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.22.0:test (default-test) on 
> project parquet-format-structures: There are test failures.
> [ERROR]
> [ERROR] Please refer to 
> /Users/gangwu/Projects/parquet-mr/parquet-format-structures/target/surefire-reports
>  for the individual test results.
> [ERROR] Please refer to dump files (if any exist) [date]-jvmRun[N].dump, 
> [date].dumpstream and [date]-jvmRun[N].dumpstream.
> [ERROR] The forked VM terminated without properly saying goodbye. VM crash or 
> System.exit called?
> [ERROR] Command was /bin/sh -c cd 
> /Users/gangwu/Projects/parquet-mr/parquet-format-structures && 
> /Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home/jre/bin/java 
> '${surefire.argLine}' -XX:+IgnoreUnrecognizedVMOptions 
> --add-opens=java.base/java.lang=ALL-UNNAMED 
> --add-opens=java.base/java.lang.invoke=ALL-UNNAMED 
> --add-opens=java.base/java.lang.reflect=ALL-UNNAMED 
> --add-opens=java.base/java.io=ALL-UNNAMED 
> --add-opens=java.base/java.net=ALL-UNNAMED 
> --add-opens=java.base/java.nio=ALL-UNNAMED 
> --add-opens=java.base/java.util=ALL-UNNAMED 
> --add-opens=java.base/java.util.concurrent=ALL-UNNAMED 
> --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED 
> --add-opens=java.base/sun.nio.ch=ALL-UNNAMED 
> --add-opens=java.base/sun.nio.cs=ALL-UNNAMED 
> --add-opens=java.base/sun.security.action=ALL-UNNAMED 
> --add-opens=java.base/sun.util.calendar=ALL-UNNAMED -jar 
> /Users/gangwu/Projects/parquet-mr/parquet-format-structures/target/surefire/surefirebooter4762879200927950684.jar
>  /Users/gangwu/Projects/parquet-mr/parquet-format-structures/target/surefire 
> 2023-03-27T09-52-19_523-jvmRun1 surefire1144200075397565188tmp 
> surefire_08731611064868391570tmp
> [ERROR] Error occurred in starting fork, check output in log
> [ERROR] Process Exit Code: 1
> [ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: The 
> forked VM terminated without properly saying goodbye. VM crash or System.exit 
> called?
> [ERROR] Command was /bin/sh -c cd 
> /Users/gangwu/Projects/parquet-mr/parquet-format-structures && 
> 

[GitHub] [parquet-mr] wgtmac commented on pull request #1045: PARQUET-2262: Fix local build failure due to missing surefire.argLine

2023-03-27 Thread via GitHub


wgtmac commented on PR #1045:
URL: https://github.com/apache/parquet-mr/pull/1045#issuecomment-1485205157

   @gszadovszky @ggershinsky @shangxinli Could you please take a look? I hit 
this issue while releasing the 1.12.4-rc0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts

2023-03-27 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705327#comment-17705327
 ] 

Steve Loughran commented on PARQUET-2224:
-

we had to roll this back from hadoop as the maven plugin didn't work with maven 
3.3.9. is it better now?

> Publish SBOM artifacts
> --
>
> Key: PARQUET-2224
> URL: https://issues.apache.org/jira/browse/PARQUET-2224
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705195#comment-17705195
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148822969


##
src/main/thrift/parquet.thrift:
##
@@ -190,6 +190,35 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+* The number of logic bytes needed to store present/non-null values.
+* Unless specified below, the computed size is the size it would take to 
plain-encode the underlying
+* physical type.
+* Special calculations:
+*  - Enum: plain-encoded BYTE_ARRAY size
+*  - Integers (same size used for signed and unsigned): int8 - 1 bytes, 
int16 - 2 
+*  - Decimal - Each value is assumed to take the minimal number of bytes 
necessary to encode
+*the precision of the decimal value.
+*  - Nested types (lists, nested groups and maps) - No additional size for 
these structures
+*are accounted for in this field, instead the histogram fields below 
can be
+*be used to estimate overhead to recreate these structures.
+*/
+   1: optional i64 logical_value_byte_storage;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_level+1) 
+ * where each element represens the number of time the repetition level 
was observed in the data.
+ */
+   2: optional list repetition_level_histogram;

Review Comment:
   There are a few things to consider here:
   1.  What happens if max rep/dep level is zero (should we require these).  
This also relates should the size be max_dep_level + 1 or max_dep_level.  The 
first allows readers to sanity check the statistics sum to num_values, the 
second does not
   2. Should we require variable size bytes if the column doesn't have any (0 
is an acceptable value here)?
   3. it has kind of been drilled into me that any message that lives long 
enough having a required field one will live to regret it.  I'd prefer to 
document that writers should populate relevant fields (and be specific about 
when we believe they are relevant).





> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-27 Thread via GitHub


emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148822969


##
src/main/thrift/parquet.thrift:
##
@@ -190,6 +190,35 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+* The number of logic bytes needed to store present/non-null values.
+* Unless specified below, the computed size is the size it would take to 
plain-encode the underlying
+* physical type.
+* Special calculations:
+*  - Enum: plain-encoded BYTE_ARRAY size
+*  - Integers (same size used for signed and unsigned): int8 - 1 bytes, 
int16 - 2 
+*  - Decimal - Each value is assumed to take the minimal number of bytes 
necessary to encode
+*the precision of the decimal value.
+*  - Nested types (lists, nested groups and maps) - No additional size for 
these structures
+*are accounted for in this field, instead the histogram fields below 
can be
+*be used to estimate overhead to recreate these structures.
+*/
+   1: optional i64 logical_value_byte_storage;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_level+1) 
+ * where each element represens the number of time the repetition level 
was observed in the data.
+ */
+   2: optional list repetition_level_histogram;

Review Comment:
   There are a few things to consider here:
   1.  What happens if max rep/dep level is zero (should we require these).  
This also relates should the size be max_dep_level + 1 or max_dep_level.  The 
first allows readers to sanity check the statistics sum to num_values, the 
second does not
   2. Should we require variable size bytes if the column doesn't have any (0 
is an acceptable value here)?
   3. it has kind of been drilled into me that any message that lives long 
enough having a required field one will live to regret it.  I'd prefer to 
document that writers should populate relevant fields (and be specific about 
when we believe they are relevant).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org