Re: Files with inconsistent num_rows and num_values?

2023-11-28 Thread Micah Kornfield
Hi Gang,
For writes I'm seeing "parquet-mr version 1.11.1" and "parquet-mr version
1.10.1".  I need to look more into the page headers to check for
consistency.  At the column level, in some cases the number of values read
by pyarrow is consistent with num_rows and in some cases it is consistent
with num_values. I don't see any discernable pattern based on schema or
types.

It looks like the parquet files might have been written with
avro ("parquet.avro.schema" key and a corresponding schema are present in
their metadata).

Thanks,
Micah

On Tue, Nov 28, 2023 at 6:30 PM Gang Wu  wrote:

> Hi Micah,
>
> Does the FileMetaData.version [1] provide any information about
> the writer? What about the num_values in each page header? Is
> the actual number of values consistent with num_values in the
> ColumnMetaData?
>
> [1]
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1108
>
> Best,
> Gang
>
> On Wed, Nov 29, 2023 at 2:22 AM Micah Kornfield 
> wrote:
>
> > We've recently encountered files that have inconsistencies between the
> > number of rows specified in the row group [1] and the total number of
> > values in a column [2] for non-repeated columns (within a file there is
> > inconsistency between columns but all counts appear to be greater than or
> > equal to the number of rows). .
> >
> > Two questions:
> > 1.  Is anyone aware of parquet implementations that might generate files
> > like this?
> > 2.  Does anyone have an opinion on the correct interpretation of these
> > files?  Should the files be treated as corrupt, or should the number of
> > rows be treated as authoritative and any additional data in a column be
> > truncated?
> >
> > It appears different engines make different choices in this case.  Arrow
> > treats this as corruption. Spark seems to allow reading the data.
> >
> > Thanks,
> > Micah
> >
> >
> > [1]
> >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L895
> > [2]
> >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L786
> >
>


[jira] [Commented] (PARQUET-2396) Refactor `ColumnIndexBuilder`

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790918#comment-17790918
 ] 

ASF GitHub Bot commented on PARQUET-2396:
-

zhangjiashen commented on code in PR #1219:
URL: https://github.com/apache/parquet-mr/pull/1219#discussion_r1408788644


##
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java:
##
@@ -298,24 +295,22 @@ public > PrimitiveIterator.OfInt 
visit(NotEq notEq) {
 public > PrimitiveIterator.OfInt visit(In in) {
   Set values = in.getValues();
   IntSet matchingIndexesForNull = new IntOpenHashSet();  // for null
-  Iterator it = values.iterator();
-  while(it.hasNext()) {
-T value = it.next();
-if (value == null) {
-  if (nullCounts == null) {
-// Searching for nulls so if we don't have null related statistics 
we have to return all pages
-return IndexIterator.all(getPageCount());
-  } else {
-for (int i = 0; i < nullCounts.length; i++) {
-  if (nullCounts[i] > 0) {
-matchingIndexesForNull.add(i);
+  for (T value : values) {
+  if (value == null) {

Review Comment:
   Nit: Let's modify the indent spaces to 2 and ensure consistency and similar 
to changes below?





> Refactor `ColumnIndexBuilder`
> -
>
> Key: PARQUET-2396
> URL: https://issues.apache.org/jira/browse/PARQUET-2396
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] PARQUET-2396: Refactor `ColumnIndexBuilder` [parquet-mr]

2023-11-28 Thread via GitHub


zhangjiashen commented on code in PR #1219:
URL: https://github.com/apache/parquet-mr/pull/1219#discussion_r1408788644


##
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java:
##
@@ -298,24 +295,22 @@ public > PrimitiveIterator.OfInt 
visit(NotEq notEq) {
 public > PrimitiveIterator.OfInt visit(In in) {
   Set values = in.getValues();
   IntSet matchingIndexesForNull = new IntOpenHashSet();  // for null
-  Iterator it = values.iterator();
-  while(it.hasNext()) {
-T value = it.next();
-if (value == null) {
-  if (nullCounts == null) {
-// Searching for nulls so if we don't have null related statistics 
we have to return all pages
-return IndexIterator.all(getPageCount());
-  } else {
-for (int i = 0; i < nullCounts.length; i++) {
-  if (nullCounts[i] > 0) {
-matchingIndexesForNull.add(i);
+  for (T value : values) {
+  if (value == null) {

Review Comment:
   Nit: Let's modify the indent spaces to 2 and ensure consistency and similar 
to changes below?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2396) Refactor `ColumnIndexBuilder`

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790917#comment-17790917
 ] 

ASF GitHub Bot commented on PARQUET-2396:
-

zhangjiashen commented on code in PR #1219:
URL: https://github.com/apache/parquet-mr/pull/1219#discussion_r1408788644


##
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java:
##
@@ -298,24 +295,22 @@ public > PrimitiveIterator.OfInt 
visit(NotEq notEq) {
 public > PrimitiveIterator.OfInt visit(In in) {
   Set values = in.getValues();
   IntSet matchingIndexesForNull = new IntOpenHashSet();  // for null
-  Iterator it = values.iterator();
-  while(it.hasNext()) {
-T value = it.next();
-if (value == null) {
-  if (nullCounts == null) {
-// Searching for nulls so if we don't have null related statistics 
we have to return all pages
-return IndexIterator.all(getPageCount());
-  } else {
-for (int i = 0; i < nullCounts.length; i++) {
-  if (nullCounts[i] > 0) {
-matchingIndexesForNull.add(i);
+  for (T value : values) {
+  if (value == null) {

Review Comment:
   Nit: Let's modify the indent spaces to 2 and ensure consistency?





> Refactor `ColumnIndexBuilder`
> -
>
> Key: PARQUET-2396
> URL: https://issues.apache.org/jira/browse/PARQUET-2396
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] PARQUET-2396: Refactor `ColumnIndexBuilder` [parquet-mr]

2023-11-28 Thread via GitHub


zhangjiashen commented on code in PR #1219:
URL: https://github.com/apache/parquet-mr/pull/1219#discussion_r1408788644


##
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java:
##
@@ -298,24 +295,22 @@ public > PrimitiveIterator.OfInt 
visit(NotEq notEq) {
 public > PrimitiveIterator.OfInt visit(In in) {
   Set values = in.getValues();
   IntSet matchingIndexesForNull = new IntOpenHashSet();  // for null
-  Iterator it = values.iterator();
-  while(it.hasNext()) {
-T value = it.next();
-if (value == null) {
-  if (nullCounts == null) {
-// Searching for nulls so if we don't have null related statistics 
we have to return all pages
-return IndexIterator.all(getPageCount());
-  } else {
-for (int i = 0; i < nullCounts.length; i++) {
-  if (nullCounts[i] > 0) {
-matchingIndexesForNull.add(i);
+  for (T value : values) {
+  if (value == null) {

Review Comment:
   Nit: Let's modify the indent spaces to 2 and ensure consistency?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: Files with inconsistent num_rows and num_values?

2023-11-28 Thread Gang Wu
Hi Micah,

Does the FileMetaData.version [1] provide any information about
the writer? What about the num_values in each page header? Is
the actual number of values consistent with num_values in the
ColumnMetaData?

[1]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1108

Best,
Gang

On Wed, Nov 29, 2023 at 2:22 AM Micah Kornfield 
wrote:

> We've recently encountered files that have inconsistencies between the
> number of rows specified in the row group [1] and the total number of
> values in a column [2] for non-repeated columns (within a file there is
> inconsistency between columns but all counts appear to be greater than or
> equal to the number of rows). .
>
> Two questions:
> 1.  Is anyone aware of parquet implementations that might generate files
> like this?
> 2.  Does anyone have an opinion on the correct interpretation of these
> files?  Should the files be treated as corrupt, or should the number of
> rows be treated as authoritative and any additional data in a column be
> truncated?
>
> It appears different engines make different choices in this case.  Arrow
> treats this as corruption. Spark seems to allow reading the data.
>
> Thanks,
> Micah
>
>
> [1]
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L895
> [2]
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L786
>


[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790840#comment-17790840
 ] 

ASF GitHub Bot commented on PARQUET-2386:
-

wgtmac commented on PR #1209:
URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1831045581

   Thanks for the improvement!
   
   Could you please take a look at this? @shangxinli @gszadovszky @Fokko




> More consistent code style in parquet-mr
> 
>
> Key: PARQUET-2386
> URL: https://issues.apache.org/jira/browse/PARQUET-2386
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Assignee: Atour Mousavi Gourabi
>Priority: Major
>
> The code style conventions used in parquet-mr are generally inconsistent and 
> unenforced. We might want to consider using linters such as Spotless and a 
> more extensive .editorconfig configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] PARQUET-2386: More consistent code style in parquet-mr [parquet-mr]

2023-11-28 Thread via GitHub


wgtmac commented on PR #1209:
URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1831045581

   Thanks for the improvement!
   
   Could you please take a look at this? @shangxinli @gszadovszky @Fokko


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2397) Make use of `isEmpty`

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790822#comment-17790822
 ] 

ASF GitHub Bot commented on PARQUET-2397:
-

Fokko opened a new pull request, #1220:
URL: https://github.com/apache/parquet-mr/pull/1220

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2397
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Make use of `isEmpty`
> -
>
> Key: PARQUET-2397
> URL: https://issues.apache.org/jira/browse/PARQUET-2397
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] PARQUET-2397: Make use of `isEmpty` [parquet-mr]

2023-11-28 Thread via GitHub


Fokko opened a new pull request, #1220:
URL: https://github.com/apache/parquet-mr/pull/1220

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2397
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (PARQUET-2397) Make use of `isEmpty`

2023-11-28 Thread Fokko Driesprong (Jira)
Fokko Driesprong created PARQUET-2397:
-

 Summary: Make use of `isEmpty`
 Key: PARQUET-2397
 URL: https://issues.apache.org/jira/browse/PARQUET-2397
 Project: Parquet
  Issue Type: Improvement
Affects Versions: 1.13.1
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.14.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2396) Refactor `ColumnIndexBuilder`

2023-11-28 Thread Fokko Driesprong (Jira)
Fokko Driesprong created PARQUET-2396:
-

 Summary: Refactor `ColumnIndexBuilder`
 Key: PARQUET-2396
 URL: https://issues.apache.org/jira/browse/PARQUET-2396
 Project: Parquet
  Issue Type: Improvement
Affects Versions: 1.13.1
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.14.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] PARQUET-2396: Refactor `ColumnIndexBuilder` [parquet-mr]

2023-11-28 Thread via GitHub


Fokko opened a new pull request, #1219:
URL: https://github.com/apache/parquet-mr/pull/1219

   Small refactor to improve readability
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2396) Refactor `ColumnIndexBuilder`

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790819#comment-17790819
 ] 

ASF GitHub Bot commented on PARQUET-2396:
-

Fokko opened a new pull request, #1219:
URL: https://github.com/apache/parquet-mr/pull/1219

   Small refactor to improve readability
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Refactor `ColumnIndexBuilder`
> -
>
> Key: PARQUET-2396
> URL: https://issues.apache.org/jira/browse/PARQUET-2396
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2395) Prefer `singletonList` over `asList`

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790815#comment-17790815
 ] 

ASF GitHub Bot commented on PARQUET-2395:
-

Fokko opened a new pull request, #1218:
URL: https://github.com/apache/parquet-mr/pull/1218

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Prefer `singletonList` over `asList`
> 
>
> Key: PARQUET-2395
> URL: https://issues.apache.org/jira/browse/PARQUET-2395
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] PARQUET-2395: Prefer `singletonList` over `asList` [parquet-mr]

2023-11-28 Thread via GitHub


Fokko opened a new pull request, #1218:
URL: https://github.com/apache/parquet-mr/pull/1218

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (PARQUET-2395) Prefer `singletonList`

2023-11-28 Thread Fokko Driesprong (Jira)
Fokko Driesprong created PARQUET-2395:
-

 Summary: Prefer `singletonList`
 Key: PARQUET-2395
 URL: https://issues.apache.org/jira/browse/PARQUET-2395
 Project: Parquet
  Issue Type: Improvement
Affects Versions: 1.13.1
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.14.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2395) Prefer `singletonList` over `asList`

2023-11-28 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated PARQUET-2395:
--
Summary: Prefer `singletonList` over `asList`  (was: Prefer `singletonList`)

> Prefer `singletonList` over `asList`
> 
>
> Key: PARQUET-2395
> URL: https://issues.apache.org/jira/browse/PARQUET-2395
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2394) Use `computeIfAbsent` in `MessageColumnIO`

2023-11-28 Thread Fokko Driesprong (Jira)
Fokko Driesprong created PARQUET-2394:
-

 Summary: Use `computeIfAbsent` in `MessageColumnIO`
 Key: PARQUET-2394
 URL: https://issues.apache.org/jira/browse/PARQUET-2394
 Project: Parquet
  Issue Type: Improvement
Affects Versions: 1.13.1
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.14.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2394) Use `computeIfAbsent` in `MessageColumnIO`

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790812#comment-17790812
 ] 

ASF GitHub Bot commented on PARQUET-2394:
-

Fokko opened a new pull request, #1217:
URL: https://github.com/apache/parquet-mr/pull/1217

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2394
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Use `computeIfAbsent` in `MessageColumnIO`
> --
>
> Key: PARQUET-2394
> URL: https://issues.apache.org/jira/browse/PARQUET-2394
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] PARQUET-2394: Use `computeIfAbsent` in `MessageColumnIO` [parquet-mr]

2023-11-28 Thread via GitHub


Fokko opened a new pull request, #1217:
URL: https://github.com/apache/parquet-mr/pull/1217

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2394
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2393) Make `ColumnIOCreatorVisitor` static

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790810#comment-17790810
 ] 

ASF GitHub Bot commented on PARQUET-2393:
-

Fokko opened a new pull request, #1216:
URL: https://github.com/apache/parquet-mr/pull/1216

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Make `ColumnIOCreatorVisitor` static
> 
>
> Key: PARQUET-2393
> URL: https://issues.apache.org/jira/browse/PARQUET-2393
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2393) Make `ColumnIOCreatorVisitor` static

2023-11-28 Thread Fokko Driesprong (Jira)
Fokko Driesprong created PARQUET-2393:
-

 Summary: Make `ColumnIOCreatorVisitor` static
 Key: PARQUET-2393
 URL: https://issues.apache.org/jira/browse/PARQUET-2393
 Project: Parquet
  Issue Type: Improvement
Affects Versions: 1.13.1
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.14.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] PARQUET-2393: Make `ColumnIOCreatorVisitor` static [parquet-mr]

2023-11-28 Thread via GitHub


Fokko opened a new pull request, #1216:
URL: https://github.com/apache/parquet-mr/pull/1216

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2392) Remove StringBuilder in `LogicalTypeAnnotation`

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790807#comment-17790807
 ] 

ASF GitHub Bot commented on PARQUET-2392:
-

Fokko opened a new pull request, #1215:
URL: https://github.com/apache/parquet-mr/pull/1215

   Make sure you have checked _all_ steps below.
   
   StringBuilder only makes sense when you concatenate in a loop.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2392
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Remove StringBuilder in `LogicalTypeAnnotation`
> ---
>
> Key: PARQUET-2392
> URL: https://issues.apache.org/jira/browse/PARQUET-2392
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] PARQUET-2392: Remove StringBuilder in `LogicalTypeAnnotation` [parquet-mr]

2023-11-28 Thread via GitHub


Fokko opened a new pull request, #1215:
URL: https://github.com/apache/parquet-mr/pull/1215

   Make sure you have checked _all_ steps below.
   
   StringBuilder only makes sense when you concatenate in a loop.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2392
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (PARQUET-2392) Remove StringBuilder in `LogicalTypeAnnotation`

2023-11-28 Thread Fokko Driesprong (Jira)
Fokko Driesprong created PARQUET-2392:
-

 Summary: Remove StringBuilder in `LogicalTypeAnnotation`
 Key: PARQUET-2392
 URL: https://issues.apache.org/jira/browse/PARQUET-2392
 Project: Parquet
  Issue Type: Improvement
Affects Versions: 1.13.1
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.14.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2391) Remove unnecessary unboxing

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790805#comment-17790805
 ] 

ASF GitHub Bot commented on PARQUET-2391:
-

Fokko opened a new pull request, #1214:
URL: https://github.com/apache/parquet-mr/pull/1214

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2391
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Remove unnecessary unboxing
> ---
>
> Key: PARQUET-2391
> URL: https://issues.apache.org/jira/browse/PARQUET-2391
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] PARQUET-2391: Remove unnecessary unboxing [parquet-mr]

2023-11-28 Thread via GitHub


Fokko opened a new pull request, #1214:
URL: https://github.com/apache/parquet-mr/pull/1214

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2391
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (PARQUET-2391) Remove unnecessary unboxing

2023-11-28 Thread Fokko Driesprong (Jira)
Fokko Driesprong created PARQUET-2391:
-

 Summary: Remove unnecessary unboxing
 Key: PARQUET-2391
 URL: https://issues.apache.org/jira/browse/PARQUET-2391
 Project: Parquet
  Issue Type: Improvement
Affects Versions: 1.13.1
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.14.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2390) Replace anonymouse functions with lambda's

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790800#comment-17790800
 ] 

ASF GitHub Bot commented on PARQUET-2390:
-

Fokko opened a new pull request, #1213:
URL: https://github.com/apache/parquet-mr/pull/1213

   They are easier to read
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2390
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Replace anonymouse functions with lambda's
> --
>
> Key: PARQUET-2390
> URL: https://issues.apache.org/jira/browse/PARQUET-2390
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] PARQUET-2390: Replace anonymouse functions with lambdas [parquet-mr]

2023-11-28 Thread via GitHub


Fokko opened a new pull request, #1213:
URL: https://github.com/apache/parquet-mr/pull/1213

   They are easier to read
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2390
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (PARQUET-2390) Replace anonymouse functions with lambda's

2023-11-28 Thread Fokko Driesprong (Jira)
Fokko Driesprong created PARQUET-2390:
-

 Summary: Replace anonymouse functions with lambda's
 Key: PARQUET-2390
 URL: https://issues.apache.org/jira/browse/PARQUET-2390
 Project: Parquet
  Issue Type: Improvement
Affects Versions: 1.13.1
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.14.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] PARQUET-2389: Remove redundant initializers [parquet-mr]

2023-11-28 Thread via GitHub


Fokko opened a new pull request, #1212:
URL: https://github.com/apache/parquet-mr/pull/1212

   Just some cleanup
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2389
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (PARQUET-2389) Remove redundant initializers

2023-11-28 Thread Fokko Driesprong (Jira)
Fokko Driesprong created PARQUET-2389:
-

 Summary: Remove redundant initializers
 Key: PARQUET-2389
 URL: https://issues.apache.org/jira/browse/PARQUET-2389
 Project: Parquet
  Issue Type: Improvement
Affects Versions: 1.13.1
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.14.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2389) Remove redundant initializers

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790797#comment-17790797
 ] 

ASF GitHub Bot commented on PARQUET-2389:
-

Fokko opened a new pull request, #1212:
URL: https://github.com/apache/parquet-mr/pull/1212

   Just some cleanup
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2389
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Remove redundant initializers
> -
>
> Key: PARQUET-2389
> URL: https://issues.apache.org/jira/browse/PARQUET-2389
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2388) Deprecate `CHARSETS` on `PlainValuesWriter`

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790795#comment-17790795
 ] 

ASF GitHub Bot commented on PARQUET-2388:
-

Fokko opened a new pull request, #1211:
URL: https://github.com/apache/parquet-mr/pull/1211

   Not being used
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2388
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Deprecate `CHARSETS` on `PlainValuesWriter`
> ---
>
> Key: PARQUET-2388
> URL: https://issues.apache.org/jira/browse/PARQUET-2388
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] PARQUET-2388: Deprecate `CHARSETS` on `PlainValuesWriter` [parquet-mr]

2023-11-28 Thread via GitHub


Fokko opened a new pull request, #1211:
URL: https://github.com/apache/parquet-mr/pull/1211

   Not being used
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2388
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (PARQUET-2388) Deprecate `CHARSETS` on `PlainValuesWriter`

2023-11-28 Thread Fokko Driesprong (Jira)
Fokko Driesprong created PARQUET-2388:
-

 Summary: Deprecate `CHARSETS` on `PlainValuesWriter`
 Key: PARQUET-2388
 URL: https://issues.apache.org/jira/browse/PARQUET-2388
 Project: Parquet
  Issue Type: Improvement
Affects Versions: 1.13.1
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.14.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2387) Simplify `hasFieldsIgnored` expression

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790793#comment-17790793
 ] 

ASF GitHub Bot commented on PARQUET-2387:
-

Fokko opened a new pull request, #1210:
URL: https://github.com/apache/parquet-mr/pull/1210

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2387
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Simplify `hasFieldsIgnored` expression
> --
>
> Key: PARQUET-2387
> URL: https://issues.apache.org/jira/browse/PARQUET-2387
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-thrift
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] PARQUET-2387: Simplify `hasFieldsIgnored` expression [parquet-mr]

2023-11-28 Thread via GitHub


Fokko opened a new pull request, #1210:
URL: https://github.com/apache/parquet-mr/pull/1210

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2387
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (PARQUET-2387) Simplify `hasFieldsIgnored` expression

2023-11-28 Thread Fokko Driesprong (Jira)
Fokko Driesprong created PARQUET-2387:
-

 Summary: Simplify `hasFieldsIgnored` expression
 Key: PARQUET-2387
 URL: https://issues.apache.org/jira/browse/PARQUET-2387
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-thrift
Affects Versions: 1.13.1
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.14.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2344) Bump to Thirft 0.19.0

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790791#comment-17790791
 ] 

ASF GitHub Bot commented on PARQUET-2344:
-

Fokko commented on PR #1192:
URL: https://github.com/apache/parquet-mr/pull/1192#issuecomment-1830859619

   @wgtmac Thanks for splitting out the format upgrade. Always a good idea to 
make PRs smaller.
   
   I finally fixed all the tests, and this looks good to go to me 👍 




> Bump to Thirft 0.19.0
> -
>
> Key: PARQUET-2344
> URL: https://issues.apache.org/jira/browse/PARQUET-2344
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format, parquet-mr
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: format-2.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] PARQUET-2344: Bump to Thrift 0.19.0 [parquet-mr]

2023-11-28 Thread via GitHub


Fokko commented on PR #1192:
URL: https://github.com/apache/parquet-mr/pull/1192#issuecomment-1830859619

   @wgtmac Thanks for splitting out the format upgrade. Always a good idea to 
make PRs smaller.
   
   I finally fixed all the tests, and this looks good to go to me 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (PARQUET-2296) Bump easymock from 3.4 to 5.1.0

2023-11-28 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong resolved PARQUET-2296.
---
Resolution: Fixed

> Bump easymock from 3.4 to 5.1.0
> ---
>
> Key: PARQUET-2296
> URL: https://issues.apache.org/jira/browse/PARQUET-2296
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2300) Update jackson-core 2.13.4 to a version without CVE PRISMA-2023-0067

2023-11-28 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong resolved PARQUET-2300.
---
  Assignee: Gang Wu
Resolution: Fixed

> Update jackson-core 2.13.4 to a version without CVE PRISMA-2023-0067
> 
>
> Key: PARQUET-2300
> URL: https://issues.apache.org/jira/browse/PARQUET-2300
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Gianluca Vagnoni
>Assignee: Gang Wu
>Priority: Major
> Fix For: 1.14.0
>
>
> The library "{*}parquet-jackson{*}" version 1.13.0 and 1.13.1 contains the 
> vulnerability PRISMA-2023-0067 
> ([https://github.com/FasterXML/jackson-core/pull/827)] 
> ([https://github.com/IBM/ibm-cos-sdk-java/issues/58)]
> Please upgrade the shaded library to jackson-core version 2.15.0 to fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2336) Add caching key to CodecFactory

2023-11-28 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong resolved PARQUET-2336.
---
Resolution: Fixed

> Add caching key to CodecFactory
> ---
>
> Key: PARQUET-2336
> URL: https://issues.apache.org/jira/browse/PARQUET-2336
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-hadoop
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2384) Mark toOriginalType as deprecated

2023-11-28 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong resolved PARQUET-2384.
---
Resolution: Fixed

> Mark toOriginalType as deprecated
> -
>
> Key: PARQUET-2384
> URL: https://issues.apache.org/jira/browse/PARQUET-2384
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2368) Update japicmp to 1.18.1

2023-11-28 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong resolved PARQUET-2368.
---
Resolution: Fixed

> Update japicmp to 1.18.1
> 
>
> Key: PARQUET-2368
> URL: https://issues.apache.org/jira/browse/PARQUET-2368
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2382) Remove the deprecated OriginalType

2023-11-28 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated PARQUET-2382:
--
Fix Version/s: 2.0.0
   (was: 1.14.0)

> Remove the deprecated OriginalType
> --
>
> Key: PARQUET-2382
> URL: https://issues.apache.org/jira/browse/PARQUET-2382
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2384) Mark toOriginalType as deprecated

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790790#comment-17790790
 ] 

ASF GitHub Bot commented on PARQUET-2384:
-

Fokko merged PR #1202:
URL: https://github.com/apache/parquet-mr/pull/1202




> Mark toOriginalType as deprecated
> -
>
> Key: PARQUET-2384
> URL: https://issues.apache.org/jira/browse/PARQUET-2384
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] Bump org.codehaus.mojo:exec-maven-plugin from 3.1.0 to 3.1.1 [parquet-mr]

2023-11-28 Thread via GitHub


Fokko merged PR #1206:
URL: https://github.com/apache/parquet-mr/pull/1206


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2384) Mark toOriginalType as deprecated

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790789#comment-17790789
 ] 

ASF GitHub Bot commented on PARQUET-2384:
-

Fokko commented on PR #1202:
URL: https://github.com/apache/parquet-mr/pull/1202#issuecomment-1830855336

   Thanks for the review @wgtmac 




> Mark toOriginalType as deprecated
> -
>
> Key: PARQUET-2384
> URL: https://issues.apache.org/jira/browse/PARQUET-2384
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] PARQUET-2384: Mark `toOriginalType` as deprecated [parquet-mr]

2023-11-28 Thread via GitHub


Fokko merged PR #1202:
URL: https://github.com/apache/parquet-mr/pull/1202


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] PARQUET-2384: Mark `toOriginalType` as deprecated [parquet-mr]

2023-11-28 Thread via GitHub


Fokko commented on PR #1202:
URL: https://github.com/apache/parquet-mr/pull/1202#issuecomment-1830855336

   Thanks for the review @wgtmac 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Files with inconsistent num_rows and num_values?

2023-11-28 Thread Micah Kornfield
We've recently encountered files that have inconsistencies between the
number of rows specified in the row group [1] and the total number of
values in a column [2] for non-repeated columns (within a file there is
inconsistency between columns but all counts appear to be greater than or
equal to the number of rows). .

Two questions:
1.  Is anyone aware of parquet implementations that might generate files
like this?
2.  Does anyone have an opinion on the correct interpretation of these
files?  Should the files be treated as corrupt, or should the number of
rows be treated as authoritative and any additional data in a column be
truncated?

It appears different engines make different choices in this case.  Arrow
treats this as corruption. Spark seems to allow reading the data.

Thanks,
Micah


[1]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L895
[2]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L786


[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790682#comment-17790682
 ] 

ASF GitHub Bot commented on PARQUET-2386:
-

amousavigourabi commented on PR #1209:
URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1830328306

   The `.editorconfig` has been expanded for IntelliJ and is mostly compliant 
with the Spotless configuration. IntelliJ refactoring and Spotless have some 
minor disagreements on continuation indents sometimes, which cannot really be 
resolved at the moment. As it is included in the Maven lifecycle, the Spotless 
configuration would of course be leading.




> More consistent code style in parquet-mr
> 
>
> Key: PARQUET-2386
> URL: https://issues.apache.org/jira/browse/PARQUET-2386
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Assignee: Atour Mousavi Gourabi
>Priority: Major
>
> The code style conventions used in parquet-mr are generally inconsistent and 
> unenforced. We might want to consider using linters such as Spotless and a 
> more extensive .editorconfig configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] PARQUET-2386: More consistent code style in parquet-mr [parquet-mr]

2023-11-28 Thread via GitHub


amousavigourabi commented on PR #1209:
URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1830328306

   The `.editorconfig` has been expanded for IntelliJ and is mostly compliant 
with the Spotless configuration. IntelliJ refactoring and Spotless have some 
minor disagreements on continuation indents sometimes, which cannot really be 
resolved at the moment. As it is included in the Maven lifecycle, the Spotless 
configuration would of course be leading.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790681#comment-17790681
 ] 

ASF GitHub Bot commented on PARQUET-2386:
-

amousavigourabi commented on PR #1209:
URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1830320974

   @wgtmac 




> More consistent code style in parquet-mr
> 
>
> Key: PARQUET-2386
> URL: https://issues.apache.org/jira/browse/PARQUET-2386
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Assignee: Atour Mousavi Gourabi
>Priority: Major
>
> The code style conventions used in parquet-mr are generally inconsistent and 
> unenforced. We might want to consider using linters such as Spotless and a 
> more extensive .editorconfig configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] PARQUET-2386: More consistent code style in parquet-mr [parquet-mr]

2023-11-28 Thread via GitHub


amousavigourabi commented on PR #1209:
URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1830320974

   @wgtmac 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790680#comment-17790680
 ] 

ASF GitHub Bot commented on PARQUET-2386:
-

amousavigourabi opened a new pull request, #1209:
URL: https://github.com/apache/parquet-mr/pull/1209

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [x] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [x] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   - This PR only refactors style, no logic is added or removed in any way 
shape or form
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
 - Adds note in the PR template on the style checks
   
   ---
   
   This PR contains two commits, the first adds the style checks and 
configurations, the second applies these changes to the repository.




> More consistent code style in parquet-mr
> 
>
> Key: PARQUET-2386
> URL: https://issues.apache.org/jira/browse/PARQUET-2386
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Assignee: Atour Mousavi Gourabi
>Priority: Major
>
> The code style conventions used in parquet-mr are generally inconsistent and 
> unenforced. We might want to consider using linters such as Spotless and a 
> more extensive .editorconfig configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] PARQUET-2386: More consistent code style in parquet-mr [parquet-mr]

2023-11-28 Thread via GitHub


amousavigourabi opened a new pull request, #1209:
URL: https://github.com/apache/parquet-mr/pull/1209

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [x] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [x] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   - This PR only refactors style, no logic is added or removed in any way 
shape or form
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
 - Adds note in the PR template on the style checks
   
   ---
   
   This PR contains two commits, the first adds the style checks and 
configurations, the second applies these changes to the repository.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org