[jira] [Updated] (PARQUET-2398) Make static variables final

2023-12-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-2398:

Labels: pull-request-available  (was: )

> Make static variables final
> ---
>
> Key: PARQUET-2398
> URL: https://issues.apache.org/jira/browse/PARQUET-2398
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.14.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2407) Add custom .asf.yaml for finer-grained control of email notifications

2023-12-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-2407:

Labels: pull-request-available  (was: )

> Add custom .asf.yaml for finer-grained control of email notifications
> -
>
> Key: PARQUET-2407
> URL: https://issues.apache.org/jira/browse/PARQUET-2407
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format, parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As per the discussion on ML: 
> https://lists.apache.org/thread/4x2ob2ojkznfft3czz0gypmtoz7vo9fz, we can add 
> a customized .asf.yaml config file for better control of email notifications.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2407) Add custom .asf.yaml for finer-grained control of email notifications

2023-12-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793078#comment-17793078
 ] 

ASF GitHub Bot commented on PARQUET-2407:
-

wgtmac merged PR #1232:
URL: https://github.com/apache/parquet-mr/pull/1232




> Add custom .asf.yaml for finer-grained control of email notifications
> -
>
> Key: PARQUET-2407
> URL: https://issues.apache.org/jira/browse/PARQUET-2407
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format, parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> As per the discussion on ML: 
> https://lists.apache.org/thread/4x2ob2ojkznfft3czz0gypmtoz7vo9fz, we can add 
> a customized .asf.yaml config file for better control of email notifications.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2407) Add custom .asf.yaml for finer-grained control of email notifications

2023-12-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793077#comment-17793077
 ] 

ASF GitHub Bot commented on PARQUET-2407:
-

wgtmac commented on PR #1232:
URL: https://github.com/apache/parquet-mr/pull/1232#issuecomment-1839865078

   Thanks @gszadovszky @pitrou! I'll merge it to see what happens. If it works 
as expected, I'll create a PR for parquet-format as well.




> Add custom .asf.yaml for finer-grained control of email notifications
> -
>
> Key: PARQUET-2407
> URL: https://issues.apache.org/jira/browse/PARQUET-2407
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format, parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> As per the discussion on ML: 
> https://lists.apache.org/thread/4x2ob2ojkznfft3czz0gypmtoz7vo9fz, we can add 
> a customized .asf.yaml config file for better control of email notifications.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2408) Fix license header in .gitattributes

2023-12-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792995#comment-17792995
 ] 

ASF GitHub Bot commented on PARQUET-2408:
-

Fokko merged PR #1231:
URL: https://github.com/apache/parquet-mr/pull/1231




> Fix license header in .gitattributes
> 
>
> Key: PARQUET-2408
> URL: https://issues.apache.org/jira/browse/PARQUET-2408
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
> Fix For: 1.14.0
>
>
> {code:java}
> ➜  parquet-mr git:(master) ✗ git status
> (ASF) is not a valid attribute name: .gitattributes:2
> License, is not a valid attribute name: .gitattributes:6
> "License"); is not a valid attribute name: .gitattributes:7
> http://www.apache.org/licenses/LICENSE-2.0 is not a valid attribute name: 
> .gitattributes:10
> writing, is not a valid attribute name: .gitattributes:12
> "AS is not a valid attribute name: .gitattributes:14
> KIND, is not a valid attribute name: .gitattributes:15
> On branch master
> Your branch is up to date with 'origin/master'.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-12-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792965#comment-17792965
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

shangxinli commented on code in PR #1177:
URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1414220467


##
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java:
##
@@ -409,4 +428,14 @@ abstract void writePage(
   ValuesWriter definitionLevels,
   ValuesWriter values)
   throws IOException;
+
+  abstract void writePage(
+  int rowCount,
+  int valueCount,
+  Statistics statistics,
+  SizeStatistics sizeStatistics,

Review Comment:
   There could be some confusion of the two names that sieStatistics is one 
type of statistics but we are separating them





> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
> Fix For: format-2.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-12-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792963#comment-17792963
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

shangxinli commented on code in PR #1177:
URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1414218649


##
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java:
##
@@ -389,7 +400,14 @@ void writePage() {
 this.rowsWrittenSoFar += pageRowCount;
 if (DEBUG) LOG.debug("write page");
 try {
-  writePage(pageRowCount, valueCount, statistics, repetitionLevelColumn, 
definitionLevelColumn, dataColumn);
+  writePage(
+  pageRowCount,
+  valueCount,
+  statistics,
+  sizeStatisticsBuilder.build(),

Review Comment:
   Can we have parity of line 406 and 407? You can use a varaiible in line 407 





> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
> Fix For: format-2.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-12-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792964#comment-17792964
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

shangxinli commented on code in PR #1177:
URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1414218649


##
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java:
##
@@ -389,7 +400,14 @@ void writePage() {
 this.rowsWrittenSoFar += pageRowCount;
 if (DEBUG) LOG.debug("write page");
 try {
-  writePage(pageRowCount, valueCount, statistics, repetitionLevelColumn, 
definitionLevelColumn, dataColumn);
+  writePage(
+  pageRowCount,
+  valueCount,
+  statistics,
+  sizeStatisticsBuilder.build(),

Review Comment:
   Can we have the parity of lines 406 and 407? You can use a variable in line 
407 





> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
> Fix For: format-2.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2407) Add custom .asf.yaml for finer-grained control of email notifications

2023-12-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792930#comment-17792930
 ] 

ASF GitHub Bot commented on PARQUET-2407:
-

wgtmac commented on PR #1232:
URL: https://github.com/apache/parquet-mr/pull/1232#issuecomment-1838950498

   Per the [discussion on 
ML](https://lists.apache.org/thread/4x2ob2ojkznfft3czz0gypmtoz7vo9fz), I 
proposed to create an .asf.yaml file for customizing email notification. Please 
take a look, thanks! @gszadovszky @shangxinli @pitrou @emkornfield 




> Add custom .asf.yaml for finer-grained control of email notifications
> -
>
> Key: PARQUET-2407
> URL: https://issues.apache.org/jira/browse/PARQUET-2407
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format, parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> As per the discussion on ML: 
> https://lists.apache.org/thread/4x2ob2ojkznfft3czz0gypmtoz7vo9fz, we can add 
> a customized .asf.yaml config file for better control of email notifications.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2407) Add custom .asf.yaml for finer-grained control of email notifications

2023-12-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792927#comment-17792927
 ] 

ASF GitHub Bot commented on PARQUET-2407:
-

wgtmac opened a new pull request, #1232:
URL: https://github.com/apache/parquet-mr/pull/1232

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references
 them in the PR title. For example, "PARQUET-1234: My Parquet PR"
   - https://issues.apache.org/jira/browse/PARQUET-2407
   - In case you are adding a dependency, check if the license complies with
 the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines
 from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
   1. Subject is separated from body by a blank line
   1. Subject is limited to 50 characters (not including Jira issue 
reference)
   1. Subject does not end with a period
   1. Subject uses the imperative mood ("add", not "adding")
   1. Body wraps at 72 characters
   1. Body explains "what" and "why", not "how"
   
   ### Style
   - [ ] My contribution adheres to the code style guidelines and Spotless 
passes.
   - To apply the necessary changes, run `mvn spotless:apply 
-Pvector-plugins`
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
   - All the public functions and the classes in the PR contain Javadoc 
that explain what it does
   




> Add custom .asf.yaml for finer-grained control of email notifications
> -
>
> Key: PARQUET-2407
> URL: https://issues.apache.org/jira/browse/PARQUET-2407
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format, parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> As per the discussion on ML: 
> https://lists.apache.org/thread/4x2ob2ojkznfft3czz0gypmtoz7vo9fz, we can add 
> a customized .asf.yaml config file for better control of email notifications.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2408) Fix license header in .gitattributes

2023-12-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792926#comment-17792926
 ] 

ASF GitHub Bot commented on PARQUET-2408:
-

wgtmac commented on PR #1231:
URL: https://github.com/apache/parquet-mr/pull/1231#issuecomment-1838934525

   PTAL @amousavigourabi @Fokko 




> Fix license header in .gitattributes
> 
>
> Key: PARQUET-2408
> URL: https://issues.apache.org/jira/browse/PARQUET-2408
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> {code:java}
> ➜  parquet-mr git:(master) ✗ git status
> (ASF) is not a valid attribute name: .gitattributes:2
> License, is not a valid attribute name: .gitattributes:6
> "License"); is not a valid attribute name: .gitattributes:7
> http://www.apache.org/licenses/LICENSE-2.0 is not a valid attribute name: 
> .gitattributes:10
> writing, is not a valid attribute name: .gitattributes:12
> "AS is not a valid attribute name: .gitattributes:14
> KIND, is not a valid attribute name: .gitattributes:15
> On branch master
> Your branch is up to date with 'origin/master'.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2408) Fix license header in .gitattributes

2023-12-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792925#comment-17792925
 ] 

ASF GitHub Bot commented on PARQUET-2408:
-

wgtmac opened a new pull request, #1231:
URL: https://github.com/apache/parquet-mr/pull/1231

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references
 them in the PR title. For example, "PARQUET-1234: My Parquet PR"
   - https://issues.apache.org/jira/browse/PARQUET-2408
   - In case you are adding a dependency, check if the license complies with
 the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines
 from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
   1. Subject is separated from body by a blank line
   1. Subject is limited to 50 characters (not including Jira issue 
reference)
   1. Subject does not end with a period
   1. Subject uses the imperative mood ("add", not "adding")
   1. Body wraps at 72 characters
   1. Body explains "what" and "why", not "how"
   
   ### Style
   - [ ] My contribution adheres to the code style guidelines and Spotless 
passes.
   - To apply the necessary changes, run `mvn spotless:apply 
-Pvector-plugins`
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
   - All the public functions and the classes in the PR contain Javadoc 
that explain what it does
   




> Fix license header in .gitattributes
> 
>
> Key: PARQUET-2408
> URL: https://issues.apache.org/jira/browse/PARQUET-2408
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> {code:java}
> ➜  parquet-mr git:(master) ✗ git status
> (ASF) is not a valid attribute name: .gitattributes:2
> License, is not a valid attribute name: .gitattributes:6
> "License"); is not a valid attribute name: .gitattributes:7
> http://www.apache.org/licenses/LICENSE-2.0 is not a valid attribute name: 
> .gitattributes:10
> writing, is not a valid attribute name: .gitattributes:12
> "AS is not a valid attribute name: .gitattributes:14
> KIND, is not a valid attribute name: .gitattributes:15
> On branch master
> Your branch is up to date with 'origin/master'.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1822) Parquet without Hadoop dependencies

2023-12-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792829#comment-17792829
 ] 

ASF GitHub Bot commented on PARQUET-1822:
-

drealeed commented on PR #:
URL: https://github.com/apache/parquet-mr/pull/#issuecomment-1838584976

   @amousavigourabi , that's actually what I did and it's working for us now. 
Thanks




> Parquet without Hadoop dependencies
> ---
>
> Key: PARQUET-1822
> URL: https://issues.apache.org/jira/browse/PARQUET-1822
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.11.0
> Environment: Amazon Fargate (linux), Windows development box.
> We are writing Parquet to be read by the Snowflake and Athena databases.
>Reporter: mark juchems
>Priority: Minor
>  Labels: documentation, newbie
> Fix For: 1.14.0
>
>
> I have been trying for weeks to create a parquet file from avro and write to 
> S3 in Java.  This has been incredibly frustrating and odd as Spark can do it 
> easily (I'm told).
> I have assembled the correct jars through luck and diligence, but now I find 
> out that I have to have hadoop installed on my machine. I am currently 
> developing in Windows and it seems a dll and exe can fix that up but am 
> wondering about Linus as the code will eventually run in Fargate on AWS.
> *Why do I need external dependencies and not pure java?*
> The thing really is how utterly complex all this is.  I would like to create 
> an avro file and convert it to Parquet and write it to S3, but I am trapped 
> in "ParquetWriter" hell! 
> *Why can't I get a normal OutputStream and write it wherever I want?*
> I have scoured the web for examples and there are a few but we really need 
> some documentation on this stuff.  I understand that there may be reasons for 
> all this but I can't find them on the web anywhere.  Any help?  Can't we get 
> the "SimpleParquet" jar that does this:
>  
> ParquetWriter writer = 
> AvroParquetWriter.builder(outputStream)
>  .withSchema(avroSchema)
>  .withConf(conf)
>  .withCompressionCodec(CompressionCodecName.SNAPPY)
>  .withWriteMode(Mode.OVERWRITE)//probably not good for prod. (overwrites 
> files).
>  .build();
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2405) Include logic for a fallback CodecFactory implementation

2023-12-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792814#comment-17792814
 ] 

ASF GitHub Bot commented on PARQUET-2405:
-

amousavigourabi opened a new pull request, #1230:
URL: https://github.com/apache/parquet-mr/pull/1230

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [x] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references
 them in the PR title. For example, "PARQUET-1234: My Parquet PR"
   - https://issues.apache.org/jira/browse/PARQUET-XXX
   - In case you are adding a dependency, check if the license complies with
 the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [x] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines
 from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
   1. Subject is separated from body by a blank line
   1. Subject is limited to 50 characters (not including Jira issue 
reference)
   1. Subject does not end with a period
   1. Subject uses the imperative mood ("add", not "adding")
   1. Body wraps at 72 characters
   1. Body explains "what" and "why", not "how"
   
   ### Style
   - [x] My contribution adheres to the code style guidelines and Spotless 
passes.
   - To apply the necessary changes, run `mvn spotless:apply 
-Pvector-plugins`
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
   - All the public functions and the classes in the PR contain Javadoc 
that explain what it does
   




> Include logic for a fallback CodecFactory implementation
> 
>
> Key: PARQUET-2405
> URL: https://issues.apache.org/jira/browse/PARQUET-2405
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Priority: Major
>
> Now that CodecFactories are pluggable in the read/write API, a fallback 
> CodecFactory implementation should be included. This would allow users to 
> more easily implement their own partially delegating CodecFactory for, for 
> example, hardware accelerating a codec they expect to encounter a lot, 
> without failing on codecs they did not implement themselves. For these cases, 
> the fallback implementation could delegate to the default 
> (Direct)CodecFactory, or any other implementation of their liking.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2406) Remove redundant valueOf calls

2023-12-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792792#comment-17792792
 ] 

ASF GitHub Bot commented on PARQUET-2406:
-

amousavigourabi opened a new pull request, #1229:
URL: https://github.com/apache/parquet-mr/pull/1229

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references
 them in the PR title. For example, "PARQUET-1234: My Parquet PR"
   - https://issues.apache.org/jira/browse/PARQUET-XXX
   - In case you are adding a dependency, check if the license complies with
 the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines
 from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
   1. Subject is separated from body by a blank line
   1. Subject is limited to 50 characters (not including Jira issue 
reference)
   1. Subject does not end with a period
   1. Subject uses the imperative mood ("add", not "adding")
   1. Body wraps at 72 characters
   1. Body explains "what" and "why", not "how"
   
   ### Style
   - [ ] My contribution adheres to the code style guidelines and Spotless 
passes.
   - To apply the necessary changes, run `mvn spotless:apply 
-Pvector-plugins`
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
   - All the public functions and the classes in the PR contain Javadoc 
that explain what it does
   




> Remove redundant valueOf calls
> --
>
> Key: PARQUET-2406
> URL: https://issues.apache.org/jira/browse/PARQUET-2406
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Priority: Minor
>
> Remove redundant valueOf calls, or replace them with parseXXX where possible. 
> This could avoid some redundant calls and/or (un)boxing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2401) Synchronize on final fields

2023-12-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792774#comment-17792774
 ] 

ASF GitHub Bot commented on PARQUET-2401:
-

Fokko merged PR #1224:
URL: https://github.com/apache/parquet-mr/pull/1224




> Synchronize on final fields
> ---
>
> Key: PARQUET-2401
> URL: https://issues.apache.org/jira/browse/PARQUET-2401
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.1
>Reporter: Atour Mousavi Gourabi
>Priority: Minor
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2393) Make `ColumnIOCreatorVisitor` static

2023-12-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792773#comment-17792773
 ] 

ASF GitHub Bot commented on PARQUET-2393:
-

Fokko merged PR #1216:
URL: https://github.com/apache/parquet-mr/pull/1216




> Make `ColumnIOCreatorVisitor` static
> 
>
> Key: PARQUET-2393
> URL: https://issues.apache.org/jira/browse/PARQUET-2393
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2392) Remove StringBuilder in `LogicalTypeAnnotation`

2023-12-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792772#comment-17792772
 ] 

ASF GitHub Bot commented on PARQUET-2392:
-

Fokko merged PR #1215:
URL: https://github.com/apache/parquet-mr/pull/1215




> Remove StringBuilder in `LogicalTypeAnnotation`
> ---
>
> Key: PARQUET-2392
> URL: https://issues.apache.org/jira/browse/PARQUET-2392
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2394) Use `computeIfAbsent` in `MessageColumnIO`

2023-12-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792769#comment-17792769
 ] 

ASF GitHub Bot commented on PARQUET-2394:
-

Fokko merged PR #1217:
URL: https://github.com/apache/parquet-mr/pull/1217




> Use `computeIfAbsent` in `MessageColumnIO`
> --
>
> Key: PARQUET-2394
> URL: https://issues.apache.org/jira/browse/PARQUET-2394
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1647) [Java] support for Arrow's float16

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792698#comment-17792698
 ] 

ASF GitHub Bot commented on PARQUET-1647:
-

wgtmac commented on PR #1142:
URL: https://github.com/apache/parquet-mr/pull/1142#issuecomment-1838002383

   BTW, it would be good to add an interoperability test to read parquet files 
from here: 
https://github.com/apache/parquet-testing/commit/da467dac2f095b979af37bcf40fa0d1dee5ff652.
 You may want to take a look at this example: 
https://github.com/apache/parquet-mr/blob/44b56225be6fe7b74667f4f2430326ef1f076cc5/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/codec/TestInteropReadLz4RawCodec.java#L40
 




> [Java] support for Arrow's float16
> --
>
> Key: PARQUET-1647
> URL: https://issues.apache.org/jira/browse/PARQUET-1647
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format, parquet-thrift
>Reporter: The Alchemist
>Priority: Minor
>
> h2. DESCRIPTION
>  
> I'm wondering if there's any interest in supporting Arrow's {{float16}} type 
> in Parquet.
> There seem to be one or two {{float16}} / {{halffloat}} tickets here (e.g., 
> PARQUET-1403) but nothing that speaks to adding half-float support to Parquet 
> in-general.
>  
> h2. PLANS
> I'm able to spend some time on this, if someone points me  in the right 
> direction.
>  
>  # Add the {{HALFFLOAT}} or {{FLOAT16}} enum (any preferred naming 
> convention?) to 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L32]
>  # Add {{HALFFLOAT}} to {{org.apache.parquet.schema.PrimitiveType}}
>  # Add {{HALFFLOAT}} support to 
> {{org.apache.parquet.arrow.schema.SchemaConverter}}
>  # Add encoding for new type at {{org.apache.parquet.column.Encoding}}
>  # ??
> If anyone has any interest in this, pointers, or comments, they would be 
> greatly appreciated!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2395) Prefer `singletonList` over `asList`

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792695#comment-17792695
 ] 

ASF GitHub Bot commented on PARQUET-2395:
-

wgtmac commented on PR #1218:
URL: https://github.com/apache/parquet-mr/pull/1218#issuecomment-1837995639

   Thanks for the explanation! 




> Prefer `singletonList` over `asList`
> 
>
> Key: PARQUET-2395
> URL: https://issues.apache.org/jira/browse/PARQUET-2395
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2395) Prefer `singletonList` over `asList`

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792693#comment-17792693
 ] 

ASF GitHub Bot commented on PARQUET-2395:
-

Fokko merged PR #1218:
URL: https://github.com/apache/parquet-mr/pull/1218




> Prefer `singletonList` over `asList`
> 
>
> Key: PARQUET-2395
> URL: https://issues.apache.org/jira/browse/PARQUET-2395
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2395) Prefer `singletonList` over `asList`

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792694#comment-17792694
 ] 

ASF GitHub Bot commented on PARQUET-2395:
-

Fokko commented on PR #1218:
URL: https://github.com/apache/parquet-mr/pull/1218#issuecomment-1837993837

   Thanks for the review @wgtmac, @zhangjiashen and @amousavigourabi  




> Prefer `singletonList` over `asList`
> 
>
> Key: PARQUET-2395
> URL: https://issues.apache.org/jira/browse/PARQUET-2395
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2395) Prefer `singletonList` over `asList`

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792692#comment-17792692
 ] 

ASF GitHub Bot commented on PARQUET-2395:
-

Fokko commented on PR #1218:
URL: https://github.com/apache/parquet-mr/pull/1218#issuecomment-1837993326

   @wgtmac Two things:
   
   - `singletonList` is completely immutable, while with `asList` you can still 
mutate the reference.
   - `singletonList` is not backed by an array, reducing the memory footprint.




> Prefer `singletonList` over `asList`
> 
>
> Key: PARQUET-2395
> URL: https://issues.apache.org/jira/browse/PARQUET-2395
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2344) Bump to Thirft 0.19.0

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792691#comment-17792691
 ] 

ASF GitHub Bot commented on PARQUET-2344:
-

wgtmac commented on code in PR #1192:
URL: https://github.com/apache/parquet-mr/pull/1192#discussion_r141345


##
pom.xml:
##
@@ -619,6 +622,9 @@
 
   true
   true
+  
+
javax.annotation:javax.annotation-api:jar:1.3.2

Review Comment:
   Why do we need to ignore this?



##
parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftSchemaConverter.java:
##
@@ -225,14 +225,18 @@ private static ThriftField toThriftField(String name, 
Field field, ThriftField.R
 final Field listElemField = field.getListElemField();
 type = new ThriftType.ListType(toThriftField(listElemField.getName(), 
listElemField, requirement));
 break;
+  case UUID:
   case ENUM:
-Collection enumValues = field.getEnumValues();
-List values = new ArrayList();
-for (TEnum tEnum : enumValues) {
-  values.add(new EnumValue(tEnum.getValue(), tEnum.toString()));
+if (field.isEnum()) {

Review Comment:
   Why mixing UUID and ENUM in this case?



##
parquet-format-structures/pom.xml:
##
@@ -156,6 +156,11 @@
   libthrift
   ${format.thrift.version}
 
+
+  javax.annotation
+  javax.annotation-api

Review Comment:
   Where do we need this?



##
parquet-thrift/src/main/java/org/apache/parquet/thrift/struct/ThriftTypeID.java:
##
@@ -51,10 +51,15 @@ public enum ThriftTypeID {
   LIST (TType.LIST, true, ListType.class),
   ENUM (TType.ENUM, TType.I32, EnumType.class);
 
-  private static ThriftTypeID[] types = new ThriftTypeID[17];
+  private static final ThriftTypeID[] types;
   static {
+types = new ThriftTypeID[18];
 for (ThriftTypeID t : ThriftTypeID.values()) {
-  types[t.thriftType] = t;
+  if (t.thriftType == -1) {

Review Comment:
   It would be good to add the link to the comment as well. Or at least we need 
to explain why -1 is used here.





> Bump to Thirft 0.19.0
> -
>
> Key: PARQUET-2344
> URL: https://issues.apache.org/jira/browse/PARQUET-2344
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format, parquet-mr
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: format-2.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2396) Refactor `ColumnIndexBuilder`

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792690#comment-17792690
 ] 

ASF GitHub Bot commented on PARQUET-2396:
-

Fokko merged PR #1219:
URL: https://github.com/apache/parquet-mr/pull/1219




> Refactor `ColumnIndexBuilder`
> -
>
> Key: PARQUET-2396
> URL: https://issues.apache.org/jira/browse/PARQUET-2396
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2391) Remove unnecessary unboxing

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792668#comment-17792668
 ] 

ASF GitHub Bot commented on PARQUET-2391:
-

Fokko commented on PR #1214:
URL: https://github.com/apache/parquet-mr/pull/1214#issuecomment-1837969587

   Thanks for the review @wgtmac & @amousavigourabi  




> Remove unnecessary unboxing
> ---
>
> Key: PARQUET-2391
> URL: https://issues.apache.org/jira/browse/PARQUET-2391
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1647) [Java] support for Arrow's float16

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792667#comment-17792667
 ] 

ASF GitHub Bot commented on PARQUET-1647:
-

zhangjiashen commented on code in PR #1142:
URL: https://github.com/apache/parquet-mr/pull/1142#discussion_r1413455235


##
pom.xml:
##
@@ -596,6 +597,9 @@
 
[Java] support for Arrow's float16
> --
>
> Key: PARQUET-1647
> URL: https://issues.apache.org/jira/browse/PARQUET-1647
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format, parquet-thrift
>Reporter: The Alchemist
>Priority: Minor
>
> h2. DESCRIPTION
>  
> I'm wondering if there's any interest in supporting Arrow's {{float16}} type 
> in Parquet.
> There seem to be one or two {{float16}} / {{halffloat}} tickets here (e.g., 
> PARQUET-1403) but nothing that speaks to adding half-float support to Parquet 
> in-general.
>  
> h2. PLANS
> I'm able to spend some time on this, if someone points me  in the right 
> direction.
>  
>  # Add the {{HALFFLOAT}} or {{FLOAT16}} enum (any preferred naming 
> convention?) to 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L32]
>  # Add {{HALFFLOAT}} to {{org.apache.parquet.schema.PrimitiveType}}
>  # Add {{HALFFLOAT}} support to 
> {{org.apache.parquet.arrow.schema.SchemaConverter}}
>  # Add encoding for new type at {{org.apache.parquet.column.Encoding}}
>  # ??
> If anyone has any interest in this, pointers, or comments, they would be 
> greatly appreciated!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2391) Remove unnecessary unboxing

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792666#comment-17792666
 ] 

ASF GitHub Bot commented on PARQUET-2391:
-

Fokko merged PR #1214:
URL: https://github.com/apache/parquet-mr/pull/1214




> Remove unnecessary unboxing
> ---
>
> Key: PARQUET-2391
> URL: https://issues.apache.org/jira/browse/PARQUET-2391
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2385) Don't initialize CodecFactory in ParquetWriter

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792645#comment-17792645
 ] 

ASF GitHub Bot commented on PARQUET-2385:
-

wgtmac merged PR #1203:
URL: https://github.com/apache/parquet-mr/pull/1203




> Don't initialize CodecFactory in ParquetWriter
> --
>
> Key: PARQUET-2385
> URL: https://issues.apache.org/jira/browse/PARQUET-2385
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Priority: Major
>
> In ParquetWriter we initialize a CodecFactory, instead we should allow users 
> to set their own via the builder as to provide a little more flexibility 
> (analogous to PARQUET-2282).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2390) Replace anonymouse functions with lambda's

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792641#comment-17792641
 ] 

ASF GitHub Bot commented on PARQUET-2390:
-

Fokko merged PR #1213:
URL: https://github.com/apache/parquet-mr/pull/1213




> Replace anonymouse functions with lambda's
> --
>
> Key: PARQUET-2390
> URL: https://issues.apache.org/jira/browse/PARQUET-2390
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2390) Replace anonymouse functions with lambda's

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792642#comment-17792642
 ] 

ASF GitHub Bot commented on PARQUET-2390:
-

Fokko commented on PR #1213:
URL: https://github.com/apache/parquet-mr/pull/1213#issuecomment-1837879063

   Thanks for the review @wgtmac 




> Replace anonymouse functions with lambda's
> --
>
> Key: PARQUET-2390
> URL: https://issues.apache.org/jira/browse/PARQUET-2390
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2400) Update Spotless command in PR prompt to include vector plugins

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792640#comment-17792640
 ] 

ASF GitHub Bot commented on PARQUET-2400:
-

wgtmac merged PR #1223:
URL: https://github.com/apache/parquet-mr/pull/1223




> Update Spotless command in PR prompt to include vector plugins
> --
>
> Key: PARQUET-2400
> URL: https://issues.apache.org/jira/browse/PARQUET-2400
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Atour Mousavi Gourabi
>Priority: Minor
>
> The Maven command to apply Spotless referenced in the PR prompt does not 
> include applying it to the parquet-plugins. This needs to be updated in those 
> docs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1647) [Java] support for Arrow's float16

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792628#comment-17792628
 ] 

ASF GitHub Bot commented on PARQUET-1647:
-

wgtmac commented on code in PR #1142:
URL: https://github.com/apache/parquet-mr/pull/1142#discussion_r1413353347


##
pom.xml:
##
@@ -596,6 +597,9 @@
 
[Java] support for Arrow's float16
> --
>
> Key: PARQUET-1647
> URL: https://issues.apache.org/jira/browse/PARQUET-1647
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format, parquet-thrift
>Reporter: The Alchemist
>Priority: Minor
>
> h2. DESCRIPTION
>  
> I'm wondering if there's any interest in supporting Arrow's {{float16}} type 
> in Parquet.
> There seem to be one or two {{float16}} / {{halffloat}} tickets here (e.g., 
> PARQUET-1403) but nothing that speaks to adding half-float support to Parquet 
> in-general.
>  
> h2. PLANS
> I'm able to spend some time on this, if someone points me  in the right 
> direction.
>  
>  # Add the {{HALFFLOAT}} or {{FLOAT16}} enum (any preferred naming 
> convention?) to 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L32]
>  # Add {{HALFFLOAT}} to {{org.apache.parquet.schema.PrimitiveType}}
>  # Add {{HALFFLOAT}} support to 
> {{org.apache.parquet.arrow.schema.SchemaConverter}}
>  # Add encoding for new type at {{org.apache.parquet.column.Encoding}}
>  # ??
> If anyone has any interest in this, pointers, or comments, they would be 
> greatly appreciated!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1647) [Java] support for Arrow's float16

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792621#comment-17792621
 ] 

ASF GitHub Bot commented on PARQUET-1647:
-

zhangjiashen commented on PR #1142:
URL: https://github.com/apache/parquet-mr/pull/1142#issuecomment-1837800275

   > Could you please rebase it?
   
   Rebased, can you help merge this PR?




> [Java] support for Arrow's float16
> --
>
> Key: PARQUET-1647
> URL: https://issues.apache.org/jira/browse/PARQUET-1647
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format, parquet-thrift
>Reporter: The Alchemist
>Priority: Minor
>
> h2. DESCRIPTION
>  
> I'm wondering if there's any interest in supporting Arrow's {{float16}} type 
> in Parquet.
> There seem to be one or two {{float16}} / {{halffloat}} tickets here (e.g., 
> PARQUET-1403) but nothing that speaks to adding half-float support to Parquet 
> in-general.
>  
> h2. PLANS
> I'm able to spend some time on this, if someone points me  in the right 
> direction.
>  
>  # Add the {{HALFFLOAT}} or {{FLOAT16}} enum (any preferred naming 
> convention?) to 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L32]
>  # Add {{HALFFLOAT}} to {{org.apache.parquet.schema.PrimitiveType}}
>  # Add {{HALFFLOAT}} support to 
> {{org.apache.parquet.arrow.schema.SchemaConverter}}
>  # Add encoding for new type at {{org.apache.parquet.column.Encoding}}
>  # ??
> If anyone has any interest in this, pointers, or comments, they would be 
> greatly appreciated!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2388) Deprecate `CHARSETS` on `PlainValuesWriter`

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792618#comment-17792618
 ] 

ASF GitHub Bot commented on PARQUET-2388:
-

Fokko merged PR #1211:
URL: https://github.com/apache/parquet-mr/pull/1211




> Deprecate `CHARSETS` on `PlainValuesWriter`
> ---
>
> Key: PARQUET-2388
> URL: https://issues.apache.org/jira/browse/PARQUET-2388
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2389) Remove redundant initializers

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792617#comment-17792617
 ] 

ASF GitHub Bot commented on PARQUET-2389:
-

Fokko commented on PR #1212:
URL: https://github.com/apache/parquet-mr/pull/1212#issuecomment-1837791938

   Thanks for the review @wgtmac and @amousavigourabi  




> Remove redundant initializers
> -
>
> Key: PARQUET-2389
> URL: https://issues.apache.org/jira/browse/PARQUET-2389
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2389) Remove redundant initializers

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792616#comment-17792616
 ] 

ASF GitHub Bot commented on PARQUET-2389:
-

Fokko merged PR #1212:
URL: https://github.com/apache/parquet-mr/pull/1212




> Remove redundant initializers
> -
>
> Key: PARQUET-2389
> URL: https://issues.apache.org/jira/browse/PARQUET-2389
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2387) Simplify `hasFieldsIgnored` expression

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792615#comment-17792615
 ] 

ASF GitHub Bot commented on PARQUET-2387:
-

Fokko commented on PR #1210:
URL: https://github.com/apache/parquet-mr/pull/1210#issuecomment-1837786670

   Thanks @wgtmac  




> Simplify `hasFieldsIgnored` expression
> --
>
> Key: PARQUET-2387
> URL: https://issues.apache.org/jira/browse/PARQUET-2387
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-thrift
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2387) Simplify `hasFieldsIgnored` expression

2023-12-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792614#comment-17792614
 ] 

ASF GitHub Bot commented on PARQUET-2387:
-

Fokko merged PR #1210:
URL: https://github.com/apache/parquet-mr/pull/1210




> Simplify `hasFieldsIgnored` expression
> --
>
> Key: PARQUET-2387
> URL: https://issues.apache.org/jira/browse/PARQUET-2387
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-thrift
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2399) Use deprecated tag in Javadoc

2023-12-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792463#comment-17792463
 ] 

ASF GitHub Bot commented on PARQUET-2399:
-

Fokko merged PR #1222:
URL: https://github.com/apache/parquet-mr/pull/1222




> Use deprecated tag in Javadoc
> -
>
> Key: PARQUET-2399
> URL: https://issues.apache.org/jira/browse/PARQUET-2399
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Atour Mousavi Gourabi
>Priority: Minor
>
> In some Javadoc, we use Deprecated: instead of the deprecated tag.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2401) Synchronize on final fields

2023-12-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792271#comment-17792271
 ] 

ASF GitHub Bot commented on PARQUET-2401:
-

amousavigourabi opened a new pull request, #1224:
URL: https://github.com/apache/parquet-mr/pull/1224

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references
 them in the PR title. For example, "PARQUET-1234: My Parquet PR"
   - https://issues.apache.org/jira/browse/PARQUET-XXX
   - In case you are adding a dependency, check if the license complies with
 the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines
 from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
   1. Subject is separated from body by a blank line
   1. Subject is limited to 50 characters (not including Jira issue 
reference)
   1. Subject does not end with a period
   1. Subject uses the imperative mood ("add", not "adding")
   1. Body wraps at 72 characters
   1. Body explains "what" and "why", not "how"
   
   ### Style
   - [ ] My contribution adheres to the code style guidelines and Spotless 
passes.
   - To apply the necessary changes, run the spotless:apply goal in Maven
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
   - All the public functions and the classes in the PR contain Javadoc 
that explain what it does
   




> Synchronize on final fields
> ---
>
> Key: PARQUET-2401
> URL: https://issues.apache.org/jira/browse/PARQUET-2401
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1822) Parquet without Hadoop dependencies

2023-12-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792267#comment-17792267
 ] 

ASF GitHub Bot commented on PARQUET-1822:
-

amousavigourabi commented on PR #:
URL: https://github.com/apache/parquet-mr/pull/#issuecomment-1836931968

   > Our project needs this feature as well, is there a date for the next major 
release?
   
   @drealeed if you just need to be able to drop the Hadoop Path dependency, 
you might want to consider copying the InputFile, OutputFile implementations 
from this pull request before the next release is out. If you need to fully 
drop Hadoop, this is still being worked on.




> Parquet without Hadoop dependencies
> ---
>
> Key: PARQUET-1822
> URL: https://issues.apache.org/jira/browse/PARQUET-1822
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.11.0
> Environment: Amazon Fargate (linux), Windows development box.
> We are writing Parquet to be read by the Snowflake and Athena databases.
>Reporter: mark juchems
>Priority: Minor
>  Labels: documentation, newbie
> Fix For: 1.14.0
>
>
> I have been trying for weeks to create a parquet file from avro and write to 
> S3 in Java.  This has been incredibly frustrating and odd as Spark can do it 
> easily (I'm told).
> I have assembled the correct jars through luck and diligence, but now I find 
> out that I have to have hadoop installed on my machine. I am currently 
> developing in Windows and it seems a dll and exe can fix that up but am 
> wondering about Linus as the code will eventually run in Fargate on AWS.
> *Why do I need external dependencies and not pure java?*
> The thing really is how utterly complex all this is.  I would like to create 
> an avro file and convert it to Parquet and write it to S3, but I am trapped 
> in "ParquetWriter" hell! 
> *Why can't I get a normal OutputStream and write it wherever I want?*
> I have scoured the web for examples and there are a few but we really need 
> some documentation on this stuff.  I understand that there may be reasons for 
> all this but I can't find them on the web anywhere.  Any help?  Can't we get 
> the "SimpleParquet" jar that does this:
>  
> ParquetWriter writer = 
> AvroParquetWriter.builder(outputStream)
>  .withSchema(avroSchema)
>  .withConf(conf)
>  .withCompressionCodec(CompressionCodecName.SNAPPY)
>  .withWriteMode(Mode.OVERWRITE)//probably not good for prod. (overwrites 
> files).
>  .build();
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1822) Parquet without Hadoop dependencies

2023-12-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792120#comment-17792120
 ] 

ASF GitHub Bot commented on PARQUET-1822:
-

drealeed commented on PR #:
URL: https://github.com/apache/parquet-mr/pull/#issuecomment-1836333591

   Our project needs this feature as well, is there a date for the next major 
release?




> Parquet without Hadoop dependencies
> ---
>
> Key: PARQUET-1822
> URL: https://issues.apache.org/jira/browse/PARQUET-1822
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.11.0
> Environment: Amazon Fargate (linux), Windows development box.
> We are writing Parquet to be read by the Snowflake and Athena databases.
>Reporter: mark juchems
>Priority: Minor
>  Labels: documentation, newbie
> Fix For: 1.14.0
>
>
> I have been trying for weeks to create a parquet file from avro and write to 
> S3 in Java.  This has been incredibly frustrating and odd as Spark can do it 
> easily (I'm told).
> I have assembled the correct jars through luck and diligence, but now I find 
> out that I have to have hadoop installed on my machine. I am currently 
> developing in Windows and it seems a dll and exe can fix that up but am 
> wondering about Linus as the code will eventually run in Fargate on AWS.
> *Why do I need external dependencies and not pure java?*
> The thing really is how utterly complex all this is.  I would like to create 
> an avro file and convert it to Parquet and write it to S3, but I am trapped 
> in "ParquetWriter" hell! 
> *Why can't I get a normal OutputStream and write it wherever I want?*
> I have scoured the web for examples and there are a few but we really need 
> some documentation on this stuff.  I understand that there may be reasons for 
> all this but I can't find them on the web anywhere.  Any help?  Can't we get 
> the "SimpleParquet" jar that does this:
>  
> ParquetWriter writer = 
> AvroParquetWriter.builder(outputStream)
>  .withSchema(avroSchema)
>  .withConf(conf)
>  .withCompressionCodec(CompressionCodecName.SNAPPY)
>  .withWriteMode(Mode.OVERWRITE)//probably not good for prod. (overwrites 
> files).
>  .build();
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2387) Simplify `hasFieldsIgnored` expression

2023-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791649#comment-17791649
 ] 

ASF GitHub Bot commented on PARQUET-2387:
-

Fokko commented on code in PR #1210:
URL: https://github.com/apache/parquet-mr/pull/1210#discussion_r1410732198


##
parquet-thrift/src/main/java/org/apache/parquet/thrift/BufferedProtocolReadToWrite.java:
##
@@ -369,7 +369,7 @@ public String toDebugString() {
   ThriftField expectedField;
   if ((expectedField = type.getChildById(field.id)) == null) {
 handleUnrecognizedField(field, type, in);
-hasFieldsIgnored |= true;

Review Comment:
   `hasFieldsIgnored |= true` equals `hasFieldsIgnored = hasFieldsIgnored || 
true` which will always return true.





> Simplify `hasFieldsIgnored` expression
> --
>
> Key: PARQUET-2387
> URL: https://issues.apache.org/jira/browse/PARQUET-2387
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-thrift
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2392) Remove StringBuilder in `LogicalTypeAnnotation`

2023-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791643#comment-17791643
 ] 

ASF GitHub Bot commented on PARQUET-2392:
-

Fokko commented on PR #1215:
URL: https://github.com/apache/parquet-mr/pull/1215#issuecomment-1833844311

   @zhangjiashen Thanks for the review, appreciate it! I don't think there are 
any negative performance implications since the compile will just optimize it 
to a single concatenation. `StringBuilders` should only be used when you 
concatenate in a loop (fields in a schema for example).




> Remove StringBuilder in `LogicalTypeAnnotation`
> ---
>
> Key: PARQUET-2392
> URL: https://issues.apache.org/jira/browse/PARQUET-2392
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2171) Implement vectored IO in parquet file format

2023-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791631#comment-17791631
 ] 

ASF GitHub Bot commented on PARQUET-2171:
-

steveloughran commented on code in PR #1139:
URL: https://github.com/apache/parquet-mr/pull/1139#discussion_r1410699415


##
parquet-hadoop/README.md:
##
@@ -501,3 +501,11 @@ If `false`, key material is stored in separate new files, 
created in the same fo
 **Description:** Length of key encryption keys (KEKs), randomly generated by 
parquet key management tools. Can be 128, 192 or 256 bits.  
 **Default value:** `128`
 
+---
+
+**Property:** `parquet.hadoop.vectored.io.enabled`  
+**Description:** Flag to enable use of the FileSystem Vector IO API on Hadoop 
releases which support the feature.
+If `true` then an attempt will be made to dynamically load the relevant 
classes; 

Review Comment:
   no, hdfs doesn't support it. Native IO does, so if you use file:// URLS you 
get direct NIO vectored IO into buffers (yay! hadoop APIs move to the 2010s!). 
S3A supports it with multiple parallel GET with some range coalescing in 
between. Would love ABFS connector to support it too...





> Implement vectored IO in parquet file format
> 
>
> Key: PARQUET-2171
> URL: https://issues.apache.org/jira/browse/PARQUET-2171
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Mukund Thakur
>Priority: Major
>
> We recently added a new feature called vectored IO in Hadoop for improving 
> read performance for seek heavy readers. Spark Jobs and others which uses 
> parquet will greatly benefit from this api. Details can be found here 
> [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5]
> https://issues.apache.org/jira/browse/HADOOP-18103
> https://issues.apache.org/jira/browse/HADOOP-11867



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2171) Implement vectored IO in parquet file format

2023-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791630#comment-17791630
 ] 

ASF GitHub Bot commented on PARQUET-2171:
-

steveloughran commented on PR #1139:
URL: https://github.com/apache/parquet-mr/pull/1139#issuecomment-1833813575

   @Fokko that's japicmp getting its logic wrong because it's a new file; 
thought I'd edited the build rules so it would ignore that.
   
   
   anyway, need to fix the merge as something (#1209?) has just broken it




> Implement vectored IO in parquet file format
> 
>
> Key: PARQUET-2171
> URL: https://issues.apache.org/jira/browse/PARQUET-2171
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Mukund Thakur
>Priority: Major
>
> We recently added a new feature called vectored IO in Hadoop for improving 
> read performance for seek heavy readers. Spark Jobs and others which uses 
> parquet will greatly benefit from this api. Details can be found here 
> [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5]
> https://issues.apache.org/jira/browse/HADOOP-18103
> https://issues.apache.org/jira/browse/HADOOP-11867



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2385) Don't initialize CodecFactory in ParquetWriter

2023-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791571#comment-17791571
 ] 

ASF GitHub Bot commented on PARQUET-2385:
-

amousavigourabi commented on PR #1203:
URL: https://github.com/apache/parquet-mr/pull/1203#issuecomment-183334

   > @amousavigourabi It seems a rebase is required.
   
   @wgtmac Done, thanks for the heads up




> Don't initialize CodecFactory in ParquetWriter
> --
>
> Key: PARQUET-2385
> URL: https://issues.apache.org/jira/browse/PARQUET-2385
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Priority: Major
>
> In ParquetWriter we initialize a CodecFactory, instead we should allow users 
> to set their own via the builder as to provide a little more flexibility 
> (analogous to PARQUET-2282).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2400) Update Spotless command in PR prompt to include vector plugins

2023-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791563#comment-17791563
 ] 

ASF GitHub Bot commented on PARQUET-2400:
-

amousavigourabi opened a new pull request, #1223:
URL: https://github.com/apache/parquet-mr/pull/1223

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references
 them in the PR title. For example, "PARQUET-1234: My Parquet PR"
   - https://issues.apache.org/jira/browse/PARQUET-XXX
   - In case you are adding a dependency, check if the license complies with
 the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines
 from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
   1. Subject is separated from body by a blank line
   1. Subject is limited to 50 characters (not including Jira issue 
reference)
   1. Subject does not end with a period
   1. Subject uses the imperative mood ("add", not "adding")
   1. Body wraps at 72 characters
   1. Body explains "what" and "why", not "how"
   
   ### Style
   - [ ] My contribution adheres to the code style guidelines and Spotless 
passes.
   - To apply the necessary changes, run the spotless:apply goal in Maven
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
   - All the public functions and the classes in the PR contain Javadoc 
that explain what it does
   




> Update Spotless command in PR prompt to include vector plugins
> --
>
> Key: PARQUET-2400
> URL: https://issues.apache.org/jira/browse/PARQUET-2400
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Atour Mousavi Gourabi
>Priority: Minor
>
> The Maven command to apply Spotless referenced in the PR prompt does not 
> include applying it to the parquet-plugins. This needs to be updated in those 
> docs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2399) Use deprecated tag in Javadoc

2023-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791560#comment-17791560
 ] 

ASF GitHub Bot commented on PARQUET-2399:
-

amousavigourabi opened a new pull request, #1222:
URL: https://github.com/apache/parquet-mr/pull/1222

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references
 them in the PR title. For example, "PARQUET-1234: My Parquet PR"
   - https://issues.apache.org/jira/browse/PARQUET-XXX
   - In case you are adding a dependency, check if the license complies with
 the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines
 from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
   1. Subject is separated from body by a blank line
   1. Subject is limited to 50 characters (not including Jira issue 
reference)
   1. Subject does not end with a period
   1. Subject uses the imperative mood ("add", not "adding")
   1. Body wraps at 72 characters
   1. Body explains "what" and "why", not "how"
   
   ### Style
   - [ ] My contribution adheres to the code style guidelines and Spotless 
passes.
   - To apply the necessary changes, run the spotless:apply goal in Maven
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
   - All the public functions and the classes in the PR contain Javadoc 
that explain what it does
   




> Use deprecated tag in Javadoc
> -
>
> Key: PARQUET-2399
> URL: https://issues.apache.org/jira/browse/PARQUET-2399
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Atour Mousavi Gourabi
>Priority: Minor
>
> In some Javadoc, we use Deprecated: instead of the deprecated tag.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1647) [Java] support for Arrow's float16

2023-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791520#comment-17791520
 ] 

ASF GitHub Bot commented on PARQUET-1647:
-

wgtmac commented on PR #1142:
URL: https://github.com/apache/parquet-mr/pull/1142#issuecomment-1833370284

   Could you please rebase it?




> [Java] support for Arrow's float16
> --
>
> Key: PARQUET-1647
> URL: https://issues.apache.org/jira/browse/PARQUET-1647
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format, parquet-thrift
>Reporter: The Alchemist
>Priority: Minor
>
> h2. DESCRIPTION
>  
> I'm wondering if there's any interest in supporting Arrow's {{float16}} type 
> in Parquet.
> There seem to be one or two {{float16}} / {{halffloat}} tickets here (e.g., 
> PARQUET-1403) but nothing that speaks to adding half-float support to Parquet 
> in-general.
>  
> h2. PLANS
> I'm able to spend some time on this, if someone points me  in the right 
> direction.
>  
>  # Add the {{HALFFLOAT}} or {{FLOAT16}} enum (any preferred naming 
> convention?) to 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L32]
>  # Add {{HALFFLOAT}} to {{org.apache.parquet.schema.PrimitiveType}}
>  # Add {{HALFFLOAT}} support to 
> {{org.apache.parquet.arrow.schema.SchemaConverter}}
>  # Add encoding for new type at {{org.apache.parquet.column.Encoding}}
>  # ??
> If anyone has any interest in this, pointers, or comments, they would be 
> greatly appreciated!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2385) Don't initialize CodecFactory in ParquetWriter

2023-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791519#comment-17791519
 ] 

ASF GitHub Bot commented on PARQUET-2385:
-

wgtmac commented on PR #1203:
URL: https://github.com/apache/parquet-mr/pull/1203#issuecomment-1833368029

   @amousavigourabi It seems a rebase is required.




> Don't initialize CodecFactory in ParquetWriter
> --
>
> Key: PARQUET-2385
> URL: https://issues.apache.org/jira/browse/PARQUET-2385
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Priority: Major
>
> In ParquetWriter we initialize a CodecFactory, instead we should allow users 
> to set their own via the builder as to provide a little more flexibility 
> (analogous to PARQUET-2282).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr

2023-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791517#comment-17791517
 ] 

ASF GitHub Bot commented on PARQUET-2386:
-

wgtmac commented on PR #1209:
URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1833367177

   > I rebased on the latest master. I'd rather get this merged somewhat 
quickly (maybe by the end of the week?) as to avoid blocking other merges 
and/or endless rebasing.
   
   I just merged this PR to unblock others. Thanks @amousavigourabi!




> More consistent code style in parquet-mr
> 
>
> Key: PARQUET-2386
> URL: https://issues.apache.org/jira/browse/PARQUET-2386
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Assignee: Atour Mousavi Gourabi
>Priority: Major
> Fix For: 1.14.0
>
>
> The code style conventions used in parquet-mr are generally inconsistent and 
> unenforced. We might want to consider using linters such as Spotless and a 
> more extensive .editorconfig configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr

2023-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791516#comment-17791516
 ] 

ASF GitHub Bot commented on PARQUET-2386:
-

wgtmac merged PR #1209:
URL: https://github.com/apache/parquet-mr/pull/1209




> More consistent code style in parquet-mr
> 
>
> Key: PARQUET-2386
> URL: https://issues.apache.org/jira/browse/PARQUET-2386
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Assignee: Atour Mousavi Gourabi
>Priority: Major
>
> The code style conventions used in parquet-mr are generally inconsistent and 
> unenforced. We might want to consider using linters such as Spotless and a 
> more extensive .editorconfig configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr

2023-11-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791324#comment-17791324
 ] 

ASF GitHub Bot commented on PARQUET-2386:
-

amousavigourabi commented on PR #1209:
URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1832709341

   I rebased on the latest master. I'd rather get this merged somewhat quickly 
(maybe by the end of the week?) as to avoid blocking other merges and/or 
endless rebasing.




> More consistent code style in parquet-mr
> 
>
> Key: PARQUET-2386
> URL: https://issues.apache.org/jira/browse/PARQUET-2386
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Assignee: Atour Mousavi Gourabi
>Priority: Major
>
> The code style conventions used in parquet-mr are generally inconsistent and 
> unenforced. We might want to consider using linters such as Spotless and a 
> more extensive .editorconfig configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2171) Implement vectored IO in parquet file format

2023-11-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791320#comment-17791320
 ] 

ASF GitHub Bot commented on PARQUET-2171:
-

Fokko commented on PR #1139:
URL: https://github.com/apache/parquet-mr/pull/1139#issuecomment-1832684057

   @steveloughran Can you fix the compatibility issue?
   
   ```
   Error:  Failed to execute goal 
com.github.siom79.japicmp:japicmp-maven-plugin:0.18.2:cmp (default) on project 
parquet-hadoop: There is at least one incompatibility: 
org.apache.parquet.hadoop.util.vectorio.BindingUtils.raiseInnerCause(java.util.concurrent.ExecutionException):CLASS_GENERIC_TEMPLATE_CHANGED
 -> [Help 1]
   ```




> Implement vectored IO in parquet file format
> 
>
> Key: PARQUET-2171
> URL: https://issues.apache.org/jira/browse/PARQUET-2171
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Mukund Thakur
>Priority: Major
>
> We recently added a new feature called vectored IO in Hadoop for improving 
> read performance for seek heavy readers. Spark Jobs and others which uses 
> parquet will greatly benefit from this api. Details can be found here 
> [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5]
> https://issues.apache.org/jira/browse/HADOOP-18103
> https://issues.apache.org/jira/browse/HADOOP-11867



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr

2023-11-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791300#comment-17791300
 ] 

ASF GitHub Bot commented on PARQUET-2386:
-

Fokko commented on PR #1209:
URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1832643602

   @amousavigourabi I can see that it is not nice having to rebase all the 
time. I can hold off on mine, they are probably easier to resolve than yours.




> More consistent code style in parquet-mr
> 
>
> Key: PARQUET-2386
> URL: https://issues.apache.org/jira/browse/PARQUET-2386
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Assignee: Atour Mousavi Gourabi
>Priority: Major
>
> The code style conventions used in parquet-mr are generally inconsistent and 
> unenforced. We might want to consider using linters such as Spotless and a 
> more extensive .editorconfig configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr

2023-11-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791277#comment-17791277
 ] 

ASF GitHub Bot commented on PARQUET-2386:
-

amousavigourabi commented on PR #1209:
URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1832537608

   @Fokko , how do we want to coordinate this pull request and your refactoring 
PRs? I do not feel a lot for applying the editorconfig and Spotless a dozen or 
so times to resolve conflicts in this PR after each of those get merged.




> More consistent code style in parquet-mr
> 
>
> Key: PARQUET-2386
> URL: https://issues.apache.org/jira/browse/PARQUET-2386
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Assignee: Atour Mousavi Gourabi
>Priority: Major
>
> The code style conventions used in parquet-mr are generally inconsistent and 
> unenforced. We might want to consider using linters such as Spotless and a 
> more extensive .editorconfig configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr

2023-11-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791272#comment-17791272
 ] 

ASF GitHub Bot commented on PARQUET-2386:
-

amousavigourabi commented on code in PR #1209:
URL: https://github.com/apache/parquet-mr/pull/1209#discussion_r1409749808


##
pom.xml:
##
@@ -512,6 +515,43 @@
 
   
 
+   More consistent code style in parquet-mr
> 
>
> Key: PARQUET-2386
> URL: https://issues.apache.org/jira/browse/PARQUET-2386
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Assignee: Atour Mousavi Gourabi
>Priority: Major
>
> The code style conventions used in parquet-mr are generally inconsistent and 
> unenforced. We might want to consider using linters such as Spotless and a 
> more extensive .editorconfig configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr

2023-11-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791271#comment-17791271
 ] 

ASF GitHub Bot commented on PARQUET-2386:
-

amousavigourabi commented on code in PR #1209:
URL: https://github.com/apache/parquet-mr/pull/1209#discussion_r1409748314


##
pom.xml:
##
@@ -512,6 +515,43 @@
 
   
 
+   More consistent code style in parquet-mr
> 
>
> Key: PARQUET-2386
> URL: https://issues.apache.org/jira/browse/PARQUET-2386
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Assignee: Atour Mousavi Gourabi
>Priority: Major
>
> The code style conventions used in parquet-mr are generally inconsistent and 
> unenforced. We might want to consider using linters such as Spotless and a 
> more extensive .editorconfig configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2398) Make static variables final

2023-11-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791101#comment-17791101
 ] 

ASF GitHub Bot commented on PARQUET-2398:
-

Fokko opened a new pull request, #1221:
URL: https://github.com/apache/parquet-mr/pull/1221

   Make sure you have checked _all_ steps below.
   
   These variables should not change, therefore they should be final.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Make static variables final
> ---
>
> Key: PARQUET-2398
> URL: https://issues.apache.org/jira/browse/PARQUET-2398
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2396) Refactor `ColumnIndexBuilder`

2023-11-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791079#comment-17791079
 ] 

ASF GitHub Bot commented on PARQUET-2396:
-

Fokko commented on code in PR #1219:
URL: https://github.com/apache/parquet-mr/pull/1219#discussion_r1409257114


##
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java:
##
@@ -298,24 +295,22 @@ public > PrimitiveIterator.OfInt 
visit(NotEq notEq) {
 public > PrimitiveIterator.OfInt visit(In in) {
   Set values = in.getValues();
   IntSet matchingIndexesForNull = new IntOpenHashSet();  // for null
-  Iterator it = values.iterator();
-  while(it.hasNext()) {
-T value = it.next();
-if (value == null) {
-  if (nullCounts == null) {
-// Searching for nulls so if we don't have null related statistics 
we have to return all pages
-return IndexIterator.all(getPageCount());
-  } else {
-for (int i = 0; i < nullCounts.length; i++) {
-  if (nullCounts[i] > 0) {
-matchingIndexesForNull.add(i);
+  for (T value : values) {
+  if (value == null) {

Review Comment:
   Ah, good point. I've updated the PR. Thanks for catching this  



##
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java:
##
@@ -298,24 +295,22 @@ public > PrimitiveIterator.OfInt 
visit(NotEq notEq) {
 public > PrimitiveIterator.OfInt visit(In in) {
   Set values = in.getValues();
   IntSet matchingIndexesForNull = new IntOpenHashSet();  // for null
-  Iterator it = values.iterator();
-  while(it.hasNext()) {
-T value = it.next();
-if (value == null) {
-  if (nullCounts == null) {
-// Searching for nulls so if we don't have null related statistics 
we have to return all pages
-return IndexIterator.all(getPageCount());
-  } else {
-for (int i = 0; i < nullCounts.length; i++) {
-  if (nullCounts[i] > 0) {
-matchingIndexesForNull.add(i);
+  for (T value : values) {
+  if (value == null) {

Review Comment:
   Ah, good point. I've updated the PR. Thanks for catching this  





> Refactor `ColumnIndexBuilder`
> -
>
> Key: PARQUET-2396
> URL: https://issues.apache.org/jira/browse/PARQUET-2396
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr

2023-11-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791077#comment-17791077
 ] 

ASF GitHub Bot commented on PARQUET-2386:
-

Fokko commented on PR #1209:
URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1831861024

   This is great @amousavigourabi. I was thinking of adding linting as well, so 
great work here!




> More consistent code style in parquet-mr
> 
>
> Key: PARQUET-2386
> URL: https://issues.apache.org/jira/browse/PARQUET-2386
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Assignee: Atour Mousavi Gourabi
>Priority: Major
>
> The code style conventions used in parquet-mr are generally inconsistent and 
> unenforced. We might want to consider using linters such as Spotless and a 
> more extensive .editorconfig configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2397) Make use of `isEmpty`

2023-11-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791074#comment-17791074
 ] 

ASF GitHub Bot commented on PARQUET-2397:
-

Fokko merged PR #1220:
URL: https://github.com/apache/parquet-mr/pull/1220




> Make use of `isEmpty`
> -
>
> Key: PARQUET-2397
> URL: https://issues.apache.org/jira/browse/PARQUET-2397
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2171) Implement vectored IO in parquet file format

2023-11-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791061#comment-17791061
 ] 

ASF GitHub Bot commented on PARQUET-2171:
-

steveloughran commented on PR #1139:
URL: https://github.com/apache/parquet-mr/pull/1139#issuecomment-1831828739

   Code wise, no, other than reviews from others about what is the best place 
for things, such as that awaitFuture stuff or any other suggestions which 
people who know the parquet codebase think is best. Code works and we have been 
testing this through Amazon S3 Express storage for extra speed up. To be 
ruthless: there's no point paying the premium for that until you've embraced 
the extra speed ups you get from this first




> Implement vectored IO in parquet file format
> 
>
> Key: PARQUET-2171
> URL: https://issues.apache.org/jira/browse/PARQUET-2171
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Mukund Thakur
>Priority: Major
>
> We recently added a new feature called vectored IO in Hadoop for improving 
> read performance for seek heavy readers. Spark Jobs and others which uses 
> parquet will greatly benefit from this api. Details can be found here 
> [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5]
> https://issues.apache.org/jira/browse/HADOOP-18103
> https://issues.apache.org/jira/browse/HADOOP-11867



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr

2023-11-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790960#comment-17790960
 ] 

ASF GitHub Bot commented on PARQUET-2386:
-

gszadovszky commented on code in PR #1209:
URL: https://github.com/apache/parquet-mr/pull/1209#discussion_r1408884415


##
pom.xml:
##
@@ -512,6 +515,43 @@
 
   
 
+   More consistent code style in parquet-mr
> 
>
> Key: PARQUET-2386
> URL: https://issues.apache.org/jira/browse/PARQUET-2386
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Assignee: Atour Mousavi Gourabi
>Priority: Major
>
> The code style conventions used in parquet-mr are generally inconsistent and 
> unenforced. We might want to consider using linters such as Spotless and a 
> more extensive .editorconfig configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2396) Refactor `ColumnIndexBuilder`

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790918#comment-17790918
 ] 

ASF GitHub Bot commented on PARQUET-2396:
-

zhangjiashen commented on code in PR #1219:
URL: https://github.com/apache/parquet-mr/pull/1219#discussion_r1408788644


##
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java:
##
@@ -298,24 +295,22 @@ public > PrimitiveIterator.OfInt 
visit(NotEq notEq) {
 public > PrimitiveIterator.OfInt visit(In in) {
   Set values = in.getValues();
   IntSet matchingIndexesForNull = new IntOpenHashSet();  // for null
-  Iterator it = values.iterator();
-  while(it.hasNext()) {
-T value = it.next();
-if (value == null) {
-  if (nullCounts == null) {
-// Searching for nulls so if we don't have null related statistics 
we have to return all pages
-return IndexIterator.all(getPageCount());
-  } else {
-for (int i = 0; i < nullCounts.length; i++) {
-  if (nullCounts[i] > 0) {
-matchingIndexesForNull.add(i);
+  for (T value : values) {
+  if (value == null) {

Review Comment:
   Nit: Let's modify the indent spaces to 2 and ensure consistency and similar 
to changes below?





> Refactor `ColumnIndexBuilder`
> -
>
> Key: PARQUET-2396
> URL: https://issues.apache.org/jira/browse/PARQUET-2396
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2396) Refactor `ColumnIndexBuilder`

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790917#comment-17790917
 ] 

ASF GitHub Bot commented on PARQUET-2396:
-

zhangjiashen commented on code in PR #1219:
URL: https://github.com/apache/parquet-mr/pull/1219#discussion_r1408788644


##
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java:
##
@@ -298,24 +295,22 @@ public > PrimitiveIterator.OfInt 
visit(NotEq notEq) {
 public > PrimitiveIterator.OfInt visit(In in) {
   Set values = in.getValues();
   IntSet matchingIndexesForNull = new IntOpenHashSet();  // for null
-  Iterator it = values.iterator();
-  while(it.hasNext()) {
-T value = it.next();
-if (value == null) {
-  if (nullCounts == null) {
-// Searching for nulls so if we don't have null related statistics 
we have to return all pages
-return IndexIterator.all(getPageCount());
-  } else {
-for (int i = 0; i < nullCounts.length; i++) {
-  if (nullCounts[i] > 0) {
-matchingIndexesForNull.add(i);
+  for (T value : values) {
+  if (value == null) {

Review Comment:
   Nit: Let's modify the indent spaces to 2 and ensure consistency?





> Refactor `ColumnIndexBuilder`
> -
>
> Key: PARQUET-2396
> URL: https://issues.apache.org/jira/browse/PARQUET-2396
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790840#comment-17790840
 ] 

ASF GitHub Bot commented on PARQUET-2386:
-

wgtmac commented on PR #1209:
URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1831045581

   Thanks for the improvement!
   
   Could you please take a look at this? @shangxinli @gszadovszky @Fokko




> More consistent code style in parquet-mr
> 
>
> Key: PARQUET-2386
> URL: https://issues.apache.org/jira/browse/PARQUET-2386
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Assignee: Atour Mousavi Gourabi
>Priority: Major
>
> The code style conventions used in parquet-mr are generally inconsistent and 
> unenforced. We might want to consider using linters such as Spotless and a 
> more extensive .editorconfig configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2397) Make use of `isEmpty`

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790822#comment-17790822
 ] 

ASF GitHub Bot commented on PARQUET-2397:
-

Fokko opened a new pull request, #1220:
URL: https://github.com/apache/parquet-mr/pull/1220

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2397
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Make use of `isEmpty`
> -
>
> Key: PARQUET-2397
> URL: https://issues.apache.org/jira/browse/PARQUET-2397
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2396) Refactor `ColumnIndexBuilder`

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790819#comment-17790819
 ] 

ASF GitHub Bot commented on PARQUET-2396:
-

Fokko opened a new pull request, #1219:
URL: https://github.com/apache/parquet-mr/pull/1219

   Small refactor to improve readability
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Refactor `ColumnIndexBuilder`
> -
>
> Key: PARQUET-2396
> URL: https://issues.apache.org/jira/browse/PARQUET-2396
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2395) Prefer `singletonList` over `asList`

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790815#comment-17790815
 ] 

ASF GitHub Bot commented on PARQUET-2395:
-

Fokko opened a new pull request, #1218:
URL: https://github.com/apache/parquet-mr/pull/1218

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Prefer `singletonList` over `asList`
> 
>
> Key: PARQUET-2395
> URL: https://issues.apache.org/jira/browse/PARQUET-2395
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2394) Use `computeIfAbsent` in `MessageColumnIO`

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790812#comment-17790812
 ] 

ASF GitHub Bot commented on PARQUET-2394:
-

Fokko opened a new pull request, #1217:
URL: https://github.com/apache/parquet-mr/pull/1217

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2394
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Use `computeIfAbsent` in `MessageColumnIO`
> --
>
> Key: PARQUET-2394
> URL: https://issues.apache.org/jira/browse/PARQUET-2394
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2393) Make `ColumnIOCreatorVisitor` static

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790810#comment-17790810
 ] 

ASF GitHub Bot commented on PARQUET-2393:
-

Fokko opened a new pull request, #1216:
URL: https://github.com/apache/parquet-mr/pull/1216

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Make `ColumnIOCreatorVisitor` static
> 
>
> Key: PARQUET-2393
> URL: https://issues.apache.org/jira/browse/PARQUET-2393
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2392) Remove StringBuilder in `LogicalTypeAnnotation`

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790807#comment-17790807
 ] 

ASF GitHub Bot commented on PARQUET-2392:
-

Fokko opened a new pull request, #1215:
URL: https://github.com/apache/parquet-mr/pull/1215

   Make sure you have checked _all_ steps below.
   
   StringBuilder only makes sense when you concatenate in a loop.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2392
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Remove StringBuilder in `LogicalTypeAnnotation`
> ---
>
> Key: PARQUET-2392
> URL: https://issues.apache.org/jira/browse/PARQUET-2392
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2391) Remove unnecessary unboxing

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790805#comment-17790805
 ] 

ASF GitHub Bot commented on PARQUET-2391:
-

Fokko opened a new pull request, #1214:
URL: https://github.com/apache/parquet-mr/pull/1214

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2391
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Remove unnecessary unboxing
> ---
>
> Key: PARQUET-2391
> URL: https://issues.apache.org/jira/browse/PARQUET-2391
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2390) Replace anonymouse functions with lambda's

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790800#comment-17790800
 ] 

ASF GitHub Bot commented on PARQUET-2390:
-

Fokko opened a new pull request, #1213:
URL: https://github.com/apache/parquet-mr/pull/1213

   They are easier to read
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2390
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Replace anonymouse functions with lambda's
> --
>
> Key: PARQUET-2390
> URL: https://issues.apache.org/jira/browse/PARQUET-2390
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2389) Remove redundant initializers

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790797#comment-17790797
 ] 

ASF GitHub Bot commented on PARQUET-2389:
-

Fokko opened a new pull request, #1212:
URL: https://github.com/apache/parquet-mr/pull/1212

   Just some cleanup
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2389
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Remove redundant initializers
> -
>
> Key: PARQUET-2389
> URL: https://issues.apache.org/jira/browse/PARQUET-2389
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2388) Deprecate `CHARSETS` on `PlainValuesWriter`

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790795#comment-17790795
 ] 

ASF GitHub Bot commented on PARQUET-2388:
-

Fokko opened a new pull request, #1211:
URL: https://github.com/apache/parquet-mr/pull/1211

   Not being used
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2388
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Deprecate `CHARSETS` on `PlainValuesWriter`
> ---
>
> Key: PARQUET-2388
> URL: https://issues.apache.org/jira/browse/PARQUET-2388
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2387) Simplify `hasFieldsIgnored` expression

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790793#comment-17790793
 ] 

ASF GitHub Bot commented on PARQUET-2387:
-

Fokko opened a new pull request, #1210:
URL: https://github.com/apache/parquet-mr/pull/1210

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2387
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Simplify `hasFieldsIgnored` expression
> --
>
> Key: PARQUET-2387
> URL: https://issues.apache.org/jira/browse/PARQUET-2387
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-thrift
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2344) Bump to Thirft 0.19.0

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790791#comment-17790791
 ] 

ASF GitHub Bot commented on PARQUET-2344:
-

Fokko commented on PR #1192:
URL: https://github.com/apache/parquet-mr/pull/1192#issuecomment-1830859619

   @wgtmac Thanks for splitting out the format upgrade. Always a good idea to 
make PRs smaller.
   
   I finally fixed all the tests, and this looks good to go to me  




> Bump to Thirft 0.19.0
> -
>
> Key: PARQUET-2344
> URL: https://issues.apache.org/jira/browse/PARQUET-2344
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format, parquet-mr
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: format-2.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2384) Mark toOriginalType as deprecated

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790790#comment-17790790
 ] 

ASF GitHub Bot commented on PARQUET-2384:
-

Fokko merged PR #1202:
URL: https://github.com/apache/parquet-mr/pull/1202




> Mark toOriginalType as deprecated
> -
>
> Key: PARQUET-2384
> URL: https://issues.apache.org/jira/browse/PARQUET-2384
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2384) Mark toOriginalType as deprecated

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790789#comment-17790789
 ] 

ASF GitHub Bot commented on PARQUET-2384:
-

Fokko commented on PR #1202:
URL: https://github.com/apache/parquet-mr/pull/1202#issuecomment-1830855336

   Thanks for the review @wgtmac 




> Mark toOriginalType as deprecated
> -
>
> Key: PARQUET-2384
> URL: https://issues.apache.org/jira/browse/PARQUET-2384
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790682#comment-17790682
 ] 

ASF GitHub Bot commented on PARQUET-2386:
-

amousavigourabi commented on PR #1209:
URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1830328306

   The `.editorconfig` has been expanded for IntelliJ and is mostly compliant 
with the Spotless configuration. IntelliJ refactoring and Spotless have some 
minor disagreements on continuation indents sometimes, which cannot really be 
resolved at the moment. As it is included in the Maven lifecycle, the Spotless 
configuration would of course be leading.




> More consistent code style in parquet-mr
> 
>
> Key: PARQUET-2386
> URL: https://issues.apache.org/jira/browse/PARQUET-2386
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Assignee: Atour Mousavi Gourabi
>Priority: Major
>
> The code style conventions used in parquet-mr are generally inconsistent and 
> unenforced. We might want to consider using linters such as Spotless and a 
> more extensive .editorconfig configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790681#comment-17790681
 ] 

ASF GitHub Bot commented on PARQUET-2386:
-

amousavigourabi commented on PR #1209:
URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1830320974

   @wgtmac 




> More consistent code style in parquet-mr
> 
>
> Key: PARQUET-2386
> URL: https://issues.apache.org/jira/browse/PARQUET-2386
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Assignee: Atour Mousavi Gourabi
>Priority: Major
>
> The code style conventions used in parquet-mr are generally inconsistent and 
> unenforced. We might want to consider using linters such as Spotless and a 
> more extensive .editorconfig configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr

2023-11-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790680#comment-17790680
 ] 

ASF GitHub Bot commented on PARQUET-2386:
-

amousavigourabi opened a new pull request, #1209:
URL: https://github.com/apache/parquet-mr/pull/1209

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [x] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [x] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   - This PR only refactors style, no logic is added or removed in any way 
shape or form
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
 - Adds note in the PR template on the style checks
   
   ---
   
   This PR contains two commits, the first adds the style checks and 
configurations, the second applies these changes to the repository.




> More consistent code style in parquet-mr
> 
>
> Key: PARQUET-2386
> URL: https://issues.apache.org/jira/browse/PARQUET-2386
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Assignee: Atour Mousavi Gourabi
>Priority: Major
>
> The code style conventions used in parquet-mr are generally inconsistent and 
> unenforced. We might want to consider using linters such as Spotless and a 
> more extensive .editorconfig configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2385) Don't initialize CodecFactory in ParquetWriter

2023-11-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790129#comment-17790129
 ] 

ASF GitHub Bot commented on PARQUET-2385:
-

amousavigourabi commented on PR #1203:
URL: https://github.com/apache/parquet-mr/pull/1203#issuecomment-1828048073

   > > @wgtmac , after the comments on the code style I had a quick look 
through the repository and found that indentations and the such differ quite 
drastically between (and even within) files. ParquetWriter has indents of four 
spaces for the constructor arguments, where ParquetReader has them at 14 
spaces. Would you feel anything for a more extensive `.editorconfig` and 
easy-to-use linter such as spotless?
   > 
   > Yes, I think that would be good to make the style consistent across all 
files automatically.
   
   I'll get started on expanding the `.editorconfig` for at least IntelliJ and 
setting up a compatible Spotless configuration then.




> Don't initialize CodecFactory in ParquetWriter
> --
>
> Key: PARQUET-2385
> URL: https://issues.apache.org/jira/browse/PARQUET-2385
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Priority: Major
>
> In ParquetWriter we initialize a CodecFactory, instead we should allow users 
> to set their own via the builder as to provide a little more flexibility 
> (analogous to PARQUET-2282).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2385) Don't initialize CodecFactory in ParquetWriter

2023-11-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790122#comment-17790122
 ] 

ASF GitHub Bot commented on PARQUET-2385:
-

wgtmac commented on PR #1203:
URL: https://github.com/apache/parquet-mr/pull/1203#issuecomment-1828031389

   cc @Fokko 




> Don't initialize CodecFactory in ParquetWriter
> --
>
> Key: PARQUET-2385
> URL: https://issues.apache.org/jira/browse/PARQUET-2385
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Priority: Major
>
> In ParquetWriter we initialize a CodecFactory, instead we should allow users 
> to set their own via the builder as to provide a little more flexibility 
> (analogous to PARQUET-2282).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2385) Don't initialize CodecFactory in ParquetWriter

2023-11-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790121#comment-17790121
 ] 

ASF GitHub Bot commented on PARQUET-2385:
-

wgtmac commented on PR #1203:
URL: https://github.com/apache/parquet-mr/pull/1203#issuecomment-1828030327

   > @wgtmac , after the comments on the code style I had a quick look through 
the repository and found that indentations and the such differ quite 
drastically between (and even within) files. ParquetWriter has indents of four 
spaces for the constructor arguments, where ParquetReader has them at 14 
spaces. Would you feel anything for a more extensive `.editorconfig` and 
easy-to-use linter such as spotless?
   
   Yes, I think that would be good to make the style consistent across all 
files automatically.




> Don't initialize CodecFactory in ParquetWriter
> --
>
> Key: PARQUET-2385
> URL: https://issues.apache.org/jira/browse/PARQUET-2385
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Priority: Major
>
> In ParquetWriter we initialize a CodecFactory, instead we should allow users 
> to set their own via the builder as to provide a little more flexibility 
> (analogous to PARQUET-2282).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2385) Don't initialize CodecFactory in ParquetWriter

2023-11-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790097#comment-17790097
 ] 

ASF GitHub Bot commented on PARQUET-2385:
-

amousavigourabi commented on PR #1203:
URL: https://github.com/apache/parquet-mr/pull/1203#issuecomment-1827942202

   @wgtmac , after the comments on the code style I had a quick look through 
the repository and found that indentations and the such differ quite 
drastically between (and even within) files. ParquetWriter has indents of four 
spaces for the constructor arguments, where ParquetReader has them at 14 
spaces. Would you feel anything for a more extensive `.editorconfig` and 
easy-to-use linter such as spotless?




> Don't initialize CodecFactory in ParquetWriter
> --
>
> Key: PARQUET-2385
> URL: https://issues.apache.org/jira/browse/PARQUET-2385
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Priority: Major
>
> In ParquetWriter we initialize a CodecFactory, instead we should allow users 
> to set their own via the builder as to provide a little more flexibility 
> (analogous to PARQUET-2282).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2373) Improve I/O performance with bloom_filter_length

2023-11-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789930#comment-17789930
 ] 

ASF GitHub Bot commented on PARQUET-2373:
-

mapleFU commented on PR #1184:
URL: https://github.com/apache/parquet-mr/pull/1184#issuecomment-1827228087

   FYI, I've update a BloomFilter with length for testing: 
https://github.com/apache/parquet-testing/pull/43




> Improve I/O performance with bloom_filter_length
> 
>
> Key: PARQUET-2373
> URL: https://issues.apache.org/jira/browse/PARQUET-2373
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Jiashen Zhang
>Priority: Minor
>
> The spec PARQUET-2257 has added bloom_filter_length for reader to load the 
> bloom filter in a single shot. This implementation alters the code to make 
> use of the 'bloom_filter_length' field for loading the bloom filter 
> (consisting of the header and bitset) in order to enhance I/O scheduling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2373) Improve I/O performance with bloom_filter_length

2023-11-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789926#comment-17789926
 ] 

ASF GitHub Bot commented on PARQUET-2373:
-

zhangjiashen commented on code in PR #1184:
URL: https://github.com/apache/parquet-mr/pull/1184#discussion_r1396702452


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1347,11 +1348,24 @@ public BloomFilter readBloomFilter(ColumnChunkMetaData 
meta) throws IOException
   }
 }
 
-// Read Bloom filter data header.
+// Seek to Bloom filter offset.
 f.seek(bloomFilterOffset);
+
+// Read Bloom filter length.
+int bloomFilterLength = meta.getBloomFilterLength();
+
+// If it is set, read Bloom filter header and bitset together.
+// Otherwise, read Bloom filter header first and then bitset.
+InputStream in = null;
+if (bloomFilterLength > 0) {
+  byte[] headerAndBitSet = new byte[bloomFilterLength];
+  f.readFully(headerAndBitSet);
+  in = new ByteArrayInputStream(headerAndBitSet);
+}
+
 BloomFilterHeader bloomFilterHeader;
 try {
-  bloomFilterHeader = Util.readBloomFilterHeader(f, bloomFilterDecryptor, 
bloomFilterHeaderAAD);
+  bloomFilterHeader = Util.readBloomFilterHeader(in != null ? in : f, 
bloomFilterDecryptor, bloomFilterHeaderAAD);

Review Comment:
   It would make code more complex to read if we separate these into two 
methods. Changed code little bit to avoid sereral checks, please check if it 
makes sense?





> Improve I/O performance with bloom_filter_length
> 
>
> Key: PARQUET-2373
> URL: https://issues.apache.org/jira/browse/PARQUET-2373
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Jiashen Zhang
>Priority: Minor
>
> The spec PARQUET-2257 has added bloom_filter_length for reader to load the 
> bloom filter in a single shot. This implementation alters the code to make 
> use of the 'bloom_filter_length' field for loading the bloom filter 
> (consisting of the header and bitset) in order to enhance I/O scheduling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1647) [Java] support for Arrow's float16

2023-11-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789925#comment-17789925
 ] 

ASF GitHub Bot commented on PARQUET-1647:
-

zhangjiashen commented on PR #1142:
URL: https://github.com/apache/parquet-mr/pull/1142#issuecomment-1827206658

   > @zhangjiashen This can be rebased to adopt parquet-format 2.10.0
   
   @wgtmac I just rebased with master branch and please help take a look when 
you get a chance?




> [Java] support for Arrow's float16
> --
>
> Key: PARQUET-1647
> URL: https://issues.apache.org/jira/browse/PARQUET-1647
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format, parquet-thrift
>Reporter: The Alchemist
>Priority: Minor
>
> h2. DESCRIPTION
>  
> I'm wondering if there's any interest in supporting Arrow's {{float16}} type 
> in Parquet.
> There seem to be one or two {{float16}} / {{halffloat}} tickets here (e.g., 
> PARQUET-1403) but nothing that speaks to adding half-float support to Parquet 
> in-general.
>  
> h2. PLANS
> I'm able to spend some time on this, if someone points me  in the right 
> direction.
>  
>  # Add the {{HALFFLOAT}} or {{FLOAT16}} enum (any preferred naming 
> convention?) to 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L32]
>  # Add {{HALFFLOAT}} to {{org.apache.parquet.schema.PrimitiveType}}
>  # Add {{HALFFLOAT}} support to 
> {{org.apache.parquet.arrow.schema.SchemaConverter}}
>  # Add encoding for new type at {{org.apache.parquet.column.Encoding}}
>  # ??
> If anyone has any interest in this, pointers, or comments, they would be 
> greatly appreciated!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2373) Improve I/O performance with bloom_filter_length

2023-11-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789924#comment-17789924
 ] 

ASF GitHub Bot commented on PARQUET-2373:
-

zhangjiashen commented on PR #1184:
URL: https://github.com/apache/parquet-mr/pull/1184#issuecomment-1827205946

   > @zhangjiashen This can be rebased to adopt parquet-format 2.10.0
   
   @wgtmac I just rebased with master branch and please help take a look when 
you get a chance? 




> Improve I/O performance with bloom_filter_length
> 
>
> Key: PARQUET-2373
> URL: https://issues.apache.org/jira/browse/PARQUET-2373
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Jiashen Zhang
>Priority: Minor
>
> The spec PARQUET-2257 has added bloom_filter_length for reader to load the 
> bloom filter in a single shot. This implementation alters the code to make 
> use of the 'bloom_filter_length' field for loading the bloom filter 
> (consisting of the header and bitset) in order to enhance I/O scheduling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2042) Unwrap common Protobuf wrappers and logical Timestamps, Date, TimeOfDay

2023-11-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789880#comment-17789880
 ] 

ASF GitHub Bot commented on PARQUET-2042:
-

shangxinli merged PR #900:
URL: https://github.com/apache/parquet-mr/pull/900




> Unwrap common Protobuf wrappers and logical Timestamps, Date, TimeOfDay
> ---
>
> Key: PARQUET-2042
> URL: https://issues.apache.org/jira/browse/PARQUET-2042
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-protobuf
>Reporter: Michael Wong
>Priority: Major
>
> Related to https://issues.apache.org/jira/browse/PARQUET-1595



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2042) Unwrap common Protobuf wrappers and logical Timestamps, Date, TimeOfDay

2023-11-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789881#comment-17789881
 ] 

ASF GitHub Bot commented on PARQUET-2042:
-

shangxinli commented on PR #900:
URL: https://github.com/apache/parquet-mr/pull/900#issuecomment-1827000412

   Merged




> Unwrap common Protobuf wrappers and logical Timestamps, Date, TimeOfDay
> ---
>
> Key: PARQUET-2042
> URL: https://issues.apache.org/jira/browse/PARQUET-2042
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-protobuf
>Reporter: Michael Wong
>Priority: Major
>
> Related to https://issues.apache.org/jira/browse/PARQUET-1595



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2385) Don't initialize CodecFactory in ParquetWriter

2023-11-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789879#comment-17789879
 ] 

ASF GitHub Bot commented on PARQUET-2385:
-

wgtmac commented on code in PR #1203:
URL: https://github.com/apache/parquet-mr/pull/1203#discussion_r1405536183


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java:
##
@@ -303,6 +303,32 @@ public ParquetWriter(Path file, Configuration conf, 
WriteSupport writeSupport
   int maxPaddingSize,
   ParquetProperties encodingProps,
   FileEncryptionProperties encryptionProperties) throws IOException {
+this(
+  file,
+  mode,
+  writeSupport,
+  compressionCodecName,
+  new CodecFactory(conf, encodingProps.getPageSizeThreshold()),
+  rowGroupSize,
+  validating,
+  conf,
+  maxPaddingSize,
+  encodingProps,
+  encryptionProperties);
+  }
+
+  ParquetWriter(
+OutputFile file,

Review Comment:
   It seems the indentation of constructor is 4 spaces elsewhere.



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java:
##
@@ -321,17 +347,17 @@ public ParquetWriter(Path file, Configuration conf, 
WriteSupport writeSupport
   encodingProps.getPageWriteChecksumEnabled(), encryptionProperties);
 fileWriter.start();
 
-this.codecFactory = new CodecFactory(conf, 
encodingProps.getPageSizeThreshold());
+this.codecFactory = codecFactory;
 CompressionCodecFactory.BytesInputCompressor compressor = 
codecFactory.getCompressor(compressionCodecName);
 this.writer = new InternalParquetRecordWriter(
-fileWriter,
-writeSupport,
-schema,
-writeContext.getExtraMetaData(),
-rowGroupSize,
-compressor,
-validating,
-encodingProps);
+  fileWriter,

Review Comment:
   Could you please revert the irrelevant style change?





> Don't initialize CodecFactory in ParquetWriter
> --
>
> Key: PARQUET-2385
> URL: https://issues.apache.org/jira/browse/PARQUET-2385
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Priority: Major
>
> In ParquetWriter we initialize a CodecFactory, instead we should allow users 
> to set their own via the builder as to provide a little more flexibility 
> (analogous to PARQUET-2282).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2042) Unwrap common Protobuf wrappers and logical Timestamps, Date, TimeOfDay

2023-11-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789870#comment-17789870
 ] 

ASF GitHub Bot commented on PARQUET-2042:
-

mwong38 commented on PR #900:
URL: https://github.com/apache/parquet-mr/pull/900#issuecomment-1826944950

   > I just noticed this PR and sorry to see it does not check in. @mwong38 
Could you try rebase it one last time? Thanks!
   
   Done. I really hope it's the last time.




> Unwrap common Protobuf wrappers and logical Timestamps, Date, TimeOfDay
> ---
>
> Key: PARQUET-2042
> URL: https://issues.apache.org/jira/browse/PARQUET-2042
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-protobuf
>Reporter: Michael Wong
>Priority: Major
>
> Related to https://issues.apache.org/jira/browse/PARQUET-1595



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2385) Don't initialize CodecFactory in ParquetWriter

2023-11-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789508#comment-17789508
 ] 

ASF GitHub Bot commented on PARQUET-2385:
-

amousavigourabi opened a new pull request, #1203:
URL: https://github.com/apache/parquet-mr/pull/1203

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Don't initialize CodecFactory in ParquetWriter
> --
>
> Key: PARQUET-2385
> URL: https://issues.apache.org/jira/browse/PARQUET-2385
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Atour Mousavi Gourabi
>Priority: Major
>
> In ParquetWriter we initialize a CodecFactory, instead we should allow users 
> to set their own via the builder as to provide a little more flexibility 
> (analogous to PARQUET-2282).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >