[GitHub] [parquet-format] mapleFU commented on pull request #190: Minor: add FIXED_LEN_BYTE_ARRAY under Types in doc

2023-02-07 Thread via GitHub


mapleFU commented on PR #190:
URL: https://github.com/apache/parquet-format/pull/190#issuecomment-1422061712

   @pitrou I think this is great, mind take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685675#comment-17685675
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

sunchao commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1099630871


##
README.md:
##
@@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Column stats
 * Delta encoding
 * Index pages
+* Java Vector API support
+
+## Java Vector API support

Review Comment:
   Ah thanks! this looks promising and looking forward to the Spark PR!





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] sunchao commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-07 Thread via GitHub


sunchao commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1099630871


##
README.md:
##
@@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Column stats
 * Delta encoding
 * Index pages
+* Java Vector API support
+
+## Java Vector API support

Review Comment:
   Ah thanks! this looks promising and looking forward to the Spark PR!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685657#comment-17685657
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

Fang-Xie commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1099602361


##
README.md:
##
@@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Column stats
 * Delta encoding
 * Index pages
+* Java Vector API support
+
+## Java Vector API support

Review Comment:
   @sunchao , 
[here](https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-2159) 
shows the micro-benchmark of bitpack function and test report from Spark 
vectorizedparquetRecordReader (scan operatories). Most tpch queries are 
join-related operatories so the hotspot lies in the join/shuffle stage. bitpack 
optimization would be beneficial for SQL filter query





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] Fang-Xie commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-07 Thread via GitHub


Fang-Xie commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1099602361


##
README.md:
##
@@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Column stats
 * Delta encoding
 * Index pages
+* Java Vector API support
+
+## Java Vector API support

Review Comment:
   @sunchao , 
[here](https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-2159) 
shows the micro-benchmark of bitpack function and test report from Spark 
vectorizedparquetRecordReader (scan operatories). Most tpch queries are 
join-related operatories so the hotspot lies in the join/shuffle stage. bitpack 
optimization would be beneficial for SQL filter query



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685656#comment-17685656
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

WangYuxing0924 commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1099601047


##
README.md:
##
@@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Column stats
 * Delta encoding
 * Index pages
+* Java Vector API support
+
+## Java Vector API support

Review Comment:
   @sunchao , 
[here](https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-2159) 
shows the micro-benchmark of bitpack function and test report from Spark 
vectorizedparquetRecordReader (scan operatories). Most tpch queries are 
join-related operatories so the hotspot lies in the join/shuffle stage. bitpack 
optimization would be beneficial for SQL filter query





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] WangYuxing0924 commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-07 Thread via GitHub


WangYuxing0924 commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1099601047


##
README.md:
##
@@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Column stats
 * Delta encoding
 * Index pages
+* Java Vector API support
+
+## Java Vector API support

Review Comment:
   @sunchao , 
[here](https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-2159) 
shows the micro-benchmark of bitpack function and test report from Spark 
vectorizedparquetRecordReader (scan operatories). Most tpch queries are 
join-related operatories so the hotspot lies in the join/shuffle stage. bitpack 
optimization would be beneficial for SQL filter query



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685617#comment-17685617
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

sunchao commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1099446478


##
README.md:
##
@@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Column stats
 * Delta encoding
 * Index pages
+* Java Vector API support
+
+## Java Vector API support

Review Comment:
   @jiangjiguang sounds good, could you share the TPC-H benchmark results too?





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] sunchao commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-07 Thread via GitHub


sunchao commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1099446478


##
README.md:
##
@@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Column stats
 * Delta encoding
 * Index pages
+* Java Vector API support
+
+## Java Vector API support

Review Comment:
   @jiangjiguang sounds good, could you share the TPC-H benchmark results too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] dependabot[bot] opened a new pull request, #191: Bump junit from 4.10 to 4.13.1

2023-02-07 Thread via GitHub


dependabot[bot] opened a new pull request, #191:
URL: https://github.com/apache/parquet-format/pull/191

   Bumps [junit](https://github.com/junit-team/junit4) from 4.10 to 4.13.1.
   
   Release notes
   Sourced from https://github.com/junit-team/junit4/releases";>junit's 
releases.
   
   JUnit 4.13.1
   Please refer to the https://github.com/junit-team/junit/blob/HEAD/doc/ReleaseNotes4.13.1.md";>release
 notes for details.
   JUnit 4.13
   Please refer to the https://github.com/junit-team/junit/blob/HEAD/doc/ReleaseNotes4.13.md";>release
 notes for details.
   JUnit 4.13 RC 2
   Please refer to the https://github.com/junit-team/junit4/wiki/4.13-Release-Notes";>release 
notes for details.
   JUnit 4.13 RC 1
   Please refer to the https://github.com/junit-team/junit4/wiki/4.13-Release-Notes";>release 
notes for details.
   JUnit 4.13 Beta 3
   Please refer to the https://github.com/junit-team/junit4/wiki/4.13-Release-Notes";>release 
notes for details.
   JUnit 4.13 Beta 2
   Please refer to the https://github.com/junit-team/junit4/wiki/4.13-Release-Notes";>release 
notes for details.
   JUnit 4.13 Beta 1
   Please refer to the https://github.com/junit-team/junit4/wiki/4.13-Release-Notes";>release 
notes for details.
   JUnit 4.12
   Please refer to the https://github.com/junit-team/junit/blob/HEAD/doc/ReleaseNotes4.12.md";>release
 notes for details.
   JUnit 4.12 Beta 3
   Please refer to the https://github.com/junit-team/junit/blob/HEAD/doc/ReleaseNotes4.12.md";>release
 notes for details.
   JUnit 4.12 Beta 2
   No release notes provided.
   JUnit 4.12 Beta 1
   No release notes provided.
   JUnit 4.11
   No release notes provided.
   
   
   
   Commits
   
   https://github.com/junit-team/junit4/commit/1b683f4ec07bcfa40149f086d32240f805487e66";>1b683f4
 [maven-release-plugin] prepare release r4.13.1
   https://github.com/junit-team/junit4/commit/ce6ce3aadc070db2902698fe0d3dc6729cd631f2";>ce6ce3a
 Draft 4.13.1 release notes
   https://github.com/junit-team/junit4/commit/c29dd8239d6b353e699397eb090a1fd27411fa24";>c29dd82
 Change version to 4.13.1-SNAPSHOT
   https://github.com/junit-team/junit4/commit/1d174861f0b64f97ab0722bb324a760bfb02f567";>1d17486
 Add a link to assertThrows in exception testing
   https://github.com/junit-team/junit4/commit/543905df72ff10364b94dda27552efebf3dd04e9";>543905d
 Use separate line for annotation in Javadoc
   https://github.com/junit-team/junit4/commit/510e906b391e7e46a346e1c852416dc7be934944";>510e906
 Add sub headlines to class Javadoc
   https://github.com/junit-team/junit4/commit/610155b8c22138329f0723eec22521627dbc52ae";>610155b
 Merge pull request from GHSA-269g-pwp5-87pp
   https://github.com/junit-team/junit4/commit/b6cfd1e3d736cc2106242a8be799615b472c7fec";>b6cfd1e
 Explicitly wrap float parameter for consistency (https://github-redirect.dependabot.com/junit-team/junit4/issues/1671";>#1671)
   https://github.com/junit-team/junit4/commit/a5d205c7956dbed302b3bb5ecde5ba4299f0b646";>a5d205c
 Fix GitHub link in FAQ (https://github-redirect.dependabot.com/junit-team/junit4/issues/1672";>#1672)
   https://github.com/junit-team/junit4/commit/3a5c6b4d08f408c8ca6a8e0bae71a9bc5a8f97e8";>3a5c6b4
 Deprecated since jdk9 replacing constructor instance of Double and Float (https://github-redirect.dependabot.com/junit-team/junit4/issues/1660";>#1660)
   Additional commits viewable in https://github.com/junit-team/junit4/compare/r4.10...r4.13.1";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=junit:junit&package-manager=maven&previous-version=4.10&new-version=4.13.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore thi

[jira] [Commented] (PARQUET-2005) Upgrade thrift to 0.14.1

2023-02-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685474#comment-17685474
 ] 

ASF GitHub Bot commented on PARQUET-2005:
-

Fokko merged PR #175:
URL: https://github.com/apache/parquet-format/pull/175




> Upgrade thrift to 0.14.1
> 
>
> Key: PARQUET-2005
> URL: https://issues.apache.org/jira/browse/PARQUET-2005
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] Fokko merged pull request #175: PARQUET-2005: Upgrade Apache Thrift to 0.14.1

2023-02-07 Thread via GitHub


Fokko merged PR #175:
URL: https://github.com/apache/parquet-format/pull/175


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2229) ParquetRewriter supports masking and encrypting the same column

2023-02-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685443#comment-17685443
 ] 

ASF GitHub Bot commented on PARQUET-2229:
-

shangxinli commented on PR #1021:
URL: https://github.com/apache/parquet-mr/pull/1021#issuecomment-1421322311

   LGTM too. If no other comments, we can merge.




> ParquetRewriter supports masking and encrypting the same column
> ---
>
> Key: PARQUET-2229
> URL: https://issues.apache.org/jira/browse/PARQUET-2229
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter does not yet support  masking and encrypting the same column. 
> The scope of this task is to enable it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli commented on pull request #1021: PARQUET-2229: ParquetRewriter masks and encrypts the same column

2023-02-07 Thread via GitHub


shangxinli commented on PR #1021:
URL: https://github.com/apache/parquet-mr/pull/1021#issuecomment-1421322311

   LGTM too. If no other comments, we can merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2005) Upgrade thrift to 0.14.1

2023-02-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685442#comment-17685442
 ] 

ASF GitHub Bot commented on PARQUET-2005:
-

shangxinli commented on PR #175:
URL: https://github.com/apache/parquet-format/pull/175#issuecomment-1421315309

   @Fokko same ask as @pitrou. 




> Upgrade thrift to 0.14.1
> 
>
> Key: PARQUET-2005
> URL: https://issues.apache.org/jira/browse/PARQUET-2005
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] shangxinli commented on pull request #175: PARQUET-2005: Upgrade Apache Thrift to 0.14.1

2023-02-07 Thread via GitHub


shangxinli commented on PR #175:
URL: https://github.com/apache/parquet-format/pull/175#issuecomment-1421315309

   @Fokko same ask as @pitrou. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-1950) Define core features / compliance level

2023-02-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685439#comment-17685439
 ] 

ASF GitHub Bot commented on PARQUET-1950:
-

shangxinli commented on PR #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-1421313170

   @gszadovszky Are you still looking forward to merging this PR? 




> Define core features / compliance level
> ---
>
> Key: PARQUET-1950
> URL: https://issues.apache.org/jira/browse/PARQUET-1950
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Parquet format is getting more and more features while the different 
> implementations cannot keep the pace and left behind with some features 
> implemented and some are not. In many cases it is also not clear if the 
> related feature is mature enough to be used widely or more an experimental 
> one.
> These are huge issues that makes hard ensure interoperability between the 
> different implementations.
> The following idea came up in a 
> [discussion|https://lists.apache.org/thread.html/rde5cba8443487bccd47593ddf5dfb39f69c729d260165cb936a1a289%40%3Cdev.parquet.apache.org%3E].
>  Create a now document in the parquet-format repository that lists the "core 
> features". This document is versioned by the parquet-format releases. This 
> way a certain version of "core features" defines a level of compatibility 
> between the different implementations. This version number can be written to 
> a new field (e.g. complianceLevel) in the footer. If an implementation writes 
> a file with a version in the field it must implement all the related "core 
> features" (read and write) and must not use any other features at write 
> because it makes the data unreadable by another implementation if only the 
> same level of "core features" are implemented.
> For example if we have encoding A listed in the version 1 "core features" but 
> encoding B is not then at "complianceLevel = 1" we can use encoding A but we 
> cannot use encoding B because it would make the related data unreadable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] shangxinli commented on pull request #164: PARQUET-1950: Define core features

2023-02-07 Thread via GitHub


shangxinli commented on PR #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-1421313170

   @gszadovszky Are you still looking forward to merging this PR? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (PARQUET-2240) DateTimeFormatter is used in static context, but not thread safe

2023-02-07 Thread Shani Elharrar (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shani Elharrar updated PARQUET-2240:

Description: 
DateTimeFormatter is used in static context but not thread safe, a formatter 
instance is created in PrimitiveStringifer.DateStringifier, and DateStringifier 
is created in static final DATE_STRINGIFIER, TIMESTAMP_MILLIS_STRINGIFIER, 
TIMESTAMP_MICROS_STRINGIFIER, TIMESTAMP_NANOS_STRINGIFIER, 
TIMESTAMP_MILLIS_UTC_STRINGIFIER, TIMESTAMP_MICROS_UTC_STRINGIFIER, and 
TIMESTAMP_NANOS_UTC_STRINGIFIER.

This causes exceptions like the following to be thrown from parquet-code:

java.lang.ArrayIndexOutOfBoundsException: Index 633 out of bounds for length 13

stacktrace:

    at 
java.base/sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:457)
    at 
java.base/java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2358)
    at 
java.base/java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2273)
    at java.base/java.util.Calendar.setTimeInMillis(Calendar.java:1827)
    at java.base/java.util.Calendar.setTime(Calendar.java:1793)
    at java.base/java.text.SimpleDateFormat.format(SimpleDateFormat.java:978)
    at java.base/java.text.SimpleDateFormat.format(SimpleDateFormat.java:971)
    at java.base/java.text.DateFormat.format(DateFormat.java:339)
    at java.base/java.text.Format.format(Format.java:159)
    at 
org.apache.parquet.schema.PrimitiveStringifier$DateStringifier.toFormattedString(PrimitiveStringifier.java:265)
    at 
org.apache.parquet.schema.PrimitiveStringifier$DateStringifier.stringify(PrimitiveStringifier.java:256)
    at 
org.apache.parquet.column.statistics.IntStatistics.stringify(IntStatistics.java:92)
    at 
org.apache.parquet.column.statistics.IntStatistics.stringify(IntStatistics.java:25)
    at 
org.apache.parquet.column.statistics.Statistics.minAsString(Statistics.java:423)

    (... unrelated code)

A simple solution would be to change those from static to non static values.

I can create a PR if the solution is ok by the maintainers of the library.

  was:
DateTimeFormatter is used in static context but not thread safe, a formatter 
instance is created in PrimitiveStringifer.DateStringifier, and DateStringifier 
is created in static final DATE_STRINGIFIER, TIMESTAMP_MILLIS_STRINGIFIER, 
TIMESTAMP_MICROS_STRINGIFIER, TIMESTAMP_NANOS_STRINGIFIER, 
TIMESTAMP_MILLIS_UTC_STRINGIFIER, TIMESTAMP_MICROS_UTC_STRINGIFIER, and 
TIMESTAMP_NANOS_UTC_STRINGIFIER.

This causes exceptions like the following to be thrown from parquet-code:

java.lang.ArrayIndexOutOfBoundsException: Index 633 out of bounds for length 13

stacktrace:

    at 
java.base/sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:457)
    at 
java.base/java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2358)
    at 
java.base/java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2273)
    at java.base/java.util.Calendar.setTimeInMillis(Calendar.java:1827)
    at java.base/java.util.Calendar.setTime(Calendar.java:1793)
    at java.base/java.text.SimpleDateFormat.format(SimpleDateFormat.java:978)
    at java.base/java.text.SimpleDateFormat.format(SimpleDateFormat.java:971)
    at java.base/java.text.DateFormat.format(DateFormat.java:339)
    at java.base/java.text.Format.format(Format.java:159)
    at 
org.apache.parquet.schema.PrimitiveStringifier$DateStringifier.toFormattedString(PrimitiveStringifier.java:265)
    at 
org.apache.parquet.schema.PrimitiveStringifier$DateStringifier.stringify(PrimitiveStringifier.java:256)
    at 
org.apache.parquet.column.statistics.IntStatistics.stringify(IntStatistics.java:92)
    at 
org.apache.parquet.column.statistics.IntStatistics.stringify(IntStatistics.java:25)
    at 
org.apache.parquet.column.statistics.Statistics.minAsString(Statistics.java:423)

    (... unrelated code)

A simple solution would be to change those from static to non static values.


> DateTimeFormatter is used in static context, but not thread safe
> 
>
> Key: PARQUET-2240
> URL: https://issues.apache.org/jira/browse/PARQUET-2240
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
> Environment: Linux, OpenJDK 17 (based on docker image openjdk:17-slim)
>  
>Reporter: Shani Elharrar
>Priority: Trivial
>  Labels: bug
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> DateTimeFormatter is used in static context but not thread safe, a formatter 
> instance is created in PrimitiveStringifer.DateStringifier, and 
> DateStringifier is created in static final DATE_STRINGIFIER, 
> TIMESTAMP_MILLIS_STRINGIFIER, TIMESTAMP_MICROS_STRINGIFIER, 
> TIMESTAMP_NANOS_STRINGIFIER, TIMESTAMP_MILLIS_UTC_STRINGIFIER, 
> TIMESTAMP_MICROS_UTC_ST

[GitHub] [parquet-format] XinyuZeng opened a new pull request, #190: Minor: add FIXED_LEN_BYTE_ARRAY under Types in doc

2023-02-07 Thread via GitHub


XinyuZeng opened a new pull request, #190:
URL: https://github.com/apache/parquet-format/pull/190

   Add missing FIXED_LEN_BYTE_ARRAY type under Types in README.md
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (PARQUET-2240) DateTimeFormatter is used in static context, but not thread safe

2023-02-07 Thread Shani Elharrar (Jira)
Shani Elharrar created PARQUET-2240:
---

 Summary: DateTimeFormatter is used in static context, but not 
thread safe
 Key: PARQUET-2240
 URL: https://issues.apache.org/jira/browse/PARQUET-2240
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
 Environment: Linux, OpenJDK 17 (based on docker image openjdk:17-slim)

 
Reporter: Shani Elharrar


DateTimeFormatter is used in static context but not thread safe, a formatter 
instance is created in PrimitiveStringifer.DateStringifier, and DateStringifier 
is created in static final DATE_STRINGIFIER, TIMESTAMP_MILLIS_STRINGIFIER, 
TIMESTAMP_MICROS_STRINGIFIER, TIMESTAMP_NANOS_STRINGIFIER, 
TIMESTAMP_MILLIS_UTC_STRINGIFIER, TIMESTAMP_MICROS_UTC_STRINGIFIER, and 
TIMESTAMP_NANOS_UTC_STRINGIFIER.

This causes exceptions like the following to be thrown from parquet-code:

java.lang.ArrayIndexOutOfBoundsException: Index 633 out of bounds for length 13

stacktrace:

    at 
java.base/sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:457)
    at 
java.base/java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2358)
    at 
java.base/java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2273)
    at java.base/java.util.Calendar.setTimeInMillis(Calendar.java:1827)
    at java.base/java.util.Calendar.setTime(Calendar.java:1793)
    at java.base/java.text.SimpleDateFormat.format(SimpleDateFormat.java:978)
    at java.base/java.text.SimpleDateFormat.format(SimpleDateFormat.java:971)
    at java.base/java.text.DateFormat.format(DateFormat.java:339)
    at java.base/java.text.Format.format(Format.java:159)
    at 
org.apache.parquet.schema.PrimitiveStringifier$DateStringifier.toFormattedString(PrimitiveStringifier.java:265)
    at 
org.apache.parquet.schema.PrimitiveStringifier$DateStringifier.stringify(PrimitiveStringifier.java:256)
    at 
org.apache.parquet.column.statistics.IntStatistics.stringify(IntStatistics.java:92)
    at 
org.apache.parquet.column.statistics.IntStatistics.stringify(IntStatistics.java:25)
    at 
org.apache.parquet.column.statistics.Statistics.minAsString(Statistics.java:423)

    (... unrelated code)

A simple solution would be to change those from static to non static values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2239) Replace log4j1 with reload4j

2023-02-07 Thread Akshat Mathur (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685158#comment-17685158
 ] 

Akshat Mathur commented on PARQUET-2239:


Thanks [~ste...@apache.org] for sharing this, will take the reference :)

> Replace log4j1 with reload4j
> 
>
> Key: PARQUET-2239
> URL: https://issues.apache.org/jira/browse/PARQUET-2239
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Akshat Mathur
>Priority: Major
>  Labels: pick-me-up
>
> Due to multiple CVE in log4j1, replace log4j dependency with reload4j.
> More about reload4j: https://reload4j.qos.ch/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)