[GitHub] [parquet-format] mapleFU commented on pull request #190: Minor: add FIXED_LEN_BYTE_ARRAY under Types in doc
mapleFU commented on PR #190: URL: https://github.com/apache/parquet-format/pull/190#issuecomment-1422061712 @pitrou I think this is great, mind take a look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685675#comment-17685675 ] ASF GitHub Bot commented on PARQUET-2159: - sunchao commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1099630871 ## README.md: ## @@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being added quickly. Here * Column stats * Delta encoding * Index pages +* Java Vector API support + +## Java Vector API support Review Comment: Ah thanks! this looks promising and looking forward to the Spark PR! > Parquet bit-packing de/encode optimization > -- > > Key: PARQUET-2159 > URL: https://issues.apache.org/jira/browse/PARQUET-2159 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Fang-Xie >Assignee: Fang-Xie >Priority: Major > Fix For: 1.13.0 > > Attachments: image-2022-06-15-22-56-08-396.png, > image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, > image-2022-06-15-22-58-40-704.png > > > Current Spark use Parquet-mr as parquet reader/writer library, but the > built-in bit-packing en/decode is not efficient enough. > Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector > in Open JDK18 brings prominent performance improvement. > Due to Vector API is added to OpenJDK since 16, So this optimization request > JDK16 or higher. > *Below are our test results* > Functional test is based on open-source parquet-mr Bit-pack decoding > function: *_public final void unpack8Values(final byte[] in, final int inPos, > final int[] out, final int outPos)_* __ > compared with our implementation with vector API *_public final void > unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final > int outPos)_* > We tested 10 pairs (open source parquet bit unpacking vs ours optimized > vectorized SIMD implementation) decode function with bit > width=\{1,2,3,4,5,6,7,8,9,10}, below are test results: > !image-2022-06-15-22-56-08-396.png|width=437,height=223! > We integrated our bit-packing decode implementation into parquet-mr, tested > the parquet batch reader ability from Spark VectorizedParquetRecordReader > which get parquet column data by the batch way. We construct parquet file > with different row count and column count, the column data type is Int32, the > maximum int value is 127 which satisfies bit pack encode with bit width=7, > the count of the row is from 10k to 100 million and the count of the column > is from 1 to 4. > !image-2022-06-15-22-57-15-964.png|width=453,height=229! > !image-2022-06-15-22-58-01-442.png|width=439,height=217! > !image-2022-06-15-22-58-40-704.png|width=415,height=208! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] sunchao commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization
sunchao commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1099630871 ## README.md: ## @@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being added quickly. Here * Column stats * Delta encoding * Index pages +* Java Vector API support + +## Java Vector API support Review Comment: Ah thanks! this looks promising and looking forward to the Spark PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685657#comment-17685657 ] ASF GitHub Bot commented on PARQUET-2159: - Fang-Xie commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1099602361 ## README.md: ## @@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being added quickly. Here * Column stats * Delta encoding * Index pages +* Java Vector API support + +## Java Vector API support Review Comment: @sunchao , [here](https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-2159) shows the micro-benchmark of bitpack function and test report from Spark vectorizedparquetRecordReader (scan operatories). Most tpch queries are join-related operatories so the hotspot lies in the join/shuffle stage. bitpack optimization would be beneficial for SQL filter query > Parquet bit-packing de/encode optimization > -- > > Key: PARQUET-2159 > URL: https://issues.apache.org/jira/browse/PARQUET-2159 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Fang-Xie >Assignee: Fang-Xie >Priority: Major > Fix For: 1.13.0 > > Attachments: image-2022-06-15-22-56-08-396.png, > image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, > image-2022-06-15-22-58-40-704.png > > > Current Spark use Parquet-mr as parquet reader/writer library, but the > built-in bit-packing en/decode is not efficient enough. > Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector > in Open JDK18 brings prominent performance improvement. > Due to Vector API is added to OpenJDK since 16, So this optimization request > JDK16 or higher. > *Below are our test results* > Functional test is based on open-source parquet-mr Bit-pack decoding > function: *_public final void unpack8Values(final byte[] in, final int inPos, > final int[] out, final int outPos)_* __ > compared with our implementation with vector API *_public final void > unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final > int outPos)_* > We tested 10 pairs (open source parquet bit unpacking vs ours optimized > vectorized SIMD implementation) decode function with bit > width=\{1,2,3,4,5,6,7,8,9,10}, below are test results: > !image-2022-06-15-22-56-08-396.png|width=437,height=223! > We integrated our bit-packing decode implementation into parquet-mr, tested > the parquet batch reader ability from Spark VectorizedParquetRecordReader > which get parquet column data by the batch way. We construct parquet file > with different row count and column count, the column data type is Int32, the > maximum int value is 127 which satisfies bit pack encode with bit width=7, > the count of the row is from 10k to 100 million and the count of the column > is from 1 to 4. > !image-2022-06-15-22-57-15-964.png|width=453,height=229! > !image-2022-06-15-22-58-01-442.png|width=439,height=217! > !image-2022-06-15-22-58-40-704.png|width=415,height=208! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] Fang-Xie commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization
Fang-Xie commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1099602361 ## README.md: ## @@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being added quickly. Here * Column stats * Delta encoding * Index pages +* Java Vector API support + +## Java Vector API support Review Comment: @sunchao , [here](https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-2159) shows the micro-benchmark of bitpack function and test report from Spark vectorizedparquetRecordReader (scan operatories). Most tpch queries are join-related operatories so the hotspot lies in the join/shuffle stage. bitpack optimization would be beneficial for SQL filter query -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685656#comment-17685656 ] ASF GitHub Bot commented on PARQUET-2159: - WangYuxing0924 commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1099601047 ## README.md: ## @@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being added quickly. Here * Column stats * Delta encoding * Index pages +* Java Vector API support + +## Java Vector API support Review Comment: @sunchao , [here](https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-2159) shows the micro-benchmark of bitpack function and test report from Spark vectorizedparquetRecordReader (scan operatories). Most tpch queries are join-related operatories so the hotspot lies in the join/shuffle stage. bitpack optimization would be beneficial for SQL filter query > Parquet bit-packing de/encode optimization > -- > > Key: PARQUET-2159 > URL: https://issues.apache.org/jira/browse/PARQUET-2159 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Fang-Xie >Assignee: Fang-Xie >Priority: Major > Fix For: 1.13.0 > > Attachments: image-2022-06-15-22-56-08-396.png, > image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, > image-2022-06-15-22-58-40-704.png > > > Current Spark use Parquet-mr as parquet reader/writer library, but the > built-in bit-packing en/decode is not efficient enough. > Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector > in Open JDK18 brings prominent performance improvement. > Due to Vector API is added to OpenJDK since 16, So this optimization request > JDK16 or higher. > *Below are our test results* > Functional test is based on open-source parquet-mr Bit-pack decoding > function: *_public final void unpack8Values(final byte[] in, final int inPos, > final int[] out, final int outPos)_* __ > compared with our implementation with vector API *_public final void > unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final > int outPos)_* > We tested 10 pairs (open source parquet bit unpacking vs ours optimized > vectorized SIMD implementation) decode function with bit > width=\{1,2,3,4,5,6,7,8,9,10}, below are test results: > !image-2022-06-15-22-56-08-396.png|width=437,height=223! > We integrated our bit-packing decode implementation into parquet-mr, tested > the parquet batch reader ability from Spark VectorizedParquetRecordReader > which get parquet column data by the batch way. We construct parquet file > with different row count and column count, the column data type is Int32, the > maximum int value is 127 which satisfies bit pack encode with bit width=7, > the count of the row is from 10k to 100 million and the count of the column > is from 1 to 4. > !image-2022-06-15-22-57-15-964.png|width=453,height=229! > !image-2022-06-15-22-58-01-442.png|width=439,height=217! > !image-2022-06-15-22-58-40-704.png|width=415,height=208! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] WangYuxing0924 commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization
WangYuxing0924 commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1099601047 ## README.md: ## @@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being added quickly. Here * Column stats * Delta encoding * Index pages +* Java Vector API support + +## Java Vector API support Review Comment: @sunchao , [here](https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-2159) shows the micro-benchmark of bitpack function and test report from Spark vectorizedparquetRecordReader (scan operatories). Most tpch queries are join-related operatories so the hotspot lies in the join/shuffle stage. bitpack optimization would be beneficial for SQL filter query -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685617#comment-17685617 ] ASF GitHub Bot commented on PARQUET-2159: - sunchao commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1099446478 ## README.md: ## @@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being added quickly. Here * Column stats * Delta encoding * Index pages +* Java Vector API support + +## Java Vector API support Review Comment: @jiangjiguang sounds good, could you share the TPC-H benchmark results too? > Parquet bit-packing de/encode optimization > -- > > Key: PARQUET-2159 > URL: https://issues.apache.org/jira/browse/PARQUET-2159 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Fang-Xie >Assignee: Fang-Xie >Priority: Major > Fix For: 1.13.0 > > Attachments: image-2022-06-15-22-56-08-396.png, > image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, > image-2022-06-15-22-58-40-704.png > > > Current Spark use Parquet-mr as parquet reader/writer library, but the > built-in bit-packing en/decode is not efficient enough. > Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector > in Open JDK18 brings prominent performance improvement. > Due to Vector API is added to OpenJDK since 16, So this optimization request > JDK16 or higher. > *Below are our test results* > Functional test is based on open-source parquet-mr Bit-pack decoding > function: *_public final void unpack8Values(final byte[] in, final int inPos, > final int[] out, final int outPos)_* __ > compared with our implementation with vector API *_public final void > unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final > int outPos)_* > We tested 10 pairs (open source parquet bit unpacking vs ours optimized > vectorized SIMD implementation) decode function with bit > width=\{1,2,3,4,5,6,7,8,9,10}, below are test results: > !image-2022-06-15-22-56-08-396.png|width=437,height=223! > We integrated our bit-packing decode implementation into parquet-mr, tested > the parquet batch reader ability from Spark VectorizedParquetRecordReader > which get parquet column data by the batch way. We construct parquet file > with different row count and column count, the column data type is Int32, the > maximum int value is 127 which satisfies bit pack encode with bit width=7, > the count of the row is from 10k to 100 million and the count of the column > is from 1 to 4. > !image-2022-06-15-22-57-15-964.png|width=453,height=229! > !image-2022-06-15-22-58-01-442.png|width=439,height=217! > !image-2022-06-15-22-58-40-704.png|width=415,height=208! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] sunchao commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization
sunchao commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1099446478 ## README.md: ## @@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being added quickly. Here * Column stats * Delta encoding * Index pages +* Java Vector API support + +## Java Vector API support Review Comment: @jiangjiguang sounds good, could you share the TPC-H benchmark results too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-format] dependabot[bot] opened a new pull request, #191: Bump junit from 4.10 to 4.13.1
dependabot[bot] opened a new pull request, #191: URL: https://github.com/apache/parquet-format/pull/191 Bumps [junit](https://github.com/junit-team/junit4) from 4.10 to 4.13.1. Release notes Sourced from https://github.com/junit-team/junit4/releases";>junit's releases. JUnit 4.13.1 Please refer to the https://github.com/junit-team/junit/blob/HEAD/doc/ReleaseNotes4.13.1.md";>release notes for details. JUnit 4.13 Please refer to the https://github.com/junit-team/junit/blob/HEAD/doc/ReleaseNotes4.13.md";>release notes for details. JUnit 4.13 RC 2 Please refer to the https://github.com/junit-team/junit4/wiki/4.13-Release-Notes";>release notes for details. JUnit 4.13 RC 1 Please refer to the https://github.com/junit-team/junit4/wiki/4.13-Release-Notes";>release notes for details. JUnit 4.13 Beta 3 Please refer to the https://github.com/junit-team/junit4/wiki/4.13-Release-Notes";>release notes for details. JUnit 4.13 Beta 2 Please refer to the https://github.com/junit-team/junit4/wiki/4.13-Release-Notes";>release notes for details. JUnit 4.13 Beta 1 Please refer to the https://github.com/junit-team/junit4/wiki/4.13-Release-Notes";>release notes for details. JUnit 4.12 Please refer to the https://github.com/junit-team/junit/blob/HEAD/doc/ReleaseNotes4.12.md";>release notes for details. JUnit 4.12 Beta 3 Please refer to the https://github.com/junit-team/junit/blob/HEAD/doc/ReleaseNotes4.12.md";>release notes for details. JUnit 4.12 Beta 2 No release notes provided. JUnit 4.12 Beta 1 No release notes provided. JUnit 4.11 No release notes provided. Commits https://github.com/junit-team/junit4/commit/1b683f4ec07bcfa40149f086d32240f805487e66";>1b683f4 [maven-release-plugin] prepare release r4.13.1 https://github.com/junit-team/junit4/commit/ce6ce3aadc070db2902698fe0d3dc6729cd631f2";>ce6ce3a Draft 4.13.1 release notes https://github.com/junit-team/junit4/commit/c29dd8239d6b353e699397eb090a1fd27411fa24";>c29dd82 Change version to 4.13.1-SNAPSHOT https://github.com/junit-team/junit4/commit/1d174861f0b64f97ab0722bb324a760bfb02f567";>1d17486 Add a link to assertThrows in exception testing https://github.com/junit-team/junit4/commit/543905df72ff10364b94dda27552efebf3dd04e9";>543905d Use separate line for annotation in Javadoc https://github.com/junit-team/junit4/commit/510e906b391e7e46a346e1c852416dc7be934944";>510e906 Add sub headlines to class Javadoc https://github.com/junit-team/junit4/commit/610155b8c22138329f0723eec22521627dbc52ae";>610155b Merge pull request from GHSA-269g-pwp5-87pp https://github.com/junit-team/junit4/commit/b6cfd1e3d736cc2106242a8be799615b472c7fec";>b6cfd1e Explicitly wrap float parameter for consistency (https://github-redirect.dependabot.com/junit-team/junit4/issues/1671";>#1671) https://github.com/junit-team/junit4/commit/a5d205c7956dbed302b3bb5ecde5ba4299f0b646";>a5d205c Fix GitHub link in FAQ (https://github-redirect.dependabot.com/junit-team/junit4/issues/1672";>#1672) https://github.com/junit-team/junit4/commit/3a5c6b4d08f408c8ca6a8e0bae71a9bc5a8f97e8";>3a5c6b4 Deprecated since jdk9 replacing constructor instance of Double and Float (https://github-redirect.dependabot.com/junit-team/junit4/issues/1660";>#1660) Additional commits viewable in https://github.com/junit-team/junit4/compare/r4.10...r4.13.1";>compare view [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=junit:junit&package-manager=maven&previous-version=4.10&new-version=4.13.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore thi
[jira] [Commented] (PARQUET-2005) Upgrade thrift to 0.14.1
[ https://issues.apache.org/jira/browse/PARQUET-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685474#comment-17685474 ] ASF GitHub Bot commented on PARQUET-2005: - Fokko merged PR #175: URL: https://github.com/apache/parquet-format/pull/175 > Upgrade thrift to 0.14.1 > > > Key: PARQUET-2005 > URL: https://issues.apache.org/jira/browse/PARQUET-2005 > Project: Parquet > Issue Type: Improvement >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > Fix For: 1.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] Fokko merged pull request #175: PARQUET-2005: Upgrade Apache Thrift to 0.14.1
Fokko merged PR #175: URL: https://github.com/apache/parquet-format/pull/175 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2229) ParquetRewriter supports masking and encrypting the same column
[ https://issues.apache.org/jira/browse/PARQUET-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685443#comment-17685443 ] ASF GitHub Bot commented on PARQUET-2229: - shangxinli commented on PR #1021: URL: https://github.com/apache/parquet-mr/pull/1021#issuecomment-1421322311 LGTM too. If no other comments, we can merge. > ParquetRewriter supports masking and encrypting the same column > --- > > Key: PARQUET-2229 > URL: https://issues.apache.org/jira/browse/PARQUET-2229 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter does not yet support masking and encrypting the same column. > The scope of this task is to enable it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] shangxinli commented on pull request #1021: PARQUET-2229: ParquetRewriter masks and encrypts the same column
shangxinli commented on PR #1021: URL: https://github.com/apache/parquet-mr/pull/1021#issuecomment-1421322311 LGTM too. If no other comments, we can merge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2005) Upgrade thrift to 0.14.1
[ https://issues.apache.org/jira/browse/PARQUET-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685442#comment-17685442 ] ASF GitHub Bot commented on PARQUET-2005: - shangxinli commented on PR #175: URL: https://github.com/apache/parquet-format/pull/175#issuecomment-1421315309 @Fokko same ask as @pitrou. > Upgrade thrift to 0.14.1 > > > Key: PARQUET-2005 > URL: https://issues.apache.org/jira/browse/PARQUET-2005 > Project: Parquet > Issue Type: Improvement >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > Fix For: 1.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] shangxinli commented on pull request #175: PARQUET-2005: Upgrade Apache Thrift to 0.14.1
shangxinli commented on PR #175: URL: https://github.com/apache/parquet-format/pull/175#issuecomment-1421315309 @Fokko same ask as @pitrou. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1950) Define core features / compliance level
[ https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685439#comment-17685439 ] ASF GitHub Bot commented on PARQUET-1950: - shangxinli commented on PR #164: URL: https://github.com/apache/parquet-format/pull/164#issuecomment-1421313170 @gszadovszky Are you still looking forward to merging this PR? > Define core features / compliance level > --- > > Key: PARQUET-1950 > URL: https://issues.apache.org/jira/browse/PARQUET-1950 > Project: Parquet > Issue Type: New Feature > Components: parquet-format >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > > Parquet format is getting more and more features while the different > implementations cannot keep the pace and left behind with some features > implemented and some are not. In many cases it is also not clear if the > related feature is mature enough to be used widely or more an experimental > one. > These are huge issues that makes hard ensure interoperability between the > different implementations. > The following idea came up in a > [discussion|https://lists.apache.org/thread.html/rde5cba8443487bccd47593ddf5dfb39f69c729d260165cb936a1a289%40%3Cdev.parquet.apache.org%3E]. > Create a now document in the parquet-format repository that lists the "core > features". This document is versioned by the parquet-format releases. This > way a certain version of "core features" defines a level of compatibility > between the different implementations. This version number can be written to > a new field (e.g. complianceLevel) in the footer. If an implementation writes > a file with a version in the field it must implement all the related "core > features" (read and write) and must not use any other features at write > because it makes the data unreadable by another implementation if only the > same level of "core features" are implemented. > For example if we have encoding A listed in the version 1 "core features" but > encoding B is not then at "complianceLevel = 1" we can use encoding A but we > cannot use encoding B because it would make the related data unreadable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] shangxinli commented on pull request #164: PARQUET-1950: Define core features
shangxinli commented on PR #164: URL: https://github.com/apache/parquet-format/pull/164#issuecomment-1421313170 @gszadovszky Are you still looking forward to merging this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (PARQUET-2240) DateTimeFormatter is used in static context, but not thread safe
[ https://issues.apache.org/jira/browse/PARQUET-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shani Elharrar updated PARQUET-2240: Description: DateTimeFormatter is used in static context but not thread safe, a formatter instance is created in PrimitiveStringifer.DateStringifier, and DateStringifier is created in static final DATE_STRINGIFIER, TIMESTAMP_MILLIS_STRINGIFIER, TIMESTAMP_MICROS_STRINGIFIER, TIMESTAMP_NANOS_STRINGIFIER, TIMESTAMP_MILLIS_UTC_STRINGIFIER, TIMESTAMP_MICROS_UTC_STRINGIFIER, and TIMESTAMP_NANOS_UTC_STRINGIFIER. This causes exceptions like the following to be thrown from parquet-code: java.lang.ArrayIndexOutOfBoundsException: Index 633 out of bounds for length 13 stacktrace: at java.base/sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:457) at java.base/java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2358) at java.base/java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2273) at java.base/java.util.Calendar.setTimeInMillis(Calendar.java:1827) at java.base/java.util.Calendar.setTime(Calendar.java:1793) at java.base/java.text.SimpleDateFormat.format(SimpleDateFormat.java:978) at java.base/java.text.SimpleDateFormat.format(SimpleDateFormat.java:971) at java.base/java.text.DateFormat.format(DateFormat.java:339) at java.base/java.text.Format.format(Format.java:159) at org.apache.parquet.schema.PrimitiveStringifier$DateStringifier.toFormattedString(PrimitiveStringifier.java:265) at org.apache.parquet.schema.PrimitiveStringifier$DateStringifier.stringify(PrimitiveStringifier.java:256) at org.apache.parquet.column.statistics.IntStatistics.stringify(IntStatistics.java:92) at org.apache.parquet.column.statistics.IntStatistics.stringify(IntStatistics.java:25) at org.apache.parquet.column.statistics.Statistics.minAsString(Statistics.java:423) (... unrelated code) A simple solution would be to change those from static to non static values. I can create a PR if the solution is ok by the maintainers of the library. was: DateTimeFormatter is used in static context but not thread safe, a formatter instance is created in PrimitiveStringifer.DateStringifier, and DateStringifier is created in static final DATE_STRINGIFIER, TIMESTAMP_MILLIS_STRINGIFIER, TIMESTAMP_MICROS_STRINGIFIER, TIMESTAMP_NANOS_STRINGIFIER, TIMESTAMP_MILLIS_UTC_STRINGIFIER, TIMESTAMP_MICROS_UTC_STRINGIFIER, and TIMESTAMP_NANOS_UTC_STRINGIFIER. This causes exceptions like the following to be thrown from parquet-code: java.lang.ArrayIndexOutOfBoundsException: Index 633 out of bounds for length 13 stacktrace: at java.base/sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:457) at java.base/java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2358) at java.base/java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2273) at java.base/java.util.Calendar.setTimeInMillis(Calendar.java:1827) at java.base/java.util.Calendar.setTime(Calendar.java:1793) at java.base/java.text.SimpleDateFormat.format(SimpleDateFormat.java:978) at java.base/java.text.SimpleDateFormat.format(SimpleDateFormat.java:971) at java.base/java.text.DateFormat.format(DateFormat.java:339) at java.base/java.text.Format.format(Format.java:159) at org.apache.parquet.schema.PrimitiveStringifier$DateStringifier.toFormattedString(PrimitiveStringifier.java:265) at org.apache.parquet.schema.PrimitiveStringifier$DateStringifier.stringify(PrimitiveStringifier.java:256) at org.apache.parquet.column.statistics.IntStatistics.stringify(IntStatistics.java:92) at org.apache.parquet.column.statistics.IntStatistics.stringify(IntStatistics.java:25) at org.apache.parquet.column.statistics.Statistics.minAsString(Statistics.java:423) (... unrelated code) A simple solution would be to change those from static to non static values. > DateTimeFormatter is used in static context, but not thread safe > > > Key: PARQUET-2240 > URL: https://issues.apache.org/jira/browse/PARQUET-2240 > Project: Parquet > Issue Type: Bug > Components: parquet-mr > Environment: Linux, OpenJDK 17 (based on docker image openjdk:17-slim) > >Reporter: Shani Elharrar >Priority: Trivial > Labels: bug > Original Estimate: 24h > Remaining Estimate: 24h > > DateTimeFormatter is used in static context but not thread safe, a formatter > instance is created in PrimitiveStringifer.DateStringifier, and > DateStringifier is created in static final DATE_STRINGIFIER, > TIMESTAMP_MILLIS_STRINGIFIER, TIMESTAMP_MICROS_STRINGIFIER, > TIMESTAMP_NANOS_STRINGIFIER, TIMESTAMP_MILLIS_UTC_STRINGIFIER, > TIMESTAMP_MICROS_UTC_ST
[GitHub] [parquet-format] XinyuZeng opened a new pull request, #190: Minor: add FIXED_LEN_BYTE_ARRAY under Types in doc
XinyuZeng opened a new pull request, #190: URL: https://github.com/apache/parquet-format/pull/190 Add missing FIXED_LEN_BYTE_ARRAY type under Types in README.md -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (PARQUET-2240) DateTimeFormatter is used in static context, but not thread safe
Shani Elharrar created PARQUET-2240: --- Summary: DateTimeFormatter is used in static context, but not thread safe Key: PARQUET-2240 URL: https://issues.apache.org/jira/browse/PARQUET-2240 Project: Parquet Issue Type: Bug Components: parquet-mr Environment: Linux, OpenJDK 17 (based on docker image openjdk:17-slim) Reporter: Shani Elharrar DateTimeFormatter is used in static context but not thread safe, a formatter instance is created in PrimitiveStringifer.DateStringifier, and DateStringifier is created in static final DATE_STRINGIFIER, TIMESTAMP_MILLIS_STRINGIFIER, TIMESTAMP_MICROS_STRINGIFIER, TIMESTAMP_NANOS_STRINGIFIER, TIMESTAMP_MILLIS_UTC_STRINGIFIER, TIMESTAMP_MICROS_UTC_STRINGIFIER, and TIMESTAMP_NANOS_UTC_STRINGIFIER. This causes exceptions like the following to be thrown from parquet-code: java.lang.ArrayIndexOutOfBoundsException: Index 633 out of bounds for length 13 stacktrace: at java.base/sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:457) at java.base/java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2358) at java.base/java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2273) at java.base/java.util.Calendar.setTimeInMillis(Calendar.java:1827) at java.base/java.util.Calendar.setTime(Calendar.java:1793) at java.base/java.text.SimpleDateFormat.format(SimpleDateFormat.java:978) at java.base/java.text.SimpleDateFormat.format(SimpleDateFormat.java:971) at java.base/java.text.DateFormat.format(DateFormat.java:339) at java.base/java.text.Format.format(Format.java:159) at org.apache.parquet.schema.PrimitiveStringifier$DateStringifier.toFormattedString(PrimitiveStringifier.java:265) at org.apache.parquet.schema.PrimitiveStringifier$DateStringifier.stringify(PrimitiveStringifier.java:256) at org.apache.parquet.column.statistics.IntStatistics.stringify(IntStatistics.java:92) at org.apache.parquet.column.statistics.IntStatistics.stringify(IntStatistics.java:25) at org.apache.parquet.column.statistics.Statistics.minAsString(Statistics.java:423) (... unrelated code) A simple solution would be to change those from static to non static values. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2239) Replace log4j1 with reload4j
[ https://issues.apache.org/jira/browse/PARQUET-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685158#comment-17685158 ] Akshat Mathur commented on PARQUET-2239: Thanks [~ste...@apache.org] for sharing this, will take the reference :) > Replace log4j1 with reload4j > > > Key: PARQUET-2239 > URL: https://issues.apache.org/jira/browse/PARQUET-2239 > Project: Parquet > Issue Type: Improvement >Reporter: Akshat Mathur >Priority: Major > Labels: pick-me-up > > Due to multiple CVE in log4j1, replace log4j dependency with reload4j. > More about reload4j: https://reload4j.qos.ch/ -- This message was sent by Atlassian Jira (v8.20.10#820010)