[jira] [Commented] (PARQUET-2173) Fix parquet build against hadoop 3.3.3+
[ https://issues.apache.org/jira/browse/PARQUET-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683250#comment-17683250 ] ASF GitHub Bot commented on PARQUET-2173: - gszadovszky merged PR #985: URL: https://github.com/apache/parquet-mr/pull/985 > Fix parquet build against hadoop 3.3.3+ > --- > > Key: PARQUET-2173 > URL: https://issues.apache.org/jira/browse/PARQUET-2173 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Affects Versions: 1.13.0 >Reporter: Steve Loughran >Priority: Major > > parquet won't build against hadoop 3.3.3+ because it swapped out log4j 1.17 > for reload4j, and this creates maven dependency problems in parquet cli > {code} > [INFO] --- maven-dependency-plugin:3.1.1:analyze-only (default) @ parquet-cli > --- > [WARNING] Used undeclared dependencies found: > [WARNING]ch.qos.reload4j:reload4j:jar:1.2.22:provided > {code} > the hadoop common dependencies need to exclude this jar and any changed slf4j > ones. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] gszadovszky merged pull request #985: PARQUET-2173. Fix parquet build against hadoop 3.3.3+
gszadovszky merged PR #985: URL: https://github.com/apache/parquet-mr/pull/985 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683229#comment-17683229 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1094064182 ## parquet-generator/src/main/resources/ByteBitPacking512VectorLE: ## @@ -0,0 +1,3095 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.parquet.column.values.bitpacking; + +import jdk.incubator.vector.ByteVector; +import jdk.incubator.vector.IntVector; +import jdk.incubator.vector.LongVector; +import jdk.incubator.vector.ShortVector; +import jdk.incubator.vector.Vector; +import jdk.incubator.vector.VectorMask; +import jdk.incubator.vector.VectorOperators; +import jdk.incubator.vector.VectorShuffle; +import jdk.incubator.vector.VectorSpecies; + +import java.nio.ByteBuffer; + +/** + * This is an auto-generated source file and should not edit it directly. + */ +public abstract class ByteBitPacking512VectorLE { Review Comment: > In this case, the script is not necessary. Manual bit-unpacking code is error-prone, we really rely on the quality and coverage of test cases. @wgtmac I strongly agree with you, so I try my best to cover all aspects(different aspects from bitWidth 1 to 32) in class TestByteBitPacking512VectorLE. Besides, I have done the TPC-H testing, and compared the query result with before optimization. In short, I have done more work to ensure the code quality. > Parquet bit-packing de/encode optimization > -- > > Key: PARQUET-2159 > URL: https://issues.apache.org/jira/browse/PARQUET-2159 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Fang-Xie >Assignee: Fang-Xie >Priority: Major > Fix For: 1.13.0 > > Attachments: image-2022-06-15-22-56-08-396.png, > image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, > image-2022-06-15-22-58-40-704.png > > > Current Spark use Parquet-mr as parquet reader/writer library, but the > built-in bit-packing en/decode is not efficient enough. > Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector > in Open JDK18 brings prominent performance improvement. > Due to Vector API is added to OpenJDK since 16, So this optimization request > JDK16 or higher. > *Below are our test results* > Functional test is based on open-source parquet-mr Bit-pack decoding > function: *_public final void unpack8Values(final byte[] in, final int inPos, > final int[] out, final int outPos)_* __ > compared with our implementation with vector API *_public final void > unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final > int outPos)_* > We tested 10 pairs (open source parquet bit unpacking vs ours optimized > vectorized SIMD implementation) decode function with bit > width=\{1,2,3,4,5,6,7,8,9,10}, below are test results: > !image-2022-06-15-22-56-08-396.png|width=437,height=223! > We integrated our bit-packing decode implementation into parquet-mr, tested > the parquet batch reader ability from Spark VectorizedParquetRecordReader > which get parquet column data by the batch way. We construct parquet file > with different row count and column count, the column data type is Int32, the > maximum int value is 127 which satisfies bit pack encode with bit width=7, > the count of the row is from 10k to 100 million and the count of the column > is from 1 to 4. > !image-2022-06-15-22-57-15-964.png|width=453,height=229! > !image-2022-06-15-22-58-01-442.png|width=439,height=217! > !image-2022-06-15-22-58-40-704.png|width=415,height=208! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization
jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1094064182 ## parquet-generator/src/main/resources/ByteBitPacking512VectorLE: ## @@ -0,0 +1,3095 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.parquet.column.values.bitpacking; + +import jdk.incubator.vector.ByteVector; +import jdk.incubator.vector.IntVector; +import jdk.incubator.vector.LongVector; +import jdk.incubator.vector.ShortVector; +import jdk.incubator.vector.Vector; +import jdk.incubator.vector.VectorMask; +import jdk.incubator.vector.VectorOperators; +import jdk.incubator.vector.VectorShuffle; +import jdk.incubator.vector.VectorSpecies; + +import java.nio.ByteBuffer; + +/** + * This is an auto-generated source file and should not edit it directly. + */ +public abstract class ByteBitPacking512VectorLE { Review Comment: > In this case, the script is not necessary. Manual bit-unpacking code is error-prone, we really rely on the quality and coverage of test cases. @wgtmac I strongly agree with you, so I try my best to cover all aspects(different aspects from bitWidth 1 to 32) in class TestByteBitPacking512VectorLE. Besides, I have done the TPC-H testing, and compared the query result with before optimization. In short, I have done more work to ensure the code quality. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683228#comment-17683228 ] ASF GitHub Bot commented on PARQUET-2159: - wgtmac commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1094047290 ## parquet-generator/src/main/resources/ByteBitPacking512VectorLE: ## @@ -0,0 +1,3095 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.parquet.column.values.bitpacking; + +import jdk.incubator.vector.ByteVector; +import jdk.incubator.vector.IntVector; +import jdk.incubator.vector.LongVector; +import jdk.incubator.vector.ShortVector; +import jdk.incubator.vector.Vector; +import jdk.incubator.vector.VectorMask; +import jdk.incubator.vector.VectorOperators; +import jdk.incubator.vector.VectorShuffle; +import jdk.incubator.vector.VectorSpecies; + +import java.nio.ByteBuffer; + +/** + * This is an auto-generated source file and should not edit it directly. + */ +public abstract class ByteBitPacking512VectorLE { Review Comment: In this case, the script is not necessary. Manual bit-unpacking code is error-prone, we really rely on the quality and coverage of test cases. > Parquet bit-packing de/encode optimization > -- > > Key: PARQUET-2159 > URL: https://issues.apache.org/jira/browse/PARQUET-2159 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Fang-Xie >Assignee: Fang-Xie >Priority: Major > Fix For: 1.13.0 > > Attachments: image-2022-06-15-22-56-08-396.png, > image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, > image-2022-06-15-22-58-40-704.png > > > Current Spark use Parquet-mr as parquet reader/writer library, but the > built-in bit-packing en/decode is not efficient enough. > Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector > in Open JDK18 brings prominent performance improvement. > Due to Vector API is added to OpenJDK since 16, So this optimization request > JDK16 or higher. > *Below are our test results* > Functional test is based on open-source parquet-mr Bit-pack decoding > function: *_public final void unpack8Values(final byte[] in, final int inPos, > final int[] out, final int outPos)_* __ > compared with our implementation with vector API *_public final void > unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final > int outPos)_* > We tested 10 pairs (open source parquet bit unpacking vs ours optimized > vectorized SIMD implementation) decode function with bit > width=\{1,2,3,4,5,6,7,8,9,10}, below are test results: > !image-2022-06-15-22-56-08-396.png|width=437,height=223! > We integrated our bit-packing decode implementation into parquet-mr, tested > the parquet batch reader ability from Spark VectorizedParquetRecordReader > which get parquet column data by the batch way. We construct parquet file > with different row count and column count, the column data type is Int32, the > maximum int value is 127 which satisfies bit pack encode with bit width=7, > the count of the row is from 10k to 100 million and the count of the column > is from 1 to 4. > !image-2022-06-15-22-57-15-964.png|width=453,height=229! > !image-2022-06-15-22-58-01-442.png|width=439,height=217! > !image-2022-06-15-22-58-40-704.png|width=415,height=208! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization
wgtmac commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1094047290 ## parquet-generator/src/main/resources/ByteBitPacking512VectorLE: ## @@ -0,0 +1,3095 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.parquet.column.values.bitpacking; + +import jdk.incubator.vector.ByteVector; +import jdk.incubator.vector.IntVector; +import jdk.incubator.vector.LongVector; +import jdk.incubator.vector.ShortVector; +import jdk.incubator.vector.Vector; +import jdk.incubator.vector.VectorMask; +import jdk.incubator.vector.VectorOperators; +import jdk.incubator.vector.VectorShuffle; +import jdk.incubator.vector.VectorSpecies; + +import java.nio.ByteBuffer; + +/** + * This is an auto-generated source file and should not edit it directly. + */ +public abstract class ByteBitPacking512VectorLE { Review Comment: In this case, the script is not necessary. Manual bit-unpacking code is error-prone, we really rely on the quality and coverage of test cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683215#comment-17683215 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1093985661 ## parquet-generator/src/main/resources/ByteBitPacking512VectorLE: ## @@ -0,0 +1,3095 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.parquet.column.values.bitpacking; + +import jdk.incubator.vector.ByteVector; +import jdk.incubator.vector.IntVector; +import jdk.incubator.vector.LongVector; +import jdk.incubator.vector.ShortVector; +import jdk.incubator.vector.Vector; +import jdk.incubator.vector.VectorMask; +import jdk.incubator.vector.VectorOperators; +import jdk.incubator.vector.VectorShuffle; +import jdk.incubator.vector.VectorSpecies; + +import java.nio.ByteBuffer; + +/** + * This is an auto-generated source file and should not edit it directly. + */ +public abstract class ByteBitPacking512VectorLE { Review Comment: @wgtmac I have the script, but it generate only the code partly. It needs hard work and lots of time to complete the script(I don't think it is necessary). In fact, the code is completed mostly by manually. Should I commit the script which is partly completed ? > Parquet bit-packing de/encode optimization > -- > > Key: PARQUET-2159 > URL: https://issues.apache.org/jira/browse/PARQUET-2159 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Fang-Xie >Assignee: Fang-Xie >Priority: Major > Fix For: 1.13.0 > > Attachments: image-2022-06-15-22-56-08-396.png, > image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, > image-2022-06-15-22-58-40-704.png > > > Current Spark use Parquet-mr as parquet reader/writer library, but the > built-in bit-packing en/decode is not efficient enough. > Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector > in Open JDK18 brings prominent performance improvement. > Due to Vector API is added to OpenJDK since 16, So this optimization request > JDK16 or higher. > *Below are our test results* > Functional test is based on open-source parquet-mr Bit-pack decoding > function: *_public final void unpack8Values(final byte[] in, final int inPos, > final int[] out, final int outPos)_* __ > compared with our implementation with vector API *_public final void > unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final > int outPos)_* > We tested 10 pairs (open source parquet bit unpacking vs ours optimized > vectorized SIMD implementation) decode function with bit > width=\{1,2,3,4,5,6,7,8,9,10}, below are test results: > !image-2022-06-15-22-56-08-396.png|width=437,height=223! > We integrated our bit-packing decode implementation into parquet-mr, tested > the parquet batch reader ability from Spark VectorizedParquetRecordReader > which get parquet column data by the batch way. We construct parquet file > with different row count and column count, the column data type is Int32, the > maximum int value is 127 which satisfies bit pack encode with bit width=7, > the count of the row is from 10k to 100 million and the count of the column > is from 1 to 4. > !image-2022-06-15-22-57-15-964.png|width=453,height=229! > !image-2022-06-15-22-58-01-442.png|width=439,height=217! > !image-2022-06-15-22-58-40-704.png|width=415,height=208! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization
jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1093985661 ## parquet-generator/src/main/resources/ByteBitPacking512VectorLE: ## @@ -0,0 +1,3095 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.parquet.column.values.bitpacking; + +import jdk.incubator.vector.ByteVector; +import jdk.incubator.vector.IntVector; +import jdk.incubator.vector.LongVector; +import jdk.incubator.vector.ShortVector; +import jdk.incubator.vector.Vector; +import jdk.incubator.vector.VectorMask; +import jdk.incubator.vector.VectorOperators; +import jdk.incubator.vector.VectorShuffle; +import jdk.incubator.vector.VectorSpecies; + +import java.nio.ByteBuffer; + +/** + * This is an auto-generated source file and should not edit it directly. + */ +public abstract class ByteBitPacking512VectorLE { Review Comment: @wgtmac I have the script, but it generate only the code partly. It needs hard work and lots of time to complete the script(I don't think it is necessary). In fact, the code is completed mostly by manually. Should I commit the script which is partly completed ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683197#comment-17683197 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#issuecomment-1413064662 > @wgtmac I know your concern: 1. I will keep the content of the PR updated if needed when java changed. 2. I have coded a test to verify generated code, org.apache.parquet.column.values.bitpacking.TestByteBitPacking512VectorLE 3. I have finished the TPC-H integrated Testing with spark, maybe I can write a document to give best practice to test them > Parquet bit-packing de/encode optimization > -- > > Key: PARQUET-2159 > URL: https://issues.apache.org/jira/browse/PARQUET-2159 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Fang-Xie >Assignee: Fang-Xie >Priority: Major > Fix For: 1.13.0 > > Attachments: image-2022-06-15-22-56-08-396.png, > image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, > image-2022-06-15-22-58-40-704.png > > > Current Spark use Parquet-mr as parquet reader/writer library, but the > built-in bit-packing en/decode is not efficient enough. > Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector > in Open JDK18 brings prominent performance improvement. > Due to Vector API is added to OpenJDK since 16, So this optimization request > JDK16 or higher. > *Below are our test results* > Functional test is based on open-source parquet-mr Bit-pack decoding > function: *_public final void unpack8Values(final byte[] in, final int inPos, > final int[] out, final int outPos)_* __ > compared with our implementation with vector API *_public final void > unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final > int outPos)_* > We tested 10 pairs (open source parquet bit unpacking vs ours optimized > vectorized SIMD implementation) decode function with bit > width=\{1,2,3,4,5,6,7,8,9,10}, below are test results: > !image-2022-06-15-22-56-08-396.png|width=437,height=223! > We integrated our bit-packing decode implementation into parquet-mr, tested > the parquet batch reader ability from Spark VectorizedParquetRecordReader > which get parquet column data by the batch way. We construct parquet file > with different row count and column count, the column data type is Int32, the > maximum int value is 127 which satisfies bit pack encode with bit width=7, > the count of the row is from 10k to 100 million and the count of the column > is from 1 to 4. > !image-2022-06-15-22-57-15-964.png|width=453,height=229! > !image-2022-06-15-22-58-01-442.png|width=439,height=217! > !image-2022-06-15-22-58-40-704.png|width=415,height=208! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] jiangjiguang commented on pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization
jiangjiguang commented on PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#issuecomment-1413064662 > @wgtmac I know your concern: 1. I will keep the content of the PR updated if needed when java changed. 2. I have coded a test to verify generated code, org.apache.parquet.column.values.bitpacking.TestByteBitPacking512VectorLE 3. I have finished the TPC-H integrated Testing with spark, maybe I can write a document to give best practice to test them -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [C++] Parquet and Arrow overlap
Hi Will, AFAIK, the Apache Parquet community no longer considers contribution to parquet-cpp when promoting new committers after the donation to Apache Arrow. It would be a dilemma for the parquet-cpp contributors if none of the Apache Arrow community or Apache Parquet community recognizes their work. Does the parquet rust implementation have a similar issue? Best, Gang On Thu, Feb 2, 2023 at 3:27 AM Will Jones wrote: > Hello, > > A while back, the Parquet C++ implementation was merged into the Apache > Arrow monorepo [1]. As I understand it, this helped the development process > immensely. However, I am noticing some governance issues because of it. > > First, it's not obvious where issues are supposed to be open: In Parquet > Jira or Arrow GitHub issues. Looking back at some of the original > discussion, it looks like the intention was > > * use PARQUET-XXX for issues relating to Parquet core > > * use ARROW-XXX for issues relation to Arrow's consumption of Parquet > > core (e.g. changes that are in parquet/arrow right now) > > > > The README for the old parquet-cpp repo [3] states instead in it's > migration note: > > JIRA issues should continue to be opened in the PARQUET JIRA project. > > > Either way, it doesn't seem like this process is obvious to people. Perhaps > we could clarify this and add notices to Arrow's GitHub issues template? > > Second, committer status is a little unclear. I am a committer on Arrow, > but not on Parquet right now. Does that mean I should only merge Parquet > C++ PRs for code changes in parquet/arrow? Or that I shouldn't merge > Parquet changes at all? > > Also, are the contributions to Arrow C++ Parquet being actively reviewed > for potential new committers? > > Best, > > Will Jones > > [1] https://lists.apache.org/thread/76wzx2lsbwjl363bg066g8kdsocd03rw > [2] https://lists.apache.org/thread/dkh6vjomcfyjlvoy83qdk9j5jgxk7n4j > [3] https://github.com/apache/parquet-cpp >
[C++] Parquet and Arrow overlap
Hello, A while back, the Parquet C++ implementation was merged into the Apache Arrow monorepo [1]. As I understand it, this helped the development process immensely. However, I am noticing some governance issues because of it. First, it's not obvious where issues are supposed to be open: In Parquet Jira or Arrow GitHub issues. Looking back at some of the original discussion, it looks like the intention was * use PARQUET-XXX for issues relating to Parquet core > * use ARROW-XXX for issues relation to Arrow's consumption of Parquet > core (e.g. changes that are in parquet/arrow right now) > The README for the old parquet-cpp repo [3] states instead in it's migration note: JIRA issues should continue to be opened in the PARQUET JIRA project. Either way, it doesn't seem like this process is obvious to people. Perhaps we could clarify this and add notices to Arrow's GitHub issues template? Second, committer status is a little unclear. I am a committer on Arrow, but not on Parquet right now. Does that mean I should only merge Parquet C++ PRs for code changes in parquet/arrow? Or that I shouldn't merge Parquet changes at all? Also, are the contributions to Arrow C++ Parquet being actively reviewed for potential new committers? Best, Will Jones [1] https://lists.apache.org/thread/76wzx2lsbwjl363bg066g8kdsocd03rw [2] https://lists.apache.org/thread/dkh6vjomcfyjlvoy83qdk9j5jgxk7n4j [3] https://github.com/apache/parquet-cpp
[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type
[ https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683128#comment-17683128 ] ASF GitHub Bot commented on PARQUET-758: shangxinli commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1412526562 As @julienledem mentioned in the email discussion, let's have the corresponding PRs for support in the Java and C++ implementation ready before we merge this PR. We would like to have implementation support when the new type is released. > [Format] HALF precision FLOAT Logical type > -- > > Key: PARQUET-758 > URL: https://issues.apache.org/jira/browse/PARQUET-758 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Julien Le Dem >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] shangxinli commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type
shangxinli commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1412526562 As @julienledem mentioned in the email discussion, let's have the corresponding PRs for support in the Java and C++ implementation ready before we merge this PR. We would like to have implementation support when the new type is released. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-format] pitrou commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type
pitrou commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1412476439 > @shangxinli are there guidelines for what needs to happen to accept this addition? I suppose it needs a discussion and then a formal vote on the ML? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type
[ https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683113#comment-17683113 ] ASF GitHub Bot commented on PARQUET-758: pitrou commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1412476439 > @shangxinli are there guidelines for what needs to happen to accept this addition? I suppose it needs a discussion and then a formal vote on the ML? > [Format] HALF precision FLOAT Logical type > -- > > Key: PARQUET-758 > URL: https://issues.apache.org/jira/browse/PARQUET-758 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Julien Le Dem >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type
[ https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683111#comment-17683111 ] ASF GitHub Bot commented on PARQUET-758: emkornfield commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1412470881 @shangxinli are there guidelines for what needs to happen to accept this addition? > [Format] HALF precision FLOAT Logical type > -- > > Key: PARQUET-758 > URL: https://issues.apache.org/jira/browse/PARQUET-758 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Julien Le Dem >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] emkornfield commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type
emkornfield commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1412470881 @shangxinli are there guidelines for what needs to happen to accept this addition? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Parquet array schema incompatibilities
Hello, I wanted to raise attention to incompatibilities when it comes to Parquet, Avro and parquet-cli. My main findings can be found here: https://github.com/MrR0807/Notes/blob/master/parquet-not-working-cases.md#simple-schema-with-array. But in short, recommended schema definition for Lists as per parquet-format (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists) does not work well with Avro, parquet-cli or just in general. I wonder what you think about this? Is it something that should be explicitly addressed at least in documentation? Are you aware of these problems? I can create PR into documentation, but before that I wanted to validate it with you. Thank you, Laurynas