[jira] [Commented] (PARQUET-2173) Fix parquet build against hadoop 3.3.3+

2023-02-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683250#comment-17683250
 ] 

ASF GitHub Bot commented on PARQUET-2173:
-

gszadovszky merged PR #985:
URL: https://github.com/apache/parquet-mr/pull/985




> Fix parquet build against hadoop 3.3.3+
> ---
>
> Key: PARQUET-2173
> URL: https://issues.apache.org/jira/browse/PARQUET-2173
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.13.0
>Reporter: Steve Loughran
>Priority: Major
>
> parquet won't build against hadoop 3.3.3+ because it swapped out log4j 1.17 
> for reload4j, and this creates maven dependency problems in parquet cli
> {code}
> [INFO] --- maven-dependency-plugin:3.1.1:analyze-only (default) @ parquet-cli 
> ---
> [WARNING] Used undeclared dependencies found:
> [WARNING]ch.qos.reload4j:reload4j:jar:1.2.22:provided
> {code}
> the hadoop common dependencies need to exclude this jar and any changed slf4j 
> ones.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] gszadovszky merged pull request #985: PARQUET-2173. Fix parquet build against hadoop 3.3.3+

2023-02-01 Thread via GitHub


gszadovszky merged PR #985:
URL: https://github.com/apache/parquet-mr/pull/985


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683229#comment-17683229
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1094064182


##
parquet-generator/src/main/resources/ByteBitPacking512VectorLE:
##
@@ -0,0 +1,3095 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bitpacking;
+
+import jdk.incubator.vector.ByteVector;
+import jdk.incubator.vector.IntVector;
+import jdk.incubator.vector.LongVector;
+import jdk.incubator.vector.ShortVector;
+import jdk.incubator.vector.Vector;
+import jdk.incubator.vector.VectorMask;
+import jdk.incubator.vector.VectorOperators;
+import jdk.incubator.vector.VectorShuffle;
+import jdk.incubator.vector.VectorSpecies;
+
+import java.nio.ByteBuffer;
+
+/**
+ * This is an auto-generated source file and should not edit it directly.
+ */
+public abstract class ByteBitPacking512VectorLE {

Review Comment:
   > In this case, the script is not necessary. Manual bit-unpacking code is 
error-prone, we really rely on the quality and coverage of test cases.
   
   @wgtmac  I strongly agree with you, so I try my best to cover all 
aspects(different aspects from bitWidth 1 to 32) in class 
TestByteBitPacking512VectorLE. Besides, I have done the TPC-H testing, and 
compared the query result with before optimization. In short, I have done more 
work to ensure the code quality.





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-01 Thread via GitHub


jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1094064182


##
parquet-generator/src/main/resources/ByteBitPacking512VectorLE:
##
@@ -0,0 +1,3095 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bitpacking;
+
+import jdk.incubator.vector.ByteVector;
+import jdk.incubator.vector.IntVector;
+import jdk.incubator.vector.LongVector;
+import jdk.incubator.vector.ShortVector;
+import jdk.incubator.vector.Vector;
+import jdk.incubator.vector.VectorMask;
+import jdk.incubator.vector.VectorOperators;
+import jdk.incubator.vector.VectorShuffle;
+import jdk.incubator.vector.VectorSpecies;
+
+import java.nio.ByteBuffer;
+
+/**
+ * This is an auto-generated source file and should not edit it directly.
+ */
+public abstract class ByteBitPacking512VectorLE {

Review Comment:
   > In this case, the script is not necessary. Manual bit-unpacking code is 
error-prone, we really rely on the quality and coverage of test cases.
   
   @wgtmac  I strongly agree with you, so I try my best to cover all 
aspects(different aspects from bitWidth 1 to 32) in class 
TestByteBitPacking512VectorLE. Besides, I have done the TPC-H testing, and 
compared the query result with before optimization. In short, I have done more 
work to ensure the code quality.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683228#comment-17683228
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

wgtmac commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1094047290


##
parquet-generator/src/main/resources/ByteBitPacking512VectorLE:
##
@@ -0,0 +1,3095 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bitpacking;
+
+import jdk.incubator.vector.ByteVector;
+import jdk.incubator.vector.IntVector;
+import jdk.incubator.vector.LongVector;
+import jdk.incubator.vector.ShortVector;
+import jdk.incubator.vector.Vector;
+import jdk.incubator.vector.VectorMask;
+import jdk.incubator.vector.VectorOperators;
+import jdk.incubator.vector.VectorShuffle;
+import jdk.incubator.vector.VectorSpecies;
+
+import java.nio.ByteBuffer;
+
+/**
+ * This is an auto-generated source file and should not edit it directly.
+ */
+public abstract class ByteBitPacking512VectorLE {

Review Comment:
   In this case, the script is not necessary. Manual bit-unpacking code is 
error-prone, we really rely on the quality and coverage of test cases.





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-01 Thread via GitHub


wgtmac commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1094047290


##
parquet-generator/src/main/resources/ByteBitPacking512VectorLE:
##
@@ -0,0 +1,3095 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bitpacking;
+
+import jdk.incubator.vector.ByteVector;
+import jdk.incubator.vector.IntVector;
+import jdk.incubator.vector.LongVector;
+import jdk.incubator.vector.ShortVector;
+import jdk.incubator.vector.Vector;
+import jdk.incubator.vector.VectorMask;
+import jdk.incubator.vector.VectorOperators;
+import jdk.incubator.vector.VectorShuffle;
+import jdk.incubator.vector.VectorSpecies;
+
+import java.nio.ByteBuffer;
+
+/**
+ * This is an auto-generated source file and should not edit it directly.
+ */
+public abstract class ByteBitPacking512VectorLE {

Review Comment:
   In this case, the script is not necessary. Manual bit-unpacking code is 
error-prone, we really rely on the quality and coverage of test cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683215#comment-17683215
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1093985661


##
parquet-generator/src/main/resources/ByteBitPacking512VectorLE:
##
@@ -0,0 +1,3095 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bitpacking;
+
+import jdk.incubator.vector.ByteVector;
+import jdk.incubator.vector.IntVector;
+import jdk.incubator.vector.LongVector;
+import jdk.incubator.vector.ShortVector;
+import jdk.incubator.vector.Vector;
+import jdk.incubator.vector.VectorMask;
+import jdk.incubator.vector.VectorOperators;
+import jdk.incubator.vector.VectorShuffle;
+import jdk.incubator.vector.VectorSpecies;
+
+import java.nio.ByteBuffer;
+
+/**
+ * This is an auto-generated source file and should not edit it directly.
+ */
+public abstract class ByteBitPacking512VectorLE {

Review Comment:
   @wgtmac I have the script, but it generate only the code partly. 
   It needs hard work and lots of time to complete the script(I don't think it 
is necessary). 
   In fact, the code is completed mostly by manually.
   Should I commit the script which is partly completed ? 





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-01 Thread via GitHub


jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1093985661


##
parquet-generator/src/main/resources/ByteBitPacking512VectorLE:
##
@@ -0,0 +1,3095 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bitpacking;
+
+import jdk.incubator.vector.ByteVector;
+import jdk.incubator.vector.IntVector;
+import jdk.incubator.vector.LongVector;
+import jdk.incubator.vector.ShortVector;
+import jdk.incubator.vector.Vector;
+import jdk.incubator.vector.VectorMask;
+import jdk.incubator.vector.VectorOperators;
+import jdk.incubator.vector.VectorShuffle;
+import jdk.incubator.vector.VectorSpecies;
+
+import java.nio.ByteBuffer;
+
+/**
+ * This is an auto-generated source file and should not edit it directly.
+ */
+public abstract class ByteBitPacking512VectorLE {

Review Comment:
   @wgtmac I have the script, but it generate only the code partly. 
   It needs hard work and lots of time to complete the script(I don't think it 
is necessary). 
   In fact, the code is completed mostly by manually.
   Should I commit the script which is partly completed ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683197#comment-17683197
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

jiangjiguang commented on PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#issuecomment-1413064662

   > 
   
   @wgtmac I know your concern: 
1. I will keep the content of the PR updated if needed when java changed.
2. I have coded a test to verify generated code, 
org.apache.parquet.column.values.bitpacking.TestByteBitPacking512VectorLE
3. I have finished the TPC-H integrated Testing with spark, maybe I can 
write a document to give best practice to test them




> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] jiangjiguang commented on pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-01 Thread via GitHub


jiangjiguang commented on PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#issuecomment-1413064662

   > 
   
   @wgtmac I know your concern: 
1. I will keep the content of the PR updated if needed when java changed.
2. I have coded a test to verify generated code, 
org.apache.parquet.column.values.bitpacking.TestByteBitPacking512VectorLE
3. I have finished the TPC-H integrated Testing with spark, maybe I can 
write a document to give best practice to test them


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [C++] Parquet and Arrow overlap

2023-02-01 Thread Gang Wu
Hi Will,

AFAIK, the Apache Parquet community no longer considers contribution to
parquet-cpp when promoting new committers after the donation to Apache
Arrow.

It would be a dilemma for the parquet-cpp contributors if none of the
Apache Arrow community or Apache Parquet community recognizes their work.

Does the parquet rust implementation have a similar issue?

Best,
Gang

On Thu, Feb 2, 2023 at 3:27 AM Will Jones  wrote:

> Hello,
>
> A while back, the Parquet C++ implementation was merged into the Apache
> Arrow monorepo [1]. As I understand it, this helped the development process
> immensely. However, I am noticing some governance issues because of it.
>
> First, it's not obvious where issues are supposed to be open: In Parquet
> Jira or Arrow GitHub issues. Looking back at some of the original
> discussion, it looks like the intention was
>
> * use PARQUET-XXX for issues relating to Parquet core
> > * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
> > core (e.g. changes that are in parquet/arrow right now)
> >
>
> The README for the old parquet-cpp repo [3] states instead in it's
> migration note:
>
>  JIRA issues should continue to be opened in the PARQUET JIRA project.
>
>
> Either way, it doesn't seem like this process is obvious to people. Perhaps
> we could clarify this and add notices to Arrow's GitHub issues template?
>
> Second, committer status is a little unclear. I am a committer on Arrow,
> but not on Parquet right now. Does that mean I should only merge Parquet
> C++ PRs for code changes in parquet/arrow? Or that I shouldn't merge
> Parquet changes at all?
>
> Also, are the contributions to Arrow C++ Parquet being actively reviewed
> for potential new committers?
>
> Best,
>
> Will Jones
>
> [1] https://lists.apache.org/thread/76wzx2lsbwjl363bg066g8kdsocd03rw
> [2] https://lists.apache.org/thread/dkh6vjomcfyjlvoy83qdk9j5jgxk7n4j
> [3] https://github.com/apache/parquet-cpp
>


[C++] Parquet and Arrow overlap

2023-02-01 Thread Will Jones
Hello,

A while back, the Parquet C++ implementation was merged into the Apache
Arrow monorepo [1]. As I understand it, this helped the development process
immensely. However, I am noticing some governance issues because of it.

First, it's not obvious where issues are supposed to be open: In Parquet
Jira or Arrow GitHub issues. Looking back at some of the original
discussion, it looks like the intention was

* use PARQUET-XXX for issues relating to Parquet core
> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
> core (e.g. changes that are in parquet/arrow right now)
>

The README for the old parquet-cpp repo [3] states instead in it's
migration note:

 JIRA issues should continue to be opened in the PARQUET JIRA project.


Either way, it doesn't seem like this process is obvious to people. Perhaps
we could clarify this and add notices to Arrow's GitHub issues template?

Second, committer status is a little unclear. I am a committer on Arrow,
but not on Parquet right now. Does that mean I should only merge Parquet
C++ PRs for code changes in parquet/arrow? Or that I shouldn't merge
Parquet changes at all?

Also, are the contributions to Arrow C++ Parquet being actively reviewed
for potential new committers?

Best,

Will Jones

[1] https://lists.apache.org/thread/76wzx2lsbwjl363bg066g8kdsocd03rw
[2] https://lists.apache.org/thread/dkh6vjomcfyjlvoy83qdk9j5jgxk7n4j
[3] https://github.com/apache/parquet-cpp


[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type

2023-02-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683128#comment-17683128
 ] 

ASF GitHub Bot commented on PARQUET-758:


shangxinli commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1412526562

   As @julienledem mentioned in the email discussion, let's have the 
corresponding PRs for support in the Java and C++ implementation ready before 
we merge this PR. We would like to have implementation support when the new 
type is released. 
   




> [Format] HALF precision FLOAT Logical type
> --
>
> Key: PARQUET-758
> URL: https://issues.apache.org/jira/browse/PARQUET-758
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] shangxinli commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type

2023-02-01 Thread via GitHub


shangxinli commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1412526562

   As @julienledem mentioned in the email discussion, let's have the 
corresponding PRs for support in the Java and C++ implementation ready before 
we merge this PR. We would like to have implementation support when the new 
type is released. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] pitrou commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type

2023-02-01 Thread via GitHub


pitrou commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1412476439

   > @shangxinli are there guidelines for what needs to happen to accept this 
addition?
   
   I suppose it needs a discussion and then a formal vote on the ML?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type

2023-02-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683113#comment-17683113
 ] 

ASF GitHub Bot commented on PARQUET-758:


pitrou commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1412476439

   > @shangxinli are there guidelines for what needs to happen to accept this 
addition?
   
   I suppose it needs a discussion and then a formal vote on the ML?
   




> [Format] HALF precision FLOAT Logical type
> --
>
> Key: PARQUET-758
> URL: https://issues.apache.org/jira/browse/PARQUET-758
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type

2023-02-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683111#comment-17683111
 ] 

ASF GitHub Bot commented on PARQUET-758:


emkornfield commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1412470881

   @shangxinli are there guidelines for what needs to happen to accept this 
addition?




> [Format] HALF precision FLOAT Logical type
> --
>
> Key: PARQUET-758
> URL: https://issues.apache.org/jira/browse/PARQUET-758
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] emkornfield commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type

2023-02-01 Thread via GitHub


emkornfield commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1412470881

   @shangxinli are there guidelines for what needs to happen to accept this 
addition?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Parquet array schema incompatibilities

2023-02-01 Thread Laurynas Katkus
Hello,

I wanted to raise attention to incompatibilities when it comes to Parquet,
Avro and parquet-cli. My main findings can be found here:
https://github.com/MrR0807/Notes/blob/master/parquet-not-working-cases.md#simple-schema-with-array.
But in short, recommended schema definition for Lists as per parquet-format
(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists)
does not work well with Avro, parquet-cli or just in general. I wonder what
you think about this? Is it something that should be explicitly addressed
at least in documentation? Are you aware of these problems? I can create PR
into documentation, but before that I wanted to validate it with you.

Thank you,
Laurynas