[jira] [Commented] (PARQUET-2251) Avoid generating Bloomfilter when all pages of a column are encoded by dictionary

2023-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693833#comment-17693833
 ] 

ASF GitHub Bot commented on PARQUET-2251:
-

gszadovszky merged PR #1033:
URL: https://github.com/apache/parquet-mr/pull/1033




> Avoid generating Bloomfilter when all pages of a column are encoded by 
> dictionary
> -
>
> Key: PARQUET-2251
> URL: https://issues.apache.org/jira/browse/PARQUET-2251
> Project: Parquet
>  Issue Type: Bug
>Reporter: Mars
>Priority: Major
>
> In parquet pageV1, even all pages of a column are encoded by dictionary, it 
> will still generate BloomFilter. Actually it is unnecessary to generate 
> BloomFilter and it cost time and occupy storage.
> Parquet pageV2 doesn't generate BloomFilter if all pages of a column are 
> encoded by dictionary,



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] gszadovszky merged pull request #1033: PARQUET-2251 Avoid generating Bloomfilter when all pages of a column are encoded by dictionary in parquet v1

2023-02-26 Thread via GitHub


gszadovszky merged PR #1033:
URL: https://github.com/apache/parquet-mr/pull/1033


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2230) Add a new rewrite command powered by ParquetRewriter

2023-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693830#comment-17693830
 ] 

ASF GitHub Bot commented on PARQUET-2230:
-

gszadovszky merged PR #1034:
URL: https://github.com/apache/parquet-mr/pull/1034




> Add a new rewrite command powered by ParquetRewriter
> 
>
> Key: PARQUET-2230
> URL: https://issues.apache.org/jira/browse/PARQUET-2230
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cli
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> parquet-cli has several commands for rewriting files but missing a 
> consolidated one to provide the full features of ParquetRewriter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] gszadovszky merged pull request #1034: PARQUET-2230: Add a new rewrite command powered by ParquetRewriter

2023-02-26 Thread via GitHub


gszadovszky merged PR #1034:
URL: https://github.com/apache/parquet-mr/pull/1034


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693804#comment-17693804
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

wgtmac commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118319141


##
parquet-generator/src/main/java/org/apache/parquet/encoding/vectorbitpacking/BitPackingGenerator512Vector.java:
##
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.encoding.vectorbitpacking;
+
+import java.io.File;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * This class generates vector bit packers that pack the most significant bit 
first.
+ * The result of the generation is checked in. To regenerate the code run this 
class and check in the result.
+ */
+public class BitPackingGenerator512Vector {
+  private static final String CLASS_NAME_PREFIX_FOR_INT = 
"ByteBitPacking512Vector";
+  private static final String CLASS_NAME_PREFIX_FOR_LONG = 
"ByteBitPacking512VectorForLong";
+
+  public static void main(String[] args) throws Exception {
+String basePath = args[0];
+//TODO: Int for Big Endian
+//generateScheme(false, true, basePath);
+
+// Int for Little Endian
+generateScheme(false, false, basePath);
+
+//TODO: Long for Big Endian
+//generateScheme(true, true, basePath);
+
+//TODO: Long for Little Endian
+//generateScheme(true, false, basePath);
+  }
+
+  private static void generateScheme(boolean isLong, boolean msbFirst,

Review Comment:
   Not sure whether this answer works in this case: 
https://stackoverflow.com/a/13383092





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-26 Thread via GitHub


wgtmac commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118319141


##
parquet-generator/src/main/java/org/apache/parquet/encoding/vectorbitpacking/BitPackingGenerator512Vector.java:
##
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.encoding.vectorbitpacking;
+
+import java.io.File;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * This class generates vector bit packers that pack the most significant bit 
first.
+ * The result of the generation is checked in. To regenerate the code run this 
class and check in the result.
+ */
+public class BitPackingGenerator512Vector {
+  private static final String CLASS_NAME_PREFIX_FOR_INT = 
"ByteBitPacking512Vector";
+  private static final String CLASS_NAME_PREFIX_FOR_LONG = 
"ByteBitPacking512VectorForLong";
+
+  public static void main(String[] args) throws Exception {
+String basePath = args[0];
+//TODO: Int for Big Endian
+//generateScheme(false, true, basePath);
+
+// Int for Little Endian
+generateScheme(false, false, basePath);
+
+//TODO: Long for Big Endian
+//generateScheme(true, true, basePath);
+
+//TODO: Long for Little Endian
+//generateScheme(true, false, basePath);
+  }
+
+  private static void generateScheme(boolean isLong, boolean msbFirst,

Review Comment:
   Not sure whether this answer works in this case: 
https://stackoverflow.com/a/13383092



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-02-26 Thread Xuwei Fu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693798#comment-17693798
 ] 

Xuwei Fu commented on PARQUET-:
---

I don't understand. Isn't length the part of encoding in spec?

And seems that DataPageV2 in parquet-mr is not in-prod?

 

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2202) Redundant String allocation on the hot path in CapacityByteArrayOutputStream.setByte

2023-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693672#comment-17693672
 ] 

ASF GitHub Bot commented on PARQUET-2202:
-

jerolba opened a new pull request, #1035:
URL: https://github.com/apache/parquet-mr/pull/1035

   The [original Jira ticket 
](https://issues.apache.org/jira/browse/PARQUET-2202) references to a concrete 
bad usage of the `Preconditions.checkArgument` method, where a String is 
calculated before validating the check, creating an overhead when in theory 
100% of the cases the composed String will not be used.
   
   The proposed solution inlines the call to `Preconditions.checkArgument`, but 
the correct approach is to call the `checkArgument` method with a String to 
format if the argument check is not valid.
   
   A similar issue ocurrs in the 
[constructor](https://github.com/apache/parquet-mr/blob/62b774cd0f0c60cfbe540bbfa60bee15929af5d4/parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java#L153)
 of the `CapacityByteArrayOutputStream` class, where the error message is 
always created using the `String.format`, instead of passing the template 
String and its params to the `checkArgument` method. This performance issue is 
also visible in a profiling.
   
   The MR fixes both cases, and reviews the usage of `Preconditions` class, to 
ensure that error messages are not calculated before checking the boolean 
expression.
   
   I've also reviewed `Preconditions` methods to improve the performance. When 
you call to a method with a varargs argument, Java internally allocates an 
array containing all values. To avoid this allocation when the number of params 
is very low, is recomended to overload the method with versions of the method 
with different number of arguments. This approach is heavily used in logging 
frameworks, or in [Guava Preconditions 
implementation](https://github.com/google/guava/blob/4312d949967f3fb245636f66437a00dd8c346d38/guava/src/com/google/common/base/Preconditions.java#L118)
   Because nearly 100% of cases are going to check the condition as valid, the 
Array associated with varargs call to `Preconditions.strings` method will never 
be created.
   
   Because all changes are related with `Preconditions.checkargument`, I've 
created a single PR. I can split it in multiple PRs if needed. Each commit of 
the PR makes a type of change, but I can squash into a single commit if needed.
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [x] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-2202: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2202
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [x] My PR adds test cases to the following unit tests:
 1. 
https://github.com/jerolba/parquet-mr/blob/review_usage_of_preconditions_checkargument/parquet-common/src/test/java/org/apache/parquet/TestPreconditions.java
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Redundant String allocation on the hot path in 
> CapacityByteArrayOutputStream.setByte
> 
>
> Key: PARQUET-2202
> URL: https://issues.apache.org/jira/browse/PARQUET-2202
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Andrei Pangin
>Priority: Major
>  Labels: performance
> Attachments: profile-alloc.png, profile-cpu.png
>
>
> Profiling of a Spark application revealed a performance issue in production:
> {{CapacityByteArrayOutputStream.setByte}} consumed 2.2% of total CPU time and 
> made up 4.6% of total allocations. However, in normal case, this method 
> should allocate nothing at all.
> Here is an excerpt from async-profiler report.
> CPU profile:
> !profile-cpu.

[GitHub] [parquet-mr] jerolba opened a new pull request, #1035: PARQUET-2202: Review usage and implementation of Preconditions.checkargument method

2023-02-26 Thread via GitHub


jerolba opened a new pull request, #1035:
URL: https://github.com/apache/parquet-mr/pull/1035

   The [original Jira ticket 
](https://issues.apache.org/jira/browse/PARQUET-2202) references to a concrete 
bad usage of the `Preconditions.checkArgument` method, where a String is 
calculated before validating the check, creating an overhead when in theory 
100% of the cases the composed String will not be used.
   
   The proposed solution inlines the call to `Preconditions.checkArgument`, but 
the correct approach is to call the `checkArgument` method with a String to 
format if the argument check is not valid.
   
   A similar issue ocurrs in the 
[constructor](https://github.com/apache/parquet-mr/blob/62b774cd0f0c60cfbe540bbfa60bee15929af5d4/parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java#L153)
 of the `CapacityByteArrayOutputStream` class, where the error message is 
always created using the `String.format`, instead of passing the template 
String and its params to the `checkArgument` method. This performance issue is 
also visible in a profiling.
   
   The MR fixes both cases, and reviews the usage of `Preconditions` class, to 
ensure that error messages are not calculated before checking the boolean 
expression.
   
   I've also reviewed `Preconditions` methods to improve the performance. When 
you call to a method with a varargs argument, Java internally allocates an 
array containing all values. To avoid this allocation when the number of params 
is very low, is recomended to overload the method with versions of the method 
with different number of arguments. This approach is heavily used in logging 
frameworks, or in [Guava Preconditions 
implementation](https://github.com/google/guava/blob/4312d949967f3fb245636f66437a00dd8c346d38/guava/src/com/google/common/base/Preconditions.java#L118)
   Because nearly 100% of cases are going to check the condition as valid, the 
Array associated with varargs call to `Preconditions.strings` method will never 
be created.
   
   Because all changes are related with `Preconditions.checkargument`, I've 
created a single PR. I can split it in multiple PRs if needed. Each commit of 
the PR makes a type of change, but I can squash into a single commit if needed.
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [x] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-2202: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2202
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [x] My PR adds test cases to the following unit tests:
 1. 
https://github.com/jerolba/parquet-mr/blob/review_usage_of_preconditions_checkargument/parquet-common/src/test/java/org/apache/parquet/TestPreconditions.java
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693637#comment-17693637
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118075418


##
parquet-column/src/main/java/org/apache/parquet/column/values/bitpacking/ParquetReadRouter.java:
##
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.column.values.bitpacking;
+
+import org.apache.parquet.bytes.ByteBufferInputStream;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.EOFException;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.charset.StandardCharsets;
+import java.nio.file.Files;
+import java.nio.file.Paths;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * Utility class for big data applications (such as Apache Spark and Apache 
Flink).
+ * For Intel CPU, Flags containing avx512vbmi and avx512_vbmi2 can have better 
performance gains.
+ */
+public class ParquetReadRouter {
+  private static final Logger LOG = 
LoggerFactory.getLogger(ParquetReadRouter.class);
+
+  private static final int BITS_PER_BYTE = 8;
+
+  // register of avx512 are 512 bits, and can load up to 64 bytes
+  private static final int BYTES_PER_VECTOR_512 = 64;
+
+  // values are bit packed 8 at a time, so reading bitWidth will always work
+  private static final int NUM_VALUES_TO_PACK = 8;
+
+  private static final VectorSupport vectorSupport;
+
+  static {
+vectorSupport = getSupportVectorFromCPUFlags();
+  }
+
+  // Dispatches to use vector when available. Directly call 
readBatchUsing512Vector() if you are sure about it.
+  public static void read(int bitWidth, ByteBufferInputStream in, int 
currentCount, int[] currentBuffer) throws IOException {
+switch (vectorSupport) {
+  case VECTOR_512:
+readBatchUsing512Vector(bitWidth, in, currentCount, currentBuffer);
+break;
+  default:
+readBatch(bitWidth, in, currentCount, currentBuffer);
+}
+  }
+
+  // Call the method directly if your computer system contains avx512vbmi and 
avx512_vbmi2 CPU Flags
+  public static void readBatchUsing512Vector(int bitWidth, 
ByteBufferInputStream in, int currentCount, int[] currentBuffer) throws 
IOException {
+BytePacker packer = Packer.LITTLE_ENDIAN.newBytePacker(bitWidth);
+BytePacker packerVector = 
Packer.LITTLE_ENDIAN.newBytePackerVector(bitWidth);
+int valueIndex = 0;
+int byteIndex = 0;
+int unpackCount = packerVector.getUnpackCount();
+int inputByteCountPerVector = packerVector.getUnpackCount() / 
BITS_PER_BYTE * bitWidth;
+int totalByteCount = currentCount * bitWidth / BITS_PER_BYTE;
+int totalByteCountVector = totalByteCount - BYTES_PER_VECTOR_512;
+ByteBuffer buffer = in.slice(totalByteCount);
+if (buffer.hasArray()) {
+  for (; byteIndex < totalByteCountVector; byteIndex += 
inputByteCountPerVector, valueIndex += unpackCount) {
+packerVector.unpackValuesUsingVector(buffer.array(), 
buffer.arrayOffset() + buffer.position() + byteIndex, currentBuffer, 
valueIndex);
+  }
+  // If the remaining bytes size <= {BYTES_PER_512VECTOR}, the remaining 
bytes are unpacked by packer
+  for (; byteIndex < totalByteCount; byteIndex += bitWidth, valueIndex += 
NUM_VALUES_TO_PACK) {
+packer.unpack8Values(buffer.array(), buffer.arrayOffset() + 
buffer.position() + byteIndex, currentBuffer, valueIndex);
+  }
+} else {
+  for (; byteIndex < totalByteCountVector; byteIndex += 
inputByteCountPerVector, valueIndex += unpackCount) {
+packerVector.unpackValuesUsingVector(buffer, buffer.position() + 
byteIndex, currentBuffer, valueIndex);
+  }
+  for (; byteIndex < totalByteCount; byteIndex += bitWidth, valueIndex += 
NUM_VALUES_TO_PACK) {
+packer.unpack8Values(buffer, buffer.position() + byteIndex, 
currentBuffer, valueIndex);
+  }
+}
+  }
+
+  // Call the method direc

[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-26 Thread via GitHub


jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118075418


##
parquet-column/src/main/java/org/apache/parquet/column/values/bitpacking/ParquetReadRouter.java:
##
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.column.values.bitpacking;
+
+import org.apache.parquet.bytes.ByteBufferInputStream;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.EOFException;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.charset.StandardCharsets;
+import java.nio.file.Files;
+import java.nio.file.Paths;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * Utility class for big data applications (such as Apache Spark and Apache 
Flink).
+ * For Intel CPU, Flags containing avx512vbmi and avx512_vbmi2 can have better 
performance gains.
+ */
+public class ParquetReadRouter {
+  private static final Logger LOG = 
LoggerFactory.getLogger(ParquetReadRouter.class);
+
+  private static final int BITS_PER_BYTE = 8;
+
+  // register of avx512 are 512 bits, and can load up to 64 bytes
+  private static final int BYTES_PER_VECTOR_512 = 64;
+
+  // values are bit packed 8 at a time, so reading bitWidth will always work
+  private static final int NUM_VALUES_TO_PACK = 8;
+
+  private static final VectorSupport vectorSupport;
+
+  static {
+vectorSupport = getSupportVectorFromCPUFlags();
+  }
+
+  // Dispatches to use vector when available. Directly call 
readBatchUsing512Vector() if you are sure about it.
+  public static void read(int bitWidth, ByteBufferInputStream in, int 
currentCount, int[] currentBuffer) throws IOException {
+switch (vectorSupport) {
+  case VECTOR_512:
+readBatchUsing512Vector(bitWidth, in, currentCount, currentBuffer);
+break;
+  default:
+readBatch(bitWidth, in, currentCount, currentBuffer);
+}
+  }
+
+  // Call the method directly if your computer system contains avx512vbmi and 
avx512_vbmi2 CPU Flags
+  public static void readBatchUsing512Vector(int bitWidth, 
ByteBufferInputStream in, int currentCount, int[] currentBuffer) throws 
IOException {
+BytePacker packer = Packer.LITTLE_ENDIAN.newBytePacker(bitWidth);
+BytePacker packerVector = 
Packer.LITTLE_ENDIAN.newBytePackerVector(bitWidth);
+int valueIndex = 0;
+int byteIndex = 0;
+int unpackCount = packerVector.getUnpackCount();
+int inputByteCountPerVector = packerVector.getUnpackCount() / 
BITS_PER_BYTE * bitWidth;
+int totalByteCount = currentCount * bitWidth / BITS_PER_BYTE;
+int totalByteCountVector = totalByteCount - BYTES_PER_VECTOR_512;
+ByteBuffer buffer = in.slice(totalByteCount);
+if (buffer.hasArray()) {
+  for (; byteIndex < totalByteCountVector; byteIndex += 
inputByteCountPerVector, valueIndex += unpackCount) {
+packerVector.unpackValuesUsingVector(buffer.array(), 
buffer.arrayOffset() + buffer.position() + byteIndex, currentBuffer, 
valueIndex);
+  }
+  // If the remaining bytes size <= {BYTES_PER_512VECTOR}, the remaining 
bytes are unpacked by packer
+  for (; byteIndex < totalByteCount; byteIndex += bitWidth, valueIndex += 
NUM_VALUES_TO_PACK) {
+packer.unpack8Values(buffer.array(), buffer.arrayOffset() + 
buffer.position() + byteIndex, currentBuffer, valueIndex);
+  }
+} else {
+  for (; byteIndex < totalByteCountVector; byteIndex += 
inputByteCountPerVector, valueIndex += unpackCount) {
+packerVector.unpackValuesUsingVector(buffer, buffer.position() + 
byteIndex, currentBuffer, valueIndex);
+  }
+  for (; byteIndex < totalByteCount; byteIndex += bitWidth, valueIndex += 
NUM_VALUES_TO_PACK) {
+packer.unpack8Values(buffer, buffer.position() + byteIndex, 
currentBuffer, valueIndex);
+  }
+}
+  }
+
+  // Call the method directly if your computer system doesn't contain 
avx512vbmi and avx512_vbmi2 CPU Flags
+  public static void readBatch(int bitWidth, ByteBufferInputStream in, int 
currentCount, int[] currentBuffer) throws EOFException {
+BytePacker packer = Packer.LITTLE_ENDIA

[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-26 Thread via GitHub


jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118074325


##
parquet-generator/src/main/java/org/apache/parquet/encoding/vectorbitpacking/BitPackingGenerator512Vector.java:
##
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.encoding.vectorbitpacking;
+
+import java.io.File;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * This class generates vector bit packers that pack the most significant bit 
first.
+ * The result of the generation is checked in. To regenerate the code run this 
class and check in the result.
+ */

Review Comment:
   I have checked it in instead of having the generator thing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693636#comment-17693636
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118074325


##
parquet-generator/src/main/java/org/apache/parquet/encoding/vectorbitpacking/BitPackingGenerator512Vector.java:
##
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.encoding.vectorbitpacking;
+
+import java.io.File;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * This class generates vector bit packers that pack the most significant bit 
first.
+ * The result of the generation is checked in. To regenerate the code run this 
class and check in the result.
+ */

Review Comment:
   I have checked it in instead of having the generator thing





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693635#comment-17693635
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118074070


##
parquet-generator/src/main/java/org/apache/parquet/encoding/vectorbitpacking/BitPackingGenerator512Vector.java:
##
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.encoding.vectorbitpacking;
+
+import java.io.File;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * This class generates vector bit packers that pack the most significant bit 
first.
+ * The result of the generation is checked in. To regenerate the code run this 
class and check in the result.
+ */
+public class BitPackingGenerator512Vector {
+  private static final String CLASS_NAME_PREFIX_FOR_INT = 
"ByteBitPacking512Vector";
+  private static final String CLASS_NAME_PREFIX_FOR_LONG = 
"ByteBitPacking512VectorForLong";
+
+  public static void main(String[] args) throws Exception {
+String basePath = args[0];
+//TODO: Int for Big Endian
+//generateScheme(false, true, basePath);
+
+// Int for Little Endian
+generateScheme(false, false, basePath);
+
+//TODO: Long for Big Endian
+//generateScheme(true, true, basePath);
+
+//TODO: Long for Little Endian
+//generateScheme(true, false, basePath);
+  }
+
+  private static void generateScheme(boolean isLong, boolean msbFirst,

Review Comment:
   @gszadovszky @wgtmac I agree with you, but I can't solve how to exclude a 
module(parquet-encoding-vector) in the profile by default.
   In fact, The current solution is also simply and user only go to the 
plugins/parquet-encoding-vector and build with maven





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-

[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-26 Thread via GitHub


jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118074070


##
parquet-generator/src/main/java/org/apache/parquet/encoding/vectorbitpacking/BitPackingGenerator512Vector.java:
##
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.encoding.vectorbitpacking;
+
+import java.io.File;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * This class generates vector bit packers that pack the most significant bit 
first.
+ * The result of the generation is checked in. To regenerate the code run this 
class and check in the result.
+ */
+public class BitPackingGenerator512Vector {
+  private static final String CLASS_NAME_PREFIX_FOR_INT = 
"ByteBitPacking512Vector";
+  private static final String CLASS_NAME_PREFIX_FOR_LONG = 
"ByteBitPacking512VectorForLong";
+
+  public static void main(String[] args) throws Exception {
+String basePath = args[0];
+//TODO: Int for Big Endian
+//generateScheme(false, true, basePath);
+
+// Int for Little Endian
+generateScheme(false, false, basePath);
+
+//TODO: Long for Big Endian
+//generateScheme(true, true, basePath);
+
+//TODO: Long for Little Endian
+//generateScheme(true, false, basePath);
+  }
+
+  private static void generateScheme(boolean isLong, boolean msbFirst,

Review Comment:
   @gszadovszky @wgtmac I agree with you, but I can't solve how to exclude a 
module(parquet-encoding-vector) in the profile by default.
   In fact, The current solution is also simply and user only go to the 
plugins/parquet-encoding-vector and build with maven



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693634#comment-17693634
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118073030


##
pom.xml:
##
@@ -151,6 +151,9 @@
 parquet-scala
 parquet-thrift
 parquet-hadoop-bundle
+
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-26 Thread via GitHub


jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118073030


##
pom.xml:
##
@@ -151,6 +151,9 @@
 parquet-scala
 parquet-thrift
 parquet-hadoop-bundle
+

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693633#comment-17693633
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118072844


##
plugins/parquet-encoding-vector/pom.xml:
##
@@ -0,0 +1,131 @@
+
+http://maven.apache.org/POM/4.0.0";
+ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
+ xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd";>
+  
+org.apache.parquet
+parquet
+1.13.0-SNAPSHOT
+../../pom.xml
+  
+
+  4.0.0
+
+  parquet-encoding-vector
+  jar
+
+  Apache Parquet Encodings Vector
+  https://parquet.apache.org
+
+  
+17

Review Comment:
   I have removed the property  and use 17 like parent 
pom.xml
   plugins/parquet-encoding-vector is independent and it needs some 
configurations(such as extraJavaVectorArgs), but other modules will not need, 
so I think it is best placed in its own module





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-26 Thread via GitHub


jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118072844


##
plugins/parquet-encoding-vector/pom.xml:
##
@@ -0,0 +1,131 @@
+
+http://maven.apache.org/POM/4.0.0";
+ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
+ xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd";>
+  
+org.apache.parquet
+parquet
+1.13.0-SNAPSHOT
+../../pom.xml
+  
+
+  4.0.0
+
+  parquet-encoding-vector
+  jar
+
+  Apache Parquet Encodings Vector
+  https://parquet.apache.org
+
+  
+17

Review Comment:
   I have removed the property  and use 17 like parent 
pom.xml
   plugins/parquet-encoding-vector is independent and it needs some 
configurations(such as extraJavaVectorArgs), but other modules will not need, 
so I think it is best placed in its own module



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693630#comment-17693630
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118070969


##
parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/ByteBitPackingVectorBenchmarks.java:
##
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.benchmarks;
+
+import org.apache.parquet.column.values.bitpacking.BytePacker;
+import org.apache.parquet.column.values.bitpacking.Packer;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.Warmup;
+
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class uses the java17 vector API, add VM options 
--add-modules=jdk.incubator.vector
+ */
+
+@State(Scope.Benchmark)
+@BenchmarkMode(Mode.AverageTime)
+@Warmup(iterations = 1, batchSize = 10)
+@Measurement(iterations = 1, batchSize = 10)
+@OutputTimeUnit(TimeUnit.MILLISECONDS)
+public class ByteBitPackingVectorBenchmarks {

Review Comment:
   Yes, it will fail, so I new a new module plugins/parquet-plugins-benchmarks  
and put the class into the module.





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-26 Thread via GitHub


jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118070969


##
parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/ByteBitPackingVectorBenchmarks.java:
##
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.benchmarks;
+
+import org.apache.parquet.column.values.bitpacking.BytePacker;
+import org.apache.parquet.column.values.bitpacking.Packer;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.Warmup;
+
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class uses the java17 vector API, add VM options 
--add-modules=jdk.incubator.vector
+ */
+
+@State(Scope.Benchmark)
+@BenchmarkMode(Mode.AverageTime)
+@Warmup(iterations = 1, batchSize = 10)
+@Measurement(iterations = 1, batchSize = 10)
+@OutputTimeUnit(TimeUnit.MILLISECONDS)
+public class ByteBitPackingVectorBenchmarks {

Review Comment:
   Yes, it will fail, so I new a new module plugins/parquet-plugins-benchmarks  
and put the class into the module.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693628#comment-17693628
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118070147


##
parquet-encoding/src/main/java/org/apache/parquet/column/values/bitpacking/Packer.java:
##
@@ -86,6 +105,7 @@ private static Object getStaticField(String className, 
String fieldName) {
   static IntPackerFactory leIntPackerFactory = 
getIntPackerFactory("LemireBitPackingLE");
   static BytePackerFactory beBytePackerFactory = 
getBytePackerFactory("ByteBitPackingBE");
   static BytePackerFactory leBytePackerFactory = 
getBytePackerFactory("ByteBitPackingLE");
+  static BytePackerFactory leBytePacker512VectorFactory = null;

Review Comment:
   I have added some comments





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-26 Thread via GitHub


jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118070147


##
parquet-encoding/src/main/java/org/apache/parquet/column/values/bitpacking/Packer.java:
##
@@ -86,6 +105,7 @@ private static Object getStaticField(String className, 
String fieldName) {
   static IntPackerFactory leIntPackerFactory = 
getIntPackerFactory("LemireBitPackingLE");
   static BytePackerFactory beBytePackerFactory = 
getBytePackerFactory("ByteBitPackingBE");
   static BytePackerFactory leBytePackerFactory = 
getBytePackerFactory("ByteBitPackingLE");
+  static BytePackerFactory leBytePacker512VectorFactory = null;

Review Comment:
   I have added some comments



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693626#comment-17693626
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118069968


##
README.md:
##
@@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Column stats
 * Delta encoding
 * Index pages
+* Java Vector API support
+
+## Java Vector API support
+Parquet-MR has supported Java Vector API to speed up reading, to enable the 
function:
+* Java 17+, 64-bit
+* For Intel CPU, Flags containing avx512vbmi and avx512_vbmi2 can have better 
performance gains(ICE Lake or newer processor).
+* mvn clean package -P java17-target -P vector

Review Comment:
   I have updated the document





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693627#comment-17693627
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118070050


##
plugins/parquet-encoding-vector/src/main/java/org/apache/parquet/column/values/bitpacking/ByteBitPacking512VectorLE.java:
##
@@ -0,0 +1,3010 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bitpacking;
+
+import jdk.incubator.vector.ByteVector;
+import jdk.incubator.vector.IntVector;
+import jdk.incubator.vector.LongVector;
+import jdk.incubator.vector.ShortVector;
+import jdk.incubator.vector.Vector;
+import jdk.incubator.vector.VectorMask;
+import jdk.incubator.vector.VectorOperators;
+import jdk.incubator.vector.VectorShuffle;
+import jdk.incubator.vector.VectorSpecies;
+
+import java.nio.ByteBuffer;
+
+/**
+ *

Review Comment:
   I have added comments





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-26 Thread via GitHub


jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118070050


##
plugins/parquet-encoding-vector/src/main/java/org/apache/parquet/column/values/bitpacking/ByteBitPacking512VectorLE.java:
##
@@ -0,0 +1,3010 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bitpacking;
+
+import jdk.incubator.vector.ByteVector;
+import jdk.incubator.vector.IntVector;
+import jdk.incubator.vector.LongVector;
+import jdk.incubator.vector.ShortVector;
+import jdk.incubator.vector.Vector;
+import jdk.incubator.vector.VectorMask;
+import jdk.incubator.vector.VectorOperators;
+import jdk.incubator.vector.VectorShuffle;
+import jdk.incubator.vector.VectorSpecies;
+
+import java.nio.ByteBuffer;
+
+/**
+ *

Review Comment:
   I have added comments



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-26 Thread via GitHub


jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118069968


##
README.md:
##
@@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Column stats
 * Delta encoding
 * Index pages
+* Java Vector API support
+
+## Java Vector API support
+Parquet-MR has supported Java Vector API to speed up reading, to enable the 
function:
+* Java 17+, 64-bit
+* For Intel CPU, Flags containing avx512vbmi and avx512_vbmi2 can have better 
performance gains(ICE Lake or newer processor).
+* mvn clean package -P java17-target -P vector

Review Comment:
   I have updated the document



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org