date:20221213

[jira] [Commented] (PARQUET-2075) Unified Rewriter Tool

2022-12-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646980#comment-17646980
 ] 

ASF GitHub Bot commented on PARQUET-2075:
-

wgtmac opened a new pull request, #1014:
URL: https://github.com/apache/parquet-mr/pull/1014

   ### Jira
   
   - This patch aims to solve the first step of 
[PARQUET-2075](https://issues.apache.org/jira/browse/PARQUET-2075).
   
   ### Tests
   
   - Make sure all tasks pass, especially CompressionConverterTest and 
ColumnPrunerTest.
   
   ### Commits
   
- A new ParquetRewriter is introduced to unify rewriting logic.
- RewriteOptions is defined to provide essential settings.
- CompressionConverter and ColumnPruner have been refactored and marked as 
deprecated.
   




> Unified Rewriter Tool  
> ---
>
> Key: PARQUET-2075
> URL: https://issues.apache.org/jira/browse/PARQUET-2075
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Xinli Shang
>Assignee: Gang Wu
>Priority: Major
>
> During the discussion of PARQUET-2071, we came up with the idea of a 
> universal tool to translate the existing file to a different state while 
> skipping some level steps like encoding/decoding, to gain speed. For example, 
> only decompress pages and then compress directly. For PARQUET-2071, we only 
> decrypt and then encrypt directly. This will be useful for the existing data 
> to onboard Parquet features like column encryption, zstd etc. 
> We already have tools like trans-compression, column pruning etc. We will 
> consolidate all these tools with this universal tool. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-mr] wgtmac opened a new pull request, #1014: PARQUET-2075: Implement ParquetRewriter

2022-12-13 Thread GitBox



wgtmac opened a new pull request, #1014:
URL: https://github.com/apache/parquet-mr/pull/1014

   ### Jira
   
   - This patch aims to solve the first step of 
[PARQUET-2075](https://issues.apache.org/jira/browse/PARQUET-2075).
   
   ### Tests
   
   - Make sure all tasks pass, especially CompressionConverterTest and 
ColumnPrunerTest.
   
   ### Commits
   
- A new ParquetRewriter is introduced to unify rewriting logic.
- RewriteOptions is defined to provide essential settings.
- CompressionConverter and ColumnPruner have been refactored and marked as 
deprecated.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: parquet checksum coverage

2022-12-13 Thread Micah Kornfield

>
> i think there's a good case for turning it on as (a) there are lots of
> other filesystems out there, including NTFS on windows laptops, *and*
> there's the risk of corruption of data in flight from the hdfs data node
> processes where the CRC checks place and the actual reader code.


Yep, agree it seems reasonable, like I said, I'm not sure about actual
performance implications.

On Fri, Dec 2, 2022 at 8:22 AM Steve Loughran 
wrote:

> thanks for that, especially the link
>
> as the docs say
>* If enabled, this allows for disabling checksumming in HDFS if only a
> few
>* pages need to be read.
>
> i think there's a good case for turning it on as (a) there are lots of
> other filesystems out there, including NTFS on windows laptops, *and*
> there's the risk of corruption of data in flight from the hdfs data node
> processes where the CRC checks place and the actual reader code.
>
>
> On Thu, 1 Dec 2022 at 05:53, Micah Kornfield 
> wrote:
>
> > Hi Steve,
> >
> > 1. What data in a parquet file is covered by CRC checks, and are there
> > >any blocks of data (footers, summaries etc) which aren't
> checksummed?
> >
> >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L642
> > I think has the best summary.  My understanding is that thrift  metadata
> > data is not itself checksummed
> >
> >  2. I see that verification as set
> > >by "parquet.page.verify-checksum.enabled" is false by default. Why
> > > isn't it
> > >on? is there a significant performance hit.
> >
> > Sorry I don't know the answer to this.
> >
> > On Mon, Nov 14, 2022 at 3:39 AM Steve Loughran
>  > >
> > wrote:
> >
> > > hi
> > >
> > > I am busy dealing with a bug where the Azure abfs connector can get the
> > > prefetch data blocks of one thread/task overwritten by those of another
> > > task whose input stream was closed while a prefetch was in progress.
> > > https://issues.apache.org/jira/browse/HADOOP-18521
> > >
> > > I have not been able to trigger any failures reading parquet data,
> > > presumably because it's seek-heavy read patterns don't benefit from
> > > prefetching much.
> > >
> > > Parquet also stores CRC checksums of pages of data written -which I
> need
> > a
> > > bit of help understanding.
> > >
> > >
> > >1. What data in a parquet file is covered by CRC checks, and are
> there
> > >any blocks of data (footers, summaries etc) which aren't
> checksummed?
> > >2. I see that verification as set
> > >by "parquet.page.verify-checksum.enabled" is false by default. Why
> > > isn't it
> > >on? is there a significant performance hit.
> > >
> > >
> > > Thanks
> > >
> > > steve
> > >
> >
>

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2022-12-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646943#comment-17646943
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

jatin-bhateja commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1048036480


##
parquet-encoding/src/main/java/org/apache/parquet/column/values/bitpacking/BytePacker.java:
##
@@ -105,4 +116,16 @@ public void unpack8Values(final byte[] input, final int 
inPos, final int[] outpu
   public void unpack32Values(byte[] input, int inPos, int[] output, int 
outPos) {
 unpack32Values(ByteBuffer.wrap(input), inPos, output, outPos);
   }
+
+  public void unpackValuesVector(final byte[] input, final int inPos, final 
int[] output, final int outPos) {

Review Comment:
   Please add appropriate comments over newly added definitions, like the ones 
over top of scalar routines.





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-mr] jatin-bhateja commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2022-12-13 Thread GitBox



jatin-bhateja commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1048036480


##
parquet-encoding/src/main/java/org/apache/parquet/column/values/bitpacking/BytePacker.java:
##
@@ -105,4 +116,16 @@ public void unpack8Values(final byte[] input, final int 
inPos, final int[] outpu
   public void unpack32Values(byte[] input, int inPos, int[] output, int 
outPos) {
 unpack32Values(ByteBuffer.wrap(input), inPos, output, outPos);
   }
+
+  public void unpackValuesVector(final byte[] input, final int inPos, final 
int[] output, final int outPos) {

Review Comment:
   Please add appropriate comments over newly added definitions, like the ones 
over top of scalar routines.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2022-12-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646941#comment-17646941
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

jatin-bhateja commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1048036480


##
parquet-encoding/src/main/java/org/apache/parquet/column/values/bitpacking/BytePacker.java:
##
@@ -105,4 +116,16 @@ public void unpack8Values(final byte[] input, final int 
inPos, final int[] outpu
   public void unpack32Values(byte[] input, int inPos, int[] output, int 
outPos) {
 unpack32Values(ByteBuffer.wrap(input), inPos, output, outPos);
   }
+
+  public void unpackValuesVector(final byte[] input, final int inPos, final 
int[] output, final int outPos) {

Review Comment:
   Please add appropriate comments over newly added definitions, like the ones 
over top over scalar routines.



##
parquet-generator/src/main/resources/ByteBitPackingVectorLE:
##
@@ -0,0 +1,3218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bitpacking;
+
+import jdk.incubator.vector.*;
+
+import java.nio.ByteBuffer;
+
+/**
+ * This is an auto-generated source file and should not edit it directly.
+ */
+public abstract class ByteBitPackingVectorLE {
+  private static final BytePacker[] packers = new BytePacker[33];
+
+  static {
+packers[0] = new Packer0();
+packers[1] = new Packer1();
+packers[2] = new Packer2();
+packers[3] = new Packer3();
+packers[4] = new Packer4();
+packers[5] = new Packer5();
+packers[6] = new Packer6();
+packers[7] = new Packer7();
+packers[8] = new Packer8();
+packers[9] = new Packer9();
+packers[10] = new Packer10();
+packers[11] = new Packer11();
+packers[12] = new Packer12();
+packers[13] = new Packer13();
+packers[14] = new Packer14();
+packers[15] = new Packer15();
+packers[16] = new Packer16();
+packers[17] = new Packer17();
+packers[18] = new Packer18();
+packers[19] = new Packer19();
+packers[20] = new Packer20();
+packers[21] = new Packer21();
+packers[22] = new Packer22();
+packers[23] = new Packer23();
+packers[24] = new Packer24();
+packers[25] = new Packer25();
+packers[26] = new Packer26();
+packers[27] = new Packer27();
+packers[28] = new Packer28();
+packers[29] = new Packer29();
+packers[30] = new Packer30();
+packers[31] = new Packer31();
+packers[32] = new Packer32();
+  }
+
+  public static final BytePackerFactory factory = new BytePackerFactory() {
+public BytePacker newBytePacker(int bitWidth) {
+  return packers[bitWidth];
+}
+  };
+
+  private static final class Packer0 extends BytePacker {
+private int unpackCount = 0;
+
+private Packer0() {
+  super(0);
+}
+
+public int getUnpackCount() {
+  return unpackCount;
+}
+
+public final void pack8Values(final int[] in, final int inPos, final 
byte[] out, final int outPos) {
+}
+
+public final void pack32Values(final int[] in, final int inPos, final 
byte[] out, final int outPos) {
+}
+
+public final void unpack8Values(final byte[] in, final int inPos, final 
int[] out, final int outPos) {
+}
+
+public final void unpack8Values(final ByteBuffer in, final int inPos, 
final int[] out, final int outPos) {
+}
+
+public final void unpack32Values(final byte[] in, final int inPos, final 
int[] out, final int outPos) {
+}
+
+public final void unpack32Values(final ByteBuffer in, final int inPos, 
final int[] out, final int outPos) {
+}
+
+public final void unpackValuesVector(final byte[] input, final int inPos, 
final int[] output, final int outPos) {
+}
+
+public final void unpackValuesVector(final ByteBuffer input, final int 
inPos, final int[] output, final int outPos) {

Review Comment:
   We also need to ensure that we invoke both scalar generators and vector 
generators for jdk17 profile. 



##
parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/ByteBitPackingVectorBenchmarks.java:

[GitHub] [parquet-mr] jatin-bhateja commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2022-12-13 Thread GitBox



jatin-bhateja commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1048036480


##
parquet-encoding/src/main/java/org/apache/parquet/column/values/bitpacking/BytePacker.java:
##
@@ -105,4 +116,16 @@ public void unpack8Values(final byte[] input, final int 
inPos, final int[] outpu
   public void unpack32Values(byte[] input, int inPos, int[] output, int 
outPos) {
 unpack32Values(ByteBuffer.wrap(input), inPos, output, outPos);
   }
+
+  public void unpackValuesVector(final byte[] input, final int inPos, final 
int[] output, final int outPos) {

Review Comment:
   Please add appropriate comments over newly added definitions, like the ones 
over top over scalar routines.



##
parquet-generator/src/main/resources/ByteBitPackingVectorLE:
##
@@ -0,0 +1,3218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bitpacking;
+
+import jdk.incubator.vector.*;
+
+import java.nio.ByteBuffer;
+
+/**
+ * This is an auto-generated source file and should not edit it directly.
+ */
+public abstract class ByteBitPackingVectorLE {
+  private static final BytePacker[] packers = new BytePacker[33];
+
+  static {
+packers[0] = new Packer0();
+packers[1] = new Packer1();
+packers[2] = new Packer2();
+packers[3] = new Packer3();
+packers[4] = new Packer4();
+packers[5] = new Packer5();
+packers[6] = new Packer6();
+packers[7] = new Packer7();
+packers[8] = new Packer8();
+packers[9] = new Packer9();
+packers[10] = new Packer10();
+packers[11] = new Packer11();
+packers[12] = new Packer12();
+packers[13] = new Packer13();
+packers[14] = new Packer14();
+packers[15] = new Packer15();
+packers[16] = new Packer16();
+packers[17] = new Packer17();
+packers[18] = new Packer18();
+packers[19] = new Packer19();
+packers[20] = new Packer20();
+packers[21] = new Packer21();
+packers[22] = new Packer22();
+packers[23] = new Packer23();
+packers[24] = new Packer24();
+packers[25] = new Packer25();
+packers[26] = new Packer26();
+packers[27] = new Packer27();
+packers[28] = new Packer28();
+packers[29] = new Packer29();
+packers[30] = new Packer30();
+packers[31] = new Packer31();
+packers[32] = new Packer32();
+  }
+
+  public static final BytePackerFactory factory = new BytePackerFactory() {
+public BytePacker newBytePacker(int bitWidth) {
+  return packers[bitWidth];
+}
+  };
+
+  private static final class Packer0 extends BytePacker {
+private int unpackCount = 0;
+
+private Packer0() {
+  super(0);
+}
+
+public int getUnpackCount() {
+  return unpackCount;
+}
+
+public final void pack8Values(final int[] in, final int inPos, final 
byte[] out, final int outPos) {
+}
+
+public final void pack32Values(final int[] in, final int inPos, final 
byte[] out, final int outPos) {
+}
+
+public final void unpack8Values(final byte[] in, final int inPos, final 
int[] out, final int outPos) {
+}
+
+public final void unpack8Values(final ByteBuffer in, final int inPos, 
final int[] out, final int outPos) {
+}
+
+public final void unpack32Values(final byte[] in, final int inPos, final 
int[] out, final int outPos) {
+}
+
+public final void unpack32Values(final ByteBuffer in, final int inPos, 
final int[] out, final int outPos) {
+}
+
+public final void unpackValuesVector(final byte[] input, final int inPos, 
final int[] output, final int outPos) {
+}
+
+public final void unpackValuesVector(final ByteBuffer input, final int 
inPos, final int[] output, final int outPos) {

Review Comment:
   We also need to ensure that we invoke both scalar generators and vector 
generators for jdk17 profile. 



##
parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/ByteBitPackingVectorBenchmarks.java:
##
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2022-12-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646899#comment-17646899
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

jiangjiguang commented on PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#issuecomment-1350287549

   > This work looks promising! It would be great if you can add some 
micro-benchmark to parquet-benchmarks.
   
   @wgtmac I have add the micro-benchmark to parquet-benchmarks, this is the 
result:
   
   
![image](https://user-images.githubusercontent.com/12368495/207491959-a1a22134-98fd-45f6-aa08-1934584e0fbb.png)
   




> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-mr] jiangjiguang commented on pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2022-12-13 Thread GitBox



jiangjiguang commented on PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#issuecomment-1350287549

   > This work looks promising! It would be great if you can add some 
micro-benchmark to parquet-benchmarks.
   
   @wgtmac I have add the micro-benchmark to parquet-benchmarks, this is the 
result:
   
   
![image](https://user-images.githubusercontent.com/12368495/207491959-a1a22134-98fd-45f6-aa08-1934584e0fbb.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-2218) [Format] Clarify CRC computation

2022-12-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646685#comment-17646685
 ] 

ASF GitHub Bot commented on PARQUET-2218:
-

mapleFU commented on PR #188:
URL: https://github.com/apache/parquet-format/pull/188#issuecomment-1348743274

   The change looks good to me! Thanks a lot!




> [Format] Clarify CRC computation
> 
>
> Key: PARQUET-2218
> URL: https://issues.apache.org/jira/browse/PARQUET-2218
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
> Fix For: format-2.10.0
>
>
> The format spec on CRC checksumming felt ambiguous when trying to implement 
> it in Parquet C++, so we should make the wording clearer.
> (see discussion on 
> https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and 
> below)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-format] mapleFU commented on pull request #188: PARQUET-2218: [Format] Clarify CRC computation

2022-12-13 Thread GitBox



mapleFU commented on PR #188:
URL: https://github.com/apache/parquet-format/pull/188#issuecomment-1348743274

   The change looks good to me! Thanks a lot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646675#comment-17646675
 ] 

ASF GitHub Bot commented on PARQUET-1539:
-

pitrou commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348718160

   I opened https://github.com/apache/parquet-format/pull/188 to clarify the 
wording.




> Clarify CRC checksum in page header
> ---
>
> Key: PARQUET-1539
> URL: https://issues.apache.org/jira/browse/PARQUET-1539
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Boudewijn Braams
>Assignee: Boudewijn Braams
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.7.0
>
>
> Although a page-level CRC field is defined in the Thrift specification, 
> currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the 
> [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607]
>  in the Thrift specification reads ‘32bit crc for the data below’, which is 
> somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum 
> should be calculated on. To ensure backward- and cross-compatibility of 
> Parquet readers/writes which do want to leverage the CRC checksums, the 
> format should specify exactly how and on what data the checksum should be 
> calculated.
> h2. Alternatives
> There are three main choices to be made here:
> # Which variant of CRC32 to use
> # Whether to include the page header itself in the checksum calculation
> # Whether to calculate the checksum on uncompressed or compressed data
> h3. Algorithm
> The CRC field holds a 32-bit value. There are many different variants of the 
> original CRC32 algorithm, each producing different values for the same input. 
> For ease of implementation we propose to use the standard CRC32 algorithm.
> h3. Including page header
> The page header itself could be included in the checksum calculation using an 
> approach similar to what TCP does, whereby the checksum field itself is 
> zeroed out before calculating the checksum that will be inserted there. 
> Evidently, including the page header is better in the sense that it increases 
> the data covered by the checksum. However, from an implementation 
> perspective, not including it is likely easier. Furthermore, given the 
> relatively small size of the page header compared to the page itself, simply 
> not including it will likely be good enough.
> h3. Compressed vs uncompressed
> *Compressed*
>  Pros
>  * Inherently faster, less data to operate on
>  * Potentially better triaging when determining where a corruption may have 
> been introduced, as checksum is calculated in a later stage
> Cons
>  * We have to trust both the encoding stage and the compression stage
> *Uncompressed*
>  Pros
>  * We only have to trust the encoding stage
>  * Possibly able to detect more corruptions, as data is checksummed at 
> earliest possible moment, checksum will be more sensitive to corruption 
> introduced further down the line
> Cons
>  * Inherently slower, more data to operate on, always need to decompress first
>  * Potentially harder triaging, more stages in which corruption could have 
> been introduced
> h2. Proposal
> The checksum will be calculated using the *standard CRC32 algorithm*, whereby 
> the checksum is to be calculated on the *data only, not including the page 
> header* itself (simple implementation) and the checksum will be calculated on 
> *compressed data* (inherently faster, likely better triaging). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-format] pitrou commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox



pitrou commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348718160

   I opened https://github.com/apache/parquet-format/pull/188 to clarify the 
wording.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-2218) [Format] Clarify CRC computation

2022-12-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646674#comment-17646674
 ] 

ASF GitHub Bot commented on PARQUET-2218:
-

pitrou commented on PR #188:
URL: https://github.com/apache/parquet-format/pull/188#issuecomment-1348714880

   @bbraams @gszadovszky @mapleFU thoughts?




> [Format] Clarify CRC computation
> 
>
> Key: PARQUET-2218
> URL: https://issues.apache.org/jira/browse/PARQUET-2218
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
> Fix For: format-2.10.0
>
>
> The format spec on CRC checksumming felt ambiguous when trying to implement 
> it in Parquet C++, so we should make the wording clearer.
> (see discussion on 
> https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and 
> below)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2218) [Format] Clarify CRC computation

2022-12-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646672#comment-17646672
 ] 

ASF GitHub Bot commented on PARQUET-2218:
-

pitrou opened a new pull request, #188:
URL: https://github.com/apache/parquet-format/pull/188

   When trying to implement CRC computation in Parquet C++, we found the 
wording to be ambiguous.
   
   Clarify that CRC computation happens on the exact binary serialization 
(instead of a long-winded and confusing elaboration about v1 and v2 data page 
layout).
   
   Also, clarify that CRC computation can apply to all page kinds, not only 
data pages (for reference, parquet-mr currently support checksumming v1 data 
pages as well as dictionary pages).
   
   Also, see discussion on 
https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and 
below.
   




> [Format] Clarify CRC computation
> 
>
> Key: PARQUET-2218
> URL: https://issues.apache.org/jira/browse/PARQUET-2218
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
> Fix For: format-2.10.0
>
>
> The format spec on CRC checksumming felt ambiguous when trying to implement 
> it in Parquet C++, so we should make the wording clearer.
> (see discussion on 
> https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and 
> below)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-format] pitrou commented on pull request #188: PARQUET-2218: [Format] Clarify CRC computation

2022-12-13 Thread GitBox



pitrou commented on PR #188:
URL: https://github.com/apache/parquet-format/pull/188#issuecomment-1348714880

   @bbraams @gszadovszky @mapleFU thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-format] pitrou opened a new pull request, #188: PARQUET-2218: [Format] Clarify CRC computation

2022-12-13 Thread GitBox



pitrou opened a new pull request, #188:
URL: https://github.com/apache/parquet-format/pull/188

   When trying to implement CRC computation in Parquet C++, we found the 
wording to be ambiguous.
   
   Clarify that CRC computation happens on the exact binary serialization 
(instead of a long-winded and confusing elaboration about v1 and v2 data page 
layout).
   
   Also, clarify that CRC computation can apply to all page kinds, not only 
data pages (for reference, parquet-mr currently support checksumming v1 data 
pages as well as dictionary pages).
   
   Also, see discussion on 
https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and 
below.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (PARQUET-2218) [Format] Clarify CRC computation

2022-12-13 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2218:

Description: 
The format spec on CRC checksumming felt ambiguous when trying to implement it 
in Parquet C++, so we should make the wording clearer.

(see discussion on 
https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and 
below)

  was:The format spec on CRC checksumming felt ambiguous when trying to 
implement it in Parquet C++, so we should make the wording clearer.


> [Format] Clarify CRC computation
> 
>
> Key: PARQUET-2218
> URL: https://issues.apache.org/jira/browse/PARQUET-2218
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
> Fix For: format-2.10.0
>
>
> The format spec on CRC checksumming felt ambiguous when trying to implement 
> it in Parquet C++, so we should make the wording clearer.
> (see discussion on 
> https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and 
> below)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (PARQUET-2218) [Format] Clarify CRC computation

2022-12-13 Thread Antoine Pitrou (Jira)

Antoine Pitrou created PARQUET-2218:
---

 Summary: [Format] Clarify CRC computation
 Key: PARQUET-2218
 URL: https://issues.apache.org/jira/browse/PARQUET-2218
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: format-2.10.0


The format spec on CRC checksumming felt ambiguous when trying to implement it 
in Parquet C++, so we should make the wording clearer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646656#comment-17646656
 ] 

ASF GitHub Bot commented on PARQUET-1539:
-

pitrou commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348674622

   @wgtmac No particular rule, no. AFAIU we only synchronize when we want to 
get meaningful spec changes.




> Clarify CRC checksum in page header
> ---
>
> Key: PARQUET-1539
> URL: https://issues.apache.org/jira/browse/PARQUET-1539
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Boudewijn Braams
>Assignee: Boudewijn Braams
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.7.0
>
>
> Although a page-level CRC field is defined in the Thrift specification, 
> currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the 
> [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607]
>  in the Thrift specification reads ‘32bit crc for the data below’, which is 
> somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum 
> should be calculated on. To ensure backward- and cross-compatibility of 
> Parquet readers/writes which do want to leverage the CRC checksums, the 
> format should specify exactly how and on what data the checksum should be 
> calculated.
> h2. Alternatives
> There are three main choices to be made here:
> # Which variant of CRC32 to use
> # Whether to include the page header itself in the checksum calculation
> # Whether to calculate the checksum on uncompressed or compressed data
> h3. Algorithm
> The CRC field holds a 32-bit value. There are many different variants of the 
> original CRC32 algorithm, each producing different values for the same input. 
> For ease of implementation we propose to use the standard CRC32 algorithm.
> h3. Including page header
> The page header itself could be included in the checksum calculation using an 
> approach similar to what TCP does, whereby the checksum field itself is 
> zeroed out before calculating the checksum that will be inserted there. 
> Evidently, including the page header is better in the sense that it increases 
> the data covered by the checksum. However, from an implementation 
> perspective, not including it is likely easier. Furthermore, given the 
> relatively small size of the page header compared to the page itself, simply 
> not including it will likely be good enough.
> h3. Compressed vs uncompressed
> *Compressed*
>  Pros
>  * Inherently faster, less data to operate on
>  * Potentially better triaging when determining where a corruption may have 
> been introduced, as checksum is calculated in a later stage
> Cons
>  * We have to trust both the encoding stage and the compression stage
> *Uncompressed*
>  Pros
>  * We only have to trust the encoding stage
>  * Possibly able to detect more corruptions, as data is checksummed at 
> earliest possible moment, checksum will be more sensitive to corruption 
> introduced further down the line
> Cons
>  * Inherently slower, more data to operate on, always need to decompress first
>  * Potentially harder triaging, more stages in which corruption could have 
> been introduced
> h2. Proposal
> The checksum will be calculated using the *standard CRC32 algorithm*, whereby 
> the checksum is to be calculated on the *data only, not including the page 
> header* itself (simple implementation) and the checksum will be calculated on 
> *compressed data* (inherently faster, likely better triaging). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-format] pitrou commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox



pitrou commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348674622

   @wgtmac No particular rule, no. AFAIU we only synchronize when we want to 
get meaningful spec changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646655#comment-17646655
 ] 

ASF GitHub Bot commented on PARQUET-1539:
-

wgtmac commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348672686

   > And, yes, it would probably be nice to make the spec wording clearer. I 
can try to submit something.
   
   Quick question: is there any rule to sync the `parquet.thrift` file from 
apache/parquet-format to apache/arrow? @pitrou 




> Clarify CRC checksum in page header
> ---
>
> Key: PARQUET-1539
> URL: https://issues.apache.org/jira/browse/PARQUET-1539
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Boudewijn Braams
>Assignee: Boudewijn Braams
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.7.0
>
>
> Although a page-level CRC field is defined in the Thrift specification, 
> currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the 
> [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607]
>  in the Thrift specification reads ‘32bit crc for the data below’, which is 
> somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum 
> should be calculated on. To ensure backward- and cross-compatibility of 
> Parquet readers/writes which do want to leverage the CRC checksums, the 
> format should specify exactly how and on what data the checksum should be 
> calculated.
> h2. Alternatives
> There are three main choices to be made here:
> # Which variant of CRC32 to use
> # Whether to include the page header itself in the checksum calculation
> # Whether to calculate the checksum on uncompressed or compressed data
> h3. Algorithm
> The CRC field holds a 32-bit value. There are many different variants of the 
> original CRC32 algorithm, each producing different values for the same input. 
> For ease of implementation we propose to use the standard CRC32 algorithm.
> h3. Including page header
> The page header itself could be included in the checksum calculation using an 
> approach similar to what TCP does, whereby the checksum field itself is 
> zeroed out before calculating the checksum that will be inserted there. 
> Evidently, including the page header is better in the sense that it increases 
> the data covered by the checksum. However, from an implementation 
> perspective, not including it is likely easier. Furthermore, given the 
> relatively small size of the page header compared to the page itself, simply 
> not including it will likely be good enough.
> h3. Compressed vs uncompressed
> *Compressed*
>  Pros
>  * Inherently faster, less data to operate on
>  * Potentially better triaging when determining where a corruption may have 
> been introduced, as checksum is calculated in a later stage
> Cons
>  * We have to trust both the encoding stage and the compression stage
> *Uncompressed*
>  Pros
>  * We only have to trust the encoding stage
>  * Possibly able to detect more corruptions, as data is checksummed at 
> earliest possible moment, checksum will be more sensitive to corruption 
> introduced further down the line
> Cons
>  * Inherently slower, more data to operate on, always need to decompress first
>  * Potentially harder triaging, more stages in which corruption could have 
> been introduced
> h2. Proposal
> The checksum will be calculated using the *standard CRC32 algorithm*, whereby 
> the checksum is to be calculated on the *data only, not including the page 
> header* itself (simple implementation) and the checksum will be calculated on 
> *compressed data* (inherently faster, likely better triaging). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-format] wgtmac commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox



wgtmac commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348672686

   > And, yes, it would probably be nice to make the spec wording clearer. I 
can try to submit something.
   
   Quick question: is there any rule to sync the `parquet.thrift` file from 
apache/parquet-format to apache/arrow? @pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646613#comment-17646613
 ] 

ASF GitHub Bot commented on PARQUET-1539:
-

mapleFU commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348441147

   > And, yes, it would probably be nice to make the spec wording clearer. I 
can try to submit something.
   
   OK, thanks for your patient. I updated the descriptions in 
https://issues.apache.org/jira/browse/ARROW-17904 . And I will finish them in 
the coming 2 patches, let's keep the patch simple and finish 
https://github.com/apache/arrow/pull/14351 first




> Clarify CRC checksum in page header
> ---
>
> Key: PARQUET-1539
> URL: https://issues.apache.org/jira/browse/PARQUET-1539
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Boudewijn Braams
>Assignee: Boudewijn Braams
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.7.0
>
>
> Although a page-level CRC field is defined in the Thrift specification, 
> currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the 
> [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607]
>  in the Thrift specification reads ‘32bit crc for the data below’, which is 
> somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum 
> should be calculated on. To ensure backward- and cross-compatibility of 
> Parquet readers/writes which do want to leverage the CRC checksums, the 
> format should specify exactly how and on what data the checksum should be 
> calculated.
> h2. Alternatives
> There are three main choices to be made here:
> # Which variant of CRC32 to use
> # Whether to include the page header itself in the checksum calculation
> # Whether to calculate the checksum on uncompressed or compressed data
> h3. Algorithm
> The CRC field holds a 32-bit value. There are many different variants of the 
> original CRC32 algorithm, each producing different values for the same input. 
> For ease of implementation we propose to use the standard CRC32 algorithm.
> h3. Including page header
> The page header itself could be included in the checksum calculation using an 
> approach similar to what TCP does, whereby the checksum field itself is 
> zeroed out before calculating the checksum that will be inserted there. 
> Evidently, including the page header is better in the sense that it increases 
> the data covered by the checksum. However, from an implementation 
> perspective, not including it is likely easier. Furthermore, given the 
> relatively small size of the page header compared to the page itself, simply 
> not including it will likely be good enough.
> h3. Compressed vs uncompressed
> *Compressed*
>  Pros
>  * Inherently faster, less data to operate on
>  * Potentially better triaging when determining where a corruption may have 
> been introduced, as checksum is calculated in a later stage
> Cons
>  * We have to trust both the encoding stage and the compression stage
> *Uncompressed*
>  Pros
>  * We only have to trust the encoding stage
>  * Possibly able to detect more corruptions, as data is checksummed at 
> earliest possible moment, checksum will be more sensitive to corruption 
> introduced further down the line
> Cons
>  * Inherently slower, more data to operate on, always need to decompress first
>  * Potentially harder triaging, more stages in which corruption could have 
> been introduced
> h2. Proposal
> The checksum will be calculated using the *standard CRC32 algorithm*, whereby 
> the checksum is to be calculated on the *data only, not including the page 
> header* itself (simple implementation) and the checksum will be calculated on 
> *compressed data* (inherently faster, likely better triaging). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-format] mapleFU commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox



mapleFU commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348441147

   > And, yes, it would probably be nice to make the spec wording clearer. I 
can try to submit something.
   
   OK, thanks for your patient. I updated the descriptions in 
https://issues.apache.org/jira/browse/ARROW-17904 . And I will finish them in 
the coming 2 patches, let's keep the patch simple and finish 
https://github.com/apache/arrow/pull/14351 first


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1629) Page-level CRC checksum verification for DataPageV2

2022-12-13 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646612#comment-17646612
 ] 

Antoine Pitrou commented on PARQUET-1629:
-

[~mwish] for the record. Perhaps you would be interested in doing this, if you 
can do some Java.

> Page-level CRC checksum verification for DataPageV2
> ---
>
> Key: PARQUET-1629
> URL: https://issues.apache.org/jira/browse/PARQUET-1629
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Boudewijn Braams
>Priority: Major
>
> In https://jira.apache.org/jira/browse/PARQUET-1580 (Github PR: 
> https://github.com/apache/parquet-mr/pull/647) we implemented page level CRC 
> checksum verification for DataPageV1. As a follow up, we should add support 
> for DataPageV2 that follows the spec (see see 
> https://jira.apache.org/jira/browse/PARQUET-1539).
> What needs to be done:
> * Add writing out checksums for DataPageV2
> * Add checksum verification for DataPageV2
> * Create new test suite
> * Create new benchmarks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646609#comment-17646609
 ] 

ASF GitHub Bot commented on PARQUET-1539:
-

pitrou commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348433863

   It seems it was done deliberately in parquet-mr and all Parquet committers 
there agreed that it was how the spec should be interpreted: 
https://github.com/apache/parquet-mr/pull/647
   
   So I would vote for doing it in Parquet C++. Can you first generate test 
files similar as to what you did for data pages?
   




> Clarify CRC checksum in page header
> ---
>
> Key: PARQUET-1539
> URL: https://issues.apache.org/jira/browse/PARQUET-1539
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Boudewijn Braams
>Assignee: Boudewijn Braams
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.7.0
>
>
> Although a page-level CRC field is defined in the Thrift specification, 
> currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the 
> [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607]
>  in the Thrift specification reads ‘32bit crc for the data below’, which is 
> somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum 
> should be calculated on. To ensure backward- and cross-compatibility of 
> Parquet readers/writes which do want to leverage the CRC checksums, the 
> format should specify exactly how and on what data the checksum should be 
> calculated.
> h2. Alternatives
> There are three main choices to be made here:
> # Which variant of CRC32 to use
> # Whether to include the page header itself in the checksum calculation
> # Whether to calculate the checksum on uncompressed or compressed data
> h3. Algorithm
> The CRC field holds a 32-bit value. There are many different variants of the 
> original CRC32 algorithm, each producing different values for the same input. 
> For ease of implementation we propose to use the standard CRC32 algorithm.
> h3. Including page header
> The page header itself could be included in the checksum calculation using an 
> approach similar to what TCP does, whereby the checksum field itself is 
> zeroed out before calculating the checksum that will be inserted there. 
> Evidently, including the page header is better in the sense that it increases 
> the data covered by the checksum. However, from an implementation 
> perspective, not including it is likely easier. Furthermore, given the 
> relatively small size of the page header compared to the page itself, simply 
> not including it will likely be good enough.
> h3. Compressed vs uncompressed
> *Compressed*
>  Pros
>  * Inherently faster, less data to operate on
>  * Potentially better triaging when determining where a corruption may have 
> been introduced, as checksum is calculated in a later stage
> Cons
>  * We have to trust both the encoding stage and the compression stage
> *Uncompressed*
>  Pros
>  * We only have to trust the encoding stage
>  * Possibly able to detect more corruptions, as data is checksummed at 
> earliest possible moment, checksum will be more sensitive to corruption 
> introduced further down the line
> Cons
>  * Inherently slower, more data to operate on, always need to decompress first
>  * Potentially harder triaging, more stages in which corruption could have 
> been introduced
> h2. Proposal
> The checksum will be calculated using the *standard CRC32 algorithm*, whereby 
> the checksum is to be calculated on the *data only, not including the page 
> header* itself (simple implementation) and the checksum will be calculated on 
> *compressed data* (inherently faster, likely better triaging). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-format] pitrou commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox



pitrou commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348435655

   And, yes, it would probably be nice to make the spec wording clearer. I can 
try to submit something.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646610#comment-17646610
 ] 

ASF GitHub Bot commented on PARQUET-1539:
-

pitrou commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348435655

   And, yes, it would probably be nice to make the spec wording clearer. I can 
try to submit something.




> Clarify CRC checksum in page header
> ---
>
> Key: PARQUET-1539
> URL: https://issues.apache.org/jira/browse/PARQUET-1539
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Boudewijn Braams
>Assignee: Boudewijn Braams
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.7.0
>
>
> Although a page-level CRC field is defined in the Thrift specification, 
> currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the 
> [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607]
>  in the Thrift specification reads ‘32bit crc for the data below’, which is 
> somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum 
> should be calculated on. To ensure backward- and cross-compatibility of 
> Parquet readers/writes which do want to leverage the CRC checksums, the 
> format should specify exactly how and on what data the checksum should be 
> calculated.
> h2. Alternatives
> There are three main choices to be made here:
> # Which variant of CRC32 to use
> # Whether to include the page header itself in the checksum calculation
> # Whether to calculate the checksum on uncompressed or compressed data
> h3. Algorithm
> The CRC field holds a 32-bit value. There are many different variants of the 
> original CRC32 algorithm, each producing different values for the same input. 
> For ease of implementation we propose to use the standard CRC32 algorithm.
> h3. Including page header
> The page header itself could be included in the checksum calculation using an 
> approach similar to what TCP does, whereby the checksum field itself is 
> zeroed out before calculating the checksum that will be inserted there. 
> Evidently, including the page header is better in the sense that it increases 
> the data covered by the checksum. However, from an implementation 
> perspective, not including it is likely easier. Furthermore, given the 
> relatively small size of the page header compared to the page itself, simply 
> not including it will likely be good enough.
> h3. Compressed vs uncompressed
> *Compressed*
>  Pros
>  * Inherently faster, less data to operate on
>  * Potentially better triaging when determining where a corruption may have 
> been introduced, as checksum is calculated in a later stage
> Cons
>  * We have to trust both the encoding stage and the compression stage
> *Uncompressed*
>  Pros
>  * We only have to trust the encoding stage
>  * Possibly able to detect more corruptions, as data is checksummed at 
> earliest possible moment, checksum will be more sensitive to corruption 
> introduced further down the line
> Cons
>  * Inherently slower, more data to operate on, always need to decompress first
>  * Potentially harder triaging, more stages in which corruption could have 
> been introduced
> h2. Proposal
> The checksum will be calculated using the *standard CRC32 algorithm*, whereby 
> the checksum is to be calculated on the *data only, not including the page 
> header* itself (simple implementation) and the checksum will be calculated on 
> *compressed data* (inherently faster, likely better triaging). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-format] pitrou commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox



pitrou commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348433863

   It seems it was done deliberately in parquet-mr and all Parquet committers 
there agreed that it was how the spec should be interpreted: 
https://github.com/apache/parquet-mr/pull/647
   
   So I would vote for doing it in Parquet C++. Can you first generate test 
files similar as to what you did for data pages?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646595#comment-17646595
 ] 

ASF GitHub Bot commented on PARQUET-1539:
-

mapleFU commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348405417

   So, should we update the `parquet-format`, or just keep it here and not 
implement crc in parquet c++ version? @pitrou 
   
   




> Clarify CRC checksum in page header
> ---
>
> Key: PARQUET-1539
> URL: https://issues.apache.org/jira/browse/PARQUET-1539
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Boudewijn Braams
>Assignee: Boudewijn Braams
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.7.0
>
>
> Although a page-level CRC field is defined in the Thrift specification, 
> currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the 
> [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607]
>  in the Thrift specification reads ‘32bit crc for the data below’, which is 
> somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum 
> should be calculated on. To ensure backward- and cross-compatibility of 
> Parquet readers/writes which do want to leverage the CRC checksums, the 
> format should specify exactly how and on what data the checksum should be 
> calculated.
> h2. Alternatives
> There are three main choices to be made here:
> # Which variant of CRC32 to use
> # Whether to include the page header itself in the checksum calculation
> # Whether to calculate the checksum on uncompressed or compressed data
> h3. Algorithm
> The CRC field holds a 32-bit value. There are many different variants of the 
> original CRC32 algorithm, each producing different values for the same input. 
> For ease of implementation we propose to use the standard CRC32 algorithm.
> h3. Including page header
> The page header itself could be included in the checksum calculation using an 
> approach similar to what TCP does, whereby the checksum field itself is 
> zeroed out before calculating the checksum that will be inserted there. 
> Evidently, including the page header is better in the sense that it increases 
> the data covered by the checksum. However, from an implementation 
> perspective, not including it is likely easier. Furthermore, given the 
> relatively small size of the page header compared to the page itself, simply 
> not including it will likely be good enough.
> h3. Compressed vs uncompressed
> *Compressed*
>  Pros
>  * Inherently faster, less data to operate on
>  * Potentially better triaging when determining where a corruption may have 
> been introduced, as checksum is calculated in a later stage
> Cons
>  * We have to trust both the encoding stage and the compression stage
> *Uncompressed*
>  Pros
>  * We only have to trust the encoding stage
>  * Possibly able to detect more corruptions, as data is checksummed at 
> earliest possible moment, checksum will be more sensitive to corruption 
> introduced further down the line
> Cons
>  * Inherently slower, more data to operate on, always need to decompress first
>  * Potentially harder triaging, more stages in which corruption could have 
> been introduced
> h2. Proposal
> The checksum will be calculated using the *standard CRC32 algorithm*, whereby 
> the checksum is to be calculated on the *data only, not including the page 
> header* itself (simple implementation) and the checksum will be calculated on 
> *compressed data* (inherently faster, likely better triaging). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-format] mapleFU commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox



mapleFU commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348405417

   So, should we update the `parquet-format`, or just keep it here and not 
implement crc in parquet c++ version? @pitrou 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646591#comment-17646591
 ] 

ASF GitHub Bot commented on PARQUET-1539:
-

pitrou commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348378187

   It does seem that parquet-mr writes a CRC value for dictionary pages...
   
https://github.com/apache/parquet-mr/blob/433de8df33fcf31927f7b51456be9f53e64d48b9/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L499-L510




> Clarify CRC checksum in page header
> ---
>
> Key: PARQUET-1539
> URL: https://issues.apache.org/jira/browse/PARQUET-1539
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Boudewijn Braams
>Assignee: Boudewijn Braams
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.7.0
>
>
> Although a page-level CRC field is defined in the Thrift specification, 
> currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the 
> [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607]
>  in the Thrift specification reads ‘32bit crc for the data below’, which is 
> somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum 
> should be calculated on. To ensure backward- and cross-compatibility of 
> Parquet readers/writes which do want to leverage the CRC checksums, the 
> format should specify exactly how and on what data the checksum should be 
> calculated.
> h2. Alternatives
> There are three main choices to be made here:
> # Which variant of CRC32 to use
> # Whether to include the page header itself in the checksum calculation
> # Whether to calculate the checksum on uncompressed or compressed data
> h3. Algorithm
> The CRC field holds a 32-bit value. There are many different variants of the 
> original CRC32 algorithm, each producing different values for the same input. 
> For ease of implementation we propose to use the standard CRC32 algorithm.
> h3. Including page header
> The page header itself could be included in the checksum calculation using an 
> approach similar to what TCP does, whereby the checksum field itself is 
> zeroed out before calculating the checksum that will be inserted there. 
> Evidently, including the page header is better in the sense that it increases 
> the data covered by the checksum. However, from an implementation 
> perspective, not including it is likely easier. Furthermore, given the 
> relatively small size of the page header compared to the page itself, simply 
> not including it will likely be good enough.
> h3. Compressed vs uncompressed
> *Compressed*
>  Pros
>  * Inherently faster, less data to operate on
>  * Potentially better triaging when determining where a corruption may have 
> been introduced, as checksum is calculated in a later stage
> Cons
>  * We have to trust both the encoding stage and the compression stage
> *Uncompressed*
>  Pros
>  * We only have to trust the encoding stage
>  * Possibly able to detect more corruptions, as data is checksummed at 
> earliest possible moment, checksum will be more sensitive to corruption 
> introduced further down the line
> Cons
>  * Inherently slower, more data to operate on, always need to decompress first
>  * Potentially harder triaging, more stages in which corruption could have 
> been introduced
> h2. Proposal
> The checksum will be calculated using the *standard CRC32 algorithm*, whereby 
> the checksum is to be calculated on the *data only, not including the page 
> header* itself (simple implementation) and the checksum will be calculated on 
> *compressed data* (inherently faster, likely better triaging). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-format] pitrou commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox



pitrou commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348378187

   It does seem that parquet-mr writes a CRC value for dictionary pages...
   
https://github.com/apache/parquet-mr/blob/433de8df33fcf31927f7b51456be9f53e64d48b9/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L499-L510


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646585#comment-17646585
 ] 

ASF GitHub Bot commented on PARQUET-1539:
-

mapleFU commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348324323

   > (also cc @mapleFU, who's working on CRC support for Parquet C++)
   
   Hi, all, I have a question here, the format says:
   
   ```
 /** The 32bit CRC for the page, to be be calculated as follows:
  * - Using the standard CRC32 algorithm
  * - On the data only, i.e. this header should not be included. 'Data'
  *   hereby refers to the concatenation of the repetition levels, the
  *   definition levels and the column value, in this exact order.
  * - On the encoded versions of the repetition levels, definition levels 
and
  *   column values
  * - On the compressed versions of the repetition levels, definition levels
  *   and column values where possible;
  *   - For v1 data pages, the repetition levels, definition levels and 
column
  * values are always compressed together. If a compression scheme is
  * specified, the CRC shall be calculated on the compressed version of
  * this concatenation. If no compression scheme is specified, the CRC
  * shall be calculated on the uncompressed version of this 
concatenation.
  *   - For v2 data pages, the repetition levels and definition levels are
  * handled separately from the data and are never compressed (only
  * encoded). If a compression scheme is specified, the CRC shall be
  * calculated on the concatenation of the uncompressed repetition 
levels,
  * uncompressed definition levels and the compressed column values.
  * If no compression scheme is specified, the CRC shall be calculated 
on
  * the uncompressed concatenation.
  * - In encrypted columns, CRC is calculated after page encryption; the
  *   encryption itself is performed after page compression (if compressed)
  * If enabled, this allows for disabling checksumming in HDFS if only a few
  * pages need to be read.
  **/
   ```
   
   and in `README`:
   
   ```
   Data pages can be individually checksummed. 
   ```
   
   But in our coding, we have:
   
   ```c++
   int64_t WriteDictionaryPage(const DictionaryPage& page) override {
   // TODO(PARQUET-594) crc checksum
   ...
   }
   ```
   
   So, could DICTIONARY_PAGE or even INDEX_PAGE have crc? /cc @pitrou 
   
   
   




> Clarify CRC checksum in page header
> ---
>
> Key: PARQUET-1539
> URL: https://issues.apache.org/jira/browse/PARQUET-1539
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Boudewijn Braams
>Assignee: Boudewijn Braams
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.7.0
>
>
> Although a page-level CRC field is defined in the Thrift specification, 
> currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the 
> [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607]
>  in the Thrift specification reads ‘32bit crc for the data below’, which is 
> somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum 
> should be calculated on. To ensure backward- and cross-compatibility of 
> Parquet readers/writes which do want to leverage the CRC checksums, the 
> format should specify exactly how and on what data the checksum should be 
> calculated.
> h2. Alternatives
> There are three main choices to be made here:
> # Which variant of CRC32 to use
> # Whether to include the page header itself in the checksum calculation
> # Whether to calculate the checksum on uncompressed or compressed data
> h3. Algorithm
> The CRC field holds a 32-bit value. There are many different variants of the 
> original CRC32 algorithm, each producing different values for the same input. 
> For ease of implementation we propose to use the standard CRC32 algorithm.
> h3. Including page header
> The page header itself could be included in the checksum calculation using an 
> approach similar to what TCP does, whereby the checksum field itself is 
> zeroed out before calculating the checksum that will be inserted there. 
> Evidently, including the page header is better in the sense that it increases 
> the data covered by the checksum. However, from an implementation 
> perspective, not including it is likely easier. Furthermore, given the 
> relatively small size of the page header compared to the page itself, simply 
> not including it will likely be good enough.
> h3. Compressed vs uncompressed
> *Compressed*
>  Pros
>  * Inherently

[GitHub] [parquet-format] mapleFU commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox



mapleFU commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348324323

   > (also cc @mapleFU, who's working on CRC support for Parquet C++)
   
   Hi, all, I have a question here, the format says:
   
   ```
 /** The 32bit CRC for the page, to be be calculated as follows:
  * - Using the standard CRC32 algorithm
  * - On the data only, i.e. this header should not be included. 'Data'
  *   hereby refers to the concatenation of the repetition levels, the
  *   definition levels and the column value, in this exact order.
  * - On the encoded versions of the repetition levels, definition levels 
and
  *   column values
  * - On the compressed versions of the repetition levels, definition levels
  *   and column values where possible;
  *   - For v1 data pages, the repetition levels, definition levels and 
column
  * values are always compressed together. If a compression scheme is
  * specified, the CRC shall be calculated on the compressed version of
  * this concatenation. If no compression scheme is specified, the CRC
  * shall be calculated on the uncompressed version of this 
concatenation.
  *   - For v2 data pages, the repetition levels and definition levels are
  * handled separately from the data and are never compressed (only
  * encoded). If a compression scheme is specified, the CRC shall be
  * calculated on the concatenation of the uncompressed repetition 
levels,
  * uncompressed definition levels and the compressed column values.
  * If no compression scheme is specified, the CRC shall be calculated 
on
  * the uncompressed concatenation.
  * - In encrypted columns, CRC is calculated after page encryption; the
  *   encryption itself is performed after page compression (if compressed)
  * If enabled, this allows for disabling checksumming in HDFS if only a few
  * pages need to be read.
  **/
   ```
   
   and in `README`:
   
   ```
   Data pages can be individually checksummed. 
   ```
   
   But in our coding, we have:
   
   ```c++
   int64_t WriteDictionaryPage(const DictionaryPage& page) override {
   // TODO(PARQUET-594) crc checksum
   ...
   }
   ```
   
   So, could DICTIONARY_PAGE or even INDEX_PAGE have crc? /cc @pitrou 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646534#comment-17646534
 ] 

ASF GitHub Bot commented on PARQUET-1539:
-

pitrou commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137

   @bbraams @gszadovszky Could you explain why the spec's wording is so complex?
   
   It seems to me that the CRC is basically computed over the entire serialized 
data exactly as it's written to disk (after optional compression and 
encryption, and including the rep/def levels area that's prepended to the 
actual data). But the wording makes it seem that special care is needed to 
accumulate the CRC over different pieces of data, which may scare implementors 
(context: https://github.com/apache/arrow/pull/14351 ).
   
   Am I right in my interpretation above?
   




> Clarify CRC checksum in page header
> ---
>
> Key: PARQUET-1539
> URL: https://issues.apache.org/jira/browse/PARQUET-1539
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Boudewijn Braams
>Assignee: Boudewijn Braams
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.7.0
>
>
> Although a page-level CRC field is defined in the Thrift specification, 
> currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the 
> [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607]
>  in the Thrift specification reads ‘32bit crc for the data below’, which is 
> somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum 
> should be calculated on. To ensure backward- and cross-compatibility of 
> Parquet readers/writes which do want to leverage the CRC checksums, the 
> format should specify exactly how and on what data the checksum should be 
> calculated.
> h2. Alternatives
> There are three main choices to be made here:
> # Which variant of CRC32 to use
> # Whether to include the page header itself in the checksum calculation
> # Whether to calculate the checksum on uncompressed or compressed data
> h3. Algorithm
> The CRC field holds a 32-bit value. There are many different variants of the 
> original CRC32 algorithm, each producing different values for the same input. 
> For ease of implementation we propose to use the standard CRC32 algorithm.
> h3. Including page header
> The page header itself could be included in the checksum calculation using an 
> approach similar to what TCP does, whereby the checksum field itself is 
> zeroed out before calculating the checksum that will be inserted there. 
> Evidently, including the page header is better in the sense that it increases 
> the data covered by the checksum. However, from an implementation 
> perspective, not including it is likely easier. Furthermore, given the 
> relatively small size of the page header compared to the page itself, simply 
> not including it will likely be good enough.
> h3. Compressed vs uncompressed
> *Compressed*
>  Pros
>  * Inherently faster, less data to operate on
>  * Potentially better triaging when determining where a corruption may have 
> been introduced, as checksum is calculated in a later stage
> Cons
>  * We have to trust both the encoding stage and the compression stage
> *Uncompressed*
>  Pros
>  * We only have to trust the encoding stage
>  * Possibly able to detect more corruptions, as data is checksummed at 
> earliest possible moment, checksum will be more sensitive to corruption 
> introduced further down the line
> Cons
>  * Inherently slower, more data to operate on, always need to decompress first
>  * Potentially harder triaging, more stages in which corruption could have 
> been introduced
> h2. Proposal
> The checksum will be calculated using the *standard CRC32 algorithm*, whereby 
> the checksum is to be calculated on the *data only, not including the page 
> header* itself (simple implementation) and the checksum will be calculated on 
> *compressed data* (inherently faster, likely better triaging). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-format] pitrou commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox



pitrou commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137

   @bbraams @gszadovszky Could you explain why the spec's wording is so complex?
   
   It seems to me that the CRC is basically computed over the entire serialized 
data exactly as it's written to disk (after optional compression and 
encryption, and including the rep/def levels area that's prepended to the 
actual data). But the wording makes it seem that special care is needed to 
accumulate the CRC over different pieces of data, which may scare implementors 
(context: https://github.com/apache/arrow/pull/14351 ).
   
   Am I right in my interpretation above?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

38 matches

Mail list logo