[jira] [Commented] (PARQUET-2075) Unified Rewriter Tool
[ https://issues.apache.org/jira/browse/PARQUET-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646980#comment-17646980 ] ASF GitHub Bot commented on PARQUET-2075: - wgtmac opened a new pull request, #1014: URL: https://github.com/apache/parquet-mr/pull/1014 ### Jira - This patch aims to solve the first step of [PARQUET-2075](https://issues.apache.org/jira/browse/PARQUET-2075). ### Tests - Make sure all tasks pass, especially CompressionConverterTest and ColumnPrunerTest. ### Commits - A new ParquetRewriter is introduced to unify rewriting logic. - RewriteOptions is defined to provide essential settings. - CompressionConverter and ColumnPruner have been refactored and marked as deprecated. > Unified Rewriter Tool > --- > > Key: PARQUET-2075 > URL: https://issues.apache.org/jira/browse/PARQUET-2075 > Project: Parquet > Issue Type: New Feature >Reporter: Xinli Shang >Assignee: Gang Wu >Priority: Major > > During the discussion of PARQUET-2071, we came up with the idea of a > universal tool to translate the existing file to a different state while > skipping some level steps like encoding/decoding, to gain speed. For example, > only decompress pages and then compress directly. For PARQUET-2071, we only > decrypt and then encrypt directly. This will be useful for the existing data > to onboard Parquet features like column encryption, zstd etc. > We already have tools like trans-compression, column pruning etc. We will > consolidate all these tools with this universal tool. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] wgtmac opened a new pull request, #1014: PARQUET-2075: Implement ParquetRewriter
wgtmac opened a new pull request, #1014: URL: https://github.com/apache/parquet-mr/pull/1014 ### Jira - This patch aims to solve the first step of [PARQUET-2075](https://issues.apache.org/jira/browse/PARQUET-2075). ### Tests - Make sure all tasks pass, especially CompressionConverterTest and ColumnPrunerTest. ### Commits - A new ParquetRewriter is introduced to unify rewriting logic. - RewriteOptions is defined to provide essential settings. - CompressionConverter and ColumnPruner have been refactored and marked as deprecated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: parquet checksum coverage
> > i think there's a good case for turning it on as (a) there are lots of > other filesystems out there, including NTFS on windows laptops, *and* > there's the risk of corruption of data in flight from the hdfs data node > processes where the CRC checks place and the actual reader code. Yep, agree it seems reasonable, like I said, I'm not sure about actual performance implications. On Fri, Dec 2, 2022 at 8:22 AM Steve Loughran wrote: > thanks for that, especially the link > > as the docs say >* If enabled, this allows for disabling checksumming in HDFS if only a > few >* pages need to be read. > > i think there's a good case for turning it on as (a) there are lots of > other filesystems out there, including NTFS on windows laptops, *and* > there's the risk of corruption of data in flight from the hdfs data node > processes where the CRC checks place and the actual reader code. > > > On Thu, 1 Dec 2022 at 05:53, Micah Kornfield > wrote: > > > Hi Steve, > > > > 1. What data in a parquet file is covered by CRC checks, and are there > > >any blocks of data (footers, summaries etc) which aren't > checksummed? > > > > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L642 > > I think has the best summary. My understanding is that thrift metadata > > data is not itself checksummed > > > > 2. I see that verification as set > > >by "parquet.page.verify-checksum.enabled" is false by default. Why > > > isn't it > > >on? is there a significant performance hit. > > > > Sorry I don't know the answer to this. > > > > On Mon, Nov 14, 2022 at 3:39 AM Steve Loughran > > > > > wrote: > > > > > hi > > > > > > I am busy dealing with a bug where the Azure abfs connector can get the > > > prefetch data blocks of one thread/task overwritten by those of another > > > task whose input stream was closed while a prefetch was in progress. > > > https://issues.apache.org/jira/browse/HADOOP-18521 > > > > > > I have not been able to trigger any failures reading parquet data, > > > presumably because it's seek-heavy read patterns don't benefit from > > > prefetching much. > > > > > > Parquet also stores CRC checksums of pages of data written -which I > need > > a > > > bit of help understanding. > > > > > > > > >1. What data in a parquet file is covered by CRC checks, and are > there > > >any blocks of data (footers, summaries etc) which aren't > checksummed? > > >2. I see that verification as set > > >by "parquet.page.verify-checksum.enabled" is false by default. Why > > > isn't it > > >on? is there a significant performance hit. > > > > > > > > > Thanks > > > > > > steve > > > > > >
[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646943#comment-17646943 ] ASF GitHub Bot commented on PARQUET-2159: - jatin-bhateja commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1048036480 ## parquet-encoding/src/main/java/org/apache/parquet/column/values/bitpacking/BytePacker.java: ## @@ -105,4 +116,16 @@ public void unpack8Values(final byte[] input, final int inPos, final int[] outpu public void unpack32Values(byte[] input, int inPos, int[] output, int outPos) { unpack32Values(ByteBuffer.wrap(input), inPos, output, outPos); } + + public void unpackValuesVector(final byte[] input, final int inPos, final int[] output, final int outPos) { Review Comment: Please add appropriate comments over newly added definitions, like the ones over top of scalar routines. > Parquet bit-packing de/encode optimization > -- > > Key: PARQUET-2159 > URL: https://issues.apache.org/jira/browse/PARQUET-2159 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Fang-Xie >Assignee: Fang-Xie >Priority: Major > Fix For: 1.13.0 > > Attachments: image-2022-06-15-22-56-08-396.png, > image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, > image-2022-06-15-22-58-40-704.png > > > Current Spark use Parquet-mr as parquet reader/writer library, but the > built-in bit-packing en/decode is not efficient enough. > Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector > in Open JDK18 brings prominent performance improvement. > Due to Vector API is added to OpenJDK since 16, So this optimization request > JDK16 or higher. > *Below are our test results* > Functional test is based on open-source parquet-mr Bit-pack decoding > function: *_public final void unpack8Values(final byte[] in, final int inPos, > final int[] out, final int outPos)_* __ > compared with our implementation with vector API *_public final void > unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final > int outPos)_* > We tested 10 pairs (open source parquet bit unpacking vs ours optimized > vectorized SIMD implementation) decode function with bit > width=\{1,2,3,4,5,6,7,8,9,10}, below are test results: > !image-2022-06-15-22-56-08-396.png|width=437,height=223! > We integrated our bit-packing decode implementation into parquet-mr, tested > the parquet batch reader ability from Spark VectorizedParquetRecordReader > which get parquet column data by the batch way. We construct parquet file > with different row count and column count, the column data type is Int32, the > maximum int value is 127 which satisfies bit pack encode with bit width=7, > the count of the row is from 10k to 100 million and the count of the column > is from 1 to 4. > !image-2022-06-15-22-57-15-964.png|width=453,height=229! > !image-2022-06-15-22-58-01-442.png|width=439,height=217! > !image-2022-06-15-22-58-40-704.png|width=415,height=208! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] jatin-bhateja commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization
jatin-bhateja commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1048036480 ## parquet-encoding/src/main/java/org/apache/parquet/column/values/bitpacking/BytePacker.java: ## @@ -105,4 +116,16 @@ public void unpack8Values(final byte[] input, final int inPos, final int[] outpu public void unpack32Values(byte[] input, int inPos, int[] output, int outPos) { unpack32Values(ByteBuffer.wrap(input), inPos, output, outPos); } + + public void unpackValuesVector(final byte[] input, final int inPos, final int[] output, final int outPos) { Review Comment: Please add appropriate comments over newly added definitions, like the ones over top of scalar routines. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646941#comment-17646941 ] ASF GitHub Bot commented on PARQUET-2159: - jatin-bhateja commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1048036480 ## parquet-encoding/src/main/java/org/apache/parquet/column/values/bitpacking/BytePacker.java: ## @@ -105,4 +116,16 @@ public void unpack8Values(final byte[] input, final int inPos, final int[] outpu public void unpack32Values(byte[] input, int inPos, int[] output, int outPos) { unpack32Values(ByteBuffer.wrap(input), inPos, output, outPos); } + + public void unpackValuesVector(final byte[] input, final int inPos, final int[] output, final int outPos) { Review Comment: Please add appropriate comments over newly added definitions, like the ones over top over scalar routines. ## parquet-generator/src/main/resources/ByteBitPackingVectorLE: ## @@ -0,0 +1,3218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.parquet.column.values.bitpacking; + +import jdk.incubator.vector.*; + +import java.nio.ByteBuffer; + +/** + * This is an auto-generated source file and should not edit it directly. + */ +public abstract class ByteBitPackingVectorLE { + private static final BytePacker[] packers = new BytePacker[33]; + + static { +packers[0] = new Packer0(); +packers[1] = new Packer1(); +packers[2] = new Packer2(); +packers[3] = new Packer3(); +packers[4] = new Packer4(); +packers[5] = new Packer5(); +packers[6] = new Packer6(); +packers[7] = new Packer7(); +packers[8] = new Packer8(); +packers[9] = new Packer9(); +packers[10] = new Packer10(); +packers[11] = new Packer11(); +packers[12] = new Packer12(); +packers[13] = new Packer13(); +packers[14] = new Packer14(); +packers[15] = new Packer15(); +packers[16] = new Packer16(); +packers[17] = new Packer17(); +packers[18] = new Packer18(); +packers[19] = new Packer19(); +packers[20] = new Packer20(); +packers[21] = new Packer21(); +packers[22] = new Packer22(); +packers[23] = new Packer23(); +packers[24] = new Packer24(); +packers[25] = new Packer25(); +packers[26] = new Packer26(); +packers[27] = new Packer27(); +packers[28] = new Packer28(); +packers[29] = new Packer29(); +packers[30] = new Packer30(); +packers[31] = new Packer31(); +packers[32] = new Packer32(); + } + + public static final BytePackerFactory factory = new BytePackerFactory() { +public BytePacker newBytePacker(int bitWidth) { + return packers[bitWidth]; +} + }; + + private static final class Packer0 extends BytePacker { +private int unpackCount = 0; + +private Packer0() { + super(0); +} + +public int getUnpackCount() { + return unpackCount; +} + +public final void pack8Values(final int[] in, final int inPos, final byte[] out, final int outPos) { +} + +public final void pack32Values(final int[] in, final int inPos, final byte[] out, final int outPos) { +} + +public final void unpack8Values(final byte[] in, final int inPos, final int[] out, final int outPos) { +} + +public final void unpack8Values(final ByteBuffer in, final int inPos, final int[] out, final int outPos) { +} + +public final void unpack32Values(final byte[] in, final int inPos, final int[] out, final int outPos) { +} + +public final void unpack32Values(final ByteBuffer in, final int inPos, final int[] out, final int outPos) { +} + +public final void unpackValuesVector(final byte[] input, final int inPos, final int[] output, final int outPos) { +} + +public final void unpackValuesVector(final ByteBuffer input, final int inPos, final int[] output, final int outPos) { Review Comment: We also need to ensure that we invoke both scalar generators and vector generators for jdk17 profile. ## parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/ByteBitPackingVectorBenchmarks.java:
[GitHub] [parquet-mr] jatin-bhateja commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization
jatin-bhateja commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1048036480 ## parquet-encoding/src/main/java/org/apache/parquet/column/values/bitpacking/BytePacker.java: ## @@ -105,4 +116,16 @@ public void unpack8Values(final byte[] input, final int inPos, final int[] outpu public void unpack32Values(byte[] input, int inPos, int[] output, int outPos) { unpack32Values(ByteBuffer.wrap(input), inPos, output, outPos); } + + public void unpackValuesVector(final byte[] input, final int inPos, final int[] output, final int outPos) { Review Comment: Please add appropriate comments over newly added definitions, like the ones over top over scalar routines. ## parquet-generator/src/main/resources/ByteBitPackingVectorLE: ## @@ -0,0 +1,3218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.parquet.column.values.bitpacking; + +import jdk.incubator.vector.*; + +import java.nio.ByteBuffer; + +/** + * This is an auto-generated source file and should not edit it directly. + */ +public abstract class ByteBitPackingVectorLE { + private static final BytePacker[] packers = new BytePacker[33]; + + static { +packers[0] = new Packer0(); +packers[1] = new Packer1(); +packers[2] = new Packer2(); +packers[3] = new Packer3(); +packers[4] = new Packer4(); +packers[5] = new Packer5(); +packers[6] = new Packer6(); +packers[7] = new Packer7(); +packers[8] = new Packer8(); +packers[9] = new Packer9(); +packers[10] = new Packer10(); +packers[11] = new Packer11(); +packers[12] = new Packer12(); +packers[13] = new Packer13(); +packers[14] = new Packer14(); +packers[15] = new Packer15(); +packers[16] = new Packer16(); +packers[17] = new Packer17(); +packers[18] = new Packer18(); +packers[19] = new Packer19(); +packers[20] = new Packer20(); +packers[21] = new Packer21(); +packers[22] = new Packer22(); +packers[23] = new Packer23(); +packers[24] = new Packer24(); +packers[25] = new Packer25(); +packers[26] = new Packer26(); +packers[27] = new Packer27(); +packers[28] = new Packer28(); +packers[29] = new Packer29(); +packers[30] = new Packer30(); +packers[31] = new Packer31(); +packers[32] = new Packer32(); + } + + public static final BytePackerFactory factory = new BytePackerFactory() { +public BytePacker newBytePacker(int bitWidth) { + return packers[bitWidth]; +} + }; + + private static final class Packer0 extends BytePacker { +private int unpackCount = 0; + +private Packer0() { + super(0); +} + +public int getUnpackCount() { + return unpackCount; +} + +public final void pack8Values(final int[] in, final int inPos, final byte[] out, final int outPos) { +} + +public final void pack32Values(final int[] in, final int inPos, final byte[] out, final int outPos) { +} + +public final void unpack8Values(final byte[] in, final int inPos, final int[] out, final int outPos) { +} + +public final void unpack8Values(final ByteBuffer in, final int inPos, final int[] out, final int outPos) { +} + +public final void unpack32Values(final byte[] in, final int inPos, final int[] out, final int outPos) { +} + +public final void unpack32Values(final ByteBuffer in, final int inPos, final int[] out, final int outPos) { +} + +public final void unpackValuesVector(final byte[] input, final int inPos, final int[] output, final int outPos) { +} + +public final void unpackValuesVector(final ByteBuffer input, final int inPos, final int[] output, final int outPos) { Review Comment: We also need to ensure that we invoke both scalar generators and vector generators for jdk17 profile. ## parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/ByteBitPackingVectorBenchmarks.java: ## @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright
[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646899#comment-17646899 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#issuecomment-1350287549 > This work looks promising! It would be great if you can add some micro-benchmark to parquet-benchmarks. @wgtmac I have add the micro-benchmark to parquet-benchmarks, this is the result: ![image](https://user-images.githubusercontent.com/12368495/207491959-a1a22134-98fd-45f6-aa08-1934584e0fbb.png) > Parquet bit-packing de/encode optimization > -- > > Key: PARQUET-2159 > URL: https://issues.apache.org/jira/browse/PARQUET-2159 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Fang-Xie >Assignee: Fang-Xie >Priority: Major > Fix For: 1.13.0 > > Attachments: image-2022-06-15-22-56-08-396.png, > image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, > image-2022-06-15-22-58-40-704.png > > > Current Spark use Parquet-mr as parquet reader/writer library, but the > built-in bit-packing en/decode is not efficient enough. > Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector > in Open JDK18 brings prominent performance improvement. > Due to Vector API is added to OpenJDK since 16, So this optimization request > JDK16 or higher. > *Below are our test results* > Functional test is based on open-source parquet-mr Bit-pack decoding > function: *_public final void unpack8Values(final byte[] in, final int inPos, > final int[] out, final int outPos)_* __ > compared with our implementation with vector API *_public final void > unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final > int outPos)_* > We tested 10 pairs (open source parquet bit unpacking vs ours optimized > vectorized SIMD implementation) decode function with bit > width=\{1,2,3,4,5,6,7,8,9,10}, below are test results: > !image-2022-06-15-22-56-08-396.png|width=437,height=223! > We integrated our bit-packing decode implementation into parquet-mr, tested > the parquet batch reader ability from Spark VectorizedParquetRecordReader > which get parquet column data by the batch way. We construct parquet file > with different row count and column count, the column data type is Int32, the > maximum int value is 127 which satisfies bit pack encode with bit width=7, > the count of the row is from 10k to 100 million and the count of the column > is from 1 to 4. > !image-2022-06-15-22-57-15-964.png|width=453,height=229! > !image-2022-06-15-22-58-01-442.png|width=439,height=217! > !image-2022-06-15-22-58-40-704.png|width=415,height=208! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] jiangjiguang commented on pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization
jiangjiguang commented on PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#issuecomment-1350287549 > This work looks promising! It would be great if you can add some micro-benchmark to parquet-benchmarks. @wgtmac I have add the micro-benchmark to parquet-benchmarks, this is the result: ![image](https://user-images.githubusercontent.com/12368495/207491959-a1a22134-98fd-45f6-aa08-1934584e0fbb.png) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2218) [Format] Clarify CRC computation
[ https://issues.apache.org/jira/browse/PARQUET-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646685#comment-17646685 ] ASF GitHub Bot commented on PARQUET-2218: - mapleFU commented on PR #188: URL: https://github.com/apache/parquet-format/pull/188#issuecomment-1348743274 The change looks good to me! Thanks a lot! > [Format] Clarify CRC computation > > > Key: PARQUET-2218 > URL: https://issues.apache.org/jira/browse/PARQUET-2218 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Fix For: format-2.10.0 > > > The format spec on CRC checksumming felt ambiguous when trying to implement > it in Parquet C++, so we should make the wording clearer. > (see discussion on > https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and > below) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] mapleFU commented on pull request #188: PARQUET-2218: [Format] Clarify CRC computation
mapleFU commented on PR #188: URL: https://github.com/apache/parquet-format/pull/188#issuecomment-1348743274 The change looks good to me! Thanks a lot! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646675#comment-17646675 ] ASF GitHub Bot commented on PARQUET-1539: - pitrou commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348718160 I opened https://github.com/apache/parquet-format/pull/188 to clarify the wording. > Clarify CRC checksum in page header > --- > > Key: PARQUET-1539 > URL: https://issues.apache.org/jira/browse/PARQUET-1539 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Boudewijn Braams >Assignee: Boudewijn Braams >Priority: Major > Labels: pull-request-available > Fix For: format-2.7.0 > > > Although a page-level CRC field is defined in the Thrift specification, > currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the > [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607] > in the Thrift specification reads ‘32bit crc for the data below’, which is > somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum > should be calculated on. To ensure backward- and cross-compatibility of > Parquet readers/writes which do want to leverage the CRC checksums, the > format should specify exactly how and on what data the checksum should be > calculated. > h2. Alternatives > There are three main choices to be made here: > # Which variant of CRC32 to use > # Whether to include the page header itself in the checksum calculation > # Whether to calculate the checksum on uncompressed or compressed data > h3. Algorithm > The CRC field holds a 32-bit value. There are many different variants of the > original CRC32 algorithm, each producing different values for the same input. > For ease of implementation we propose to use the standard CRC32 algorithm. > h3. Including page header > The page header itself could be included in the checksum calculation using an > approach similar to what TCP does, whereby the checksum field itself is > zeroed out before calculating the checksum that will be inserted there. > Evidently, including the page header is better in the sense that it increases > the data covered by the checksum. However, from an implementation > perspective, not including it is likely easier. Furthermore, given the > relatively small size of the page header compared to the page itself, simply > not including it will likely be good enough. > h3. Compressed vs uncompressed > *Compressed* > Pros > * Inherently faster, less data to operate on > * Potentially better triaging when determining where a corruption may have > been introduced, as checksum is calculated in a later stage > Cons > * We have to trust both the encoding stage and the compression stage > *Uncompressed* > Pros > * We only have to trust the encoding stage > * Possibly able to detect more corruptions, as data is checksummed at > earliest possible moment, checksum will be more sensitive to corruption > introduced further down the line > Cons > * Inherently slower, more data to operate on, always need to decompress first > * Potentially harder triaging, more stages in which corruption could have > been introduced > h2. Proposal > The checksum will be calculated using the *standard CRC32 algorithm*, whereby > the checksum is to be calculated on the *data only, not including the page > header* itself (simple implementation) and the checksum will be calculated on > *compressed data* (inherently faster, likely better triaging). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] pitrou commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header
pitrou commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348718160 I opened https://github.com/apache/parquet-format/pull/188 to clarify the wording. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2218) [Format] Clarify CRC computation
[ https://issues.apache.org/jira/browse/PARQUET-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646674#comment-17646674 ] ASF GitHub Bot commented on PARQUET-2218: - pitrou commented on PR #188: URL: https://github.com/apache/parquet-format/pull/188#issuecomment-1348714880 @bbraams @gszadovszky @mapleFU thoughts? > [Format] Clarify CRC computation > > > Key: PARQUET-2218 > URL: https://issues.apache.org/jira/browse/PARQUET-2218 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Fix For: format-2.10.0 > > > The format spec on CRC checksumming felt ambiguous when trying to implement > it in Parquet C++, so we should make the wording clearer. > (see discussion on > https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and > below) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2218) [Format] Clarify CRC computation
[ https://issues.apache.org/jira/browse/PARQUET-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646672#comment-17646672 ] ASF GitHub Bot commented on PARQUET-2218: - pitrou opened a new pull request, #188: URL: https://github.com/apache/parquet-format/pull/188 When trying to implement CRC computation in Parquet C++, we found the wording to be ambiguous. Clarify that CRC computation happens on the exact binary serialization (instead of a long-winded and confusing elaboration about v1 and v2 data page layout). Also, clarify that CRC computation can apply to all page kinds, not only data pages (for reference, parquet-mr currently support checksumming v1 data pages as well as dictionary pages). Also, see discussion on https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and below. > [Format] Clarify CRC computation > > > Key: PARQUET-2218 > URL: https://issues.apache.org/jira/browse/PARQUET-2218 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Fix For: format-2.10.0 > > > The format spec on CRC checksumming felt ambiguous when trying to implement > it in Parquet C++, so we should make the wording clearer. > (see discussion on > https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and > below) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] pitrou commented on pull request #188: PARQUET-2218: [Format] Clarify CRC computation
pitrou commented on PR #188: URL: https://github.com/apache/parquet-format/pull/188#issuecomment-1348714880 @bbraams @gszadovszky @mapleFU thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-format] pitrou opened a new pull request, #188: PARQUET-2218: [Format] Clarify CRC computation
pitrou opened a new pull request, #188: URL: https://github.com/apache/parquet-format/pull/188 When trying to implement CRC computation in Parquet C++, we found the wording to be ambiguous. Clarify that CRC computation happens on the exact binary serialization (instead of a long-winded and confusing elaboration about v1 and v2 data page layout). Also, clarify that CRC computation can apply to all page kinds, not only data pages (for reference, parquet-mr currently support checksumming v1 data pages as well as dictionary pages). Also, see discussion on https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and below. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (PARQUET-2218) [Format] Clarify CRC computation
[ https://issues.apache.org/jira/browse/PARQUET-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated PARQUET-2218: Description: The format spec on CRC checksumming felt ambiguous when trying to implement it in Parquet C++, so we should make the wording clearer. (see discussion on https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and below) was:The format spec on CRC checksumming felt ambiguous when trying to implement it in Parquet C++, so we should make the wording clearer. > [Format] Clarify CRC computation > > > Key: PARQUET-2218 > URL: https://issues.apache.org/jira/browse/PARQUET-2218 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Fix For: format-2.10.0 > > > The format spec on CRC checksumming felt ambiguous when trying to implement > it in Parquet C++, so we should make the wording clearer. > (see discussion on > https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and > below) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2218) [Format] Clarify CRC computation
Antoine Pitrou created PARQUET-2218: --- Summary: [Format] Clarify CRC computation Key: PARQUET-2218 URL: https://issues.apache.org/jira/browse/PARQUET-2218 Project: Parquet Issue Type: Improvement Components: parquet-format Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix For: format-2.10.0 The format spec on CRC checksumming felt ambiguous when trying to implement it in Parquet C++, so we should make the wording clearer. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646656#comment-17646656 ] ASF GitHub Bot commented on PARQUET-1539: - pitrou commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348674622 @wgtmac No particular rule, no. AFAIU we only synchronize when we want to get meaningful spec changes. > Clarify CRC checksum in page header > --- > > Key: PARQUET-1539 > URL: https://issues.apache.org/jira/browse/PARQUET-1539 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Boudewijn Braams >Assignee: Boudewijn Braams >Priority: Major > Labels: pull-request-available > Fix For: format-2.7.0 > > > Although a page-level CRC field is defined in the Thrift specification, > currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the > [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607] > in the Thrift specification reads ‘32bit crc for the data below’, which is > somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum > should be calculated on. To ensure backward- and cross-compatibility of > Parquet readers/writes which do want to leverage the CRC checksums, the > format should specify exactly how and on what data the checksum should be > calculated. > h2. Alternatives > There are three main choices to be made here: > # Which variant of CRC32 to use > # Whether to include the page header itself in the checksum calculation > # Whether to calculate the checksum on uncompressed or compressed data > h3. Algorithm > The CRC field holds a 32-bit value. There are many different variants of the > original CRC32 algorithm, each producing different values for the same input. > For ease of implementation we propose to use the standard CRC32 algorithm. > h3. Including page header > The page header itself could be included in the checksum calculation using an > approach similar to what TCP does, whereby the checksum field itself is > zeroed out before calculating the checksum that will be inserted there. > Evidently, including the page header is better in the sense that it increases > the data covered by the checksum. However, from an implementation > perspective, not including it is likely easier. Furthermore, given the > relatively small size of the page header compared to the page itself, simply > not including it will likely be good enough. > h3. Compressed vs uncompressed > *Compressed* > Pros > * Inherently faster, less data to operate on > * Potentially better triaging when determining where a corruption may have > been introduced, as checksum is calculated in a later stage > Cons > * We have to trust both the encoding stage and the compression stage > *Uncompressed* > Pros > * We only have to trust the encoding stage > * Possibly able to detect more corruptions, as data is checksummed at > earliest possible moment, checksum will be more sensitive to corruption > introduced further down the line > Cons > * Inherently slower, more data to operate on, always need to decompress first > * Potentially harder triaging, more stages in which corruption could have > been introduced > h2. Proposal > The checksum will be calculated using the *standard CRC32 algorithm*, whereby > the checksum is to be calculated on the *data only, not including the page > header* itself (simple implementation) and the checksum will be calculated on > *compressed data* (inherently faster, likely better triaging). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] pitrou commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header
pitrou commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348674622 @wgtmac No particular rule, no. AFAIU we only synchronize when we want to get meaningful spec changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646655#comment-17646655 ] ASF GitHub Bot commented on PARQUET-1539: - wgtmac commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348672686 > And, yes, it would probably be nice to make the spec wording clearer. I can try to submit something. Quick question: is there any rule to sync the `parquet.thrift` file from apache/parquet-format to apache/arrow? @pitrou > Clarify CRC checksum in page header > --- > > Key: PARQUET-1539 > URL: https://issues.apache.org/jira/browse/PARQUET-1539 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Boudewijn Braams >Assignee: Boudewijn Braams >Priority: Major > Labels: pull-request-available > Fix For: format-2.7.0 > > > Although a page-level CRC field is defined in the Thrift specification, > currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the > [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607] > in the Thrift specification reads ‘32bit crc for the data below’, which is > somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum > should be calculated on. To ensure backward- and cross-compatibility of > Parquet readers/writes which do want to leverage the CRC checksums, the > format should specify exactly how and on what data the checksum should be > calculated. > h2. Alternatives > There are three main choices to be made here: > # Which variant of CRC32 to use > # Whether to include the page header itself in the checksum calculation > # Whether to calculate the checksum on uncompressed or compressed data > h3. Algorithm > The CRC field holds a 32-bit value. There are many different variants of the > original CRC32 algorithm, each producing different values for the same input. > For ease of implementation we propose to use the standard CRC32 algorithm. > h3. Including page header > The page header itself could be included in the checksum calculation using an > approach similar to what TCP does, whereby the checksum field itself is > zeroed out before calculating the checksum that will be inserted there. > Evidently, including the page header is better in the sense that it increases > the data covered by the checksum. However, from an implementation > perspective, not including it is likely easier. Furthermore, given the > relatively small size of the page header compared to the page itself, simply > not including it will likely be good enough. > h3. Compressed vs uncompressed > *Compressed* > Pros > * Inherently faster, less data to operate on > * Potentially better triaging when determining where a corruption may have > been introduced, as checksum is calculated in a later stage > Cons > * We have to trust both the encoding stage and the compression stage > *Uncompressed* > Pros > * We only have to trust the encoding stage > * Possibly able to detect more corruptions, as data is checksummed at > earliest possible moment, checksum will be more sensitive to corruption > introduced further down the line > Cons > * Inherently slower, more data to operate on, always need to decompress first > * Potentially harder triaging, more stages in which corruption could have > been introduced > h2. Proposal > The checksum will be calculated using the *standard CRC32 algorithm*, whereby > the checksum is to be calculated on the *data only, not including the page > header* itself (simple implementation) and the checksum will be calculated on > *compressed data* (inherently faster, likely better triaging). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] wgtmac commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header
wgtmac commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348672686 > And, yes, it would probably be nice to make the spec wording clearer. I can try to submit something. Quick question: is there any rule to sync the `parquet.thrift` file from apache/parquet-format to apache/arrow? @pitrou -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646613#comment-17646613 ] ASF GitHub Bot commented on PARQUET-1539: - mapleFU commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348441147 > And, yes, it would probably be nice to make the spec wording clearer. I can try to submit something. OK, thanks for your patient. I updated the descriptions in https://issues.apache.org/jira/browse/ARROW-17904 . And I will finish them in the coming 2 patches, let's keep the patch simple and finish https://github.com/apache/arrow/pull/14351 first > Clarify CRC checksum in page header > --- > > Key: PARQUET-1539 > URL: https://issues.apache.org/jira/browse/PARQUET-1539 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Boudewijn Braams >Assignee: Boudewijn Braams >Priority: Major > Labels: pull-request-available > Fix For: format-2.7.0 > > > Although a page-level CRC field is defined in the Thrift specification, > currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the > [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607] > in the Thrift specification reads ‘32bit crc for the data below’, which is > somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum > should be calculated on. To ensure backward- and cross-compatibility of > Parquet readers/writes which do want to leverage the CRC checksums, the > format should specify exactly how and on what data the checksum should be > calculated. > h2. Alternatives > There are three main choices to be made here: > # Which variant of CRC32 to use > # Whether to include the page header itself in the checksum calculation > # Whether to calculate the checksum on uncompressed or compressed data > h3. Algorithm > The CRC field holds a 32-bit value. There are many different variants of the > original CRC32 algorithm, each producing different values for the same input. > For ease of implementation we propose to use the standard CRC32 algorithm. > h3. Including page header > The page header itself could be included in the checksum calculation using an > approach similar to what TCP does, whereby the checksum field itself is > zeroed out before calculating the checksum that will be inserted there. > Evidently, including the page header is better in the sense that it increases > the data covered by the checksum. However, from an implementation > perspective, not including it is likely easier. Furthermore, given the > relatively small size of the page header compared to the page itself, simply > not including it will likely be good enough. > h3. Compressed vs uncompressed > *Compressed* > Pros > * Inherently faster, less data to operate on > * Potentially better triaging when determining where a corruption may have > been introduced, as checksum is calculated in a later stage > Cons > * We have to trust both the encoding stage and the compression stage > *Uncompressed* > Pros > * We only have to trust the encoding stage > * Possibly able to detect more corruptions, as data is checksummed at > earliest possible moment, checksum will be more sensitive to corruption > introduced further down the line > Cons > * Inherently slower, more data to operate on, always need to decompress first > * Potentially harder triaging, more stages in which corruption could have > been introduced > h2. Proposal > The checksum will be calculated using the *standard CRC32 algorithm*, whereby > the checksum is to be calculated on the *data only, not including the page > header* itself (simple implementation) and the checksum will be calculated on > *compressed data* (inherently faster, likely better triaging). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] mapleFU commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header
mapleFU commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348441147 > And, yes, it would probably be nice to make the spec wording clearer. I can try to submit something. OK, thanks for your patient. I updated the descriptions in https://issues.apache.org/jira/browse/ARROW-17904 . And I will finish them in the coming 2 patches, let's keep the patch simple and finish https://github.com/apache/arrow/pull/14351 first -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1629) Page-level CRC checksum verification for DataPageV2
[ https://issues.apache.org/jira/browse/PARQUET-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646612#comment-17646612 ] Antoine Pitrou commented on PARQUET-1629: - [~mwish] for the record. Perhaps you would be interested in doing this, if you can do some Java. > Page-level CRC checksum verification for DataPageV2 > --- > > Key: PARQUET-1629 > URL: https://issues.apache.org/jira/browse/PARQUET-1629 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Boudewijn Braams >Priority: Major > > In https://jira.apache.org/jira/browse/PARQUET-1580 (Github PR: > https://github.com/apache/parquet-mr/pull/647) we implemented page level CRC > checksum verification for DataPageV1. As a follow up, we should add support > for DataPageV2 that follows the spec (see see > https://jira.apache.org/jira/browse/PARQUET-1539). > What needs to be done: > * Add writing out checksums for DataPageV2 > * Add checksum verification for DataPageV2 > * Create new test suite > * Create new benchmarks -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646609#comment-17646609 ] ASF GitHub Bot commented on PARQUET-1539: - pitrou commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348433863 It seems it was done deliberately in parquet-mr and all Parquet committers there agreed that it was how the spec should be interpreted: https://github.com/apache/parquet-mr/pull/647 So I would vote for doing it in Parquet C++. Can you first generate test files similar as to what you did for data pages? > Clarify CRC checksum in page header > --- > > Key: PARQUET-1539 > URL: https://issues.apache.org/jira/browse/PARQUET-1539 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Boudewijn Braams >Assignee: Boudewijn Braams >Priority: Major > Labels: pull-request-available > Fix For: format-2.7.0 > > > Although a page-level CRC field is defined in the Thrift specification, > currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the > [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607] > in the Thrift specification reads ‘32bit crc for the data below’, which is > somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum > should be calculated on. To ensure backward- and cross-compatibility of > Parquet readers/writes which do want to leverage the CRC checksums, the > format should specify exactly how and on what data the checksum should be > calculated. > h2. Alternatives > There are three main choices to be made here: > # Which variant of CRC32 to use > # Whether to include the page header itself in the checksum calculation > # Whether to calculate the checksum on uncompressed or compressed data > h3. Algorithm > The CRC field holds a 32-bit value. There are many different variants of the > original CRC32 algorithm, each producing different values for the same input. > For ease of implementation we propose to use the standard CRC32 algorithm. > h3. Including page header > The page header itself could be included in the checksum calculation using an > approach similar to what TCP does, whereby the checksum field itself is > zeroed out before calculating the checksum that will be inserted there. > Evidently, including the page header is better in the sense that it increases > the data covered by the checksum. However, from an implementation > perspective, not including it is likely easier. Furthermore, given the > relatively small size of the page header compared to the page itself, simply > not including it will likely be good enough. > h3. Compressed vs uncompressed > *Compressed* > Pros > * Inherently faster, less data to operate on > * Potentially better triaging when determining where a corruption may have > been introduced, as checksum is calculated in a later stage > Cons > * We have to trust both the encoding stage and the compression stage > *Uncompressed* > Pros > * We only have to trust the encoding stage > * Possibly able to detect more corruptions, as data is checksummed at > earliest possible moment, checksum will be more sensitive to corruption > introduced further down the line > Cons > * Inherently slower, more data to operate on, always need to decompress first > * Potentially harder triaging, more stages in which corruption could have > been introduced > h2. Proposal > The checksum will be calculated using the *standard CRC32 algorithm*, whereby > the checksum is to be calculated on the *data only, not including the page > header* itself (simple implementation) and the checksum will be calculated on > *compressed data* (inherently faster, likely better triaging). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] pitrou commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header
pitrou commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348435655 And, yes, it would probably be nice to make the spec wording clearer. I can try to submit something. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646610#comment-17646610 ] ASF GitHub Bot commented on PARQUET-1539: - pitrou commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348435655 And, yes, it would probably be nice to make the spec wording clearer. I can try to submit something. > Clarify CRC checksum in page header > --- > > Key: PARQUET-1539 > URL: https://issues.apache.org/jira/browse/PARQUET-1539 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Boudewijn Braams >Assignee: Boudewijn Braams >Priority: Major > Labels: pull-request-available > Fix For: format-2.7.0 > > > Although a page-level CRC field is defined in the Thrift specification, > currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the > [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607] > in the Thrift specification reads ‘32bit crc for the data below’, which is > somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum > should be calculated on. To ensure backward- and cross-compatibility of > Parquet readers/writes which do want to leverage the CRC checksums, the > format should specify exactly how and on what data the checksum should be > calculated. > h2. Alternatives > There are three main choices to be made here: > # Which variant of CRC32 to use > # Whether to include the page header itself in the checksum calculation > # Whether to calculate the checksum on uncompressed or compressed data > h3. Algorithm > The CRC field holds a 32-bit value. There are many different variants of the > original CRC32 algorithm, each producing different values for the same input. > For ease of implementation we propose to use the standard CRC32 algorithm. > h3. Including page header > The page header itself could be included in the checksum calculation using an > approach similar to what TCP does, whereby the checksum field itself is > zeroed out before calculating the checksum that will be inserted there. > Evidently, including the page header is better in the sense that it increases > the data covered by the checksum. However, from an implementation > perspective, not including it is likely easier. Furthermore, given the > relatively small size of the page header compared to the page itself, simply > not including it will likely be good enough. > h3. Compressed vs uncompressed > *Compressed* > Pros > * Inherently faster, less data to operate on > * Potentially better triaging when determining where a corruption may have > been introduced, as checksum is calculated in a later stage > Cons > * We have to trust both the encoding stage and the compression stage > *Uncompressed* > Pros > * We only have to trust the encoding stage > * Possibly able to detect more corruptions, as data is checksummed at > earliest possible moment, checksum will be more sensitive to corruption > introduced further down the line > Cons > * Inherently slower, more data to operate on, always need to decompress first > * Potentially harder triaging, more stages in which corruption could have > been introduced > h2. Proposal > The checksum will be calculated using the *standard CRC32 algorithm*, whereby > the checksum is to be calculated on the *data only, not including the page > header* itself (simple implementation) and the checksum will be calculated on > *compressed data* (inherently faster, likely better triaging). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] pitrou commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header
pitrou commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348433863 It seems it was done deliberately in parquet-mr and all Parquet committers there agreed that it was how the spec should be interpreted: https://github.com/apache/parquet-mr/pull/647 So I would vote for doing it in Parquet C++. Can you first generate test files similar as to what you did for data pages? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646595#comment-17646595 ] ASF GitHub Bot commented on PARQUET-1539: - mapleFU commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348405417 So, should we update the `parquet-format`, or just keep it here and not implement crc in parquet c++ version? @pitrou > Clarify CRC checksum in page header > --- > > Key: PARQUET-1539 > URL: https://issues.apache.org/jira/browse/PARQUET-1539 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Boudewijn Braams >Assignee: Boudewijn Braams >Priority: Major > Labels: pull-request-available > Fix For: format-2.7.0 > > > Although a page-level CRC field is defined in the Thrift specification, > currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the > [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607] > in the Thrift specification reads ‘32bit crc for the data below’, which is > somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum > should be calculated on. To ensure backward- and cross-compatibility of > Parquet readers/writes which do want to leverage the CRC checksums, the > format should specify exactly how and on what data the checksum should be > calculated. > h2. Alternatives > There are three main choices to be made here: > # Which variant of CRC32 to use > # Whether to include the page header itself in the checksum calculation > # Whether to calculate the checksum on uncompressed or compressed data > h3. Algorithm > The CRC field holds a 32-bit value. There are many different variants of the > original CRC32 algorithm, each producing different values for the same input. > For ease of implementation we propose to use the standard CRC32 algorithm. > h3. Including page header > The page header itself could be included in the checksum calculation using an > approach similar to what TCP does, whereby the checksum field itself is > zeroed out before calculating the checksum that will be inserted there. > Evidently, including the page header is better in the sense that it increases > the data covered by the checksum. However, from an implementation > perspective, not including it is likely easier. Furthermore, given the > relatively small size of the page header compared to the page itself, simply > not including it will likely be good enough. > h3. Compressed vs uncompressed > *Compressed* > Pros > * Inherently faster, less data to operate on > * Potentially better triaging when determining where a corruption may have > been introduced, as checksum is calculated in a later stage > Cons > * We have to trust both the encoding stage and the compression stage > *Uncompressed* > Pros > * We only have to trust the encoding stage > * Possibly able to detect more corruptions, as data is checksummed at > earliest possible moment, checksum will be more sensitive to corruption > introduced further down the line > Cons > * Inherently slower, more data to operate on, always need to decompress first > * Potentially harder triaging, more stages in which corruption could have > been introduced > h2. Proposal > The checksum will be calculated using the *standard CRC32 algorithm*, whereby > the checksum is to be calculated on the *data only, not including the page > header* itself (simple implementation) and the checksum will be calculated on > *compressed data* (inherently faster, likely better triaging). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] mapleFU commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header
mapleFU commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348405417 So, should we update the `parquet-format`, or just keep it here and not implement crc in parquet c++ version? @pitrou -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646591#comment-17646591 ] ASF GitHub Bot commented on PARQUET-1539: - pitrou commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348378187 It does seem that parquet-mr writes a CRC value for dictionary pages... https://github.com/apache/parquet-mr/blob/433de8df33fcf31927f7b51456be9f53e64d48b9/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L499-L510 > Clarify CRC checksum in page header > --- > > Key: PARQUET-1539 > URL: https://issues.apache.org/jira/browse/PARQUET-1539 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Boudewijn Braams >Assignee: Boudewijn Braams >Priority: Major > Labels: pull-request-available > Fix For: format-2.7.0 > > > Although a page-level CRC field is defined in the Thrift specification, > currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the > [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607] > in the Thrift specification reads ‘32bit crc for the data below’, which is > somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum > should be calculated on. To ensure backward- and cross-compatibility of > Parquet readers/writes which do want to leverage the CRC checksums, the > format should specify exactly how and on what data the checksum should be > calculated. > h2. Alternatives > There are three main choices to be made here: > # Which variant of CRC32 to use > # Whether to include the page header itself in the checksum calculation > # Whether to calculate the checksum on uncompressed or compressed data > h3. Algorithm > The CRC field holds a 32-bit value. There are many different variants of the > original CRC32 algorithm, each producing different values for the same input. > For ease of implementation we propose to use the standard CRC32 algorithm. > h3. Including page header > The page header itself could be included in the checksum calculation using an > approach similar to what TCP does, whereby the checksum field itself is > zeroed out before calculating the checksum that will be inserted there. > Evidently, including the page header is better in the sense that it increases > the data covered by the checksum. However, from an implementation > perspective, not including it is likely easier. Furthermore, given the > relatively small size of the page header compared to the page itself, simply > not including it will likely be good enough. > h3. Compressed vs uncompressed > *Compressed* > Pros > * Inherently faster, less data to operate on > * Potentially better triaging when determining where a corruption may have > been introduced, as checksum is calculated in a later stage > Cons > * We have to trust both the encoding stage and the compression stage > *Uncompressed* > Pros > * We only have to trust the encoding stage > * Possibly able to detect more corruptions, as data is checksummed at > earliest possible moment, checksum will be more sensitive to corruption > introduced further down the line > Cons > * Inherently slower, more data to operate on, always need to decompress first > * Potentially harder triaging, more stages in which corruption could have > been introduced > h2. Proposal > The checksum will be calculated using the *standard CRC32 algorithm*, whereby > the checksum is to be calculated on the *data only, not including the page > header* itself (simple implementation) and the checksum will be calculated on > *compressed data* (inherently faster, likely better triaging). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] pitrou commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header
pitrou commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348378187 It does seem that parquet-mr writes a CRC value for dictionary pages... https://github.com/apache/parquet-mr/blob/433de8df33fcf31927f7b51456be9f53e64d48b9/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L499-L510 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646585#comment-17646585 ] ASF GitHub Bot commented on PARQUET-1539: - mapleFU commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348324323 > (also cc @mapleFU, who's working on CRC support for Parquet C++) Hi, all, I have a question here, the format says: ``` /** The 32bit CRC for the page, to be be calculated as follows: * - Using the standard CRC32 algorithm * - On the data only, i.e. this header should not be included. 'Data' * hereby refers to the concatenation of the repetition levels, the * definition levels and the column value, in this exact order. * - On the encoded versions of the repetition levels, definition levels and * column values * - On the compressed versions of the repetition levels, definition levels * and column values where possible; * - For v1 data pages, the repetition levels, definition levels and column * values are always compressed together. If a compression scheme is * specified, the CRC shall be calculated on the compressed version of * this concatenation. If no compression scheme is specified, the CRC * shall be calculated on the uncompressed version of this concatenation. * - For v2 data pages, the repetition levels and definition levels are * handled separately from the data and are never compressed (only * encoded). If a compression scheme is specified, the CRC shall be * calculated on the concatenation of the uncompressed repetition levels, * uncompressed definition levels and the compressed column values. * If no compression scheme is specified, the CRC shall be calculated on * the uncompressed concatenation. * - In encrypted columns, CRC is calculated after page encryption; the * encryption itself is performed after page compression (if compressed) * If enabled, this allows for disabling checksumming in HDFS if only a few * pages need to be read. **/ ``` and in `README`: ``` Data pages can be individually checksummed. ``` But in our coding, we have: ```c++ int64_t WriteDictionaryPage(const DictionaryPage& page) override { // TODO(PARQUET-594) crc checksum ... } ``` So, could DICTIONARY_PAGE or even INDEX_PAGE have crc? /cc @pitrou > Clarify CRC checksum in page header > --- > > Key: PARQUET-1539 > URL: https://issues.apache.org/jira/browse/PARQUET-1539 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Boudewijn Braams >Assignee: Boudewijn Braams >Priority: Major > Labels: pull-request-available > Fix For: format-2.7.0 > > > Although a page-level CRC field is defined in the Thrift specification, > currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the > [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607] > in the Thrift specification reads ‘32bit crc for the data below’, which is > somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum > should be calculated on. To ensure backward- and cross-compatibility of > Parquet readers/writes which do want to leverage the CRC checksums, the > format should specify exactly how and on what data the checksum should be > calculated. > h2. Alternatives > There are three main choices to be made here: > # Which variant of CRC32 to use > # Whether to include the page header itself in the checksum calculation > # Whether to calculate the checksum on uncompressed or compressed data > h3. Algorithm > The CRC field holds a 32-bit value. There are many different variants of the > original CRC32 algorithm, each producing different values for the same input. > For ease of implementation we propose to use the standard CRC32 algorithm. > h3. Including page header > The page header itself could be included in the checksum calculation using an > approach similar to what TCP does, whereby the checksum field itself is > zeroed out before calculating the checksum that will be inserted there. > Evidently, including the page header is better in the sense that it increases > the data covered by the checksum. However, from an implementation > perspective, not including it is likely easier. Furthermore, given the > relatively small size of the page header compared to the page itself, simply > not including it will likely be good enough. > h3. Compressed vs uncompressed > *Compressed* > Pros > * Inherently
[GitHub] [parquet-format] mapleFU commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header
mapleFU commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348324323 > (also cc @mapleFU, who's working on CRC support for Parquet C++) Hi, all, I have a question here, the format says: ``` /** The 32bit CRC for the page, to be be calculated as follows: * - Using the standard CRC32 algorithm * - On the data only, i.e. this header should not be included. 'Data' * hereby refers to the concatenation of the repetition levels, the * definition levels and the column value, in this exact order. * - On the encoded versions of the repetition levels, definition levels and * column values * - On the compressed versions of the repetition levels, definition levels * and column values where possible; * - For v1 data pages, the repetition levels, definition levels and column * values are always compressed together. If a compression scheme is * specified, the CRC shall be calculated on the compressed version of * this concatenation. If no compression scheme is specified, the CRC * shall be calculated on the uncompressed version of this concatenation. * - For v2 data pages, the repetition levels and definition levels are * handled separately from the data and are never compressed (only * encoded). If a compression scheme is specified, the CRC shall be * calculated on the concatenation of the uncompressed repetition levels, * uncompressed definition levels and the compressed column values. * If no compression scheme is specified, the CRC shall be calculated on * the uncompressed concatenation. * - In encrypted columns, CRC is calculated after page encryption; the * encryption itself is performed after page compression (if compressed) * If enabled, this allows for disabling checksumming in HDFS if only a few * pages need to be read. **/ ``` and in `README`: ``` Data pages can be individually checksummed. ``` But in our coding, we have: ```c++ int64_t WriteDictionaryPage(const DictionaryPage& page) override { // TODO(PARQUET-594) crc checksum ... } ``` So, could DICTIONARY_PAGE or even INDEX_PAGE have crc? /cc @pitrou -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646534#comment-17646534 ] ASF GitHub Bot commented on PARQUET-1539: - pitrou commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 @bbraams @gszadovszky Could you explain why the spec's wording is so complex? It seems to me that the CRC is basically computed over the entire serialized data exactly as it's written to disk (after optional compression and encryption, and including the rep/def levels area that's prepended to the actual data). But the wording makes it seem that special care is needed to accumulate the CRC over different pieces of data, which may scare implementors (context: https://github.com/apache/arrow/pull/14351 ). Am I right in my interpretation above? > Clarify CRC checksum in page header > --- > > Key: PARQUET-1539 > URL: https://issues.apache.org/jira/browse/PARQUET-1539 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Boudewijn Braams >Assignee: Boudewijn Braams >Priority: Major > Labels: pull-request-available > Fix For: format-2.7.0 > > > Although a page-level CRC field is defined in the Thrift specification, > currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the > [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607] > in the Thrift specification reads ‘32bit crc for the data below’, which is > somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum > should be calculated on. To ensure backward- and cross-compatibility of > Parquet readers/writes which do want to leverage the CRC checksums, the > format should specify exactly how and on what data the checksum should be > calculated. > h2. Alternatives > There are three main choices to be made here: > # Which variant of CRC32 to use > # Whether to include the page header itself in the checksum calculation > # Whether to calculate the checksum on uncompressed or compressed data > h3. Algorithm > The CRC field holds a 32-bit value. There are many different variants of the > original CRC32 algorithm, each producing different values for the same input. > For ease of implementation we propose to use the standard CRC32 algorithm. > h3. Including page header > The page header itself could be included in the checksum calculation using an > approach similar to what TCP does, whereby the checksum field itself is > zeroed out before calculating the checksum that will be inserted there. > Evidently, including the page header is better in the sense that it increases > the data covered by the checksum. However, from an implementation > perspective, not including it is likely easier. Furthermore, given the > relatively small size of the page header compared to the page itself, simply > not including it will likely be good enough. > h3. Compressed vs uncompressed > *Compressed* > Pros > * Inherently faster, less data to operate on > * Potentially better triaging when determining where a corruption may have > been introduced, as checksum is calculated in a later stage > Cons > * We have to trust both the encoding stage and the compression stage > *Uncompressed* > Pros > * We only have to trust the encoding stage > * Possibly able to detect more corruptions, as data is checksummed at > earliest possible moment, checksum will be more sensitive to corruption > introduced further down the line > Cons > * Inherently slower, more data to operate on, always need to decompress first > * Potentially harder triaging, more stages in which corruption could have > been introduced > h2. Proposal > The checksum will be calculated using the *standard CRC32 algorithm*, whereby > the checksum is to be calculated on the *data only, not including the page > header* itself (simple implementation) and the checksum will be calculated on > *compressed data* (inherently faster, likely better triaging). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] pitrou commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header
pitrou commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 @bbraams @gszadovszky Could you explain why the spec's wording is so complex? It seems to me that the CRC is basically computed over the entire serialized data exactly as it's written to disk (after optional compression and encryption, and including the rep/def levels area that's prepended to the actual data). But the wording makes it seem that special care is needed to accumulate the CRC over different pieces of data, which may scare implementors (context: https://github.com/apache/arrow/pull/14351 ). Am I right in my interpretation above? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org