This is an automated email from the ASF dual-hosted git repository.
dbtsai pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.2 by this push:
new a7dc824 [SPARK-36726] Upgrade Parquet to 1.12.1
a7dc824 is described below
commit a7dc8242ea913841a2627949fde4cd2953d0b053
Author: Chao Sun <[email protected]>
AuthorDate: Wed Sep 15 19:17:34 2021 +0000
[SPARK-36726] Upgrade Parquet to 1.12.1
### What changes were proposed in this pull request?
Upgrade Apache Parquet to 1.12.1
### Why are the changes needed?
Parquet 1.12.1 contains the following bug fixes:
- PARQUET-2064: Make Range public accessible in RowRanges
- PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream`
- PARQUET-2052: Integer overflow when writing huge binary using dictionary
encoding
- PARQUET-1633: Fix integer overflow
- PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile
- PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats
- PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase
- PARQUET-2078: Failed to read parquet file after writing with the same
In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0
release.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests + a new test for the issue in SPARK-36696
Closes #33969 from sunchao/upgrade-parquet-12.1.
Authored-by: Chao Sun <[email protected]>
Signed-off-by: DB Tsai <[email protected]>
(cherry picked from commit a927b0836bd59d6731b4970957e82ac1e403ddc4)
Signed-off-by: DB Tsai <[email protected]>
---
dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 12 ++++++------
dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 12 ++++++------
pom.xml | 2 +-
.../resources/test-data/malformed-file-offset.parquet | Bin 0 -> 37968 bytes
.../execution/datasources/parquet/ParquetIOSuite.scala | 6 ++++++
5 files changed, 19 insertions(+), 13 deletions(-)
diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-2.3
b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
index 6ba3c86..44f210f 100644
--- a/dev/deps/spark-deps-hadoop-2.7-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
@@ -201,12 +201,12 @@ orc-shims/1.6.10//orc-shims-1.6.10.jar
oro/2.0.8//oro-2.0.8.jar
osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar
paranamer/2.8//paranamer-2.8.jar
-parquet-column/1.12.0//parquet-column-1.12.0.jar
-parquet-common/1.12.0//parquet-common-1.12.0.jar
-parquet-encoding/1.12.0//parquet-encoding-1.12.0.jar
-parquet-format-structures/1.12.0//parquet-format-structures-1.12.0.jar
-parquet-hadoop/1.12.0//parquet-hadoop-1.12.0.jar
-parquet-jackson/1.12.0//parquet-jackson-1.12.0.jar
+parquet-column/1.12.1//parquet-column-1.12.1.jar
+parquet-common/1.12.1//parquet-common-1.12.1.jar
+parquet-encoding/1.12.1//parquet-encoding-1.12.1.jar
+parquet-format-structures/1.12.1//parquet-format-structures-1.12.1.jar
+parquet-hadoop/1.12.1//parquet-hadoop-1.12.1.jar
+parquet-jackson/1.12.1//parquet-jackson-1.12.1.jar
protobuf-java/2.5.0//protobuf-java-2.5.0.jar
py4j/0.10.9.2//py4j-0.10.9.2.jar
pyrolite/4.30//pyrolite-4.30.jar
diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3
b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
index 326229b..1e28dd1 100644
--- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
@@ -172,12 +172,12 @@ orc-shims/1.6.10//orc-shims-1.6.10.jar
oro/2.0.8//oro-2.0.8.jar
osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar
paranamer/2.8//paranamer-2.8.jar
-parquet-column/1.12.0//parquet-column-1.12.0.jar
-parquet-common/1.12.0//parquet-common-1.12.0.jar
-parquet-encoding/1.12.0//parquet-encoding-1.12.0.jar
-parquet-format-structures/1.12.0//parquet-format-structures-1.12.0.jar
-parquet-hadoop/1.12.0//parquet-hadoop-1.12.0.jar
-parquet-jackson/1.12.0//parquet-jackson-1.12.0.jar
+parquet-column/1.12.1//parquet-column-1.12.1.jar
+parquet-common/1.12.1//parquet-common-1.12.1.jar
+parquet-encoding/1.12.1//parquet-encoding-1.12.1.jar
+parquet-format-structures/1.12.1//parquet-format-structures-1.12.1.jar
+parquet-hadoop/1.12.1//parquet-hadoop-1.12.1.jar
+parquet-jackson/1.12.1//parquet-jackson-1.12.1.jar
protobuf-java/2.5.0//protobuf-java-2.5.0.jar
py4j/0.10.9.2//py4j-0.10.9.2.jar
pyrolite/4.30//pyrolite-4.30.jar
diff --git a/pom.xml b/pom.xml
index 54025de..e8d3863 100644
--- a/pom.xml
+++ b/pom.xml
@@ -136,7 +136,7 @@
<kafka.version>2.8.0</kafka.version>
<!-- After 10.15.1.3, the minimum required version is JDK9 -->
<derby.version>10.14.2.0</derby.version>
- <parquet.version>1.12.0</parquet.version>
+ <parquet.version>1.12.1</parquet.version>
<orc.version>1.6.10</orc.version>
<jetty.version>9.4.43.v20210629</jetty.version>
<jakartaservlet.version>4.0.3</jakartaservlet.version>
diff --git
a/sql/core/src/test/resources/test-data/malformed-file-offset.parquet
b/sql/core/src/test/resources/test-data/malformed-file-offset.parquet
new file mode 100644
index 0000000..5abeabe
Binary files /dev/null and
b/sql/core/src/test/resources/test-data/malformed-file-offset.parquet differ
diff --git
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
index a330b82..e03a50b 100644
---
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
+++
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
@@ -855,6 +855,12 @@ class ParquetIOSuite extends QueryTest with ParquetTest
with SharedSparkSession
}
}
+ test("SPARK-36726: test incorrect Parquet row group file offset") {
+ readParquetFile(testFile("test-data/malformed-file-offset.parquet")) { df
=>
+ assert(df.count() == 3650)
+ }
+ }
+
test("VectorizedParquetRecordReader - direct path read") {
val data = (0 to 10).map(i => (i, (i + 'a').toChar.toString))
withTempPath { dir =>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]