github-actions[bot] commented on code in PR #64723:
URL: https://github.com/apache/doris/pull/64723#discussion_r3465374663


##########
regression-test/suites/external_table_p0/iceberg/write/test_iceberg_write_parquet_compression.groovy:
##########
@@ -0,0 +1,95 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Verifies that Doris can write (and read back) Iceberg Parquet tables for 
every
+// compression codec it claims to support, including LZ4. LZ4 is emitted as the
+// Hadoop-framed Parquet "LZ4" codec (not LZ4_RAW) so the output stays 
readable by
+// Spark/Iceberg/Trino, matching what those engines write for
+// `write.parquet.compression-codec=lz4`.
+suite("test_iceberg_write_parquet_compression", "p0,external") {
+    String enabled = context.config.otherConfigs.get("enableIcebergTest")
+    if (enabled == null || !enabled.equalsIgnoreCase("true")) {
+        logger.info("disable iceberg test.")
+        return
+    }
+
+    String rest_port = context.config.otherConfigs.get("iceberg_rest_uri_port")
+    String minio_port = context.config.otherConfigs.get("iceberg_minio_port")
+    String externalEnvIp = context.config.otherConfigs.get("externalEnvIp")
+    String catalog_name = "test_iceberg_write_parquet_compression"
+    String db = catalog_name + "_db"
+
+    sql """drop catalog if exists ${catalog_name}"""
+    sql """
+    CREATE CATALOG ${catalog_name} PROPERTIES (
+        'type'='iceberg',
+        'iceberg.catalog.type'='rest',
+        'uri' = 'http://${externalEnvIp}:${rest_port}',
+        "s3.access_key" = "admin",
+        "s3.secret_key" = "password",
+        "s3.endpoint" = "http://${externalEnvIp}:${minio_port}";,
+        "s3.region" = "us-east-1"
+    );"""
+
+    sql """ switch ${catalog_name} """
+    sql """ drop database if exists ${db} force """
+    sql """ create database ${db} """
+
+    def expected = [["doris0", 0L], ["doris1", 1L], ["doris2", 2L]]
+
+    // Codecs Doris is expected to support for Parquet writes.
+    for (String codec : ["lz4", "snappy", "zstd", "uncompressed"]) {

Review Comment:
   This loop only covers Iceberg Parquet data-file inserts. The PR also changes 
ORC LZ4 handling (`VOrcTransformer`), Hive Parquet LZ4 handling 
(`VHivePartitionWriter`), and Iceberg delete-file compression 
(`VIcebergDeleteFileWriter`), but none of those paths are exercised here or by 
the existing LZ4 suites I found. Please add focused coverage for the other 
changed paths, at least Hive Parquet LZ4 and Iceberg/Hive ORC LZ4 writes; if 
delete files inherit the table compression, add a delete/update scenario that 
drives the LZ4 `VIcebergDeleteFileWriter` path too.



##########
be/src/exec/sink/writer/vhive_partition_writer.cpp:
##########
@@ -83,6 +83,11 @@ Status VHivePartitionWriter::open(RuntimeState* state, 
RuntimeProfile* operator_
             parquet_compression_type = TParquetCompressionType::ZSTD;
             break;
         }
+        case TFileCompressType::LZ4BLOCK: {

Review Comment:
   This BE path now accepts Hive Parquet `LZ4BLOCK`, but the Doris Hive DDL 
path still rejects the same user-facing setting. `HiveUtil.setCompressType()` 
only allows `plain/snappy/zstd` for Parquet and `plain/zlib/snappy/zstd` for 
ORC, so `CREATE TABLE ... ENGINE=hive PROPERTIES ('file_format'='parquet', 
'compression'='lz4')` still fails before it can reach this writer. Please add 
`lz4` to the Hive Parquet/ORC validation path and cover it in the Hive 
DDL/write tests so the advertised Hive LZ4 writer support is reachable from 
Doris-created Hive tables.



##########
regression-test/suites/external_table_p0/iceberg/write/test_iceberg_write_parquet_compression.groovy:
##########
@@ -0,0 +1,95 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Verifies that Doris can write (and read back) Iceberg Parquet tables for 
every
+// compression codec it claims to support, including LZ4. LZ4 is emitted as the
+// Hadoop-framed Parquet "LZ4" codec (not LZ4_RAW) so the output stays 
readable by
+// Spark/Iceberg/Trino, matching what those engines write for
+// `write.parquet.compression-codec=lz4`.
+suite("test_iceberg_write_parquet_compression", "p0,external") {
+    String enabled = context.config.otherConfigs.get("enableIcebergTest")
+    if (enabled == null || !enabled.equalsIgnoreCase("true")) {
+        logger.info("disable iceberg test.")
+        return
+    }
+
+    String rest_port = context.config.otherConfigs.get("iceberg_rest_uri_port")
+    String minio_port = context.config.otherConfigs.get("iceberg_minio_port")
+    String externalEnvIp = context.config.otherConfigs.get("externalEnvIp")
+    String catalog_name = "test_iceberg_write_parquet_compression"
+    String db = catalog_name + "_db"
+
+    sql """drop catalog if exists ${catalog_name}"""
+    sql """
+    CREATE CATALOG ${catalog_name} PROPERTIES (
+        'type'='iceberg',
+        'iceberg.catalog.type'='rest',
+        'uri' = 'http://${externalEnvIp}:${rest_port}',
+        "s3.access_key" = "admin",
+        "s3.secret_key" = "password",
+        "s3.endpoint" = "http://${externalEnvIp}:${minio_port}";,
+        "s3.region" = "us-east-1"
+    );"""
+
+    sql """ switch ${catalog_name} """
+    sql """ drop database if exists ${db} force """
+    sql """ create database ${db} """
+
+    def expected = [["doris0", 0L], ["doris1", 1L], ["doris2", 2L]]
+
+    // Codecs Doris is expected to support for Parquet writes.
+    for (String codec : ["lz4", "snappy", "zstd", "uncompressed"]) {
+        String tbl = "${db}.tbl_${codec}"
+        sql """ drop table if exists ${tbl} """
+        sql """
+        CREATE TABLE ${tbl} (a STRING, b BIGINT) PROPERTIES (
+            'write-format' = 'parquet',
+            'write.format.default' = 'parquet',
+            'format-version' = '2',
+            'write.parquet.compression-codec' = '${codec}'
+        )"""
+
+        sql """ INSERT INTO ${tbl} VALUES ('doris0', 0), ('doris1', 1), 
('doris2', 2) """
+
+        def rows = sql """ SELECT a, b FROM ${tbl} ORDER BY b """
+        assertEquals(expected.size(), rows.size())
+        for (int i = 0; i < expected.size(); i++) {
+            assertEquals(expected[i][0], rows[i][0])
+            assertEquals(expected[i][1], rows[i][1] as Long)
+        }
+        logger.info("iceberg parquet write+read round-trip OK for 
codec=${codec}")
+    }
+
+    // GZIP is intentionally not yet supported by the Doris Parquet writer; 
the INSERT must
+    // fail explicitly rather than silently fall back. Remove this block when 
GZIP support is
+    // added (and update the cross-engine product test expectation in tandem).
+    sql """ drop table if exists ${db}.tbl_gzip """
+    sql """
+    CREATE TABLE ${db}.tbl_gzip (a STRING, b BIGINT) PROPERTIES (
+        'write-format' = 'parquet',
+        'write.format.default' = 'parquet',
+        'format-version' = '2',
+        'write.parquet.compression-codec' = 'gzip'
+    )"""
+    test {
+        sql """ INSERT INTO ${db}.tbl_gzip VALUES ('doris0', 0) """
+        exception "Unsupported compress type GZ with parquet"
+    }
+
+    sql """ drop database if exists ${db} force """

Review Comment:
   Please remove the final database/catalog cleanup. The suite already drops 
the catalog/database before creating them, and the regression-test standard 
asks new tests to clean before use rather than after completion so the created 
objects remain available for debugging a failed or suspicious run.



##########
regression-test/suites/external_table_p0/iceberg/write/test_iceberg_write_parquet_compression.groovy:
##########
@@ -0,0 +1,95 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Verifies that Doris can write (and read back) Iceberg Parquet tables for 
every
+// compression codec it claims to support, including LZ4. LZ4 is emitted as the
+// Hadoop-framed Parquet "LZ4" codec (not LZ4_RAW) so the output stays 
readable by
+// Spark/Iceberg/Trino, matching what those engines write for
+// `write.parquet.compression-codec=lz4`.
+suite("test_iceberg_write_parquet_compression", "p0,external") {
+    String enabled = context.config.otherConfigs.get("enableIcebergTest")
+    if (enabled == null || !enabled.equalsIgnoreCase("true")) {
+        logger.info("disable iceberg test.")
+        return
+    }
+
+    String rest_port = context.config.otherConfigs.get("iceberg_rest_uri_port")
+    String minio_port = context.config.otherConfigs.get("iceberg_minio_port")
+    String externalEnvIp = context.config.otherConfigs.get("externalEnvIp")
+    String catalog_name = "test_iceberg_write_parquet_compression"
+    String db = catalog_name + "_db"
+
+    sql """drop catalog if exists ${catalog_name}"""
+    sql """
+    CREATE CATALOG ${catalog_name} PROPERTIES (
+        'type'='iceberg',
+        'iceberg.catalog.type'='rest',
+        'uri' = 'http://${externalEnvIp}:${rest_port}',
+        "s3.access_key" = "admin",
+        "s3.secret_key" = "password",
+        "s3.endpoint" = "http://${externalEnvIp}:${minio_port}";,
+        "s3.region" = "us-east-1"
+    );"""
+
+    sql """ switch ${catalog_name} """
+    sql """ drop database if exists ${db} force """
+    sql """ create database ${db} """
+
+    def expected = [["doris0", 0L], ["doris1", 1L], ["doris2", 2L]]
+
+    // Codecs Doris is expected to support for Parquet writes.
+    for (String codec : ["lz4", "snappy", "zstd", "uncompressed"]) {
+        String tbl = "${db}.tbl_${codec}"
+        sql """ drop table if exists ${tbl} """
+        sql """
+        CREATE TABLE ${tbl} (a STRING, b BIGINT) PROPERTIES (
+            'write-format' = 'parquet',
+            'write.format.default' = 'parquet',
+            'format-version' = '2',
+            'write.parquet.compression-codec' = '${codec}'
+        )"""
+
+        sql """ INSERT INTO ${tbl} VALUES ('doris0', 0), ('doris1', 1), 
('doris2', 2) """
+
+        def rows = sql """ SELECT a, b FROM ${tbl} ORDER BY b """

Review Comment:
   This only round-trips the rows through Doris with Groovy `assertEquals`, so 
it misses two important checks. First, deterministic regression results should 
be captured with `order_qt`/generated `.out` rather than hand assertions. 
Second, this PR specifically needs to prove the file footer says Parquet `LZ4` 
(Hadoop-framed), not `LZ4_RAW`; Doris' reader handles both, so this query would 
still pass if the writer accidentally used `TParquetCompressionType::LZ4`. 
Please add a generated-output check that inspects the written file metadata, 
for example through the existing `parquet_meta` metadata mode compression 
column, and expects `LZ4` for the LZ4 table.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to