github-actions[bot] commented on code in PR #64723: URL: https://github.com/apache/doris/pull/64723#discussion_r3465374663
########## regression-test/suites/external_table_p0/iceberg/write/test_iceberg_write_parquet_compression.groovy: ########## @@ -0,0 +1,95 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Verifies that Doris can write (and read back) Iceberg Parquet tables for every +// compression codec it claims to support, including LZ4. LZ4 is emitted as the +// Hadoop-framed Parquet "LZ4" codec (not LZ4_RAW) so the output stays readable by +// Spark/Iceberg/Trino, matching what those engines write for +// `write.parquet.compression-codec=lz4`. +suite("test_iceberg_write_parquet_compression", "p0,external") { + String enabled = context.config.otherConfigs.get("enableIcebergTest") + if (enabled == null || !enabled.equalsIgnoreCase("true")) { + logger.info("disable iceberg test.") + return + } + + String rest_port = context.config.otherConfigs.get("iceberg_rest_uri_port") + String minio_port = context.config.otherConfigs.get("iceberg_minio_port") + String externalEnvIp = context.config.otherConfigs.get("externalEnvIp") + String catalog_name = "test_iceberg_write_parquet_compression" + String db = catalog_name + "_db" + + sql """drop catalog if exists ${catalog_name}""" + sql """ + CREATE CATALOG ${catalog_name} PROPERTIES ( + 'type'='iceberg', + 'iceberg.catalog.type'='rest', + 'uri' = 'http://${externalEnvIp}:${rest_port}', + "s3.access_key" = "admin", + "s3.secret_key" = "password", + "s3.endpoint" = "http://${externalEnvIp}:${minio_port}", + "s3.region" = "us-east-1" + );""" + + sql """ switch ${catalog_name} """ + sql """ drop database if exists ${db} force """ + sql """ create database ${db} """ + + def expected = [["doris0", 0L], ["doris1", 1L], ["doris2", 2L]] + + // Codecs Doris is expected to support for Parquet writes. + for (String codec : ["lz4", "snappy", "zstd", "uncompressed"]) { Review Comment: This loop only covers Iceberg Parquet data-file inserts. The PR also changes ORC LZ4 handling (`VOrcTransformer`), Hive Parquet LZ4 handling (`VHivePartitionWriter`), and Iceberg delete-file compression (`VIcebergDeleteFileWriter`), but none of those paths are exercised here or by the existing LZ4 suites I found. Please add focused coverage for the other changed paths, at least Hive Parquet LZ4 and Iceberg/Hive ORC LZ4 writes; if delete files inherit the table compression, add a delete/update scenario that drives the LZ4 `VIcebergDeleteFileWriter` path too. ########## be/src/exec/sink/writer/vhive_partition_writer.cpp: ########## @@ -83,6 +83,11 @@ Status VHivePartitionWriter::open(RuntimeState* state, RuntimeProfile* operator_ parquet_compression_type = TParquetCompressionType::ZSTD; break; } + case TFileCompressType::LZ4BLOCK: { Review Comment: This BE path now accepts Hive Parquet `LZ4BLOCK`, but the Doris Hive DDL path still rejects the same user-facing setting. `HiveUtil.setCompressType()` only allows `plain/snappy/zstd` for Parquet and `plain/zlib/snappy/zstd` for ORC, so `CREATE TABLE ... ENGINE=hive PROPERTIES ('file_format'='parquet', 'compression'='lz4')` still fails before it can reach this writer. Please add `lz4` to the Hive Parquet/ORC validation path and cover it in the Hive DDL/write tests so the advertised Hive LZ4 writer support is reachable from Doris-created Hive tables. ########## regression-test/suites/external_table_p0/iceberg/write/test_iceberg_write_parquet_compression.groovy: ########## @@ -0,0 +1,95 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Verifies that Doris can write (and read back) Iceberg Parquet tables for every +// compression codec it claims to support, including LZ4. LZ4 is emitted as the +// Hadoop-framed Parquet "LZ4" codec (not LZ4_RAW) so the output stays readable by +// Spark/Iceberg/Trino, matching what those engines write for +// `write.parquet.compression-codec=lz4`. +suite("test_iceberg_write_parquet_compression", "p0,external") { + String enabled = context.config.otherConfigs.get("enableIcebergTest") + if (enabled == null || !enabled.equalsIgnoreCase("true")) { + logger.info("disable iceberg test.") + return + } + + String rest_port = context.config.otherConfigs.get("iceberg_rest_uri_port") + String minio_port = context.config.otherConfigs.get("iceberg_minio_port") + String externalEnvIp = context.config.otherConfigs.get("externalEnvIp") + String catalog_name = "test_iceberg_write_parquet_compression" + String db = catalog_name + "_db" + + sql """drop catalog if exists ${catalog_name}""" + sql """ + CREATE CATALOG ${catalog_name} PROPERTIES ( + 'type'='iceberg', + 'iceberg.catalog.type'='rest', + 'uri' = 'http://${externalEnvIp}:${rest_port}', + "s3.access_key" = "admin", + "s3.secret_key" = "password", + "s3.endpoint" = "http://${externalEnvIp}:${minio_port}", + "s3.region" = "us-east-1" + );""" + + sql """ switch ${catalog_name} """ + sql """ drop database if exists ${db} force """ + sql """ create database ${db} """ + + def expected = [["doris0", 0L], ["doris1", 1L], ["doris2", 2L]] + + // Codecs Doris is expected to support for Parquet writes. + for (String codec : ["lz4", "snappy", "zstd", "uncompressed"]) { + String tbl = "${db}.tbl_${codec}" + sql """ drop table if exists ${tbl} """ + sql """ + CREATE TABLE ${tbl} (a STRING, b BIGINT) PROPERTIES ( + 'write-format' = 'parquet', + 'write.format.default' = 'parquet', + 'format-version' = '2', + 'write.parquet.compression-codec' = '${codec}' + )""" + + sql """ INSERT INTO ${tbl} VALUES ('doris0', 0), ('doris1', 1), ('doris2', 2) """ + + def rows = sql """ SELECT a, b FROM ${tbl} ORDER BY b """ + assertEquals(expected.size(), rows.size()) + for (int i = 0; i < expected.size(); i++) { + assertEquals(expected[i][0], rows[i][0]) + assertEquals(expected[i][1], rows[i][1] as Long) + } + logger.info("iceberg parquet write+read round-trip OK for codec=${codec}") + } + + // GZIP is intentionally not yet supported by the Doris Parquet writer; the INSERT must + // fail explicitly rather than silently fall back. Remove this block when GZIP support is + // added (and update the cross-engine product test expectation in tandem). + sql """ drop table if exists ${db}.tbl_gzip """ + sql """ + CREATE TABLE ${db}.tbl_gzip (a STRING, b BIGINT) PROPERTIES ( + 'write-format' = 'parquet', + 'write.format.default' = 'parquet', + 'format-version' = '2', + 'write.parquet.compression-codec' = 'gzip' + )""" + test { + sql """ INSERT INTO ${db}.tbl_gzip VALUES ('doris0', 0) """ + exception "Unsupported compress type GZ with parquet" + } + + sql """ drop database if exists ${db} force """ Review Comment: Please remove the final database/catalog cleanup. The suite already drops the catalog/database before creating them, and the regression-test standard asks new tests to clean before use rather than after completion so the created objects remain available for debugging a failed or suspicious run. ########## regression-test/suites/external_table_p0/iceberg/write/test_iceberg_write_parquet_compression.groovy: ########## @@ -0,0 +1,95 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Verifies that Doris can write (and read back) Iceberg Parquet tables for every +// compression codec it claims to support, including LZ4. LZ4 is emitted as the +// Hadoop-framed Parquet "LZ4" codec (not LZ4_RAW) so the output stays readable by +// Spark/Iceberg/Trino, matching what those engines write for +// `write.parquet.compression-codec=lz4`. +suite("test_iceberg_write_parquet_compression", "p0,external") { + String enabled = context.config.otherConfigs.get("enableIcebergTest") + if (enabled == null || !enabled.equalsIgnoreCase("true")) { + logger.info("disable iceberg test.") + return + } + + String rest_port = context.config.otherConfigs.get("iceberg_rest_uri_port") + String minio_port = context.config.otherConfigs.get("iceberg_minio_port") + String externalEnvIp = context.config.otherConfigs.get("externalEnvIp") + String catalog_name = "test_iceberg_write_parquet_compression" + String db = catalog_name + "_db" + + sql """drop catalog if exists ${catalog_name}""" + sql """ + CREATE CATALOG ${catalog_name} PROPERTIES ( + 'type'='iceberg', + 'iceberg.catalog.type'='rest', + 'uri' = 'http://${externalEnvIp}:${rest_port}', + "s3.access_key" = "admin", + "s3.secret_key" = "password", + "s3.endpoint" = "http://${externalEnvIp}:${minio_port}", + "s3.region" = "us-east-1" + );""" + + sql """ switch ${catalog_name} """ + sql """ drop database if exists ${db} force """ + sql """ create database ${db} """ + + def expected = [["doris0", 0L], ["doris1", 1L], ["doris2", 2L]] + + // Codecs Doris is expected to support for Parquet writes. + for (String codec : ["lz4", "snappy", "zstd", "uncompressed"]) { + String tbl = "${db}.tbl_${codec}" + sql """ drop table if exists ${tbl} """ + sql """ + CREATE TABLE ${tbl} (a STRING, b BIGINT) PROPERTIES ( + 'write-format' = 'parquet', + 'write.format.default' = 'parquet', + 'format-version' = '2', + 'write.parquet.compression-codec' = '${codec}' + )""" + + sql """ INSERT INTO ${tbl} VALUES ('doris0', 0), ('doris1', 1), ('doris2', 2) """ + + def rows = sql """ SELECT a, b FROM ${tbl} ORDER BY b """ Review Comment: This only round-trips the rows through Doris with Groovy `assertEquals`, so it misses two important checks. First, deterministic regression results should be captured with `order_qt`/generated `.out` rather than hand assertions. Second, this PR specifically needs to prove the file footer says Parquet `LZ4` (Hadoop-framed), not `LZ4_RAW`; Doris' reader handles both, so this query would still pass if the writer accidentally used `TParquetCompressionType::LZ4`. Please add a generated-output check that inspects the written file metadata, for example through the existing `parquet_meta` metadata mode compression column, and expects `LZ4` for the LZ4 table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
