github-actions[bot] commented on code in PR #64723:
URL: https://github.com/apache/doris/pull/64723#discussion_r3471309043


##########
be/src/format/transformer/vparquet_transformer.cpp:
##########
@@ -123,6 +123,14 @@ void ParquetBuildHelper::build_compression_type(
         builder.compression(arrow::Compression::LZ4);
         break;
     }
+    case TParquetCompressionType::LZ4_HADOOP: {
+        // Hadoop-framed LZ4 -> Parquet thrift codec "LZ4" (deprecated). This 
matches what
+        // Spark/Iceberg writes for `write.parquet.compression-codec=lz4` and 
is readable by
+        // Trino/Spark/Doris. Note arrow::Compression::LZ4 (above) instead 
emits LZ4_RAW, which
+        // Trino cannot read.
+        builder.compression(arrow::Compression::LZ4_HADOOP);

Review Comment:
   This still does not guarantee the Spark/Trino compatibility the comment is 
relying on. Doris is pinned to Arrow 17.0.0, and Arrow's 
`Lz4HadoopCodec::Compress` writes one Hadoop LZ4 block for the whole Parquet 
page/dictionary page. Upstream Arrow issue apache/arrow#49641 documents that 
JVM readers using parquet-mr/Hadoop fail once that block decompresses above 
Hadoop's 256 KiB LZ4 buffer. Arrow's writer defaults data pages and dictionary 
pages to 1MB, so normal Hive/Iceberg LZ4 files with a large page can still be 
unreadable by Spark/Trino even though the footer says `LZ4`.
   
   Please either patch/upgrade the Arrow Hadoop-LZ4 writer to split blocks, or 
cap Doris' `LZ4_HADOOP` Parquet data and dictionary pages to a 
Hadoop-compatible size, and add a large-page JVM/Spark/Trino coverage case. The 
current tests only write one or three tiny rows and check footer metadata, so 
they would not catch this compatibility failure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to