parisni commented on issue #7117: URL: https://github.com/apache/hudi/issues/7117#issuecomment-1547017607
Hudi is able to benefit from parquet files written with blooms. (tested by replacing the hudi parquet files with the vanilla spark's one, and it hudi datasource triggers the bloom). Digging the source code, I guess the reason blooms are not taken in consideration is in the [hudi's parquetWriter wrapper](https://github.com/apache/hudi/blob/67ae0c8e7e4e58454cce18a8f58bfa43f67c1183/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieBaseParquetWriter.java#L49-L59). It then calls [the parquetWriter](https://github.com/apache/parquet-mr/blob/cac8f7cf55b390c2ac5ef5d14a6aa72597b99284/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L231-L236 public constructor) which has very limited parquet feature support. [There is a more complete constructor](https://github.com/apache/parquet-mr/blob/cac8f7cf55b390c2ac5ef5d14a6aa72597b99284/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L276) but sadly it's access is limited to package. Accessing to package constructor can be done by changing the `HoodieBaseParquetWriter` package to `org.apache.parquet.hadoop`, but also the `ParquetWriter` has to be present in the same jar (common package cannot be spread over multiple jars). A better option would be parquet provides more suitable constructors. Or I am missing something ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
