[jira] [Updated] (HUDI-5323) Decouple virtual key with writing bloom filters to parquet files

Ethan Guo (Jira) Fri, 02 Dec 2022 15:49:03 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-5323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ethan Guo updated HUDI-5323:
----------------------------
    Description: 
When the virtual key feature is enabled by setting hoodie.populate.meta.fields 
to false, the bloom filters are not written to parquet base files in the write 
transactions.  Relevant logic in HoodieFileWriterFactory class:
{code:java}
private static <T extends HoodieRecordPayload, R extends IndexedRecord> 
HoodieFileWriter<R> newParquetFileWriter(
    String instantTime, Path path, HoodieWriteConfig config, Schema schema, 
HoodieTable hoodieTable,
    TaskContextSupplier taskContextSupplier, boolean populateMetaFields) throws 
IOException {
  return newParquetFileWriter(instantTime, path, config, schema, 
hoodieTable.getHadoopConf(),
      taskContextSupplier, populateMetaFields, populateMetaFields);
}

private static <T extends HoodieRecordPayload, R extends IndexedRecord> 
HoodieFileWriter<R> newParquetFileWriter(
    String instantTime, Path path, HoodieWriteConfig config, Schema schema, 
Configuration conf,
    TaskContextSupplier taskContextSupplier, boolean populateMetaFields, 
boolean enableBloomFilter) throws IOException {
  Option<BloomFilter> filter = enableBloomFilter ? 
Option.of(createBloomFilter(config)) : Option.empty();
  HoodieAvroWriteSupport writeSupport = new HoodieAvroWriteSupport(new 
AvroSchemaConverter(conf).convert(schema), schema, filter);

  HoodieParquetConfig<HoodieAvroWriteSupport> parquetConfig = new 
HoodieParquetConfig<>(writeSupport, config.getParquetCompressionCodec(),
      config.getParquetBlockSize(), config.getParquetPageSize(), 
config.getParquetMaxFileSize(),
      conf, config.getParquetCompressionRatio(), 
config.parquetDictionaryEnabled());

  return new HoodieAvroParquetWriter<>(path, parquetConfig, instantTime, 
taskContextSupplier, populateMetaFields);
} {code}
Given that bloom filters are absent, when using Bloom Index on the same table, 
the writer encounters NPE (HUDI-5319).

We should decouple the virtual key feature with bloom filter and always write 
the bloom filters to the parquet files. 

  was:
When the virtual key feature is enabled by setting hoodie.populate.meta.fields 
to false, the bloom filters are not written to parquet base files in the write 
transactions.  Relevant logic in HoodieFileWriterFactory class:
{code:java}
private static <T extends HoodieRecordPayload, R extends IndexedRecord> 
HoodieFileWriter<R> newParquetFileWriter(
    String instantTime, Path path, HoodieWriteConfig config, Schema schema, 
HoodieTable hoodieTable,
    TaskContextSupplier taskContextSupplier, boolean populateMetaFields) throws 
IOException {
  return newParquetFileWriter(instantTime, path, config, schema, 
hoodieTable.getHadoopConf(),
      taskContextSupplier, populateMetaFields, populateMetaFields);
}

private static <T extends HoodieRecordPayload, R extends IndexedRecord> 
HoodieFileWriter<R> newParquetFileWriter(
    String instantTime, Path path, HoodieWriteConfig config, Schema schema, 
Configuration conf,
    TaskContextSupplier taskContextSupplier, boolean populateMetaFields, 
boolean enableBloomFilter) throws IOException {
  Option<BloomFilter> filter = enableBloomFilter ? 
Option.of(createBloomFilter(config)) : Option.empty();
  HoodieAvroWriteSupport writeSupport = new HoodieAvroWriteSupport(new 
AvroSchemaConverter(conf).convert(schema), schema, filter);

  HoodieParquetConfig<HoodieAvroWriteSupport> parquetConfig = new 
HoodieParquetConfig<>(writeSupport, config.getParquetCompressionCodec(),
      config.getParquetBlockSize(), config.getParquetPageSize(), 
config.getParquetMaxFileSize(),
      conf, config.getParquetCompressionRatio(), 
config.parquetDictionaryEnabled());

  return new HoodieAvroParquetWriter<>(path, parquetConfig, instantTime, 
taskContextSupplier, populateMetaFields);
} {code}
 


> Decouple virtual key with writing bloom filters to parquet files
> ----------------------------------------------------------------
>
>                 Key: HUDI-5323
>                 URL: https://issues.apache.org/jira/browse/HUDI-5323
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: index, writer-core
>            Reporter: Ethan Guo
>            Priority: Critical
>             Fix For: 0.13.0
>
>
> When the virtual key feature is enabled by setting 
> hoodie.populate.meta.fields to false, the bloom filters are not written to 
> parquet base files in the write transactions.  Relevant logic in 
> HoodieFileWriterFactory class:
> {code:java}
> private static <T extends HoodieRecordPayload, R extends IndexedRecord> 
> HoodieFileWriter<R> newParquetFileWriter(
>     String instantTime, Path path, HoodieWriteConfig config, Schema schema, 
> HoodieTable hoodieTable,
>     TaskContextSupplier taskContextSupplier, boolean populateMetaFields) 
> throws IOException {
>   return newParquetFileWriter(instantTime, path, config, schema, 
> hoodieTable.getHadoopConf(),
>       taskContextSupplier, populateMetaFields, populateMetaFields);
> }
> private static <T extends HoodieRecordPayload, R extends IndexedRecord> 
> HoodieFileWriter<R> newParquetFileWriter(
>     String instantTime, Path path, HoodieWriteConfig config, Schema schema, 
> Configuration conf,
>     TaskContextSupplier taskContextSupplier, boolean populateMetaFields, 
> boolean enableBloomFilter) throws IOException {
>   Option<BloomFilter> filter = enableBloomFilter ? 
> Option.of(createBloomFilter(config)) : Option.empty();
>   HoodieAvroWriteSupport writeSupport = new HoodieAvroWriteSupport(new 
> AvroSchemaConverter(conf).convert(schema), schema, filter);
>   HoodieParquetConfig<HoodieAvroWriteSupport> parquetConfig = new 
> HoodieParquetConfig<>(writeSupport, config.getParquetCompressionCodec(),
>       config.getParquetBlockSize(), config.getParquetPageSize(), 
> config.getParquetMaxFileSize(),
>       conf, config.getParquetCompressionRatio(), 
> config.parquetDictionaryEnabled());
>   return new HoodieAvroParquetWriter<>(path, parquetConfig, instantTime, 
> taskContextSupplier, populateMetaFields);
> } {code}
> Given that bloom filters are absent, when using Bloom Index on the same 
> table, the writer encounters NPE (HUDI-5319).
> We should decouple the virtual key feature with bloom filter and always write 
> the bloom filters to the parquet files. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5323) Decouple virtual key with writing bloom filters to parquet files

Reply via email to