[GitHub] [hudi] pushpavanthar opened a new issue, #7667: [SUPPORT] ambiguous table type. Seeing both MERGE_ON_READ and COPY_ON_WRITE

GitBox Fri, 13 Jan 2023 00:32:07 -0800


pushpavanthar opened a new issue, #7667:
URL: https://github.com/apache/hudi/issues/7667


   **Describe the problem you faced**
   In Hudi 0.11.1, hudi has made `hoodie.metadata.enable=true` by default. In 
logs I see that `HoodieTableMetaClient` loading table as type 
`COPY_ON_WRITE(version=1, baseFileFormat=PARQUET)` from base path and same 
class loading table as `MERGE_ON_READ(version=1, baseFileFormat=HFILE)` from 
the metadata path.
   
   Steps to reproduce the behavior:
   
   1. Deploy HoodieDeltaStreamer in continuous mode with 
`hoodie.metadata.enable=true`  and table type `COPY_ON_WRITE` with below configs
   ```
   acks: all
   auto.offset.reset: earliest
   bootstrap.servers: kafka:9092
   client.dns.lookup: use_all_dns_ips
   group.id: hudi-cow-continuous-credit-analysis-data
   hive.metastore.disallow.incompatible.col.type.changes: false
   hoodie.archive.async: true
   hoodie.archive.automatic: true
   hoodie.archive.delete.parallelism: 500
   hoodie.archive.merge.enable: true
   hoodie.archive.merge.files.batch.size: 20
   hoodie.auto.adjust.lock.configs: true
   hoodie.bloom.index.update.partition.path: false
   hoodie.bloom.index.use.metadata: true
   hoodie.clean.allow.multiple: false
   hoodie.clean.async: true
   hoodie.clean.automatic: true
   hoodie.clean.max.commits: 10
   hoodie.cleaner.hours.retained: 1
   hoodie.cleaner.incremental.mode: true
   hoodie.cleaner.parallelism: 500
   hoodie.cleaner.policy: KEEP_LATEST_BY_HOURS
   hoodie.clustering.async.enabled: false
   hoodie.clustering.async.max.commits: 1
   hoodie.clustering.execution.strategy.class: 
org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
   hoodie.clustering.inline: false
   hoodie.clustering.plan.strategy.class: 
org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
   hoodie.clustering.plan.strategy.small.file.limit: 629145600
   hoodie.clustering.plan.strategy.target.file.max.bytes: 1073741824
   hoodie.commits.archival.batch: 20
   hoodie.datasource.hive_sync.database: test_clustering
   hoodie.datasource.hive_sync.partition_extractor_class: 
org.apache.hudi.hive.NonPartitionedExtractor
   hoodie.datasource.hive_sync.table: cow_credit_analysis_data
   hoodie.datasource.write.keygenerator.class: 
org.apache.hudi.keygen.NonpartitionedKeyGenerator
   hoodie.datasource.write.partitionpath.field: ''
   hoodie.datasource.write.precombine.field: __lsn
   hoodie.datasource.write.reconcile.schema: false
   hoodie.datasource.write.recordkey.field: id
   hoodie.deltastreamer.schemaprovider.registry.url: 
https://schema_registry/subjects/lending_customer_service.public.credit_analysis_data-value/versions/latest
   hoodie.deltastreamer.schemaprovider.spark_avro_post_processor.enable: false
   hoodie.deltastreamer.source.kafka.auto.reset.offsets: earliest
   hoodie.deltastreamer.source.kafka.enable.commit.offset: true
   hoodie.deltastreamer.source.kafka.topic: 
lending_customer_service.public.credit_analysis_data
   hoodie.index.type: BLOOM
   hoodie.keep.max.commits: 800
   hoodie.keep.min.commits: 600
   hoodie.metrics.on: true
   hoodie.metrics.pushgateway.delete.on.shutdown: false
   hoodie.metrics.pushgateway.host: pushgateway
   hoodie.metrics.pushgateway.job.name: hudi_cow_continuous_credit_analysis_data
   hoodie.metrics.pushgateway.port: 443
   hoodie.metrics.pushgateway.random.job.name.suffix: false
   hoodie.metrics.reporter.metricsname.prefix: hudi
   hoodie.metrics.reporter.type: PROMETHEUS_PUSHGATEWAY
   hoodie.parquet.compression.codec: snappy
   partition.assignment.strategy: 
org.apache.kafka.clients.consumer.RangeAssignor
   sasl.jaas.config: org.apache.kafka.common.security.plain.PlainLoginModule 
required username='***************' password='***************';
   sasl.mechanism: PLAIN
   schema.registry.url: https://schema_registry
   security.protocol: SASL_SSL
   ```
   2. Check logs for `HoodieTableMetaClient` of the pattern 
`HoodieTableMetaClient: Finished Loading Table of type `
   
   **Expected behavior**
   
   Is this expected behaviour for the table config read from metadata 
hoodie.properties? isn't this is misleading. Aren't we supposed to have 
consistent table type within metadata as well? I've been noticing lot of issues 
when metadata is enabled, not sure if this is the root cause.
   
   **Environment Description**
   
   * Hudi version : 0.11.1
   
   * Spark version : 3.1.1
   
   * Hive version : 3.1.2
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   I'm attaching driver logs for better understanding of the problem. I can 
share entire driver logs if required.
   ```
   23/01/13 07:19:55 INFO [pool-31-thread-1] HoodieTableMetaClient: Loading 
HoodieTableMetaClient from 
s3://datalake_bucket/test/hudi_poc/continuous_cow/cow_credit_analysis_data
   23/01/13 07:19:55 INFO [pool-31-thread-1] HoodieTableConfig: Loading table 
properties from 
s3://datalake_bucket/test/hudi_poc/continuous_cow/cow_credit_analysis_data/.hoodie/hoodie.properties
   23/01/13 07:19:55 INFO [pool-31-thread-1] S3NativeFileSystem: Opening 
's3://datalake_bucket/test/hudi_poc/continuous_cow/cow_credit_analysis_data/.hoodie/hoodie.properties'
 for reading
   23/01/13 07:19:55 INFO [pool-31-thread-1] HoodieTableMetaClient: Finished 
Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3://datalake_bucket/test/hudi_poc/continuous_cow/cow_credit_analysis_data
   23/01/13 07:19:55 INFO [pool-31-thread-1] HoodieTableMetaClient: Loading 
HoodieTableMetaClient from 
s3://datalake_bucket/test/hudi_poc/continuous_cow/cow_credit_analysis_data/.hoodie/metadata
   23/01/13 07:19:55 INFO [pool-31-thread-1] HoodieTableConfig: Loading table 
properties from 
s3://datalake_bucket/test/hudi_poc/continuous_cow/cow_credit_analysis_data/.hoodie/metadata/.hoodie/hoodie.properties
   23/01/13 07:19:55 INFO [pool-31-thread-1] S3NativeFileSystem: Opening 
's3://datalake_bucket/test/hudi_poc/continuous_cow/cow_credit_analysis_data/.hoodie/metadata/.hoodie/hoodie.properties'
 for reading
   23/01/13 07:19:55 INFO [pool-31-thread-1] HoodieTableMetaClient: Finished 
Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=HFILE) from 
s3://datalake_bucket/test/hudi_poc/continuous_cow/cow_credit_analysis_data/.hoodie/metadata
   23/01/13 07:19:55 INFO [pool-31-thread-1] HoodieTableMetaClient: Loading 
HoodieTableMetaClient from 
s3://datalake_bucket/test/hudi_poc/continuous_cow/cow_credit_analysis_data
   23/01/13 07:19:55 INFO [pool-31-thread-1] HoodieTableConfig: Loading table 
properties from 
s3://datalake_bucket/test/hudi_poc/continuous_cow/cow_credit_analysis_data/.hoodie/hoodie.properties
   23/01/13 07:19:55 INFO [pool-31-thread-1] S3NativeFileSystem: Opening 
's3://datalake_bucket/test/hudi_poc/continuous_cow/cow_credit_analysis_data/.hoodie/hoodie.properties'
 for reading
   23/01/13 07:19:55 INFO [pool-31-thread-1] HoodieTableMetaClient: Finished 
Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3://datalake_bucket/test/hudi_poc/continuous_cow/cow_credit_analysis_data
   23/01/13 07:19:55 INFO [pool-31-thread-1] HoodieTableMetaClient: Loading 
HoodieTableMetaClient from 
s3://datalake_bucket/test/hudi_poc/continuous_cow/cow_credit_analysis_data/.hoodie/metadata
   23/01/13 07:19:55 INFO [pool-31-thread-1] HoodieTableConfig: Loading table 
properties from 
s3://datalake_bucket/test/hudi_poc/continuous_cow/cow_credit_analysis_data/.hoodie/metadata/.hoodie/hoodie.properties
   23/01/13 07:19:55 INFO [pool-31-thread-1] S3NativeFileSystem: Opening 
's3://datalake_bucket/test/hudi_poc/continuous_cow/cow_credit_analysis_data/.hoodie/metadata/.hoodie/hoodie.properties'
 for reading
   23/01/13 07:19:55 INFO [pool-31-thread-1] HoodieTableMetaClient: Finished 
Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=HFILE) from 
s3://datalake_bucket/test/hudi_poc/continuous_cow/cow_credit_analysis_data/.hoodie/metadata
   
   
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] pushpavanthar opened a new issue, #7667: [SUPPORT] ambiguous table type. Seeing both MERGE_ON_READ and COPY_ON_WRITE

Reply via email to