[GitHub] [hudi] ehurheap opened a new issue, #6194: [SUPPORT] repair deduplicate unable to find `_hoodie_record_key` in data

GitBox Fri, 22 Jul 2022 16:09:56 -0700


ehurheap opened a new issue, #6194:
URL: https://github.com/apache/hudi/issues/6194


   **Problem**
   In hudi-cli I’m trying to run `repair deduplicate` against a partition in 
which I have confirmed via a separate spark query that there are in fact 
duplicates on the `_hoodie_record_key`. 
   I'm getting
   ``` cannot resolve '_hoodie_record_key' given input columns: []```
   
   **To Reproduce**
   1. Verify duplicates exist with separate spark query:
   ```
   val basePath="s3://thebucket/data/tables/events"
   val partitionPath="env_id=123/week=20220711"
   val inputPath=s"$basePath/$partitionPath"
   val df = spark.read.load(inputPath)
   df.printSchema() // shows expected schema including `_hoodie_record_key`
   df.createOrReplaceTempView("hoh")
   val hoodieKeyQuery = "select _hoodie_record_key, count(*) from hoh group by 
_hoodie_record_key having count(*) > 1"
   val dupes = spark.sql(hoodieKeyQuery)
   dupes.count() // about 1000 dupes counted
   ```
   
   2. hudi-cli repair attempt:
   connect --path s3://thebucket/data/tables/events
   repair deduplicate --duplicatedPartitionPath "env_id=123/week=20220711" 
--repairedOutputPath hhdeduplicates --sparkMaster local[2] --sparkMemory 4G 
--dryrun true --dedupeType "upsert_type"
   outputs (stack trace at end of doc):
   ```
   cannot resolve '_hoodie_record_key' given input columns: []
   ```
   
   
   **Expected behavior**
   The dryrun should produce some information about files that will be fixed
   
   
   **Environment Description**
   * Hudi version : 0.10.1
   * Hudi cli version: hudi-cli-0.10.1-amzn-0.jar
   * Spark version : 3.2.0
   * Storage: S3
   * EMR: emr-6.6.0
   * Hadoop distribution:Amazon 3.2.1
   
   **Additional context**
   We have both a streaming ingest job and a backfill job writing to this hudi 
table. Here are the write options for each job:
   Stream ingest write options:
   ```
   HoodieWriteConfig.TBL_NAME.key -> "events",
   DataSourceWriteOptions.TABLE_TYPE.key -> MOR_TABLE_TYPE_OPT_VAL,
   DataSourceWriteOptions.RECORDKEY_FIELD.key -> "env_id,user_id,event_id",
   DataSourceWriteOptions.PARTITIONPATH_FIELD.key -> "env_id,week",
   DataSourceWriteOptions.PRECOMBINE_FIELD.key -> "time",
   DataSourceWriteOptions.OPERATION.key -> INSERT_OPERATION_OPT_VAL,
   HIVE_STYLE_PARTITIONING.key() -> "true",
   "hoodie.insert.shuffle.parallelism" -> ingestConfig.hudiInsertParallelism,
   "checkpointLocation" -> ingestConfig.hudiCheckpointPath,
   "hoodie.metadata.enable" -> "true",
   "hoodie.bloom.index.filter.dynamic.max.entries" -> "800000",
   "hoodie.datasource.write.streaming.ignore.failed.batch" -> "false",
   "hoodie.write.concurrency.mode" -> "optimistic_concurrency_control",
   "hoodie.cleaner.policy.failed.writes" -> "LAZY",
   "hoodie.write.lock.provider" -> 
"org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider",
   "hoodie.write.lock.dynamodb.table" -> "datalake-locks",
   "hoodie.write.lock.dynamodb.partition_key" -> "events",
   "hoodie.write.lock.dynamodb.region" -> "us-east-1"
   ```
   Backfill job  write options:
   ```
   HoodieWriteConfig.TBL_NAME.key -> "events",
   DataSourceWriteOptions.TABLE_TYPE.key -> COW_TABLE_TYPE_OPT_VAL, // note: 
this seems to be ignored: the table was created with MOR before the backfill 
began.
   DataSourceWriteOptions.RECORDKEY_FIELD.key -> "env_id,user_id,event_id",
   DataSourceWriteOptions.PARTITIONPATH_FIELD.key -> "env_id,week",
   DataSourceWriteOptions.PRECOMBINE_FIELD.key -> "time",
   DataSourceWriteOptions.OPERATION.key -> BULK_INSERT_OPERATION_OPT_VAL,
   HIVE_STYLE_PARTITIONING.key() -> "true",
   "hoodie.insert.shuffle.parallelism" -> hudiInsertParallelism,
   "hoodie.upsert.shuffle.parallelism" -> hudiBulkInsertParallelism,
   "hoodie.metadata.enable" -> "true",
   "hoodie.bloom.index.filter.dynamic.max.entries" -> "800000",
   "hoodie.datasource.write.streaming.ignore.failed.batch" -> "false",
   "hoodie.bulkinsert.sort.mode" -> "NONE",
   "hoodie.combine.before.insert" -> "false",
   "hoodie.datasource.write.row.writer.enable" -> "false",
   "hoodie.write.concurrency.mode" -> "optimistic_concurrency_control",
   "hoodie.cleaner.policy.failed.writes" -> "LAZY",
   "hoodie.write.lock.provider" -> 
"org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider",
   "hoodie.write.lock.dynamodb.table" -> "datalake-locks",
   "hoodie.write.lock.dynamodb.partition_key" -> "events",
   "hoodie.write.lock.dynamodb.region" -> "us-east-1"
   ```
   
   **Stacktrace**
   ```
   22/07/22 19:18:58 ERROR SparkMain: Fail to execute commandString
   org.apache.spark.sql.AnalysisException: cannot resolve '_hoodie_record_key' 
given input columns: []; line 5 pos 15;
   'UnresolvedHaving ('dupe_cnt > 1)
   +- 'Aggregate ['_hoodie_record_key], ['_hoodie_record_key AS dupe_key#0, 
count(1) AS dupe_cnt#1L]
      +- SubqueryAlias htbl_1658517533303
         +- View (htbl_1658517533303, [])
            +- LocalRelation <empty>
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] ehurheap opened a new issue, #6194: [SUPPORT] repair deduplicate unable to find `_hoodie_record_key` in data

Reply via email to