ehurheap opened a new issue, #6194:
URL: https://github.com/apache/hudi/issues/6194
**Problem**
In hudi-cli I’m trying to run `repair deduplicate` against a partition in
which I have confirmed via a separate spark query that there are in fact
duplicates on the `_hoodie_record_key`.
I'm getting
``` cannot resolve '_hoodie_record_key' given input columns: []```
**To Reproduce**
1. Verify duplicates exist with separate spark query:
```
val basePath="s3://thebucket/data/tables/events"
val partitionPath="env_id=123/week=20220711"
val inputPath=s"$basePath/$partitionPath"
val df = spark.read.load(inputPath)
df.printSchema() // shows expected schema including `_hoodie_record_key`
df.createOrReplaceTempView("hoh")
val hoodieKeyQuery = "select _hoodie_record_key, count(*) from hoh group by
_hoodie_record_key having count(*) > 1"
val dupes = spark.sql(hoodieKeyQuery)
dupes.count() // about 1000 dupes counted
```
2. hudi-cli repair attempt:
connect --path s3://thebucket/data/tables/events
repair deduplicate --duplicatedPartitionPath "env_id=123/week=20220711"
--repairedOutputPath hhdeduplicates --sparkMaster local[2] --sparkMemory 4G
--dryrun true --dedupeType "upsert_type"
outputs (stack trace at end of doc):
```
cannot resolve '_hoodie_record_key' given input columns: []
```
**Expected behavior**
The dryrun should produce some information about files that will be fixed
**Environment Description**
* Hudi version : 0.10.1
* Hudi cli version: hudi-cli-0.10.1-amzn-0.jar
* Spark version : 3.2.0
* Storage: S3
* EMR: emr-6.6.0
* Hadoop distribution:Amazon 3.2.1
**Additional context**
We have both a streaming ingest job and a backfill job writing to this hudi
table. Here are the write options for each job:
Stream ingest write options:
```
HoodieWriteConfig.TBL_NAME.key -> "events",
DataSourceWriteOptions.TABLE_TYPE.key -> MOR_TABLE_TYPE_OPT_VAL,
DataSourceWriteOptions.RECORDKEY_FIELD.key -> "env_id,user_id,event_id",
DataSourceWriteOptions.PARTITIONPATH_FIELD.key -> "env_id,week",
DataSourceWriteOptions.PRECOMBINE_FIELD.key -> "time",
DataSourceWriteOptions.OPERATION.key -> INSERT_OPERATION_OPT_VAL,
HIVE_STYLE_PARTITIONING.key() -> "true",
"hoodie.insert.shuffle.parallelism" -> ingestConfig.hudiInsertParallelism,
"checkpointLocation" -> ingestConfig.hudiCheckpointPath,
"hoodie.metadata.enable" -> "true",
"hoodie.bloom.index.filter.dynamic.max.entries" -> "800000",
"hoodie.datasource.write.streaming.ignore.failed.batch" -> "false",
"hoodie.write.concurrency.mode" -> "optimistic_concurrency_control",
"hoodie.cleaner.policy.failed.writes" -> "LAZY",
"hoodie.write.lock.provider" ->
"org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider",
"hoodie.write.lock.dynamodb.table" -> "datalake-locks",
"hoodie.write.lock.dynamodb.partition_key" -> "events",
"hoodie.write.lock.dynamodb.region" -> "us-east-1"
```
Backfill job write options:
```
HoodieWriteConfig.TBL_NAME.key -> "events",
DataSourceWriteOptions.TABLE_TYPE.key -> COW_TABLE_TYPE_OPT_VAL, // note:
this seems to be ignored: the table was created with MOR before the backfill
began.
DataSourceWriteOptions.RECORDKEY_FIELD.key -> "env_id,user_id,event_id",
DataSourceWriteOptions.PARTITIONPATH_FIELD.key -> "env_id,week",
DataSourceWriteOptions.PRECOMBINE_FIELD.key -> "time",
DataSourceWriteOptions.OPERATION.key -> BULK_INSERT_OPERATION_OPT_VAL,
HIVE_STYLE_PARTITIONING.key() -> "true",
"hoodie.insert.shuffle.parallelism" -> hudiInsertParallelism,
"hoodie.upsert.shuffle.parallelism" -> hudiBulkInsertParallelism,
"hoodie.metadata.enable" -> "true",
"hoodie.bloom.index.filter.dynamic.max.entries" -> "800000",
"hoodie.datasource.write.streaming.ignore.failed.batch" -> "false",
"hoodie.bulkinsert.sort.mode" -> "NONE",
"hoodie.combine.before.insert" -> "false",
"hoodie.datasource.write.row.writer.enable" -> "false",
"hoodie.write.concurrency.mode" -> "optimistic_concurrency_control",
"hoodie.cleaner.policy.failed.writes" -> "LAZY",
"hoodie.write.lock.provider" ->
"org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider",
"hoodie.write.lock.dynamodb.table" -> "datalake-locks",
"hoodie.write.lock.dynamodb.partition_key" -> "events",
"hoodie.write.lock.dynamodb.region" -> "us-east-1"
```
**Stacktrace**
```
22/07/22 19:18:58 ERROR SparkMain: Fail to execute commandString
org.apache.spark.sql.AnalysisException: cannot resolve '_hoodie_record_key'
given input columns: []; line 5 pos 15;
'UnresolvedHaving ('dupe_cnt > 1)
+- 'Aggregate ['_hoodie_record_key], ['_hoodie_record_key AS dupe_key#0,
count(1) AS dupe_cnt#1L]
+- SubqueryAlias htbl_1658517533303
+- View (htbl_1658517533303, [])
+- LocalRelation <empty>
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]