selvarajperiyasamy opened a new issue #1696:
URL: https://github.com/apache/hudi/issues/1696
I already have the below hive table loaded my existing spark jobs. now I
need to start updating these directories with Hudi data set.
[svchdc110p@sl73caehepc906 ~]$ hdfs dfs -ls
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200501
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200502
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200503
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:49
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200504
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200505
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200506
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200507
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:49
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200508
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200509
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200510
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200511
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:49
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200512
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200513
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200514
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200515
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:49
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200516
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200517
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200518
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200519
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200520
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200521
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200522
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200523
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:49
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200524
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200525
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200526
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200527
While try writing from below code, I get an error.
try {
val responseDF = replicateDF.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism","100").
option("hoodie.upsert.shuffle.parallelism","100").
option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
option(PRECOMBINE_FIELD_OPT_KEY,"header__change_seq").
option("hoodie.memory.merge.max.size", "2004857600000").
option(PARTITIONPATH_FIELD_OPT_KEY,"transaction_day,transaction_hour").
option(KEYGENERATOR_CLASS_OPT_KEY,"org.apache.hudi.ComplexKeyGenerator").
option(PAYLOAD_CLASS_OPT_KEY,"com.cdp.reporting.dp.hudi.custom.CustomOverWriteWithLatestAvroPayload").
option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
option("hoodie.cleaner.commits.retained", 1).
option("hoodie.keep.min.commits",2).
option("hoodie.keep.max.commits",3).
option("hoodie.copyonwrite.record.size.estimate","160").
option("hoodie.parquet.max.file.size",String.valueOf(256*1024*1024)).
option("hoodie.parquet.small.file.limit",String.valueOf(128*1024*1024)).
option(RECORDKEY_FIELD_OPT_KEY,"request_id,type_code").
option(TABLE_NAME, "ptdb_pay_rpt_payment_transaction"). mode(Append).
save("/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction")
}
catch {
case excep : Exception =>
{
throw new Exception(excep.getMessage)
}
}
20/06/01 21:57:54 ERROR ComposeEngine$: [App] ***********************
Exception occurred in failure block. Error Message : Boxed Error
java.util.concurrent.ExecutionException: Boxed Error
at scala.concurrent.impl.Promise$.resolver(Promise.scala:55)
at
scala.concurrent.impl.Promise$.scala$concurrent$impl$Promise$$resolveTry(Promise.scala:47)
at
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:244)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at
scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:23)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.AssertionError: assertion failed: Conflicting directory
structures detected. Suspicious paths:
hdfs://anahnn/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction
hdfs://anahnn/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/20200523/15
hdfs://anahnn/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/20200523/08
hdfs://anahnn/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/20200523/14
hdfs://anahnn/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/default
If provided paths are partition directories, please set "basePath" in the
options of the data source to specify the root directory of the table. If there
are multiple root directories, please load them separately and then union them.
at scala.Predef$.assert(Predef.scala:170)
at
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133)
at
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98)
at
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:153)
at
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:71)
at
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50)
at
org.apache.spark.sql.execution.datasources.DataSource.combineInferredAndUserSpecifiedPartitionSchema(DataSource.scala:115)
at
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:166)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:81)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:92)
at
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
at
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
at
com.cybs.cdp.reporting.dp.compose.ComposeEngine$.transactionTableWrite(ComposeEngine.scala:270)
at
com.cybs.cdp.reporting.dp.compose.ComposeEngine$$anonfun$1$$anonfun$apply$1.apply(ComposeEngine.scala:119)
at
com.cybs.cdp.reporting.dp.compose.ComposeEngine$$anonfun$1$$anonfun$apply$1.apply(ComposeEngine.scala:46)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
Below is the directory structure post error.
[svchdc110p@sl73caehepc906 ~]$ hdfs dfs -ls
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/
Found 148 items
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-06-01 21:57
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/.hoodie
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-06-01 21:56
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/20200523
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-06-01 21:55
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/default
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200501
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200502
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200503
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:49
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200504
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200505
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200506
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200507
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:49
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200508
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200509
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200510
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200511
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:49
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200512
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200513
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200514
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200515
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:49
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200516
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200517
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200518
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200519
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200520
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200521
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200522
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200523
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:49
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200524
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200525
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200526
drwxr-xr-x - svchdc110p Hadoop_cdp 0 2020-05-28 05:50
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200527
Questions:
1. Why is it not allowing to write on multiple directories ?
2. How can I keep the new directories aligned with existing partition naming
convention? Here existing hive table is partitioned by transaction_day and
/transaction_hour. The value I am passing for PARTITIONPATH_FIELD_OPT_KEY in
hudi is also the same . But hudi directories are not having transaction_day in
directory name but date.
Thanks,
Selva
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]