selvarajperiyasamy opened a new issue #1696:
URL: https://github.com/apache/hudi/issues/1696


   I already have the below hive table loaded my existing spark jobs. now I 
need to start updating these directories with Hudi data set. 
   [svchdc110p@sl73caehepc906 ~]$ hdfs dfs -ls 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200501
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200502
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200503
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:49 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200504
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200505
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200506
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200507
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:49 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200508
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200509
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200510
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200511
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:49 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200512
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200513
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200514
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200515
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:49 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200516
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200517
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200518
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200519
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200520
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200521
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200522
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200523
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:49 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200524
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200525
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200526
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200527
   
   While try writing from below code, I get an error. 
   
   try {
     val responseDF = replicateDF.write.format("org.apache.hudi").
       option("hoodie.insert.shuffle.parallelism","100").
       option("hoodie.upsert.shuffle.parallelism","100").
       option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
       option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
       option(PRECOMBINE_FIELD_OPT_KEY,"header__change_seq"). 
       option("hoodie.memory.merge.max.size", "2004857600000"). 
       option(PARTITIONPATH_FIELD_OPT_KEY,"transaction_day,transaction_hour").
       option(KEYGENERATOR_CLASS_OPT_KEY,"org.apache.hudi.ComplexKeyGenerator").
       
option(PAYLOAD_CLASS_OPT_KEY,"com.cdp.reporting.dp.hudi.custom.CustomOverWriteWithLatestAvroPayload").
       option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
       option("hoodie.cleaner.commits.retained", 1). 
       option("hoodie.keep.min.commits",2). 
       option("hoodie.keep.max.commits",3).
       option("hoodie.copyonwrite.record.size.estimate","160").
       option("hoodie.parquet.max.file.size",String.valueOf(256*1024*1024)).
       option("hoodie.parquet.small.file.limit",String.valueOf(128*1024*1024)).
       option(RECORDKEY_FIELD_OPT_KEY,"request_id,type_code").
       option(TABLE_NAME, "ptdb_pay_rpt_payment_transaction").    mode(Append).
       save("/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction")
   }
   catch {
     case excep : Exception =>
     {
       throw new Exception(excep.getMessage)
     }
   }
   
   
   20/06/01 21:57:54 ERROR ComposeEngine$: [App] *********************** 
Exception occurred in failure  block. Error Message : Boxed Error
   java.util.concurrent.ExecutionException: Boxed Error
        at scala.concurrent.impl.Promise$.resolver(Promise.scala:55)
        at 
scala.concurrent.impl.Promise$.scala$concurrent$impl$Promise$$resolveTry(Promise.scala:47)
        at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:244)
        at scala.concurrent.Promise$class.complete(Promise.scala:55)
        at 
scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
        at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:23)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   Caused by: java.lang.AssertionError: assertion failed: Conflicting directory 
structures detected. Suspicious paths:
        
hdfs://anahnn/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction
        
hdfs://anahnn/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/20200523/15
        
hdfs://anahnn/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/20200523/08
        
hdfs://anahnn/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/20200523/14
        
hdfs://anahnn/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/default
   
   If provided paths are partition directories, please set "basePath" in the 
options of the data source to specify the root directory of the table. If there 
are multiple root directories, please load them separately and then union them.
        at scala.Predef$.assert(Predef.scala:170)
        at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133)
        at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98)
        at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:153)
        at 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:71)
        at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50)
        at 
org.apache.spark.sql.execution.datasources.DataSource.combineInferredAndUserSpecifiedPartitionSchema(DataSource.scala:115)
        at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:166)
        at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:81)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:92)
        at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
        at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
        at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
        at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
        at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
        at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
        at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
        at 
com.cybs.cdp.reporting.dp.compose.ComposeEngine$.transactionTableWrite(ComposeEngine.scala:270)
        at 
com.cybs.cdp.reporting.dp.compose.ComposeEngine$$anonfun$1$$anonfun$apply$1.apply(ComposeEngine.scala:119)
        at 
com.cybs.cdp.reporting.dp.compose.ComposeEngine$$anonfun$1$$anonfun$apply$1.apply(ComposeEngine.scala:46)
        at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
   
   
   Below is the directory structure post error.
   
   [svchdc110p@sl73caehepc906 ~]$ hdfs dfs -ls 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/
   Found 148 items
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-06-01 21:57 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/.hoodie
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-06-01 21:56 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/20200523
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-06-01 21:55 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/default
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200501
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200502
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200503
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:49 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200504
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200505
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200506
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200507
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:49 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200508
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200509
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200510
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200511
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:49 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200512
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200513
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200514
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200515
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:49 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200516
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200517
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200518
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200519
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200520
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200521
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200522
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200523
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:49 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200524
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200525
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200526
   drwxr-xr-x   - svchdc110p Hadoop_cdp          0 2020-05-28 05:50 
/projects/cdp/data/cdp_base/ptdb_pay_rpt_payment_transaction/transaction_day=20200527
   
   
   Questions:
   
   1. Why is it not allowing to write on multiple directories ? 
   2. How can I keep the new directories aligned with existing partition naming 
convention? Here existing hive table is partitioned by transaction_day and 
/transaction_hour. The value I am passing for PARTITIONPATH_FIELD_OPT_KEY in 
hudi is also the same . But hudi directories are not having transaction_day in 
directory name but date.
   
   Thanks,
   Selva


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to