[jira] [Created] (HUDI-8471) Unify row writer and non-row writer code paths

sivabalan narayanan (Jira) Mon, 04 Nov 2024 15:07:25 -0800

sivabalan narayanan created HUDI-8471:
-----------------------------------------


             Summary: Unify row writer and non-row writer code paths
                 Key: HUDI-8471
                 URL: https://issues.apache.org/jira/browse/HUDI-8471
             Project: Apache Hudi
          Issue Type: Sub-task
          Components: writer-core
            Reporter: sivabalan narayanan
            Assignee: sivabalan narayanan


Row writer uses writeClient in an unconventional ways compared to other 
operations. 

Typical write operation takes the following flow:

```
1. 
WriteClient.upsert  { 
   Instantiate HoodieTable
   result = table.upsert() 
   postWrite()
   return HoodieData<WriteStatus>
 }
2. writeClient.commitStats(return value from (1) i.e 
HoodieData<WriteStatus>,.... ) which internally will commit the write and then 
call clean, archive, compaction, clustering etc. 

1.a 
HoodieTable.upsert() { 
    calls into SparkUpsertCommitActionExecutor.execute() 
}

1.a.i 
SparkUpsertCommitActionExecutor.execute() {

     return HoodieWriteHelper.newInstance().write(...)
}

1.a.i.1
HoodieWriteHelper.newInstance().write() {
   dedup records
   tagRecords or index lookup 
   return BaseCommitActionExecutor.execute() 
}

1.a.i.1.a
BaseSparkCommitActionExecutor.execute() {
   build workload profile
   getPartitioner
   HoodieData<WriteStatus> writeStatuses = 
mapPartitionsAsRDD(inputRecordsWithClusteringUpdate, partitioner); // this is 
where the writes happen
    update index and return HoodieData<WriteStatus>
}

```

While rowWriter looks like below 

```
1. HoodieSparkSqlWriter.write
    bulkInsertAsRow {
      writeclient.startCommit 
      WriteResult = BaseDatasetBulkInsertCommitActionExecutor.execute() // by 
the time we return from here, data is committed fully along w/ any inline table 
services. 
   }

1.a BaseDatasetBulkInsertCommitActionExecutor.execute {
   write to custom spark ds
}

2. Custom Spark DS: 
We have implemented a series of interfaces which goes as follows 
DefaultSource -> HoodieDataSourceInternalTable

HoodieDataSourceInternalTable.newWriteBuilder will return 
HoodieDataSourceInternalBatchWriteBuilder 

this builder has buildForBatch() which will return BatchWrite. 

BatchWrite is core to our writes. 
 












--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-8471) Unify row writer and non-row writer code paths

Reply via email to