[jira] [Commented] (HUDI-8628) Merge Into is pulling in additional fields which are not set as per the condition

Davis Zhang (Jira) Wed, 15 Jan 2025 17:11:27 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-8628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913506#comment-17913506
 ]


Davis Zhang commented on HUDI-8628:
-----------------------------------

Plumbed through the code and the RCA is:

When MIT does its job, internally it uses 

```

public static final ConfigProperty<String> WRITE_SCHEMA_OVERRIDE = 
ConfigProperty
.key("hoodie.write.schema")
```
when writing its completion instant file. the write schema is something else as 
it is paired with some internal RDD that MIT needs to handle.
```
public static final ConfigProperty<String> AVRO_SCHEMA_STRING = ConfigProperty
.key("hoodie.avro.schema")
```
Later when running inline clustering, it uses the dirty AVRO_SCHEMA_STRING 
config as the writer schema so it causes the issue. Bascially MIT does some 
hack internally, later when we did a different write operation, MIT does not 
clean up the mess of the writer config and leads to the issue.
 
The fix is for table service we just always read table schema. We will not rely 
on WRITE_SCHEMA_OVERRIDE or AVRO_SCHEMA_STRING

> Merge Into is pulling in additional fields which are not set as per the 
> condition 
> ----------------------------------------------------------------------------------
>
>                 Key: HUDI-8628
>                 URL: https://issues.apache.org/jira/browse/HUDI-8628
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: spark-sql
>            Reporter: sivabalan narayanan
>            Assignee: Davis Zhang
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.0.1
>
>         Attachments: image-2024-12-02-03-58-48-178.png
>
>   Original Estimate: 6h
>          Time Spent: 3.5h
>  Remaining Estimate: 2.5h
>
> spark.sql(s"set 
> ${HoodieWriteConfig.MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0")
> spark.sql(s"set 
> ${DataSourceWriteOptions.ENABLE_MERGE_INTO_PARTIAL_UPDATES.key} = true")
> spark.sql(s"set ${HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key} = 
> $logDataBlockFormat")
> spark.sql(s"set ${HoodieReaderConfig.FILE_GROUP_READER_ENABLED.key} = false")
> spark.sql(
> s"""
> |create table $tableName (|
> |id int,|
> |name string,|
> |price long,|
> |ts long,|
> |description string|
> |) using hudi|
> |tblproperties(|
> |type ='$tableType',|
> |primaryKey = 'id',|
> |preCombineField = 'ts'|
> |)|
> |location '$basePath'
> """.stripMargin)
> spark.sql(s"insert into $tableName values (1, 'a1', 10, 1000, 'a1: desc1')," +
> "(2, 'a2', 20, 1200, 'a2: desc2'), (3, 'a3', 30.0, 1250, 'a3: desc3')")|
>  
>  
> Merge Into:
> // Partial updates using MERGE INTO statement with changed fields: "price" 
> and "_ts"
> spark.sql(
> s"""
> |merge into $tableName t0|
> |using ( select 1 as id, 'a1' as name, 12 as price, 1001 as _ts|
> |union select 3 as id, 'a3' as name, 25 as price, 1260 as _ts) s0|
> |on t0.id = s0.id|
> |when matched then update set price = s0.price, ts = s0._ts|
> |""".stripMargin)|
>  
> The schema for this merge into command when we reach 
> HoodieSparkSqlWriter.deduceWriterSchema is given below. 
> i.e. 
> val writerSchema = HoodieSchemaUtils.deduceWriterSchema(sourceSchema, 
> latestTableSchemaOpt, internalSchemaOpt, parameters)
>  
> !image-2024-12-02-03-58-48-178.png!
>  
> the merge into command only instructs to update price and _ts right? So, why 
> other fields are also picked up from source(for eg name). 
> You can check out the test in TestPartialUpdateForMergeInto.Test partial 
> update with MOR and Avro log format
>  
> Note: This is partial update support w/ MergeInto btw, not a regular 
> MergeInto.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-8628) Merge Into is pulling in additional fields which are not set as per the condition

Reply via email to