[jira] [Commented] (HUDI-8628) Merge Into is pulling in additional fields which are not set as per the condition

Davis Zhang (Jira) Wed, 15 Jan 2025 10:03:07 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-8628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913403#comment-17913403
 ]


Davis Zhang commented on HUDI-8628:
-----------------------------------

cannot repro what Jon reported. The repro tests are here 
[https://github.com/apache/hudi/pull/12646.]

I got new issues 
{code:java}
Caused by: java.io.EOFException    
at 
org.apache.avro.io.BinaryDecoder$ByteArrayByteSource.readRaw(BinaryDecoder.java:970)
    at org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:372)    
at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:289)    at 
org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:208)    at 
org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:470)
    at 
org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:460)
    at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:192)
    at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)    
at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:188)
    at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)    
at 
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260)
    at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248)
    at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180)
    at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)    
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154) 
   at 
org.apache.hudi.avro.HoodieAvroUtils.bytesToAvro(HoodieAvroUtils.java:264)    
at org.apache.hudi.avro.HoodieAvroUtils.bytesToAvro(HoodieAvroUtils.java:252)   
 at org.apache.hudi.avro.HoodieAvroUtils.bytesToAvro(HoodieAvroUtils.java:244)  
  at 
org.apache.hudi.common.model.DefaultHoodieRecordPayload.combineAndGetUpdateValue(DefaultHoodieRecordPayload.java:82)
    at 
org.apache.hudi.common.model.HoodieAvroRecordMerger.combineAndGetUpdateValue(HoodieAvroRecordMerger.java:62)
 {code}

> Merge Into is pulling in additional fields which are not set as per the 
> condition 
> ----------------------------------------------------------------------------------
>
>                 Key: HUDI-8628
>                 URL: https://issues.apache.org/jira/browse/HUDI-8628
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: spark-sql
>            Reporter: sivabalan narayanan
>            Assignee: Davis Zhang
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.0.1
>
>         Attachments: image-2024-12-02-03-58-48-178.png
>
>
> spark.sql(s"set 
> ${HoodieWriteConfig.MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0")
> spark.sql(s"set 
> ${DataSourceWriteOptions.ENABLE_MERGE_INTO_PARTIAL_UPDATES.key} = true")
> spark.sql(s"set ${HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key} = 
> $logDataBlockFormat")
> spark.sql(s"set ${HoodieReaderConfig.FILE_GROUP_READER_ENABLED.key} = false")
> spark.sql(
> s"""
> |create table $tableName (|
> |id int,|
> |name string,|
> |price long,|
> |ts long,|
> |description string|
> |) using hudi|
> |tblproperties(|
> |type ='$tableType',|
> |primaryKey = 'id',|
> |preCombineField = 'ts'|
> |)|
> |location '$basePath'
> """.stripMargin)
> spark.sql(s"insert into $tableName values (1, 'a1', 10, 1000, 'a1: desc1')," +
> "(2, 'a2', 20, 1200, 'a2: desc2'), (3, 'a3', 30.0, 1250, 'a3: desc3')")|
>  
>  
> Merge Into:
> // Partial updates using MERGE INTO statement with changed fields: "price" 
> and "_ts"
> spark.sql(
> s"""
> |merge into $tableName t0|
> |using ( select 1 as id, 'a1' as name, 12 as price, 1001 as _ts|
> |union select 3 as id, 'a3' as name, 25 as price, 1260 as _ts) s0|
> |on t0.id = s0.id|
> |when matched then update set price = s0.price, ts = s0._ts|
> |""".stripMargin)|
>  
> The schema for this merge into command when we reach 
> HoodieSparkSqlWriter.deduceWriterSchema is given below. 
> i.e. 
> val writerSchema = HoodieSchemaUtils.deduceWriterSchema(sourceSchema, 
> latestTableSchemaOpt, internalSchemaOpt, parameters)
>  
> !image-2024-12-02-03-58-48-178.png!
>  
> the merge into command only instructs to update price and _ts right? So, why 
> other fields are also picked up from source(for eg name). 
> You can check out the test in TestPartialUpdateForMergeInto.Test partial 
> update with MOR and Avro log format
>  
> Note: This is partial update support w/ MergeInto btw, not a regular 
> MergeInto.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-8628) Merge Into is pulling in additional fields which are not set as per the condition

Reply via email to