stayrascal edited a comment on issue #4030:
URL: https://github.com/apache/hudi/issues/4030#issuecomment-1025077842


   Hi @danny0405 , I tired the solution of changing the ValueState of 
BucketAssignFunction by store the whole HoodieRecord instead of 
HoodieRecordGlobalLocation. (Once the partition changed, output a delete record 
to old file, and update the location of old record with new partition path and 
output a new record to new file)
   
   It works in some cases, but not works in all cases.
   
   **Works cases:**
   - The old record exists in based file(enable bootstrap index), the incoming 
record partition changed. 
     - The new record from old record and incoming record will be merged before 
write to new partition file.
   - Only one incoming record before another incoming record with partition 
changed, these two records will be merged to new  partition file.
     - Assume incoming records a(a=1,b=null,c=2022-01-01) and b(a=null, 
b=2,p=2022-01-02) coming in a same commit, the a will be stored in ValueState, 
and it will be sink to 2022-01-01 partition file, and later the b coming, a 
delete record(c=2022-01-01) will be outputted(which will merge with previous 
record, no records write to 2021-01-01 file), and the same time, a new record 
a1(a=1, b=null, c=2022-01-02) will outputted as well, the a1 and b will be 
merged and then write to 2022-01-02 partition file
   
   **Not Works cases:**
   - there are more than one incoming record before another incoming record 
with partition changed, only the last record(before partition changed) will be 
merged.
     - Assume incoming records a(a=1, b=null, c=null, d=2022-01-01), b(a=null, 
b=2, c=null, d=2022-01-01) and c(a=null, b=null, c=3, d=2022-01-02) coming in a 
same commit
       - the a and b will be sink to downstream(StreamWriteFunction) with 
location 2022-01-01 at first.
       - a delete record from b will be sink to downstream with location 
2022-01-01 later when record c coming
       - a new record from b1(a=null, b=2, c=null, d=2022-01-02) will sink to 
downstream with location 2022-01-02
       - record b1 and c will be merged and write to 2022-01-02 partition, but 
the info from record a will missed, because the ValueState will only store one 
element.
   
   So in order to support the partial update or overwrite non default(exists) 
capabilities in current architecture, we might need to change ValueState to 
ListState(or ValueState with customized object) to store recent n records or 
merge the record before store in state, once the partition changed, rewrite 
these latest n records to new partition and clear the state. The above all 
logic should be control by a feature toggle via a configuration. (notes: the n 
means only merged the latest n records in a same commit if the partition path 
changed)
   
   What's your thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to