stayrascal commented on issue #4030:
URL: https://github.com/apache/hudi/issues/4030#issuecomment-1025077842
Hi @danny0405 , I tired the solution of changing the ValueState of
BucketAssignFunction by store the whole HoodieRecord instead of
HoodieRecordGlobalLocation. (Once the partition changed, output a delete record
to old file, and update the location of old record with new partition path and
output a new record to new file)
It works in some cases, but not works in all cases.
**Works cases:**
- The old record exists in based file(enable bootstrap index), the incoming
record partition changed.
- The new record from old record and incoming record will be merged before
write to new partition file.
- Only one incoming record before another incoming record with partition
changed, these two records will be merged to new partition file.
- Assume incoming records a(a=1,b=null,c=2022-01-01) and b(a=null,
b=2,p=2022-01-02) coming in a same commit, the a will be stored in ValueState,
and it will be sink to 2022-01-01 partition file, and later the b coming, a
delete record(c=2022-01-01) will be outputted(which will merge with previous
record, no records write to 2021-01-01 file), and the same time, a new record
a1(a=1, b=null, c=2022-01-02) will outputted as well, the a1 and b will be
merged and then write to 2022-01-02 partition file
**Not Works cases:**
- there are more than one incoming record before another incoming record
with partition changed, only the last record(before partition changed) will be
merged.
- Assume incoming records a(a=1, b=null, c=null, d=2022-01-01), b(a=null,
b=2, c=null, d=2022-01-01) and c(a=null, b=null, c=3, d=2022-01-02) coming in a
same commit
- the a and b will be sink to downstream(StreamWriteFunction) with
location 2022-01-01 at first.
- a delete record from b will be sink to downstream with location
2022-01-01 later when record c coming
- a new record from b1(a=null, b=2, c=null, d=2022-01-02) will sink to
downstream with location 2022-01-02
- record b1 and c will be merged and write to 2022-01-02 partition, but
the info from record a will missed, because the ValueState will only store one
element.
So in order to support the partial update or overwrite non default(exists)
capabilities in current architecture, we might need to change ValueState to
ListState(or ValueState with customized object) to store recent n records, once
the partition changed, rewrite these latest n records to new partition and
clear the state. The above all logic should be control by a feature toggle via
a configuration. (notes: the n means only merged the latest n records in a same
commit if the partition path changed)
What's your thoughts?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]