Log compaction will compact log files into one log file. What I mean compacting 
partial log files is to merge partial log files into parquet, while keeping the 
left log files retained to next commits.


I will raise a RFC for this, then we can discuss there.








At 2023-12-04 10:29:49, "Danny Chan" <danny0...@apache.org> wrote:
>The general direction looks good, for functionality that only compact
>partial log files, does the existing log compaction match your needs?
>
>https://github.com/apache/hudi/blob/master/rfc/rfc-48/rfc-48.md
>
>Best,
>Danny
>
>孔维 <18701146...@163.com> 于2023年11月27日周一 23:43写道:
>
>> Background:
>>
>> 1. The data arrives roughly in event time order
>>
>> 2. When some users read the hudi table, they may not concern with the
>> immediate full data, but the full data before time T (eg. daily snapshot
>> data)
>>
>> 3. Reading the RT table will be more time-consuming than reading the COW
>> table (RO table) due to compaction
>>
>> 4. Currently, compact strategy will select ALL log files under the file
>> slice or not.
>>
>> 5. There is no compaction strategy based on event time. The only
>> DayBasedCompactionStrategy need the table partitioned by day in specified
>> format(yyyy/mm/dd)
>>
>>
>> Based on this, I plan to launch a compaction strategy based on event time:
>> *EventTimeBasedCompactionStrategy*.
>>
>>
>> This strategy can
>>
>> 1. Expand use-case of RO table: assign the event time attribute to the RO
>> table without date partition. Given event time T, the log files before T
>> can be compacted, then resulting RO table obtains all data before T,
>> reducing query latency.
>>
>> 2. Exact event time data use-case: in some cases, user want their data to
>> be partitioned by date. Based on the strategy, we can achieve this
>> application scenario。
>>
>>
>>
>> To implement the strategy, we need to
>>
>>
>> 1. support merge some log files in a file slice.
>>
>> Currently, all compaction strategies only support select all log files or
>> select no log file in the orderAndFilter method.
>>
>> To support event time based strategy, there will be log files left after
>> the compaction. This need hudi to support the new feature of merging some
>> log files not all the log files in one file slice.
>>
>> For the left log files, we have to make them visible to the timeline,
>> which can be achieved by creating symlinks for those log files or just copy
>> log files to new instant time.
>>
>> A simple diagram is shown below.
>>
>>
>> 2. write min event time property to the header of log block
>>
>> when append log, we add the min event time property to the log block, then
>> we can qeury out the min event time without deserializing the log data.
>>
>>
>> 3. design the EventTimeBasedCompactionStrategy
>>
>> the strategy can select all the log files needed to compact before event
>> time T
>>
>>
>> 4. sync min event time to RO table property
>>
>> with this, user can know the freshness of RO table.
>>
>>
>> In my company, this scenario is common, which can reduce read latency
>> while meeting user requirements for data time.
>>
>> I would like to ask if I can propose an RFC for this feature. I think this
>> feature would be useful for the community as well.
>>
>>
>>
>>

Reply via email to