The general direction looks good, for functionality that only compact
partial log files, does the existing log compaction match your needs?

https://github.com/apache/hudi/blob/master/rfc/rfc-48/rfc-48.md

Best,
Danny

孔维 <18701146...@163.com> 于2023年11月27日周一 23:43写道:

> Background:
>
> 1. The data arrives roughly in event time order
>
> 2. When some users read the hudi table, they may not concern with the
> immediate full data, but the full data before time T (eg. daily snapshot
> data)
>
> 3. Reading the RT table will be more time-consuming than reading the COW
> table (RO table) due to compaction
>
> 4. Currently, compact strategy will select ALL log files under the file
> slice or not.
>
> 5. There is no compaction strategy based on event time. The only
> DayBasedCompactionStrategy need the table partitioned by day in specified
> format(yyyy/mm/dd)
>
>
> Based on this, I plan to launch a compaction strategy based on event time:
> *EventTimeBasedCompactionStrategy*.
>
>
> This strategy can
>
> 1. Expand use-case of RO table: assign the event time attribute to the RO
> table without date partition. Given event time T, the log files before T
> can be compacted, then resulting RO table obtains all data before T,
> reducing query latency.
>
> 2. Exact event time data use-case: in some cases, user want their data to
> be partitioned by date. Based on the strategy, we can achieve this
> application scenario。
>
>
>
> To implement the strategy, we need to
>
>
> 1. support merge some log files in a file slice.
>
> Currently, all compaction strategies only support select all log files or
> select no log file in the orderAndFilter method.
>
> To support event time based strategy, there will be log files left after
> the compaction. This need hudi to support the new feature of merging some
> log files not all the log files in one file slice.
>
> For the left log files, we have to make them visible to the timeline,
> which can be achieved by creating symlinks for those log files or just copy
> log files to new instant time.
>
> A simple diagram is shown below.
>
>
> 2. write min event time property to the header of log block
>
> when append log, we add the min event time property to the log block, then
> we can qeury out the min event time without deserializing the log data.
>
>
> 3. design the EventTimeBasedCompactionStrategy
>
> the strategy can select all the log files needed to compact before event
> time T
>
>
> 4. sync min event time to RO table property
>
> with this, user can know the freshness of RO table.
>
>
> In my company, this scenario is common, which can reduce read latency
> while meeting user requirements for data time.
>
> I would like to ask if I can propose an RFC for this feature. I think this
> feature would be useful for the community as well.
>
>
>
>

Reply via email to