The general direction looks good, for functionality that only compact partial log files, does the existing log compaction match your needs?
https://github.com/apache/hudi/blob/master/rfc/rfc-48/rfc-48.md Best, Danny 孔维 <18701146...@163.com> 于2023年11月27日周一 23:43写道: > Background: > > 1. The data arrives roughly in event time order > > 2. When some users read the hudi table, they may not concern with the > immediate full data, but the full data before time T (eg. daily snapshot > data) > > 3. Reading the RT table will be more time-consuming than reading the COW > table (RO table) due to compaction > > 4. Currently, compact strategy will select ALL log files under the file > slice or not. > > 5. There is no compaction strategy based on event time. The only > DayBasedCompactionStrategy need the table partitioned by day in specified > format(yyyy/mm/dd) > > > Based on this, I plan to launch a compaction strategy based on event time: > *EventTimeBasedCompactionStrategy*. > > > This strategy can > > 1. Expand use-case of RO table: assign the event time attribute to the RO > table without date partition. Given event time T, the log files before T > can be compacted, then resulting RO table obtains all data before T, > reducing query latency. > > 2. Exact event time data use-case: in some cases, user want their data to > be partitioned by date. Based on the strategy, we can achieve this > application scenario。 > > > > To implement the strategy, we need to > > > 1. support merge some log files in a file slice. > > Currently, all compaction strategies only support select all log files or > select no log file in the orderAndFilter method. > > To support event time based strategy, there will be log files left after > the compaction. This need hudi to support the new feature of merging some > log files not all the log files in one file slice. > > For the left log files, we have to make them visible to the timeline, > which can be achieved by creating symlinks for those log files or just copy > log files to new instant time. > > A simple diagram is shown below. > > > 2. write min event time property to the header of log block > > when append log, we add the min event time property to the log block, then > we can qeury out the min event time without deserializing the log data. > > > 3. design the EventTimeBasedCompactionStrategy > > the strategy can select all the log files needed to compact before event > time T > > > 4. sync min event time to RO table property > > with this, user can know the freshness of RO table. > > > In my company, this scenario is common, which can reduce read latency > while meeting user requirements for data time. > > I would like to ask if I can propose an RFC for this feature. I think this > feature would be useful for the community as well. > > > >