Log compaction will compact log files into one log file. What I mean compacting partial log files is to merge partial log files into parquet, while keeping the left log files retained to next commits.
I will raise a RFC for this, then we can discuss there. At 2023-12-04 10:29:49, "Danny Chan" <danny0...@apache.org> wrote: >The general direction looks good, for functionality that only compact >partial log files, does the existing log compaction match your needs? > >https://github.com/apache/hudi/blob/master/rfc/rfc-48/rfc-48.md > >Best, >Danny > >孔维 <18701146...@163.com> 于2023年11月27日周一 23:43写道: > >> Background: >> >> 1. The data arrives roughly in event time order >> >> 2. When some users read the hudi table, they may not concern with the >> immediate full data, but the full data before time T (eg. daily snapshot >> data) >> >> 3. Reading the RT table will be more time-consuming than reading the COW >> table (RO table) due to compaction >> >> 4. Currently, compact strategy will select ALL log files under the file >> slice or not. >> >> 5. There is no compaction strategy based on event time. The only >> DayBasedCompactionStrategy need the table partitioned by day in specified >> format(yyyy/mm/dd) >> >> >> Based on this, I plan to launch a compaction strategy based on event time: >> *EventTimeBasedCompactionStrategy*. >> >> >> This strategy can >> >> 1. Expand use-case of RO table: assign the event time attribute to the RO >> table without date partition. Given event time T, the log files before T >> can be compacted, then resulting RO table obtains all data before T, >> reducing query latency. >> >> 2. Exact event time data use-case: in some cases, user want their data to >> be partitioned by date. Based on the strategy, we can achieve this >> application scenario。 >> >> >> >> To implement the strategy, we need to >> >> >> 1. support merge some log files in a file slice. >> >> Currently, all compaction strategies only support select all log files or >> select no log file in the orderAndFilter method. >> >> To support event time based strategy, there will be log files left after >> the compaction. This need hudi to support the new feature of merging some >> log files not all the log files in one file slice. >> >> For the left log files, we have to make them visible to the timeline, >> which can be achieved by creating symlinks for those log files or just copy >> log files to new instant time. >> >> A simple diagram is shown below. >> >> >> 2. write min event time property to the header of log block >> >> when append log, we add the min event time property to the log block, then >> we can qeury out the min event time without deserializing the log data. >> >> >> 3. design the EventTimeBasedCompactionStrategy >> >> the strategy can select all the log files needed to compact before event >> time T >> >> >> 4. sync min event time to RO table property >> >> with this, user can know the freshness of RO table. >> >> >> In my company, this scenario is common, which can reduce read latency >> while meeting user requirements for data time. >> >> I would like to ask if I can propose an RFC for this feature. I think this >> feature would be useful for the community as well. >> >> >> >>