Hello, everyone!

@Surya, first of all, wanted to say that i think this is a great proposal!

> A new compaction strategy can be added, but we thought it might
> complicate the existing logic and need to rely on some hacks, especially
> since Compaction action writes to a base file and places a .commit file
> upon completion. Whereas, in our use case we are not concerned with the
> base file at all, instead we are merging blocks and writing back to the log
> file. So, we thought it is better to use a new action(called
> LogCompaction), which works at a log file level and writes back to the log
> file. Since log files are in general added by deltacommit, upon completion
> LogCompaction can place a .deltacommit.


Did I understand your proposal correctly, that you're suggesting that the
Minor compaction, unlike Major one, be performed as part of the Delta
Commit?

I think Sagar's question is very valid from the perspective of configuring
and balancing major/minor compactions. Nevertheless, I think we can cover
and discuss it as part of the RFC process.

> LogCompaction is not a replacement for regular compaction. LogCompaction
> is performed as a minor compaction so as to reduce the no. of log blocks to
> consider. It does not consider base files while merging the log blocks. To
> merge log files with base file Compaction action is still needed. By using
> LogCompaction action frequently, the frequency with which we do full scale
> compaction is reduced.
> Consider a scenario in which, after 'X' no. of LogCompaction actions, for
> some file groups the log files size becomes comparable to that of base file
> size, in this scenario LogCompaction action is going to take close to the
> same amount of time as compaction action. So, now instead of LogCompaction,
> full scale Compaction needs to be performed on those file groups. In future
> we can also introduce logic to determine what is the right
> action(Compaction or LogCompaction) to be performed depending on the state
> of the file group.


To be able to reason about strategy for both major/minor compactions we
need to clearly define the criteria we're optimizing on (is it # of RPC to
HDFS/Object Storage, I/O, etc). That would also help us to measure
objectively the performance improvements from the introduction of minor
compaction.

On Mon, Mar 21, 2022 at 8:26 AM Vinoth Chandar <vin...@apache.org> wrote:

> +1 overall
>
> On Sat, Mar 19, 2022 at 5:02 PM Surya Prasanna <prasannakumar...@gmail.com
> >
> wrote:
>
> > Hi Sagar,
> > Sorry for the delay in response. Thanks for the questions.
> >
> > 1. Trying to understand the main goal. Is it to balance the tradeoff
> > between read and write amplification for metadata table? Or is it purely
> to
> > optimize for reads?
> > > On large tables, write amplification is a side effect of frequent
> > compactions. So, instead of increasing the frequency of full compaction,
> we
> > are proposing minor compaction(LogCompaction) to be done frequently to
> > merge only the log blocks and write a new log block. By merging the
> blocks,
> > there are less no. of blocks to deal with during read, that way we are
> > optimizing for read performance and potentially avoiding the write
> > amplification problem.
> >
> > 2. Why do we need a separate action? Why can't any of the existing
> > compaction strategies (or a new one if needed) help to achieve this?
> > > A new compaction strategy can be added, but we thought it might
> > complicate the existing logic and need to rely on some hacks, especially
> > since Compaction action writes to a base file and places a .commit file
> > upon completion. Whereas, in our use case we are not concerned with the
> > base file at all, instead we are merging blocks and writing back to the
> log
> > file. So, we thought it is better to use a new action(called
> > LogCompaction), which works at a log file level and writes back to the
> log
> > file. Since log files are in general added by deltacommit, upon
> completion
> > LogCompaction can place a .deltacommit.
> >
> > 3. Is the proposed LogCompaction a replacement for regular compaction for
> > metadata table i.e. if LogCompaction is enabled then compaction cannot be
> > done?
> > > LogCompaction is not a replacement for regular compaction.
> LogCompaction
> > is performed as a minor compaction so as to reduce the no. of log blocks
> to
> > consider. It does not consider base files while merging the log blocks.
> To
> > merge log files with base file Compaction action is still needed. By
> using
> > LogCompaction action frequently, the frequency with which we do full
> scale
> > compaction is reduced.
> > Consider a scenario in which, after 'X' no. of LogCompaction actions, for
> > some file groups the log files size becomes comparable to that of base
> file
> > size, in this scenario LogCompaction action is going to take close to the
> > same amount of time as compaction action. So, now instead of
> LogCompaction,
> > full scale Compaction needs to be performed on those file groups. In
> future
> > we can also introduce logic to determine what is the right
> > action(Compaction or LogCompaction) to be performed depending on the
> state
> > of the file group.
> >
> > Thanks,
> > Surya
> >
> >
> > On Fri, Mar 18, 2022 at 11:22 PM Surya Prasanna Yalla <sya...@uber.com>
> > wrote:
> >
> > >
> > >
> > > ---------- Forwarded message ---------
> > > From: sagar sumit <sagarsumi...@gmail.com>
> > > Date: Wed, Mar 16, 2022 at 11:17 PM
> > > Subject: Re: [DISCUSS] New RFC to create LogCompaction action for MOR
> > > tables?
> > > To: <dev@hudi.apache.org>
> > >
> > >
> > > Hi Surya,
> > >
> > > This is a very interesting idea! I'll be looking forward to RFC.
> > >
> > > I have a few high-level questions:
> > >
> > > 1. Trying to understand the main goal. Is it to balance the tradeoff
> > > between read and write amplification for metadata table? Or is it
> purely
> > to
> > > optimize for reads?
> > > 2. Why do we need a separate action? Why can't any of the existing
> > > compaction strategies (or a new one if needed) help to achieve this?
> > > 3. Is the proposed LogCompaction a replacement for regular compaction
> for
> > > metadata table i.e. if LogCompaction is enabled then compaction cannot
> be
> > > done?
> > >
> > > Regards,
> > > Sagar
> > >
> > > On Thu, Mar 17, 2022 at 12:51 AM Surya Prasanna <
> > > prasannakumar...@gmail.com>
> > > wrote:
> > >
> > > > Hi Team,
> > > >
> > > >
> > > > Record level index uses a metadata table which is a MOR table type.
> > > >
> > > > Each delta commit in the metadata table creates multiple hfile log
> > blocks
> > > > and so to read them multiple file handles have to be opened which
> might
> > > > cause issues in read performance. To reduce the read performance,
> > > > compaction can be run frequently which basically merges all the log
> > > blocks
> > > > to base file and creates another version of base file. If this is
> done
> > > > frequently, it would cause write amplification.
> > > >
> > > > Instead of merging all the log blocks to base file and doing a full
> > > > compaction, minor compaction can be done which basically merges log
> > > blocks
> > > > and creates one new log block.
> > > >
> > > > This can be achieved by adding a new action to Hudi called
> > LogCompaction
> > > > and requires a RFC. Please let me know what you think.
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Surya
> > > >
> > >
> >
>

Reply via email to