Re: [Discuss] Mapreduce based major compactions to minimise compactions overhead in HBase cluster

Nick Dimiduk Wed, 01 Mar 2023 21:02:44 -0800

Hi Rajeshbabu,

I think that compaction management and execution are important areas for
experimentation and growth of HBase. I’m more interested in the harness and
APIs that make an implementation possible than in any specific
implementation. I’d also like to see consideration for a cluster-wide
compaction scheduler, something to prioritize allocation of precious IO
resources.


I agree with Andrew that externalizing to MapReduce is unlikely to be a
popular compute runtime for the feature, but I also have no statistics
about which runtimes are commonly available.

I look forward to seeing how your design proposal develops.

Thanks,
Nick

On Thu, Mar 2, 2023 at 02:46 Andrew Purtell <apurt...@apache.org> wrote:

>  Hi Rajesbabu,
>
> You have proposed a solution without describing the problem. Please do that
> first.
>
> That said, compaction is fundamental to HBase operation and should have no
> external dependency on a particular compute framework. Especially
> MapReduce, which is out of favor and deprecated in many places. If this is
> an optional feature it could be fine. So perhaps you could also explain how
> you see this potential feature fitting into the long term roadmap for the
> project.
>
>
>
> On Wed, Mar 1, 2023 at 3:54 PM rajeshb...@apache.org <
> chrajeshbab...@gmail.com> wrote:
>
> > Hi Team,
> >
> > I would just like to discuss the new compactor implementation to run
> major
> > compactions through mapreduce job(which are best fit for merge sorting
> > applications)
> >
> > I have a high level plan and would like to check with you before
> proceeding
> > with detailed design and implementation to know any challenges or any
> > similar solutions you are aware of.
> >
> > High level plan:
> >
> > We should have a new compactor implementation which can create the
> > mapreduce job
> >  for running major compaction and wait for job to complete in a thread.
> > Mapreduce job implementation is as follows:
> > 1) since we need to read through all the files in a column family for
> major
> > compaction
> >  we can pass the column family folder to the mapreduce job.
> > If possible file filters might be required to not use newly created
> hfiles.
> > 2) we can identify the partitions or input splits based on  hfiles
> > boundaries and
> > utilise existing HFileInputFormatter to scan through each hfile
> partitions
> >  so that each mapper sorts data  within the partition range.
> > 3) If possible we can use combiner to remove old versions or deleted
> cells.
> > 4) we can use the HFileOutputFilter to create new HFile at tmp directory
> > and write cells to it by reading the sorted data from mappers in the
> > reducer.
> >
> > once the hfile is created in a tmp directory and mapreduce job completed
> > we can move the compacted file to the column family location, move old
> > files out and refresh the hfiles which is same as default implementation.
> >
> > There are tradeoffs with the solution where intermediate copies of data
> > required
> > while running the mapreduce job even though the hfiles have sorted data.
> >
> > Thanks,
> > Rajeshbabu.
> >
>
>
> --
> Best regards,
> Andrew
>
> Unrest, ignorance distilled, nihilistic imbeciles -
>     It's what we’ve earned
> Welcome, apocalypse, what’s taken you so long?
> Bring us the fitting end that we’ve been counting on
>    - A23, Welcome, Apocalypse
>

Re: [Discuss] Mapreduce based major compactions to minimise compactions overhead in HBase cluster

Reply via email to