Re: [Discuss] Mapreduce based major compactions to minimise compactions overhead in HBase cluster

Andrew Purtell Wed, 01 Mar 2023 17:46:42 -0800

 Hi Rajesbabu,

You have proposed a solution without describing the problem. Please do that
first.


That said, compaction is fundamental to HBase operation and should have no
external dependency on a particular compute framework. Especially
MapReduce, which is out of favor and deprecated in many places. If this is
an optional feature it could be fine. So perhaps you could also explain how
you see this potential feature fitting into the long term roadmap for the
project.



On Wed, Mar 1, 2023 at 3:54 PM rajeshb...@apache.org <
chrajeshbab...@gmail.com> wrote:

> Hi Team,
>
> I would just like to discuss the new compactor implementation to run major
> compactions through mapreduce job(which are best fit for merge sorting
> applications)
>
> I have a high level plan and would like to check with you before proceeding
> with detailed design and implementation to know any challenges or any
> similar solutions you are aware of.
>
> High level plan:
>
> We should have a new compactor implementation which can create the
> mapreduce job
>  for running major compaction and wait for job to complete in a thread.
> Mapreduce job implementation is as follows:
> 1) since we need to read through all the files in a column family for major
> compaction
>  we can pass the column family folder to the mapreduce job.
> If possible file filters might be required to not use newly created hfiles.
> 2) we can identify the partitions or input splits based on  hfiles
> boundaries and
> utilise existing HFileInputFormatter to scan through each hfile partitions
>  so that each mapper sorts data  within the partition range.
> 3) If possible we can use combiner to remove old versions or deleted cells.
> 4) we can use the HFileOutputFilter to create new HFile at tmp directory
> and write cells to it by reading the sorted data from mappers in the
> reducer.
>
> once the hfile is created in a tmp directory and mapreduce job completed
> we can move the compacted file to the column family location, move old
> files out and refresh the hfiles which is same as default implementation.
>
> There are tradeoffs with the solution where intermediate copies of data
> required
> while running the mapreduce job even though the hfiles have sorted data.
>
> Thanks,
> Rajeshbabu.
>


-- 
Best regards,
Andrew

Unrest, ignorance distilled, nihilistic imbeciles -
    It's what we’ve earned
Welcome, apocalypse, what’s taken you so long?
Bring us the fitting end that we’ve been counting on
   - A23, Welcome, Apocalypse

Re: [Discuss] Mapreduce based major compactions to minimise compactions overhead in HBase cluster

Reply via email to