Hi Rajesbabu, You have proposed a solution without describing the problem. Please do that first.
That said, compaction is fundamental to HBase operation and should have no external dependency on a particular compute framework. Especially MapReduce, which is out of favor and deprecated in many places. If this is an optional feature it could be fine. So perhaps you could also explain how you see this potential feature fitting into the long term roadmap for the project. On Wed, Mar 1, 2023 at 3:54 PM rajeshb...@apache.org < chrajeshbab...@gmail.com> wrote: > Hi Team, > > I would just like to discuss the new compactor implementation to run major > compactions through mapreduce job(which are best fit for merge sorting > applications) > > I have a high level plan and would like to check with you before proceeding > with detailed design and implementation to know any challenges or any > similar solutions you are aware of. > > High level plan: > > We should have a new compactor implementation which can create the > mapreduce job > for running major compaction and wait for job to complete in a thread. > Mapreduce job implementation is as follows: > 1) since we need to read through all the files in a column family for major > compaction > we can pass the column family folder to the mapreduce job. > If possible file filters might be required to not use newly created hfiles. > 2) we can identify the partitions or input splits based on hfiles > boundaries and > utilise existing HFileInputFormatter to scan through each hfile partitions > so that each mapper sorts data within the partition range. > 3) If possible we can use combiner to remove old versions or deleted cells. > 4) we can use the HFileOutputFilter to create new HFile at tmp directory > and write cells to it by reading the sorted data from mappers in the > reducer. > > once the hfile is created in a tmp directory and mapreduce job completed > we can move the compacted file to the column family location, move old > files out and refresh the hfiles which is same as default implementation. > > There are tradeoffs with the solution where intermediate copies of data > required > while running the mapreduce job even though the hfiles have sorted data. > > Thanks, > Rajeshbabu. > -- Best regards, Andrew Unrest, ignorance distilled, nihilistic imbeciles - It's what we’ve earned Welcome, apocalypse, what’s taken you so long? Bring us the fitting end that we’ve been counting on - A23, Welcome, Apocalypse