On Thu, Mar 21, 2013 at 12:06 PM, Nicolas Seyvet
<nicolas.sey...@gmail.com> wrote:
> @Ram: You are entirely correct, I made the exact same mistakes of mixing up
> Large and minor compaction.  By looking closely, what I see is that at
> around 200 HFiles per region it starts minor compacting files per group of
> 10 HFiles.  The "problem" seems that this minor compacting never stops even
> when there are about 20 HFiles left.  It just keep on going and on taking
> more and more time (I guess because the files to compact are getting
> bigger).
>
> Of course in parallel we keep on adding more and more data.
>
> @J-D: "It seems to me that it would be better if you were able to do a
> single load for all your files." Yes, I agree.. but that is not what we are
> testing, our use case is to use 1min batch files.

I worked on a very similar use case recently and would recommend
against doing bulk loads like this. The way bulk loaded files are
treated by the compaction selection algorithm is broken when loads are
done in a continuous fashion. The solution to this is in HBASE-7842[1]
but it is still being worked on.

What you are seeing is that the files picked up for compactions will
often include the bigger already-compacted files. As those files get
bigger, compactions will take longer and longer, up to a point where
the data that is selected for compaction is greater than your
compacting capacity.

The workaround would be to use the normal API as files will be more
properly selected for compaction, but it won't be as fast/efficient as
the continuous bulk load solution should be if the selection algo
wasn't broken.

J-D

1. https://issues.apache.org/jira/browse/HBASE-7842

Reply via email to