Hi Norbert, To answer your question directly: the RegionSizeCalculator class is annotated with @InterfaceAudience.Private, which means there's a good chance that it's implementation can be changed without need for a deprecation cycle and user participation.
Curiously, I noticed that this `sizeMap` is accessed down in the method `long getRegionSize(byte[])`, and its javadoc mentions the returned unit explicitly as bytes. So with a little investigation using git blame, I see that the switch from returning values in bytes to values in megabytes came in through HBASE-16169 -- your proposed change was the old implementation. For whatever reasons, it was determined to not be scalable. So, we could revert back, but we'd need some new solution to what HBASE-16169 aimed to solve. I hope this helps. Thanks, Nick On Tue, Oct 12, 2021 at 10:54 AM Norbert Kalmar <[email protected]> wrote: > Hi All, > > There is a new optimization in spark (SPARK-34809) where ignoreEmptySplits > filters out all regions that's size is 0. They use a hadoop library > getSize() in TableInputFormat. > > Drilling down, this will return Bytes, but it converts it from MegaBytes - > meaning anything under 1 MB will come down as 0 Bytes, meaning empty. > I did a quick PR I thought would help: > https://github.com/apache/hbase/pull/3737 > But it turns out it's not as easy as requesting the size in Bytes instead > of MB from Size class, as we set it in MB te begin with in > RegionMetricsBuilder > -> setStoreFileSize(new Size(regionLoadPB.getStorefileSizeMB(), > Size.Unit.MEGABYTE)) > > I did some testing, and inserting a few kilobytes of data, then > calling list_regions > will in fact give back size 0. > > My question is, is it okay to store the region size in Bytes instead? > Mainly asking because of backward compatibility reasons. > > Regards, > Norbert >
