Norbert made the change that Duo recommended: https://github.com/apache/hbase/pull/3872 I'll wait a few days in case someone wants to review it as well.
On Sun, Oct 17, 2021 at 6:21 PM 张铎(Duo Zhang) <[email protected]> wrote: > I think the problem is here: > > > https://github.com/apache/hbase/blob/736f3e77c8d13719fc48e04876368d3494699280/hbase-protocol-shaded/src/main/protobuf/server/ClusterStatus.proto#L98 > > We only store the storefile_size_MB in RegionLoad, so even if you use get > Unit.Byte, you can not get the actual size in bytes... > > Maybe we could always round the size to 1MB if the region size is greater > than 0 but less than 1MB? > > Nick Dimiduk <[email protected]> 于2021年10月14日周四 上午1:51写道: > > > Maybe the master is only tracking sizes in mb because that’s what is sent > > from the region servers via heartbeat protocol? Reducing the load over > the > > wire makes sense as a scalability concern. > > > > On Wed, Oct 13, 2021 at 03:45 Norbert Kalmar > <[email protected] > > > > > wrote: > > > > > Thank you for the inputs. > > > Yes, we do return in bytes, but it is multiplied from MB so we get a > > false > > > size of 0 below 1MB. > > > > > > I checked HBASE-16169, even before that patch we used MB in > > > REgionSizeCalculator, the patch just kept the idea of using MB as a > unit. > > > > > > I will map every usage related to region size and try to figure out the > > > reason why MB is the standardize unit. And after all, we can always > > convert > > > to MB - as we do convert to Bytes right now. But this way we wouldn't > > lose > > > the precise size due to conversation if we need the size in Bytes. > > > > > > I'll update my PR once I think I figured out the solution. > > > > > > - Norbert > > > > > > > > > > > > On Wed, Oct 13, 2021 at 4:59 AM 张铎(Duo Zhang) <[email protected]> > > > wrote: > > > > > > > The return value is in bytes, the problem is that we normalize the > size > > > in > > > > MB and then multiply MB to get the size in bytes, so if a file is > less > > > than > > > > 1MB, the returned value will be zero. > > > > > > > > Need to investigate more here. > > > > > > > > Reading the issue, the scalable problem they wanted to solve is that > we > > > > will go to master to get the region size, not about whether the unit > is > > > in > > > > MB or not. > > > > > > > > Thanks. > > > > > > > > Nick Dimiduk <[email protected]> 于2021年10月13日周三 上午7:47写道: > > > > > > > > > Hi Norbert, > > > > > > > > > > To answer your question directly: the RegionSizeCalculator class is > > > > > annotated with @InterfaceAudience.Private, which means there's a > good > > > > > chance that it's implementation can be changed without need for a > > > > > deprecation cycle and user participation. > > > > > > > > > > Curiously, I noticed that this `sizeMap` is accessed down in the > > method > > > > > `long getRegionSize(byte[])`, and its javadoc mentions the returned > > > unit > > > > > explicitly as bytes. > > > > > > > > > > So with a little investigation using git blame, I see that the > switch > > > > from > > > > > returning values in bytes to values in megabytes came in through > > > > > HBASE-16169 -- your proposed change was the old implementation. For > > > > > whatever reasons, it was determined to not be scalable. So, we > could > > > > revert > > > > > back, but we'd need some new solution to what HBASE-16169 aimed to > > > solve. > > > > > > > > > > I hope this helps. > > > > > > > > > > Thanks, > > > > > Nick > > > > > > > > > > On Tue, Oct 12, 2021 at 10:54 AM Norbert Kalmar < > [email protected]> > > > > > wrote: > > > > > > > > > > > Hi All, > > > > > > > > > > > > There is a new optimization in spark (SPARK-34809) where > > > > > ignoreEmptySplits > > > > > > filters out all regions that's size is 0. They use a hadoop > library > > > > > > getSize() in TableInputFormat. > > > > > > > > > > > > Drilling down, this will return Bytes, but it converts it from > > > > MegaBytes > > > > > - > > > > > > meaning anything under 1 MB will come down as 0 Bytes, meaning > > empty. > > > > > > I did a quick PR I thought would help: > > > > > > https://github.com/apache/hbase/pull/3737 > > > > > > But it turns out it's not as easy as requesting the size in Bytes > > > > instead > > > > > > of MB from Size class, as we set it in MB te begin with in > > > > > > RegionMetricsBuilder > > > > > > -> setStoreFileSize(new Size(regionLoadPB.getStorefileSizeMB(), > > > > > > Size.Unit.MEGABYTE)) > > > > > > > > > > > > I did some testing, and inserting a few kilobytes of data, then > > > > > > calling list_regions > > > > > > will in fact give back size 0. > > > > > > > > > > > > My question is, is it okay to store the region size in Bytes > > instead? > > > > > > Mainly asking because of backward compatibility reasons. > > > > > > > > > > > > Regards, > > > > > > Norbert > > > > > > > > > > > > > > > > > > > > >
