I think the problem is here:

https://github.com/apache/hbase/blob/736f3e77c8d13719fc48e04876368d3494699280/hbase-protocol-shaded/src/main/protobuf/server/ClusterStatus.proto#L98

We only store the storefile_size_MB in RegionLoad, so even if you use get
Unit.Byte, you can not get the actual size in bytes...

Maybe we could always round the size to 1MB if the region size is greater
than 0 but less than 1MB?

Nick Dimiduk <[email protected]> 于2021年10月14日周四 上午1:51写道:

> Maybe the master is only tracking sizes in mb because that’s what is sent
> from the region servers via heartbeat protocol? Reducing the load over the
> wire makes sense as a scalability concern.
>
> On Wed, Oct 13, 2021 at 03:45 Norbert Kalmar <[email protected]
> >
> wrote:
>
> > Thank you for the inputs.
> > Yes, we do return in bytes, but it is multiplied from MB so we get a
> false
> > size of 0 below 1MB.
> >
> > I checked HBASE-16169, even before that patch we used MB in
> > REgionSizeCalculator, the patch just kept the idea of using MB as a unit.
> >
> > I will map every usage related to region size and try to figure out the
> > reason why MB is the standardize unit. And after all, we can always
> convert
> > to MB - as we do convert to Bytes right now. But this way we wouldn't
> lose
> > the precise size due to conversation if we need the size in Bytes.
> >
> > I'll update my PR once I think I figured out the solution.
> >
> > - Norbert
> >
> >
> >
> > On Wed, Oct 13, 2021 at 4:59 AM 张铎(Duo Zhang) <[email protected]>
> > wrote:
> >
> > > The return value is in bytes, the problem is that we normalize the size
> > in
> > > MB and then multiply MB to get the size in bytes, so if a file is less
> > than
> > > 1MB, the returned value will be zero.
> > >
> > > Need to investigate more here.
> > >
> > > Reading the issue, the scalable problem they wanted to solve is that we
> > > will go to master to get the region size, not about whether the unit is
> > in
> > > MB or not.
> > >
> > > Thanks.
> > >
> > > Nick Dimiduk <[email protected]> 于2021年10月13日周三 上午7:47写道:
> > >
> > > > Hi Norbert,
> > > >
> > > > To answer your question directly: the RegionSizeCalculator class is
> > > > annotated with @InterfaceAudience.Private, which means there's a good
> > > > chance that it's implementation can be changed without need for a
> > > > deprecation cycle and user participation.
> > > >
> > > > Curiously, I noticed that this `sizeMap` is accessed down in the
> method
> > > > `long getRegionSize(byte[])`, and its javadoc mentions the returned
> > unit
> > > > explicitly as bytes.
> > > >
> > > > So with a little investigation using git blame, I see that the switch
> > > from
> > > > returning values in bytes to values in megabytes came in through
> > > > HBASE-16169 -- your proposed change was the old implementation. For
> > > > whatever reasons, it was determined to not be scalable. So, we could
> > > revert
> > > > back, but we'd need some new solution to what HBASE-16169 aimed to
> > solve.
> > > >
> > > > I hope this helps.
> > > >
> > > > Thanks,
> > > > Nick
> > > >
> > > > On Tue, Oct 12, 2021 at 10:54 AM Norbert Kalmar <[email protected]>
> > > > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > There is a new optimization in spark (SPARK-34809) where
> > > > ignoreEmptySplits
> > > > > filters out all regions that's size is 0. They use a hadoop library
> > > > > getSize() in TableInputFormat.
> > > > >
> > > > > Drilling down, this will return Bytes, but it converts it from
> > > MegaBytes
> > > > -
> > > > > meaning anything under 1 MB will come down as 0 Bytes, meaning
> empty.
> > > > > I did a quick PR I thought would help:
> > > > > https://github.com/apache/hbase/pull/3737
> > > > > But it turns out it's not as easy as requesting the size in Bytes
> > > instead
> > > > > of MB from Size class, as we set it in MB te begin with in
> > > > > RegionMetricsBuilder
> > > > > -> setStoreFileSize(new Size(regionLoadPB.getStorefileSizeMB(),
> > > > > Size.Unit.MEGABYTE))
> > > > >
> > > > > I did some testing, and inserting a few kilobytes of data, then
> > > > > calling list_regions
> > > > > will in fact give back size 0.
> > > > >
> > > > > My question is, is it okay to store the region size in Bytes
> instead?
> > > > > Mainly asking because of backward compatibility reasons.
> > > > >
> > > > > Regards,
> > > > > Norbert
> > > > >
> > > >
> > >
> >
>

Reply via email to