Thank you for the inputs.
Yes, we do return in bytes, but it is multiplied from MB so we get a false
size of 0 below 1MB.

I checked HBASE-16169, even before that patch we used MB in
REgionSizeCalculator, the patch just kept the idea of using MB as a unit.

I will map every usage related to region size and try to figure out the
reason why MB is the standardize unit. And after all, we can always convert
to MB - as we do convert to Bytes right now. But this way we wouldn't lose
the precise size due to conversation if we need the size in Bytes.

I'll update my PR once I think I figured out the solution.

- Norbert



On Wed, Oct 13, 2021 at 4:59 AM 张铎(Duo Zhang) <[email protected]> wrote:

> The return value is in bytes, the problem is that we normalize the size in
> MB and then multiply MB to get the size in bytes, so if a file is less than
> 1MB, the returned value will be zero.
>
> Need to investigate more here.
>
> Reading the issue, the scalable problem they wanted to solve is that we
> will go to master to get the region size, not about whether the unit is in
> MB or not.
>
> Thanks.
>
> Nick Dimiduk <[email protected]> 于2021年10月13日周三 上午7:47写道:
>
> > Hi Norbert,
> >
> > To answer your question directly: the RegionSizeCalculator class is
> > annotated with @InterfaceAudience.Private, which means there's a good
> > chance that it's implementation can be changed without need for a
> > deprecation cycle and user participation.
> >
> > Curiously, I noticed that this `sizeMap` is accessed down in the method
> > `long getRegionSize(byte[])`, and its javadoc mentions the returned unit
> > explicitly as bytes.
> >
> > So with a little investigation using git blame, I see that the switch
> from
> > returning values in bytes to values in megabytes came in through
> > HBASE-16169 -- your proposed change was the old implementation. For
> > whatever reasons, it was determined to not be scalable. So, we could
> revert
> > back, but we'd need some new solution to what HBASE-16169 aimed to solve.
> >
> > I hope this helps.
> >
> > Thanks,
> > Nick
> >
> > On Tue, Oct 12, 2021 at 10:54 AM Norbert Kalmar <[email protected]>
> > wrote:
> >
> > > Hi All,
> > >
> > > There is a new optimization in spark (SPARK-34809) where
> > ignoreEmptySplits
> > > filters out all regions that's size is 0. They use a hadoop library
> > > getSize() in TableInputFormat.
> > >
> > > Drilling down, this will return Bytes, but it converts it from
> MegaBytes
> > -
> > > meaning anything under 1 MB will come down as 0 Bytes, meaning empty.
> > > I did a quick PR I thought would help:
> > > https://github.com/apache/hbase/pull/3737
> > > But it turns out it's not as easy as requesting the size in Bytes
> instead
> > > of MB from Size class, as we set it in MB te begin with in
> > > RegionMetricsBuilder
> > > -> setStoreFileSize(new Size(regionLoadPB.getStorefileSizeMB(),
> > > Size.Unit.MEGABYTE))
> > >
> > > I did some testing, and inserting a few kilobytes of data, then
> > > calling list_regions
> > > will in fact give back size 0.
> > >
> > > My question is, is it okay to store the region size in Bytes instead?
> > > Mainly asking because of backward compatibility reasons.
> > >
> > > Regards,
> > > Norbert
> > >
> >
>

Reply via email to