Hi All,

There is a new optimization in spark (SPARK-34809) where ignoreEmptySplits
filters out all regions that's size is 0. They use a hadoop library
getSize() in TableInputFormat.

Drilling down, this will return Bytes, but it converts it from MegaBytes -
meaning anything under 1 MB will come down as 0 Bytes, meaning empty.
I did a quick PR I thought would help:
https://github.com/apache/hbase/pull/3737
But it turns out it's not as easy as requesting the size in Bytes instead
of MB from Size class, as we set it in MB te begin with in RegionMetricsBuilder
-> setStoreFileSize(new Size(regionLoadPB.getStorefileSizeMB(),
Size.Unit.MEGABYTE))

I did some testing, and inserting a few kilobytes of data, then
calling list_regions
will in fact give back size 0.

My question is, is it okay to store the region size in Bytes instead?
Mainly asking because of backward compatibility reasons.

Regards,
Norbert

Reply via email to