Hi All, There is a new optimization in spark (SPARK-34809) where ignoreEmptySplits filters out all regions that's size is 0. They use a hadoop library getSize() in TableInputFormat.
Drilling down, this will return Bytes, but it converts it from MegaBytes - meaning anything under 1 MB will come down as 0 Bytes, meaning empty. I did a quick PR I thought would help: https://github.com/apache/hbase/pull/3737 But it turns out it's not as easy as requesting the size in Bytes instead of MB from Size class, as we set it in MB te begin with in RegionMetricsBuilder -> setStoreFileSize(new Size(regionLoadPB.getStorefileSizeMB(), Size.Unit.MEGABYTE)) I did some testing, and inserting a few kilobytes of data, then calling list_regions will in fact give back size 0. My question is, is it okay to store the region size in Bytes instead? Mainly asking because of backward compatibility reasons. Regards, Norbert
