Hi!

While working on PHOENIX-6587, I've found that we call
SchemaUtil.processSplit() for both explicit split points, and salt bucket
bytes.
This effectively right pads the split point with zeros, to reach the
minimum PK length possible.

For some queries, this results in an additional scan being run on an
additional region, for scanning the <partialkey> ..
<partialkey>00000000000 range, where no row keys are possible.

This can be seen at
https://github.com/apache/phoenix/blob/fb9065760faa3986f49671df2cb64dcaca7d3476/phoenix-core/src/test/java/org/apache/phoenix/compile/QueryCompilerTest.java#L4901

There we should be scanning a single region, but we scan an extra one,
because the padded region boundaries are  A0x0x0x0x0x0x0.., B0x0x0x0x0..
instead of the A, B specified in the create table command, and we start a
scan for the A..A0x0x0x0x0x0x0 range at the end of the first region, where
no keys are even possible.

Running the test suite without this padding did not find any queries that
returned wrong results.

I checked the commit history for clues, but this is present in the initial
commit, and I could not find any discussion on it.

My best guess is that the method tries to approximate what an automatic
split would do, which uses an existing rowkey as a split point, but I can't
see the benefit.

Can someone shed some light on why we are doing this, (and if anything
would break / slow down if we just used the explicit split points provided,
or the unpadded single salt bytes directly as split points)  ?

regards
Istvan

Reply via email to