Hi! While working on PHOENIX-6587, I've found that we call SchemaUtil.processSplit() for both explicit split points, and salt bucket bytes. This effectively right pads the split point with zeros, to reach the minimum PK length possible.
For some queries, this results in an additional scan being run on an additional region, for scanning the <partialkey> .. <partialkey>00000000000 range, where no row keys are possible. This can be seen at https://github.com/apache/phoenix/blob/fb9065760faa3986f49671df2cb64dcaca7d3476/phoenix-core/src/test/java/org/apache/phoenix/compile/QueryCompilerTest.java#L4901 There we should be scanning a single region, but we scan an extra one, because the padded region boundaries are A0x0x0x0x0x0x0.., B0x0x0x0x0.. instead of the A, B specified in the create table command, and we start a scan for the A..A0x0x0x0x0x0x0 range at the end of the first region, where no keys are even possible. Running the test suite without this padding did not find any queries that returned wrong results. I checked the commit history for clues, but this is present in the initial commit, and I could not find any discussion on it. My best guess is that the method tries to approximate what an automatic split would do, which uses an existing rowkey as a split point, but I can't see the benefit. Can someone shed some light on why we are doing this, (and if anything would break / slow down if we just used the explicit split points provided, or the unpadded single salt bytes directly as split points) ? regards Istvan
