Github user sureshsubbiah commented on a diff in the pull request:
https://github.com/apache/incubator-trafodion/pull/229#discussion_r47987524
--- Diff: core/sql/src/main/java/org/trafodion/sql/HBaseClient.java ---
@@ -1088,36 +1139,65 @@ public boolean estimateRowCount(String tblName, int
partialRowSize,
//printQualifiers(reader, 100);
if (ROWS_TO_SAMPLE > 0 &&
totalEntries == reader.getEntries()) { // first file only
- // Trafodion column qualifiers are ordinal numbers, which
- // makes it easy to count missing (null) values. We also count
- // the non-Put KVs (typically delete-row markers) to estimate
- // their frequency in the full file set.
+
+ // Trafodion column qualifiers are ordinal numbers, but are
represented
+ // as varying length unsigned little-endian integers in
lexicographical
+ // order. So, for example, in a table with 260 columns, the
column
+ // qualifiers (if present) will be read in this order:
+ // 1 (x'01'), 257 (x'0101'), 2 (x'02'), 258 (x'0201'), 3
(x'03'),
+ // 259 (x'0301'), 4 (x'04'), 260 (x'0401'), 5 (x'05'), 6
(x'06'),
+ // 7 (x'07'), ...
+ // We have crossed the boundary to the next row if and only if
the
+ // next qualifier read is less than or equal to the previous,
+ // compared unsigned, lexicographically.
+
--- End diff --
Current code is good, though I am confused as to why we do not try
something simpler like
comparing the Key in consecutive KeyValue objects till it changes? There is
this method on KeyValue that will return the key as a string
https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/KeyValue.html#getKeyString()
Maybe we just compare strings then?
Is the idea that keys can be longer strings and are more expensive to
compare?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---