Hi all,

I was looking at HBase scan operator, and noticed one interesting behavior
regarding how Drill/HBase reader handling a column with nulll value.

1. Date prepare (HBase shell).  A table with 3 rows. one row has
'addr.city' column only, while the other two rows have column 'order.id'
only.

create 'customer', {NAME=>'addr'}, {NAME=>'order'}
put 'customer', 'jsmith', 'addr:city', 'sanjose'
put 'customer', 'tom', 'order:id', '2'
put 'customer', 'frank', 'order:id', '3'

scan 'customer'
ROW                                           COLUMN+CELL
 frank                                        column=order:id,
timestamp=1494969396032, value=3
 jsmith                                       column=addr:city,
timestamp=1494969355484, value=sanjose
 tom                                          column=order:id,
timestamp=1494969387941, value=2
3 row(s) in 0.0170 seconds

2. Query in Drill
Q1. check row count in Drill. Result looks good.
select count(*)  from hbase.customer t;
+---------+
| EXPR$0  |
+---------+
| 3       |
+---------+

Q2. Get column 'addr.city' only, just return 1 row.
select convert_from(t.addr.city, 'UTF8') as city  from hbase.customer t;
+----------+
|   city   |
+----------+
| sanjose  |
+----------+

Q3. Get column 'addr.city' and 'order.id', return 3 rows
select convert_from(t.addr.city, 'UTF8') as city,
convert_from(t.`order`.id, 'UTF8') as id  from hbase.customer t;
+----------+-------+
|   city   |  id   |
+----------+-------+
| null     | 3     |
| sanjose  | null  |
| null     | 2     |
+----------+-------+

Comparing Q2 and Q3, looks like Drill/Hbase scan will skip rows where all
the requested columns are null.

Is this the expected behavior? I understand that behavior comes from HBase
Scan specification. But it looks a bit hard to understand initially, from
SQL's perspective.

If this is expected behavior,  will it make sense to document this ( I did
search drill doc and did not find doc related to this behavior) ?

Another point, if we disable project push-down in query planner, then Q2
would return 3 rows. In theory, project push-down would only impact query
performance, and should not impact query result.

Any thoughts?

Thanks,

Jinfeng

Reply via email to