Hi all, I was looking at HBase scan operator, and noticed one interesting behavior regarding how Drill/HBase reader handling a column with nulll value.
1. Date prepare (HBase shell). A table with 3 rows. one row has 'addr.city' column only, while the other two rows have column 'order.id' only. create 'customer', {NAME=>'addr'}, {NAME=>'order'} put 'customer', 'jsmith', 'addr:city', 'sanjose' put 'customer', 'tom', 'order:id', '2' put 'customer', 'frank', 'order:id', '3' scan 'customer' ROW COLUMN+CELL frank column=order:id, timestamp=1494969396032, value=3 jsmith column=addr:city, timestamp=1494969355484, value=sanjose tom column=order:id, timestamp=1494969387941, value=2 3 row(s) in 0.0170 seconds 2. Query in Drill Q1. check row count in Drill. Result looks good. select count(*) from hbase.customer t; +---------+ | EXPR$0 | +---------+ | 3 | +---------+ Q2. Get column 'addr.city' only, just return 1 row. select convert_from(t.addr.city, 'UTF8') as city from hbase.customer t; +----------+ | city | +----------+ | sanjose | +----------+ Q3. Get column 'addr.city' and 'order.id', return 3 rows select convert_from(t.addr.city, 'UTF8') as city, convert_from(t.`order`.id, 'UTF8') as id from hbase.customer t; +----------+-------+ | city | id | +----------+-------+ | null | 3 | | sanjose | null | | null | 2 | +----------+-------+ Comparing Q2 and Q3, looks like Drill/Hbase scan will skip rows where all the requested columns are null. Is this the expected behavior? I understand that behavior comes from HBase Scan specification. But it looks a bit hard to understand initially, from SQL's perspective. If this is expected behavior, will it make sense to document this ( I did search drill doc and did not find doc related to this behavior) ? Another point, if we disable project push-down in query planner, then Q2 would return 3 rows. In theory, project push-down would only impact query performance, and should not impact query result. Any thoughts? Thanks, Jinfeng