[jira] [Commented] (IMPALA-886) Always display HBase cols in same order as CREATE TABLE statement

ASF subversion and git services (Jira) Tue, 19 Jul 2022 04:02:06 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568477#comment-17568477
 ]


ASF subversion and git services commented on IMPALA-886:
--------------------------------------------------------

Commit 06e8e7bba7a2423f986d3f371e0f64a50b1dd027 in impala's branch 
refs/heads/master from Csaba Ringhofer
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=06e8e7bba ]

IMPALA-886: Support displaying HBase cols in the order from HMS

Before this patch catalogd always ordered HBase columns
lexicographically by family/qualifier. This is incompatible with other
table formats and the way Hive handles HBase tables, where the order
comes from HMS as defined during CREATE TABLE.

I don't know of any valid reason behind this old behavior, it probably
just made the implementation a bit easier by doing the ordering in FE
instead of BE - the BE actually needs this ordering during scanning
as the HBase API returns results in this order, but this should have
no effect on other parts of Impala.

Added flag use_hms_column_order_for_hbase_tables (used by catalogd)
to decide whether to do this reordering:
- true: keep HMS order
- false: reorder by family/qualifier [default]

The old way is kept as default to avoid breaking existing workloads,
but it would make sense to change it in the next major release.

Note that a query option would be more convenient to use, but it
would be much harder to implement it as the order is decided during
loading in catalogd.

Testing:
- added custom cluster test for
  use_hms_column_order_for_hbase_tables = true

Change-Id: Ibc5df8b803f2ae3b93951765326cdaea706e3563
Reviewed-on: http://gerrit.cloudera.org:8080/18635
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Always display HBase cols in same order as CREATE TABLE statement
> -----------------------------------------------------------------
>
>                 Key: IMPALA-886
>                 URL: https://issues.apache.org/jira/browse/IMPALA-886
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>    Affects Versions: Impala 1.3
>            Reporter: John Russell
>            Assignee: Csaba Ringhofer
>            Priority: Minor
>              Labels: catalog-server, hbase, usability
>
> I noticed a discrepancy with Hive, in how Impala handles column order for 
> HBase tables.
> I think it would be preferable to use the same behavior as Hive, otherwise 
> life becomes
> more complicated for anyone doing INSERT or SELECT * with an HBase table 
> through Impala.
> (And I have to add caveats and usage notes in the docs.)
> Repro:
> In HBase shell, create a table with a single column family. I think most 
> Impala tests use 1 column family per column, where you won't notice this 
> behavior.
> hbase(main):008:0> create 'sample_data_fast','cols'
> 0 row(s) in 71.8750 seconds
> In Hive shell, create a mapping table. Notice how DESCRIBE repeats back the 
> columns in the same order as in CREATE TABLE.
> hive> create external table sample_data_fast (id string, val int, zfill 
> string, name string, assertion boolean)
>     > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>     > WITH SERDEPROPERTIES (
>     > "hbase.columns.mapping" =
>     > ":key,cols:val,cols:zfill,cols:name,cols:assertion")
>     > TBLPROPERTIES("hbase.table.name" = "sample_data_fast")
>     > ;
> OK
> Time taken: 1.7 seconds
> hive> desc sample_data_fast;
> OK
> id  string  from deserializer
> val int from deserializer
> zfill string  from deserializer
> name  string  from deserializer
> assertion boolean from deserializer
> Time taken: 0.302 seconds
> Now try the same DESCRIBE in impala-shell. The key column (id) is listed 
> first. Then all the other columns, part of the same column family, are listed 
> in alphabetical order rather than the order from CREATE TABLE:
> [localhost:21000] > desc sample_data_fast;
> Query: describe sample_data_fast
> +-----------+---------+---------+
> | name      | type    | comment |
> +-----------+---------+---------+
> | id        | string  |         |
> | assertion | boolean |         |
> | name      | string  |         |
> | val       | int     |         |
> | zfill     | string  |         |
> +-----------+---------+---------+
> Returned 5 row(s) in 0.02s
> Thus if you already had Hive code that was doing SELECT * from an HBase table 
> like this, you would get a different result set (different column order) in 
> Impala.
> If you tried to copy from an HDFS table via 'INSERT INTO hbase_table SELECT * 
> FROM hdfs_table', you would get an error because the columns don't match. If 
> you made a separate column family for each column, the discrepancy is masked 
> because you need more than one column per column family to experience the 
> alphabetical ordering.
> Since Hive is preserving the column order, the relevant info must be there in 
> the metastore.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-886) Always display HBase cols in same order as CREATE TABLE statement

Reply via email to