[DISCUSS] State of the work-in-progress HBase branch

Kasper Sørensen Fri, 24 Jan 2014 11:37:01 -0800

Hi everyone,

I was looking at our "hbase-module" branch and as much as I like this idea,
I think we've been a bit too idle with the branch. Maybe we should try to
make something final e.g. for a version 4.1.


So I thought to give an overview/status of the module's current
capabilities and it's shortcomings. We should figure out if we think this
is good enough for a first version, or if we want to do some improvements
to the module before adding it to our portfolio of MetaModel modules.

1) The module only offers read-only/query access to HBase. That is in my
opinion OK for now, we have several such modules, and this is something we
can better add later if we straighten out the remaining topics in this mail.

2) With regards to metadata mapping: HBase is different because it has both
column families and in column families there are columns. For the sake of
our view on HBase I would describe column families simply as "a logical of
columns". Column families are fixed within a table, but rows in a table may
contain arbitrary numbers of columns within each column family. So... You
can instantiate the HBaseDataContext in two ways:

2a) You can let MetaModel discover the metadata. This unfortunately has a
severe limitation. We discover the table names and column families using
the HBase API. But the actual columns and their contents cannot be provided
by the API. So instead we simply expose the column families with a MAP data
types. The trouble with this is that the keys and values of the maps will
simply be byte-arrays ... Usually not very useful! But it's sort of the
only thing (as far as I can see) that's "safe" in HBase, since HBase allows
anything (byte arrays) in it's columns.

2b) Like in e.g. MongoDb or CouchDb modules you can provide an array of
tables (SimpleTableDef). That way the user defines the metadata himself and
the implementation assumes that it is correct (or else it will break). The
good thing about this is that the user can define the proper data types
etc. for columns. The user defines the column family and column name by
setting defining the MetaModel column name as this: "family:name"
(consistent with most HBase tools and API calls).

3) With regards to querying: We've implemented basic query capabilities
using the MetaModel query postprocessor. But not all queries are very
effective... In addition to of course full table scans, we have optimized
support of of COUNT queries and of table scans with maxRows.

We could rather easily add optimized support for a couple of other typical
queries:
 * lookup record by ID
 * paged table scans (both firstRow and maxRows)
 * queries with simple filters/where items

4) With regards to dependencies: The module right now depends on the
artifact called "hbase-client". This dependency has a loot of transient
dependencies so the size of the module is quite extreme. As an example, it
includes stuff like jetty, jersey, jackson and of course hadoop... But I am
wondering if we can have a more thin client-side than that! If anyone knows
if e.g. we can use the REST interface easily or so, that would maybe be
better. I'm not an expert on HBase though, so please enlighten me!

Kind regards,
Kasper

[DISCUSS] State of the work-in-progress HBase branch

Reply via email to