[hypertable-dev] Hypertable Hive Integration Design Proposal

Sanjit Jhala Thu, 20 May 2010 21:26:35 -0700

At a high level the design is similar to that described
here<http://wiki.apache.org/hadoop/Hive/HBaseIntegration>.
It does require the addition of a new Hypertable "Row" object as well as new
Input/Output formats to read/write Rows via Hadoop.


*Row:*
The fundamental unit of operations in Hive is a row as opposed to a Cell in
Hypertable (for instance something like 'select col2, col3 from table where
col1=foo' makes no sense when you operate on cells). Hence Hypertable needs
to expose a row as a collection of cells to Hive. A Hive column can be
mapped to an entire column family (using the Hive Map data-type) or to a
specific qualified column. Since a Hypertable row might be sparse, the Row
object needs to provide a way to access a Cell (or set of Cells) given the
column family (CF) as well as a way to access a cell given column family +
qualifier (CQ). For simplicity the row will contain only the latest version
of each cell, multiple versions can be added based on requirement.

Implementation

A row could contain a ByteBuffer containing the cells along with two maps
storing storing the 'CF' -> set of cell buffer offsets and  'CF:CQ' -> cell
buffer offset. Alternately, since the max #CFs is 256, an array of size 256
could store indices into an array which stores the cell buffer offsets.

*LazyHTRow:*
Hive supports types and Hypertable currently doesn't. The LazyHypertableRow
minimizes the amount of deserialization, allowing Hive to only deserialize
the Cell contents that it needs.

*HypertableSerDe:
*Deserialize Rows to and serialize from LazyHTRow.*
*
*Row-Input/Output-Formats:*
We need input and output formats so Hadoop can read and write the Row
objects above. These can use a get/set_serialized_cells API or
get_/set_serialized_row, either way it'll need to convert between the Row
objects and a collection of cells.

*MetaHook*:
The Hive metastore has to store the mapping between Hive columns and
Hypertable columns as well as Hypertable column family names and column
family ids (ie the HT table schema). The table id and schema generation
number will have to be passed into the Input/Output formats to detect table
alteration. In case of alteration the Hive table needs to be dropped and
recreated.

Thoughts?


-Sanjit

-- 
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en.

[hypertable-dev] Hypertable Hive Integration Design Proposal

Reply via email to