[ 
https://issues.apache.org/jira/browse/PIG-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987839#action_12987839
 ] 

Bill Graham commented on PIG-1782:
----------------------------------

Assigning this to myself, since I've got a working patch, but the design needs 
to be vetted out further with this approach.

One issue is that the number of columns per family per row is not constant, so 
with a sparse table you'd have no idea what column names go with each value of 
the tuple returned. Another issue is that the column name is actually dynamic 
descriptive data often times in HBase and there can be multiple timestamped 
values for a cell.

* Option A:
Instead of returning a tuple of values the load can return a tuple of tuples. 
Each inner tuple is a two-tuple that contains the column descriptor and the 
most recent value. This data structure would be returned if a 'cf:' style 
column exists in the column list, but default behavior exists with explicit 
column names. This is the simplest approach.

* Option B:
Build out an even more rich (and complex) data structure that also takes into 
account multiple values and their timestamps. A tuple of tuple of tuple of 
tuples to capture the entire HBase KeyValue data structure. Something like this:

{code}
(
 ( column name, ( (value, ts), ... ) ), ...
)
{code}

Either way, the variable length tuples returned for each row containing 
additional variable length tuples would probably require a number of custom 
UDFs to do anything useful with variable name columns and multiple timestamped 
values. 

I guess I lean towards option B so we can support more use cases down the road 
with this refactor. Other opinions?

> Add ability to load data by column family in HBaseStorage
> ---------------------------------------------------------
>
>                 Key: PIG-1782
>                 URL: https://issues.apache.org/jira/browse/PIG-1782
>             Project: Pig
>          Issue Type: New Feature
>         Environment: Java 6, Mac OS X 10.6
>            Reporter: Eric Yang
>            Assignee: Bill Graham
>
> It would be nice to load all columns in the column family by using short hand 
> syntax like:
> {noformat}
> CpuMetrics = load 'hbase://SystemMetrics' USING 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey');
> {noformat}
> Assuming there are columns cpu: sys.0, cpu:sys.1, cpu:user.0, cpu:user.1,  in 
> cpu column family.
> CpuMetrics would contain something like:
> {noformat}
> (rowKey, cpu:sys.0, cpu:sys.1, cpu:user.0, cpu:user.1)
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to