[jira] Commented: (PIG-6) Addition of Hbase Storage Option In Load/Store Statement

Samuel Guo (JIRA) Tue, 18 Nov 2008 04:12:09 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648582#action_12648582
 ]


Samuel Guo commented on PIG-6:
------------------------------

My ideas about this issue.

** Load from  / Store into Table **

* Target *
Let pig have the ability to load from / store into the table in the 
bigtable-like systems (such as, hbase, hypertable, and future maybe canssandra).

* Grammer *

<tableloadclause> := <LOAD> "TABLE" <tablepath> "PROJECTION" <projections_list> 
AS <schema>
<tablestoreclause> := <STORE> <IDENTIFIER> "PROJECTION" <INTO> "TABLE" 
<tablepath> <projections_list>
<projections_list> := <projection> ["," <projections_list>]
<projection> := "'"<string>:<string>:<string>"'"
<tablepath> := " ' " <string>:<string>" ' "

<tablepath> is formed by two part : "schema" and "tablename". "schema" identify 
the system where the table is in. "schema" may be "hbase", "hypertable" or 
other system.
<projection> is formed by three part : "column_family_name", "column_name", 
"timestamp".

* Examples *

An example is below:

-- load the table 'table1' from 'hbase'
-- the operation project its "family1:column1" 's content at timestamp1 to 
field1
-- the operation project its "family2:" 's all contents at timestamp2 to field2
-- the operation project its "family3:" 's latest content to field3
A = Load table 'hbase:table1' projection 'family1:column1:timestamp1', 
'family2::timestamp2', 'family3::' as (field1: chararray, field2: tuple, 
field3:tuple);

-- do some operation over A
B = ...A;

-- store B into 'hbase' as table 'table2'
-- projection B.$1 to 'family1:column1' with system's current timestamp
-- projection B.$2 to 'family2:column2' with timestamp v2
Store B projection into table 'hbase:table2' 'family1:column1:', 
'family2:column2:v2';

* Data I/O over Table *

First, We need a custom datastorage to do the table data i/o.
Sth like:
Public interface TableDataStorage extends DataStorage {
}

The *TableDataStorage* will abstract all the bigtable-like system.

And,

for Hbase, we can construct the hbase datastorage like:
public class HbaseDataStorage implements TableDataStorage {
}

for Hypertable, we may have a different datastorage like:
public class HypertableDataStorage implements TableDataStorage {
}

* MapReduce Stuff *

Because table is different from file, we may need a different slice interface. 
Sth like:

Public interface TableSlice extends Serializable {
        // get slice locations
        String[] getLocations();

        // init the data storage
        init(TableDataStorage store) throws IOException;
        
       // get the table's name
        byte[] getTableName();

        // get the start row of the table slice in this table.
        byte[] getStartRow();

        // get the end row of the table slice in this table.
        byte[] getEndRow();

        // get the cur row of the table slice in this table.
        byte[] getCurRow();

        // get the progress
        float getProgress() throws IOException;

        // get the next tuple.
        boolean next(Tuple) throws IOException;
}

And we need a related table slicer:

public interface TableSlicer {
        void validate(TableDataStorage store, String location) throws 
IOException;
        TableSlice[] slice(TableDataStorage store, String location) throws 
IOException;
}

Finally, the inputformat, outputformat, recordreader for map/reduce over table.

* PigTranslation *
Now, pig's translation can be divided into 3 steps.
First: parser -> logical plan;
Second: logical plan -> physical plan;
Last: physical plan -> map/reduce plan;

In the first two steps, we just need to add a similar operator as what file 
load/store do.
Like:
LOLoad -> LOTableLoad
POLoad -> POTableLoad

LOStore -> LOTableStore
POStore -> POTableStore

The difference is in the last step. 
When we are constucting a map/reduce job with a table load/store operation, we 
should use the table's map/reduce related stuff (such as inputformat, 
outputformat and so on) to constuct the job. And, the load/store between jobs 
just remain using temp files.

so a pig script using table load/store may be like:

source-table --> Job1(using table inputformat) -----> 
tempfiles(piginputformat/pigoutputformat) -----> job2 -----> .... -----> jobN 
------> target-table(using table outputformat)

* Other Problems *
There may be other optimization problems using table for data-processing. These 
problems are not considering in the solution to make it clear.



Welcome for commets :-)


> Addition of Hbase Storage Option In Load/Store Statement
> --------------------------------------------------------
>
>                 Key: PIG-6
>                 URL: https://issues.apache.org/jira/browse/PIG-6
>             Project: Pig
>          Issue Type: New Feature
>         Environment: all environments
>            Reporter: Edward J. Yoon
>
> It needs to be able to load full table in hbase.  (maybe ... difficult? i'm 
> not sure yet.)
> Also, as described below, 
> It needs to compose an abstract 2d-table only with certain data filtered from 
> hbase array structure using arbitrary query-delimited. 
> {code}
> A = LOAD table('hbase_table');
> or
> B = LOAD table('hbase_table') Using HbaseQuery('Query-delimited by attributes 
> & timestamp') as (f1, f2[, f3]);
> {code}
> Once test is done on my local machines, 
> I will clarify the grammars and give you more examples to help you explain 
> more storage options. 
> Any advice welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-6) Addition of Hbase Storage Option In Load/Store Statement

Reply via email to