[ https://issues.apache.org/jira/browse/PIG-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648582#action_12648582 ]
Samuel Guo commented on PIG-6: ------------------------------ My ideas about this issue. ** Load from / Store into Table ** * Target * Let pig have the ability to load from / store into the table in the bigtable-like systems (such as, hbase, hypertable, and future maybe canssandra). * Grammer * <tableloadclause> := <LOAD> "TABLE" <tablepath> "PROJECTION" <projections_list> AS <schema> <tablestoreclause> := <STORE> <IDENTIFIER> "PROJECTION" <INTO> "TABLE" <tablepath> <projections_list> <projections_list> := <projection> ["," <projections_list>] <projection> := "'"<string>:<string>:<string>"'" <tablepath> := " ' " <string>:<string>" ' " <tablepath> is formed by two part : "schema" and "tablename". "schema" identify the system where the table is in. "schema" may be "hbase", "hypertable" or other system. <projection> is formed by three part : "column_family_name", "column_name", "timestamp". * Examples * An example is below: -- load the table 'table1' from 'hbase' -- the operation project its "family1:column1" 's content at timestamp1 to field1 -- the operation project its "family2:" 's all contents at timestamp2 to field2 -- the operation project its "family3:" 's latest content to field3 A = Load table 'hbase:table1' projection 'family1:column1:timestamp1', 'family2::timestamp2', 'family3::' as (field1: chararray, field2: tuple, field3:tuple); -- do some operation over A B = ...A; -- store B into 'hbase' as table 'table2' -- projection B.$1 to 'family1:column1' with system's current timestamp -- projection B.$2 to 'family2:column2' with timestamp v2 Store B projection into table 'hbase:table2' 'family1:column1:', 'family2:column2:v2'; * Data I/O over Table * First, We need a custom datastorage to do the table data i/o. Sth like: Public interface TableDataStorage extends DataStorage { } The *TableDataStorage* will abstract all the bigtable-like system. And, for Hbase, we can construct the hbase datastorage like: public class HbaseDataStorage implements TableDataStorage { } for Hypertable, we may have a different datastorage like: public class HypertableDataStorage implements TableDataStorage { } * MapReduce Stuff * Because table is different from file, we may need a different slice interface. Sth like: Public interface TableSlice extends Serializable { // get slice locations String[] getLocations(); // init the data storage init(TableDataStorage store) throws IOException; // get the table's name byte[] getTableName(); // get the start row of the table slice in this table. byte[] getStartRow(); // get the end row of the table slice in this table. byte[] getEndRow(); // get the cur row of the table slice in this table. byte[] getCurRow(); // get the progress float getProgress() throws IOException; // get the next tuple. boolean next(Tuple) throws IOException; } And we need a related table slicer: public interface TableSlicer { void validate(TableDataStorage store, String location) throws IOException; TableSlice[] slice(TableDataStorage store, String location) throws IOException; } Finally, the inputformat, outputformat, recordreader for map/reduce over table. * PigTranslation * Now, pig's translation can be divided into 3 steps. First: parser -> logical plan; Second: logical plan -> physical plan; Last: physical plan -> map/reduce plan; In the first two steps, we just need to add a similar operator as what file load/store do. Like: LOLoad -> LOTableLoad POLoad -> POTableLoad LOStore -> LOTableStore POStore -> POTableStore The difference is in the last step. When we are constucting a map/reduce job with a table load/store operation, we should use the table's map/reduce related stuff (such as inputformat, outputformat and so on) to constuct the job. And, the load/store between jobs just remain using temp files. so a pig script using table load/store may be like: source-table --> Job1(using table inputformat) -----> tempfiles(piginputformat/pigoutputformat) -----> job2 -----> .... -----> jobN ------> target-table(using table outputformat) * Other Problems * There may be other optimization problems using table for data-processing. These problems are not considering in the solution to make it clear. Welcome for commets :-) > Addition of Hbase Storage Option In Load/Store Statement > -------------------------------------------------------- > > Key: PIG-6 > URL: https://issues.apache.org/jira/browse/PIG-6 > Project: Pig > Issue Type: New Feature > Environment: all environments > Reporter: Edward J. Yoon > > It needs to be able to load full table in hbase. (maybe ... difficult? i'm > not sure yet.) > Also, as described below, > It needs to compose an abstract 2d-table only with certain data filtered from > hbase array structure using arbitrary query-delimited. > {code} > A = LOAD table('hbase_table'); > or > B = LOAD table('hbase_table') Using HbaseQuery('Query-delimited by attributes > & timestamp') as (f1, f2[, f3]); > {code} > Once test is done on my local machines, > I will clarify the grammars and give you more examples to help you explain > more storage options. > Any advice welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.