[jira] [Commented] (OOZIE-1123) EL Functions for hcatalog

Rohini Palaniswamy (JIRA) Sun, 09 Dec 2012 11:15:22 -0800

    [ 
https://issues.apache.org/jira/browse/OOZIE-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527589#comment-13527589
 ]


Rohini Palaniswamy commented on OOZIE-1123:
-------------------------------------------

Considering the below oozie coord definition

{noformat}

<datasets>
  <dataset name="5mLogs" frequency="${coord:minutes(15)}"
           initial-instance="2009-01-01T01:00Z" timezone="UTC">
    
<uri-template>hcat://bar:9000/db1/table/dt=${YEAR}${MONTH}${DAY}${HOUR}${MINUTE},region=us</uri-template>
  </dataset>
.
  <dataset name="hourlyLogs" frequency="${coord:hours(1)}"
           initial-instance="2009-01-01T02:00Z" timezone="UTC">
    
<uri-template>hdfs://bar:9000/db2/table2/dt=${YEAR}${MONTH}${DAY}${HOUR},region=us</uri-template>
  </dataset>
</datasets>

 <input-events>
        <data-in name="inputLogs" dataset="hourlyLogs">
          <start-instance>${coord:current(-3)}</start-instance>
          <end-instance>${coord:current(0)}</end-instance>
        </data-in>
      </input-events>
      <output-events>
        <data-out name="outputLogs" dataset="dailyLogs">
          <instance>${coord:current(0)}</instance>
        </data-out>
      </output-events>

{noformat} 

Here are two most common use cases of how 
http://incubator.apache.org/hcatalog/docs/r0.4.0/loadstore.html#HCatLoader and 
HCatStorer will be used.

Case 1:

{noformat}
A = load db1.table1 using HCatLoader();
B = FILTER A BY ((dt = '201201010015' AND region = 'us') OR (dt = 
'201201010030' AND region = 'us') OR (dt = '201201010045' AND region = 'us') OR 
(dt = '201201010100' AND region = 'us'));
C = store B into db2.table2 using HCatStorer('dt=20120103,region=us');
{noformat}


Case 2:
{noformat}
A = load db1.table1 using HCatLoader();
B = FILTER A BY dt >= '201201010015' AND dt < '201201010100' AND region='us';
C = store B into db2.table2 using HCatStorer('dt=20120103,region=us');
{noformat}


The functions needed based on the above example are below. For easy 
understanding have used the EL functions directly in pig script.
1) database(name) - db name given data-in or data-out event name. Return values 
would be db1 and db2 respectively
2) table(name) - table name given data-in or data-out event name. Return values 
would be db1 and db2 respectively
{noformat}
A = load ${coord:database(inputLogs)}.${coord:database(inputLogs)} using 
HCatLoader();
C = store B into ${coord:database(outputLogs)}.${coord:database(outputLogs)} 
using HCatStorer('dt=20120103,region=us');
{noformat}
  Database and table have already been added. 
3) dataInPartitionFilter(name) - A filter clause for data-in event .For eg: 
${coord:dataInPartitionFilter(inputLogs)}
{noformat}
((dt = '20120101' AND region = 'us') OR (dt = '20120102' AND region = 'us') OR 
(dt = '20120103' AND region = 'us'))
{noformat}
  Currently dataIn has been modified to return this for hcat uris. Will prefer 
keeping dataIn returning the comma separated uri list similar to hdfs to be 
consistent. Would be easy to use in mapreduce jobs.
4) dataOutPartition(eventname, partitionName) - A EL function to get the value 
of a partition give a dataout-event. For eg: 
${coord:dataOutPartition(outputLogs,dt)}
{noformat}
C = store B into db2.table2 using 
HCatStorer('dt=${coord:dataOutPartition(outputLogs,dt)},region=${coord:dataOutPartition(outputLogs,region)}');
{noformat}
5) dataInPartitionMin(eventname, partitionName)
6) dataInPartitionMax(eventname, partitionName) - Get the minimum and maximum 
values to do range queries as filter statements with just ==/AND/OR becomes 
inefficient as number of instances increase.
{noformat}
B = FILTER A BY dt >= '${coord:dataInPartitionMin(outputLogs,dt)}' AND dt < 
'${coord:dataInPartitionMax(outputLogs,dt)}' AND region='us';
{noformat}

dataInPartitionMin and dataInPartitionMax become very complicated if instead of 
dt, the partitions are split more fine grained into year,month,dt and hour. The 
filter even without the min and max becomes very complicated to take into 
account month,date and hour boundaries. Assuming those are out of scope for the 
first iteration and dataInPartitionFilter needs to be used in those cases. 
                
> EL Functions for hcatalog
> -------------------------
>
>                 Key: OOZIE-1123
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1123
>             Project: Oozie
>          Issue Type: Sub-task
>            Reporter: Rohini Palaniswamy
>            Assignee: Mona Chitnis
>             Fix For: trunk
>
>
>   Need new EL functions for hcatalog to support getting partition filter and 
> mix and max values for use in pig scripts. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (OOZIE-1123) EL Functions for hcatalog

Reply via email to