[
https://issues.apache.org/jira/browse/OOZIE-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527589#comment-13527589
]
Rohini Palaniswamy commented on OOZIE-1123:
-------------------------------------------
Considering the below oozie coord definition
{noformat}
<datasets>
<dataset name="5mLogs" frequency="${coord:minutes(15)}"
initial-instance="2009-01-01T01:00Z" timezone="UTC">
<uri-template>hcat://bar:9000/db1/table/dt=${YEAR}${MONTH}${DAY}${HOUR}${MINUTE},region=us</uri-template>
</dataset>
.
<dataset name="hourlyLogs" frequency="${coord:hours(1)}"
initial-instance="2009-01-01T02:00Z" timezone="UTC">
<uri-template>hdfs://bar:9000/db2/table2/dt=${YEAR}${MONTH}${DAY}${HOUR},region=us</uri-template>
</dataset>
</datasets>
<input-events>
<data-in name="inputLogs" dataset="hourlyLogs">
<start-instance>${coord:current(-3)}</start-instance>
<end-instance>${coord:current(0)}</end-instance>
</data-in>
</input-events>
<output-events>
<data-out name="outputLogs" dataset="dailyLogs">
<instance>${coord:current(0)}</instance>
</data-out>
</output-events>
{noformat}
Here are two most common use cases of how
http://incubator.apache.org/hcatalog/docs/r0.4.0/loadstore.html#HCatLoader and
HCatStorer will be used.
Case 1:
{noformat}
A = load db1.table1 using HCatLoader();
B = FILTER A BY ((dt = '201201010015' AND region = 'us') OR (dt =
'201201010030' AND region = 'us') OR (dt = '201201010045' AND region = 'us') OR
(dt = '201201010100' AND region = 'us'));
C = store B into db2.table2 using HCatStorer('dt=20120103,region=us');
{noformat}
Case 2:
{noformat}
A = load db1.table1 using HCatLoader();
B = FILTER A BY dt >= '201201010015' AND dt < '201201010100' AND region='us';
C = store B into db2.table2 using HCatStorer('dt=20120103,region=us');
{noformat}
The functions needed based on the above example are below. For easy
understanding have used the EL functions directly in pig script.
1) database(name) - db name given data-in or data-out event name. Return values
would be db1 and db2 respectively
2) table(name) - table name given data-in or data-out event name. Return values
would be db1 and db2 respectively
{noformat}
A = load ${coord:database(inputLogs)}.${coord:database(inputLogs)} using
HCatLoader();
C = store B into ${coord:database(outputLogs)}.${coord:database(outputLogs)}
using HCatStorer('dt=20120103,region=us');
{noformat}
Database and table have already been added.
3) dataInPartitionFilter(name) - A filter clause for data-in event .For eg:
${coord:dataInPartitionFilter(inputLogs)}
{noformat}
((dt = '20120101' AND region = 'us') OR (dt = '20120102' AND region = 'us') OR
(dt = '20120103' AND region = 'us'))
{noformat}
Currently dataIn has been modified to return this for hcat uris. Will prefer
keeping dataIn returning the comma separated uri list similar to hdfs to be
consistent. Would be easy to use in mapreduce jobs.
4) dataOutPartition(eventname, partitionName) - A EL function to get the value
of a partition give a dataout-event. For eg:
${coord:dataOutPartition(outputLogs,dt)}
{noformat}
C = store B into db2.table2 using
HCatStorer('dt=${coord:dataOutPartition(outputLogs,dt)},region=${coord:dataOutPartition(outputLogs,region)}');
{noformat}
5) dataInPartitionMin(eventname, partitionName)
6) dataInPartitionMax(eventname, partitionName) - Get the minimum and maximum
values to do range queries as filter statements with just ==/AND/OR becomes
inefficient as number of instances increase.
{noformat}
B = FILTER A BY dt >= '${coord:dataInPartitionMin(outputLogs,dt)}' AND dt <
'${coord:dataInPartitionMax(outputLogs,dt)}' AND region='us';
{noformat}
dataInPartitionMin and dataInPartitionMax become very complicated if instead of
dt, the partitions are split more fine grained into year,month,dt and hour. The
filter even without the min and max becomes very complicated to take into
account month,date and hour boundaries. Assuming those are out of scope for the
first iteration and dataInPartitionFilter needs to be used in those cases.
> EL Functions for hcatalog
> -------------------------
>
> Key: OOZIE-1123
> URL: https://issues.apache.org/jira/browse/OOZIE-1123
> Project: Oozie
> Issue Type: Sub-task
> Reporter: Rohini Palaniswamy
> Assignee: Mona Chitnis
> Fix For: trunk
>
>
> Need new EL functions for hcatalog to support getting partition filter and
> mix and max values for use in pig scripts.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira