Adam Kawa created FALCON-997:
--------------------------------

             Summary: Injecting the $falcon_output_path variable into Falcon 
process
                 Key: FALCON-997
                 URL: https://issues.apache.org/jira/browse/FALCON-997
             Project: Falcon
          Issue Type: New Feature
          Components: feed, process
    Affects Versions: 0.6, 0.7, trunk
            Reporter: Adam Kawa


Always when possible, I try to use Falcon with HCatalog. Falcon already injects 
several useful variables like {{falcon_output_database}}, 
{{falcon_output_table}} into a process that let you parametrize your script.

In some use-cases, however, even if you use feeds backed by Hive tables, having 
a path to your dataset that you want to create is useful e.g.

* you run a Camus job to move fresh logs from Kafka to HDFS. 

Once Camus finishes, you would like to create Hive partition on top of the 
newly-created directory. Later this directory becomes an input to ETL processes 
managed by Falcon, so you have to have a Hive table on top of it. Therefore, 
you need to know the Hive table and the exact path to the partition.

* you want to remove an existing dataset, before regenerating it to prevent 
from data duplication and make the operation idempotent

e.g. some versions of Pig and HCatalog append to the existing dataset, if they 
the script is re-run https://issues.apache.org/jira/browse/HIVE-8371. If you 
just drop the partition of the external table, the partition is removed, but 
the data in HDFS still exists.

Injecting the variable like {{falcon_output_path}} into the Falcon process 
could help here. The {{falcon_output_path}} could be taken directly from a Hive 
metastore (if the partition is already created), or constructed in some 
predefined way (if the partition isn't created yet).








--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to