Adam Kawa created FALCON-997:
--------------------------------
Summary: Injecting the $falcon_output_path variable into Falcon
process
Key: FALCON-997
URL: https://issues.apache.org/jira/browse/FALCON-997
Project: Falcon
Issue Type: New Feature
Components: feed, process
Affects Versions: 0.6, 0.7, trunk
Reporter: Adam Kawa
Always when possible, I try to use Falcon with HCatalog. Falcon already injects
several useful variables like {{falcon_output_database}},
{{falcon_output_table}} into a process that let you parametrize your script.
In some use-cases, however, even if you use feeds backed by Hive tables, having
a path to your dataset that you want to create is useful e.g.
* you run a Camus job to move fresh logs from Kafka to HDFS.
Once Camus finishes, you would like to create Hive partition on top of the
newly-created directory. Later this directory becomes an input to ETL processes
managed by Falcon, so you have to have a Hive table on top of it. Therefore,
you need to know the Hive table and the exact path to the partition.
* you want to remove an existing dataset, before regenerating it to prevent
from data duplication and make the operation idempotent
e.g. some versions of Pig and HCatalog append to the existing dataset, if they
the script is re-run https://issues.apache.org/jira/browse/HIVE-8371. If you
just drop the partition of the external table, the partition is removed, but
the data in HDFS still exists.
Injecting the variable like {{falcon_output_path}} into the Falcon process
could help here. The {{falcon_output_path}} could be taken directly from a Hive
metastore (if the partition is already created), or constructed in some
predefined way (if the partition isn't created yet).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)