Jacob Tolar created OOZIE-3668:
----------------------------------

             Summary: Simplify setting oozie.launcher.mapreduce.job.hdfs-servers
                 Key: OOZIE-3668
                 URL: https://issues.apache.org/jira/browse/OOZIE-3668
             Project: Oozie
          Issue Type: New Feature
            Reporter: Jacob Tolar


When running oozie jobs that depend on cross cluster HDFS paths, I am required 
to provide the parameter {{oozie.launcher.mapreduce.job.hdfs-servers}}.

This is a pain to manage when there are many datasources, or when the same 
coordinator/workflow is deployed to multiple clusters (e.g. staging, 
production) which have different cross-cluster data access requirements. We 
need to keep track of the datasets and nameNode lists in two places.

It's especially obnoxious if you are using something like an HCatalog table 
with partitions registered on a different HDFS. In that case, you can define 
your dataset and Oozie's coordiantor takes care of all the details no matter 
where the partitions are stored, but the workflow will fail unless you inspect 
the table and add the correct name nodes to the hdfs-servers setting.

If you are using Oozie coordinators with data dependencies to schedule jobs, 
Oozie should have access to all the required information to provide this 
setting automatically which would help to eliminate errors when it the setting 
is missing or set incorrectly. 

I think there are two reasonable approaches which should be feasible. They're 
not necessarily mutually exclusive, but I would be happy with just one of them: 

1. Oozie sets the value automatically

In this case, Oozie coordinator execution is updated to compute the list of 
hdfs-servers and pass it through to the workflow via the configuration. The 
Oozie workflow execution is updated to use the value provided by the 
coordinator as the default value for 
{{oozie.launcher.mapreduce.job.hdfs-servers}} if the setting is not provided.

The user should still be able to override the setting if needed. It would be 
helpful if there were a way for the user to specify *additional* hdfs-servers 
(i.e. specify 
{{oozie.launcher.mapreduce.job.hdfs-servers=${oozie.coord.hdfs-servers},hdfs://name-node}}
 everything computed by the coordinator plus something else), but that may be 
an uncommon use case. 

2. Oozie provides EL functions for easily computing the {{hdfs-servers}} setting

In this case, Oozie could be updated to provide three new coordinator 
functions. The output could be passed through to the workflow and used as 
needed by the user.

1.  {{coord:getAllDatasetHdfsServers()}}: takes no parameters and outputs a 
string.

This function will iterate over all {{dataIn}} and {{dataOut}} configured in 
the coordinator, and construct a string suitable for passing to the workflow 
parameter {{oozie.launcher.mapreduce.job.hdfs-servers}} . It should work for 
all supported dataset types (e.g. HDFS, HCatalog, etc). 

2. {{coord:getDataInHdfsServers(String dataIn)}}: Takes one parameter and 
outputs a string. 

This function does the same thing as (1), but for the specified dataIn dataset. 

3. {{coord:getDataOutHdfsServers(String dataOut)}}: Takes one parameter and 
outputs a string. 

This function does the same thing as (1), but only for the specified dataOut 
dataset. 




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to