[
https://issues.apache.org/jira/browse/PIG-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286769#comment-13286769
]
Daniel Dai commented on PIG-2552:
---------------------------------
Sounds good.
> Better Property handling to deal with deprecation and variable substitution
> of Hadoop config
> --------------------------------------------------------------------------------------------
>
> Key: PIG-2552
> URL: https://issues.apache.org/jira/browse/PIG-2552
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Reporter: Daniel Dai
> Assignee: Daniel Dai
> Fix For: 0.11
>
>
> Traditionally Pig handles hadoop configuration using PigContext.properties,
> the flow is:
> 1. Instantiate Hadoop Configuration, read all entries and save into
> PigContext.properties
> 2. adding system properties, pig.properties and "set" command in script into
> PigContext.properties
> 3. Every time we need to instantiate a Hadoop Configuration, we iterate
> PigContext.properties and add to Hadoop Configuration
> This approach does not deal with hadoop 23 deprecated config option. Eg, in
> hadoop 23, "mapred.output.compression.codec" is replaced with
> "mapreduce.output.fileoutputformat.compression.codec". mapred-default.xml
> contains
> "mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.DefaultCodec".
> In Pig script, user may override it with "set
> mapred.output.compression.codec 'org.apache.hadoop.io.compress.BZip2Codec'".
> This is what happen:
> 1. Pig instantiate Hadoop Configuration, and put
> mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.DefaultCodec
> into PigContext.properties
> 2. Adding
> "mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec" to
> PigContext.properties
> 3. When creating Hadoop Configuration to submit Hadoop job, Pig iterate
> PigContext.properties, it first see
> "mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec",
> Hadoop Configuration translate it into the new property
> "mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.BZip2Codec",
> which is right until this point. Then Pig see
> "mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.DefaultCodec",
> and overwrite the previous right entry.
> In PIG-2508, we address the issue by using a Configuration to handle system
> properties, pig.properties and "set" command, the flow is:
> 1. Instantiate Hadoop Configuration, adding system properties,
> pig.properties, then read all entries and save into PigContext.properties
> 2. For every set command, instantiate Hadoop Configuration with
> PigContext.properties, set the property to Configuration (Configuration
> translate the old option into new option), then read all entries back into
> PigContext.properies
> This works but is cumbersome when doing "set".
> In trunk, I want to use the following approach:
> 1. Write a subclass PigProperties extends Properties, and use it as
> PigContext.properties. The interface for PigContext remains the same
> 2. In PigProperties, we maintain a set of hadoop properties, a set of system
> properties, a set of pig.properties and a set of "set" command properties
> 3. Upon invoking PigContext.getProperties(), we instantiate hadoop
> configuration, put all properties in a sequence, then flatten it into a
> combined properties
> 4. We can do optimization to avoid recreating combined properties every time
> we call PigContext.getProperties()
> The benefit for this approach:
> 1. Solve deprecate Hadoop config in a more clear way
> 2. Separate different layer of properties to ease the debugging, also provide
> potential to show properties to the user at different level
> 3. Potential to add job level properties in the future
> 4. No backward incompatibility introduced
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira