[
https://issues.apache.org/jira/browse/PIG-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975915#comment-13975915
]
Cheolsoo Park commented on PIG-3901:
------------------------------------
[~mrflip], thank you for the wonderful work. This is very much needed. See my
responses below-
{code}
# TODO: what is this, what is the default, why would I change it?
# brief=false
{code}
Brief logging (no timestamps). The default is false. See
http://pig.apache.org/docs/r0.12.0/cmds.html#utillity-cmds
{code}
# TODO: what is this, what is the default, why would I change it?
# debug=INFO
{code}
Logging level. debug=OFF|ERROR|WARN|INFO|DEBUG; The default is INFO. See
http://pig.apache.org/docs/r0.12.0/cmds.html#utillity-cmds
{code}
# TODO: what is this, what is the default, why would I change it?
# stop.on.failure=false
{code}
Per Pig job, a DAG of MR jobs are submitted. When this property is set to true,
all the waiting/running MR jobs are canceled/killed upon a MR job failure.
{code}
# TODO: what is this, what is the default, why would I change it?
# pig.additional.jars=<comma separated list of jars>
{code}
You can register additional files (to use with your Pig script) via the command
line using the -Dpig.additional.jars option. See
http://pig.apache.org/docs/r0.12.0/basic.html
{code}
# TODO: what is this, what is the default, why would I change it?
# udf.import.list=<comma seperated list of imports>
{code}
An import list allows you to specify the package to which a UDF or a group of
UDFs belong, eliminating the need to qualify the UDF on every call. An import
list can be specified via the udf.import.list Java property on the Pig command
line. See http://pig.apache.org/docs/r0.12.0/udf.html
{code}
# (TODO: what is the default, and why would you ever set this to be false?)
# pig.user.cache.enabled=true
{code}
The default is false. You might want to disable it because cache can become
stale.
{code}
# Omit empty part files from the output? (TODO: what is the default, and why
would you ever set this to be false?)
# pig.output.lazy=true
{code}
The default is false. The default behavior of MapReduce is to generate an empty
file for no data, so Pig follows that.
{code}
# TODO: what is this, what is the default, why would I change it?
# pig.cachedbag.memusage=0.2
{code}
The amount of memory allocated to bags is determined by pig.cachedbag.memusage;
the default is set to 20% (0.2) of available memory. Note that this memory is
shared across all large bags used by the application. See
http://pig.apache.org/docs/r0.12.0/perf.html#memory-management
{code}
# TODO: what is this, what is the default, why would I change it?
# pig.skewedjoin.reduce.memusage=0.3
{code}
The pig.skewedjoin.reduce.memusage Java parameter specifies the fraction of
heap available for the reducer to perform the join. A low fraction forces Pig
to use more reducers but increases copying cost. See
http://pig.apache.org/docs/r0.12.0/perf.html#skewed-joins
{code}
# EXPERIMENTAL: Use SchemaTuples in merge joins. (default: value of
# `pig.schematuple`). (TODO: Memory savings are highest when (???))
# pig.schematuple.merge_join=false
{code}
Memory savings are from using Java primitives instead of objects. When schema
is known, we can generate a custom Java class to hold records instead of
wrapping primitives into objects. In my experiment, I haven't found a good
guidance, so I wouldn't add further comments than what's already there.
{code}
# Do not spill temp files smaller than this size (bytes)
# TODO: what is this, what is the default, why would I change it?
# pig.spill.size.threshold=5000000
{code}
Bags are spilled only if their size is greater than the configured threshold.
The default is 5000000 bytes. Usually, the more spilling the longer runtime. So
you might want to tune it according to heap size of each task, etc.
{code}
# Tempfile storage container type: (TODO: what is the default, and why would I
change it?)
# * seqfile: only supports gz(gzip), lzo, snappy, and bzip2
# * tfile: only supports supports gz(gzip) and lzo
# pig.tmpfilecompression.storage=seqfile
{code}
The default is tfile. You can change it depending on which file you want to
use. For now, there are only seq and tfile, but we can add a new format in the
future. Regarding why tfile over seq, this might be helpful-
https://issues.apache.org/jira/secure/attachment/12396286/TFile%20Specification%2020081217.pdf
{code}
# TODO: what is this, what is the default, why would I change it?
#pig.noSplitCombination=true
{code}
The default is false. See
http://pig.apache.org/docs/r0.12.0/perf.html#combine-files
{code}
# TODO: what is this, what is the default, why would I change it?
# pig.exec.nocombiner=false
{code}
The default is false. Combiner optimization does not always help performance.
When the reduction of combiners is not significant, it only adds extra
overhead. If so, you want to disable it.
{code}
# TODO: what is this, what is the default, why would I change it?
# opt.multiquery=true
{code}
The default is true. MultiQuery optimization is not bug-free. Sometimes your
query compiles fine w/o MQ optimization while it fails w/ MQ. If so, you want
to disable it.
{code}
# TODO: what is this, what is the default, why would I change it?
# opt.fetch=true
{code}
The default is true. This enables direct fetch optimization for small queries
(PIG-3642). But sometimes you want to force Pig to launch a MR job, for example
when you're testing a live cluster. If so, you can disable it.
{code}
# TODO: advice on how to choose
# TODO: This value seems low -- unless you have a fully-saturated cluster,
# driving more reducers than available slots can have unintended consequences
# that non-experts may not appreciate.
# pig.exec.reducers.bytes.per.reducer=1000000000
# pig.exec.reducers.max=999
{code}
See http://pig.apache.org/docs/r0.12.0/perf.html#reducer-estimation
{code}
# TODO: advice on how to choose
# pig.exec.reducer.estimator = <fully qualified class name of a
PigReducerEstimator implementation>
{code}
There is only one default implementation shipped with Pig. If you want to
customize it, you can do it by setting this. See
http://pig.apache.org/docs/r0.12.0/perf.html#reducer-estimation
{code}
# TODO: what is this, what is the default, why would I change it?
#pig.exec.mapPartAgg=false
#pig.exec.mapPartAgg.minReduction=10
{code}
The default is false and 10 respectively. See
http://pig.apache.org/docs/r0.12.0/perf.html#hash-based-aggregation
{code}
# TODO: what is this, what is the default, why would I change it?
#pig.load.default.statements=
{code}
You can use this property to load a bootstrap file that contains default
statements that you want to execute in every Pig job. It's similar to .bashrc.
{code}
# Support recovery to when the application master is restarted? (TODO: what is
# default, and advice on setting it)
# pig.output.committer.recovery.support=true
{code}
The default is false. This is a hadoop 2 specific property. You can enable it
if you want to take advantage of AM recovery.
{code}
# TODO: what is this, what is the default, and why would I change it?
hcat.bin=/usr/local/hcat/bin/hcat
{code}
The default is null. You want to change it if hcat is installed at a
non-default location.
{code}
# TODO: what is this, what is the default, and why would I change it?
#pig.sql.type=hcat
{code}
Currently, hcat is the only sql backend. We might add a new one in the future.
{code}
# `load '/path/to/tmp/file' using org.apache.pig.impl.io.TFileStorage();`.
# (TODO: is it true that pig.tmpfilecompression.storage affects this)
# pig.delete.temp.files=true
{code}
Yes. You should use SequenceFileInterStorage() if
pig.tmpfilecompression.storage is seqfile.
{code}
# TODO: advice on how to choose
# pig.script.max.size=10240
{code}
If Pig executes a long query, pig.script can waste space in JobConf. If so, you
can truncate it. The default is 10240 characters.
Thanks!
> Organize the Pig properties file and document all properties
> ------------------------------------------------------------
>
> Key: PIG-3901
> URL: https://issues.apache.org/jira/browse/PIG-3901
> Project: Pig
> Issue Type: Improvement
> Reporter: Philip (flip) Kromer
> Priority: Minor
> Labels: conf, config, documentation, properties, settings
> Attachments: organize_pig_properties.patch
>
>
> The current pig.properties file can use some love. Each property should be
> introduced by a documentation string explaining
> * what the feature does,
> * what its default and other allowed values are,
> * why a user might change it from the default,
> * and what might go wrong with each.
> The documentation should follow a common format -- I propose the following
> guidelines:
> * Each property should supply either a bulleted list of acceptable values,
> indicating the default; or provide the default value inline with the
> description
> * Don't say 'This setting lets you control whether Pig will decide to use the
> Hemiconducer feature', say 'Enables the hemiconducer feature, which [...]'
> * Don't document the internals of the feature. Describe its impact on job
> execution or performance.
> * Use consistent indentation, title formatting, and block delimiting. (The
> current patch does not yet do so completely, as I'm figuring it out)
> * Place each setting in the appropriate block according to its impact on the
> user experience.
> * Call out Experimental features with `EXPERIMENTAL`, but group them with
> similar settings.
> * If a setting is dangerous, call that out with `WARNING`
> * If one value is always appropriate for casual use, or always appropriate
> for production use, we should call that out. Production use should assume a
> moderately loaded single rack hadoop cluster according to the major distro's
> reference configuration -- people running massive-scale installations don't
> need this file's advice.
> I've attached a patch that organizes the current properties file and
> documents everything I felt confident describing. This is a preliminary
> patch, as I'll need some help documenting many of the currently un-documented
> ones. Please review what I've written carefully; I have reasonable experience
> programming Pig but limited familiarity with the experimental features.
--
This message was sent by Atlassian JIRA
(v6.2#6252)