Author: olga
Date: Thu Apr 8 19:43:04 2010
New Revision: 932076
URL: http://svn.apache.org/viewvc?rev=932076&view=rev
Log:
PIG-1320: more documentation updates for Pig 0.7.0 (chandec via olgan)
Modified:
hadoop/pig/branches/branch-0.7/CHANGES.txt
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/cookbook.xml
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml
Modified: hadoop/pig/branches/branch-0.7/CHANGES.txt
URL:
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/CHANGES.txt?rev=932076&r1=932075&r2=932076&view=diff
==============================================================================
--- hadoop/pig/branches/branch-0.7/CHANGES.txt (original)
+++ hadoop/pig/branches/branch-0.7/CHANGES.txt Thu Apr 8 19:43:04 2010
@@ -68,6 +68,8 @@ manner (rding via pradeepkth)
IMPROVEMENTS
+PIG-1320: more documentation updates for Pig 0.7.0 (chandec via olgan)
+
PIG-1316: TextLoader should use Bzip2TextInputFormat for bzip files so that
bzip files can be efficiently processed by splitting the files (pradeepkth)
Modified:
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/cookbook.xml
URL:
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/cookbook.xml?rev=932076&r1=932075&r2=932076&view=diff
==============================================================================
---
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/cookbook.xml
(original)
+++
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/cookbook.xml
Thu Apr 8 19:43:04 2010
@@ -272,7 +272,7 @@ STORE D INTO âmysortedcountâ U
<section>
<title>Use the LIMIT Operator</title>
-<p>A lot of the times, you are not interested in the entire output but either
a sample or top results. In those cases, using LIMIT can yeild a much better
performance as we push the limit as high as possible to minimize the amount of
data travelling through the pipeline. </p>
+<p>Often you are not interested in the entire output but rather a sample or
top results. In such cases, using LIMIT can yield a much better performance as
we push the limit as high as possible to minimize the amount of data travelling
through the pipeline. </p>
<p>Sample:
</p>
Modified:
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml
URL:
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml?rev=932076&r1=932075&r2=932076&view=diff
==============================================================================
---
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml
(original)
+++
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml
Thu Apr 8 19:43:04 2010
@@ -6976,7 +6976,7 @@ DUMP B;
<informaltable frame="all">
<tgroup cols="1"><tbody><row>
<entry>
- <para>EXPLAIN [âout path] [-brief] [-dot] [âparam
param_name = param_value] [âparam_file file_name] alias;Â </para>
+ <para>EXPLAIN [âscript pigscript] [âout path] [âbrief]
[âdot] [âparam param_name = param_value] [âparam_file file_name]
alias;Â </para>
</entry>
</row></tbody></tgroup>
</informaltable></section>
@@ -6985,14 +6985,23 @@ DUMP B;
<title>Terms</title>
<informaltable frame="all">
<tgroup cols="2"><tbody>
-
<row>
<entry>
- <para>âout path</para>
+ <para>âscript</para>
+ </entry>
+ <entry>
+ <para>Use to specify a pig script.</para>
</entry>
+ </row>
+
+ <row>
<entry>
- <para>Will generate logical_plan.[txt||dot],
physical_plan.[text||dot], exec_plan.[text||dot] in the specified directory
(path).</para>
- <para>Default (no path given): Stdout </para>
+ <para>âout</para>
+ </entry>
+ <entry>
+ <para>Use to specify the output path (directory).</para>
+ <para>Will generate a logical_plan[.txt|.dot],
physical_plan[.text|.dot], exec_plan[.text|.dot] file in the specified
path.</para>
+ <para>Default (no path specified): Stdout </para>
</entry>
</row>
@@ -7010,9 +7019,10 @@ DUMP B;
<para>âdot</para>
</entry>
<entry>
- <para>Dot mode: outputs a format that can be passed to dot for
graphical display.</para>
- <para>Text mode: multiple output (split) will be broken out in
sections. </para>
- <para>Default: Text </para>
+
+ <para>Text mode (default): multiple output (split) will be
broken out in sections. </para>
+ <para>Dot mode: outputs a format that can be passed to the dot
utility for graphical display â
+ will generate a directed-acyclic-graph (DAG) of the plans in
any supported format (.gif, .jpg ...).</para>
</entry>
</row>
@@ -7295,7 +7305,7 @@ ILLUSTRATE num_user_visits;
<para>USING â Keyword.</para>
</listitem>
<listitem>
- <para>serializer â A function that converts data from
tuples to stream format. PigStorage is the default serializer. You can also
write your own UDF.</para>
+ <para>serializer â PigStream is the default serializer.
</para>
</listitem>
</itemizedlist>
</entry>
@@ -7318,7 +7328,7 @@ ILLUSTRATE num_user_visits;
<para>USING â Keyword.</para>
</listitem>
<listitem>
- <para>deserializer â A function that converts data from
stream format to tuples. PigStorage is the default deserializer. You can also
write your own UDF.</para>
+ <para>deserializer â PigStream is the default
deserializer. </para>
</listitem>
</itemizedlist>
</entry>
@@ -7365,7 +7375,7 @@ ILLUSTRATE num_user_visits;
<para>Use DEFINE to specify a function when:</para>
<itemizedlist>
<listitem>
- <para>The function has a log package name that you don't want to
include in a script, especially if you call the function several times in that
script.</para>
+ <para>The function has a long package name that you don't want to
include in a script, especially if you call the function several times in that
script.</para>
</listitem>
<listitem>
<para>The constructor for the function takes string parameters. If
you need to use different constructor parameters for different calls to the
function you will need to create multiple defines â one for each parameter
set.</para>
@@ -7375,8 +7385,46 @@ ILLUSTRATE num_user_visits;
<section
><title>About Input and Output</title>
- <para>Serialization is needed to convert data from tuples to a format that
can be processed by the streaming application. Deserialization is needed to
convert the output from the streaming application back into tuples.</para>
- <para>PigStorage, the default serialization/deserialization function,
converts tuples to tab-delimited lines. Pig's BinarySerializer and
BinaryDeserializer functions treat the entire file as a byte stream (no
formatting or interpretation takes place). You can also write your own
serialization/deserialization functions.</para>
+ <para>Serialization is needed to convert data from tuples to a format that
can be processed by the streaming application. Deserialization is needed to
convert the output from the streaming application back into tuples.
PigStreaming is the default serialization/deserialization function.</para>
+
+<para>Streaming uses the same default format as PigStorage to
serialize/deserialize the data. If you want to explicitly specify a format, you
can do it as show below (see more examples in the Examples: Input/Output
section). </para>
+
+<programlisting>
+DEFINE CMD 'perl PigStreaming.pl - nameMap' input(stdin using
PigStreaming(',')) output(stdout using PigStreaming(','));
+A = LOAD 'file';
+B = STREAM B THROUGH CMD;
+</programlisting>
+
+<para>If you need an alternative format, you will need to create a custom
serializer/deserializer by implementing the following interfaces.</para>
+
+<programlisting>
+interface PigToStream {
+
+ /**
+ * Given a tuple, produce an array of bytes to be passed to the
streaming
+ * executable.
+ */
+ public byte[] serialize(Tuple t) throws IOException;
+ }
+
+ interface StreamToPig {
+
+ /**
+ * Given a byte array from a streaming executable, produce a tuple.
+ */
+ public Tuple deserialize(byte[]) throws IOException;
+
+ /**
+ * This will be called on the front end during planning and not on the
back
+ * end during execution.
+ *
+ * @return the {...@link LoadCaster} associated with this object.
+ * @throws IOException if there is an exception during LoadCaster
+ */
+ public LoadCaster getLoadCaster() throws IOException;
+ }
+</programlisting>
+
</section>
<section>
@@ -7448,15 +7496,15 @@ OP = stream IP through 'perl /a/b/c/scri
</section>
<section>
- <title>Example: Input/Output</title>
- <para>In this example PigStorage is the default serialization/deserialization
function. The tuples from relation A are converted to tab-delimited lines that
are passed to the script.</para>
+ <title>Examples: Input/Output</title>
+ <para>In this example PigStreaming is the default
serialization/deserialization function. The tuples from relation A are
converted to tab-delimited lines that are passed to the script.</para>
<programlisting>
X = STREAM A THROUGH 'stream.pl';
</programlisting>
- <para>In this example PigStorage is used as the
serialization/deserialization function, but a comma is used as the
delimiter.</para>
+ <para>In this example PigStreaming is used as the
serialization/deserialization function, but a comma is used as the
delimiter.</para>
<programlisting>
-DEFINE Y 'stream.pl' INPUT(stdin USING PigStorage(',')) OUTPUT (stdout USING
PigStorage(','));
+DEFINE Y 'stream.pl' INPUT(stdin USING PigStreaming(',')) OUTPUT (stdout USING
PigStreaming(','));
X = STREAM A THROUGH Y;
</programlisting>
@@ -7470,7 +7518,7 @@ X = STREAM A THROUGH Y;
</section>
<section>
- <title>Example: Ship/Cache</title>
+ <title>Examples: Ship/Cache</title>
<para>In this example ship is used to send the script to the cluster
compute nodes.</para>
<programlisting>
DEFINE Y 'stream.pl' SHIP('/work/stream.pl');
@@ -7487,7 +7535,7 @@ X = STREAM A THROUGH Y;
</section>
<section>
- <title>Example: Logging</title>
+ <title>Examples: Logging</title>
<para>In this example the streaming stderr is stored in the
_logs/<dir> directory of the job's output directory. Because the job can
have multiple streaming applications associated with it, you need to ensure
that different directory names are used to avoid conflicts. Pig stores up to
100 tasks per streaming job.</para>
<programlisting>
DEFINE Y 'stream.pl' stderr('<dir>' limit 100);
@@ -8590,6 +8638,43 @@ DUMP X;
<section>
+ <title>Handling Compression</title>
+
+<para>Support for compression is determined by the load/store function.
PigStorage and TextLoader support gzip and bzip compression for both read
(load) and write (store). BinStorage does not support compression.</para>
+
+<para>To work with gzip compressed files, input/output files need to have a
.gz extension. Gzipped files cannot be split across multiple maps; this means
that the number of maps created is equal to the number of part files in the
input location.</para>
+
+<programlisting>
+A = load âmyinput.gzâ;
+store A into âmyoutput.gzâ;
+</programlisting>
+
+<para>To work with bzip compressed files, the input/output files need to have
a .bz or .bz2 extension. Because the compression is block-oriented, bzipped
files can be split across multiple maps.</para>
+
+<programlisting>
+A = load âmyinput.bzâ;
+store A into âmyoutput.bzâ;
+</programlisting>
+
+<para>Note: PigStorage and TextLoader correctly read compressed files as long
as they are NOT CONCATENATED FILES generated in this manner: </para>
+ <itemizedlist>
+ <listitem>
+ <para>cat *.gz > text/concat.gz</para>
+ </listitem>
+ <listitem>
+ <para>cat *.bz > text/concat.bz </para>
+ </listitem>
+ <listitem>
+ <para>cat *.bz2 > text/concat.bz2</para>
+ </listitem>
+ </itemizedlist>
+
+<para>If you use concatenated gzip or bzip files with your Pig jobs, you will
NOT see a failure but the results will be INCORRECT.</para>
+<para></para>
+
+</section>
+
+ <section>
<title>BinStorage</title>
<para>Loads and stores data in machine-readable format.</para>
@@ -8618,9 +8703,10 @@ DUMP X;
<section>
<title>Usage</title>
- <para>BinStorage works with data that is represented on disk in
machine-readable format.</para>
- <para>BinStorage does not support compression.</para>
- <para>BinStorage is used internally by Pig to store the temporary data that
is created between multiple map/reduce jobs.</para></section>
+ <para>BinStorage works with data that is represented on disk in
machine-readable format.
+ BinStorage does NOT support <ulink
url="#Handling+Compression">compression</ulink>.</para>
+
+ <para>BinStorage is used internally by Pig to store the temporary data
that is created between multiple map/reduce jobs.</para></section>
<section>
<title>Example</title>
@@ -8665,9 +8751,7 @@ STORE X into 'output' USING BinStorage()
<title>Usage</title>
<para>PigStorage is the default function for the LOAD and STORE operators
and works with both simple and complex data types. </para>
- <para>PigStorage supports structured text files (in human-readable UTF-8
format).</para>
-
- <para>PigStorage also supports gzip (.gz) and bzip(.bz or .bz2) compressed
files. PigStorage correctly reads compressed files as long as they are NOT
CONCATENATED files generated in this manner: cat *.gz > text/concat.gz OR cat
*.bz > text/concat.bz (OR cat *.bz2 > text/concat.bz2). If you use concatenated
gzip or bzip files with your Pig jobs, you will not see a failure but the
results will be INCORRECT.</para>
+ <para>PigStorage supports structured text files (in human-readable UTF-8
format). PigStorage also supports <ulink
url="#Handling+Compression">compression</ulink>.</para>
<para>Load statements â PigStorage expects data to be formatted using
field delimiters, either the tab character ('\t') or other specified
character.</para>
@@ -8762,7 +8846,7 @@ STORE X INTO 'output' USING PigDump();
<section>
<title>Usage</title>
- <para>TextLoader works with unstructured data in UTF8 format. Each
resulting tuple contains a single field with one line of input text. </para>
+ <para>TextLoader works with unstructured data in UTF8 format. Each
resulting tuple contains a single field with one line of input text. TextLoader
also supports <ulink url="#Handling+Compression">compression</ulink>.</para>
<para>Currently, TextLoader support for compression is limited.</para>
<para>TextLoader cannot be used to store data.</para>
</section>
Modified:
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml
URL:
http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml?rev=932076&r1=932075&r2=932076&view=diff
==============================================================================
---
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml
(original)
+++
hadoop/pig/branches/branch-0.7/src/docs/src/documentation/content/xdocs/udf.xml
Thu Apr 8 19:43:04 2010
@@ -762,8 +762,12 @@ has methods to convert byte arrays to sp
<p>The LoadFunc abstract class is the main class to extend for implementing a
loader. The methods which need to be overriden are explained below:</p>
<ul>
- <li>getInputFormat() :This method will be called by Pig to get the
InputFormat used by the loader. The methods in the InputFormat (and underlying
RecordReader) will be called by pig in the same manner (and in the same
context) as by Hadoop in a map-reduce java program. If the InputFormat is a
hadoop packaged one, the implementation should use the new API based one under
org.apache.hadoop.mapreduce. If it is a custom InputFormat, it should be
implemented using the new API in org.apache.hadoop.mapreduce. If a custom
loader using a text-based InputFormat or a file based InputFormat would like to
read files in all subdirectories under a given input directory recursively,
then it should use the PigFileInputFormat and PigTextInputFormat classes
provided in org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. This
is to work around the current limitation in Hadoop's TextInputFormat and
FileInputFormat which only read one level down from provided input directory.
So for ex
ample if the input in the load statement is 'dir1' and there are subdirs
'dir2' and 'dir2/dir3' underneath dir1, using Hadoop's TextInputFormat or
FileInputFormat only files under 'dir1' can be read. Using PigFileInputFormat
or PigTextInputFormat (or by extending them), files in all the directories can
be read. </li>
+ <li>getInputFormat() :This method is called by Pig to get the InputFormat
used by the loader. The methods in the InputFormat (and underlying
RecordReader) are called by Pig in the same manner (and in the same context) as
by Hadoop in a MapReduce java program. If the InputFormat is a Hadoop packaged
one, the implementation should use the new API based one under
org.apache.hadoop.mapreduce. If it is a custom InputFormat, it should be
implemented using the new API in org.apache.hadoop.mapreduce.<br></br>
<br></br>
+
+ If a custom loader using a text-based InputFormat or a file-based InputFormat
would like to read files in all subdirectories under a given input directory
recursively, then it should use the PigTextInputFormat and PigFileInputFormat
classes provided in
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. The Pig
InputFormat classes work around a current limitation in the Hadoop
TextInputFormat and FileInputFormat classes which only read one level down from
the provided input directory. For example, if the input in the load statement
is 'dir1' and there are subdirs 'dir2' and 'dir2/dir3' beneath dir1, the Hadoop
TextInputFormat and FileInputFormat classes read the files under 'dir1' only.
Using PigTextInputFormat or PigFileInputFormat (or by extending them), the
files in all the directories can be read. </li>
+
<li>setLocation() :This method is called by Pig to communicate the load
location to the loader. The loader should use this method to communicate the
same information to the underlying InputFormat. This method is called multiple
times by pig - implementations should bear this in mind and should ensure there
are no inconsistent side effects due to the multiple calls. </li>
+
<li>prepareToRead() : Through this method the RecordReader associated with
the InputFormat provided by the LoadFunc is passed to the LoadFunc. The
RecordReader can then be used by the implementation in getNext() to return a
tuple representing a record of data back to pig. </li>
<li>getNext() :The meaning of getNext() has not changed and is called by Pig
runtime to get the next tuple in the data - in this method the implementation
should use the the underlying RecordReader and construct the tuple to return.
</li>
</ul>
@@ -1124,13 +1128,6 @@ public class SimpleTextStorer extends St
</section>
<!-- END LOAD/STORE FUNCTIONS -->
-<section>
-<title> Comparison Functions</title>
-
-<p>Comparison UDFs are mostly obsolete now. They were added to the language
because, at that time, the <code>ORDER</code> operator had two significant
shortcomings. First, it did not allow descending order and, second, it only
supported alphanumeric order. </p>
-<p>The latest version of Pig solves both of these issues. The <a
href="http://wiki.apache.org/pig/UserDefinedOrdering"> pointer</a> to the
original documentation is provided here for completeness. </p>
-
-</section>
<section>
<title>Builtin Functions and Function Repositories</title>