Author: daijy
Date: Sat May 23 22:56:36 2015
New Revision: 1681396

URL: http://svn.apache.org/r1681396
Log:
PIG-4560: Pig 0.15.0 Documentation

Modified:
    pig/branches/branch-0.15/CHANGES.txt
    pig/branches/branch-0.15/src/docs/src/documentation/content/xdocs/func.xml
    pig/branches/branch-0.15/src/docs/src/documentation/content/xdocs/perf.xml

Modified: pig/branches/branch-0.15/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/pig/branches/branch-0.15/CHANGES.txt?rev=1681396&r1=1681395&r2=1681396&view=diff
==============================================================================
--- pig/branches/branch-0.15/CHANGES.txt (original)
+++ pig/branches/branch-0.15/CHANGES.txt Sat May 23 22:56:36 2015
@@ -24,6 +24,8 @@ INCOMPATIBLE CHANGES
  
 IMPROVEMENTS
 
+PIG-4560: Pig 0.15.0 Documentation (daijy)
+
 PIG-4429: Add Pig alias information and Pig script to the DAG view in Tez UI 
(daijy)
 
 PIG-3994: Implement getting backend exception for Tez (rohini)

Modified: 
pig/branches/branch-0.15/src/docs/src/documentation/content/xdocs/func.xml
URL: 
http://svn.apache.org/viewvc/pig/branches/branch-0.15/src/docs/src/documentation/content/xdocs/func.xml?rev=1681396&r1=1681395&r2=1681396&view=diff
==============================================================================
--- pig/branches/branch-0.15/src/docs/src/documentation/content/xdocs/func.xml 
(original)
+++ pig/branches/branch-0.15/src/docs/src/documentation/content/xdocs/func.xml 
Sat May 23 22:56:36 2015
@@ -1877,6 +1877,8 @@ A = LOAD 'data' USING TextLoader();
                     less than this value</li>
                 <li>-timestamp=timestamp Return cell values that have a 
creation timestamp equal to
                     this value</li>
+                <li>-includeTimestamp=Record will include the timestamp after 
the rowkey on store (rowkey, timestamp, ...)</li>
+                <li>-includeTombstone=Record will include a tombstone marker 
on store after the rowKey and timestamp (if included) (rowkey, [timestamp,] 
tombstone, ...)</li>
                </ul>
             </td>
          </tr>
@@ -6056,6 +6058,74 @@ bottomResults = FOREACH D {
 </section>
 </section>
 <!-- End Other Functions -->
-
+<!-- ======================================================== -->
+<!-- ======================================================== -->
+<!-- Other Functions -->
+<section id="hive-udf">
+<title>Hive UDF</title>
+<p>Pig invokes all types of Hive UDF, including UDF, GenericUDF, UDAF, 
GenericUDAF and GenericUDTF. Depending on the Hive UDF you want to use, you 
need to declare it in Pig with HiveUDF(handles UDF and GenericUDF), 
HiveUDAF(handles UDAF and GenericUDAF), HiveUDTF(handles GenericUDTF).</p>
+  <section>
+    <title>Syntax</title>
+    <p>HiveUDF, HiveUDAF, HiveUDTF share the same syntax.</p>
+    <table>
+      <tr>
+        <td>
+          <p>HiveUDF(name[, constant parameters])</p>
+        </td>
+      </tr>
+    </table>
+  </section>
+  <section>
+    <title>Terms</title>
+    <table>
+      <tr>
+        <td>
+          <p>name</p>
+        </td>
+        <td>
+          <p>Hive UDF name. This can be a fully qualified class name of the 
Hive UDF/UDTF/UDAF class, or a registered short name in Hive FunctionRegistry 
(most Hive builtin UDF does that)</p>
+        </td>
+      </tr>
+      <tr>
+        <td>
+          <p>constant parameters</p>
+        </td>
+        <td>
+          <p>Optional tuple representing constant parameters of a Hive 
UDF/UDTF/UDAF. If Hive UDF requires a constant parameter, there is no other way 
Pig can pass that information to Hive, since Pig schema does not carry the 
information whether a parameter is constant or not. Null item in the tuple 
means this field is not a constant. Non-null item represents a constant field. 
Data type for the item is determined by Pig contant parser.</p>
+        </td>
+      </tr>
+    </table>
+  </section>
+  <section>
+  <title>Example</title>
+  <p>HiveUDF</p>
+  <source>
+define sin HiveUDF('sin');
+A = LOAD 'student' as (name:chararray, age:int, gpa:double);
+B = foreach A generate sin(gpa);
+  </source>
+  <p>HiveUDTF</p>
+  <source>
+define explode HiveUDTF('explode');
+A = load 'mydata' as (a0:{(b0:chararray)});
+B = foreach A generate flatten(explode(a0));
+  </source>
+  <p>HiveUDAF</p>
+  <source>
+define avg HiveUDAF('avg');
+A = LOAD 'student' as (name:chararray, age:int, gpa:double);
+B = group A by name;
+C = foreach B generate group, avg(A.age);
+  </source>
+  </section>
+  <p>HiveUDAF with constant parameter</p>
+<source>
+define in_file HiveUDF('in_file', '(null, "names.txt")');
+A = load 'student' as (name:chararray, age:long, gpa:double);
+B = foreach A generate in_file(name, 'names.txt');
+</source>
+<p>In this example, we pass (null, "names.txt") to the construct of UDF 
in_file, meaning the first parameter is regular, the second parameter is a 
constant. names.txt can be double quoted (unlike other Pig syntax), or quoted 
in \'. Note we need to pass 'names.txt' again in line 3. This looks stupid but 
we need to do this to fill the semantic gap between Pig and Hive. We need to 
pass the constant in the data pipeline in line 3, which is similar Pig UDF. 
Initialization code in Hive UDF takes ObjectInspector, which capture the data 
type and whether or not the parameter is a constant. However, initialization 
code in Pig takes schema, which only capture the former. We need to use 
additional mechanism (construct parameter) to convey the later.</p>
+<p>Note: A few Hive 0.14 UDF contains bug which affects Pig and are fixed in 
Hive 1.0. Here is a list: compute_stats, context_ngrams, count, ewah_bitmap, 
histogram_numeric, collect_list, collect_set, ngrams, case, in, named_struct, 
stack, percentile_approx.</p>
+</section>
   </body>
 </document>

Modified: 
pig/branches/branch-0.15/src/docs/src/documentation/content/xdocs/perf.xml
URL: 
http://svn.apache.org/viewvc/pig/branches/branch-0.15/src/docs/src/documentation/content/xdocs/perf.xml?rev=1681396&r1=1681395&r2=1681396&view=diff
==============================================================================
--- pig/branches/branch-0.15/src/docs/src/documentation/content/xdocs/perf.xml 
(original)
+++ pig/branches/branch-0.15/src/docs/src/documentation/content/xdocs/perf.xml 
Sat May 23 22:56:36 2015
@@ -47,10 +47,11 @@
   <section id="auto-parallelism">
     <title>Automatic parallelism</title>
     <p>Just like MapReduce, if user specify "parallel" in their Pig statement, 
or user define default_parallel in Tez mode, Pig will honor it (the only 
exception is if user specify a parallel which is apparently too low, Pig will 
override it) </p>
-    <p>If user specify neither "parallel" or "default_parallel", Pig will use 
automatic parallelism. In MapReduce, Pig submit one MapReduce job a time and 
before submiting a job, Pig has chance to automatically set reduce parallelism 
based on the size of input file. On the contrary, Tez submit a DAG as a unit 
and automatic parallelism is managed in two parts</p>
+    <p>If user specify neither "parallel" or "default_parallel", Pig will use 
automatic parallelism. In MapReduce, Pig submit one MapReduce job a time and 
before submiting a job, Pig has chance to automatically set reduce parallelism 
based on the size of input file. On the contrary, Tez submit a DAG as a unit 
and automatic parallelism is managed in three parts</p>
     <ul>
     <li>Before submiting a DAG, Pig estimate parallelism of each vertex 
statically based on the input file size of the DAG and the complexity of the 
pipeline of each vertex</li>
-    <li>At runtime, Tez adjust vertex parallelism dynamically based on the 
input data volume of the vertex. Note currently Tez can only decrease the 
parallelism dynamically not increase. So in step 1, Pig overestimate the 
parallelism</li>
+    <li>When DAG progress, Pig adjust the parallelism of vertexes with the 
best knowledge available at that moment (Pig grace paralellism)</li>
+    <li>At runtime, Tez adjust vertex parallelism dynamically based on the 
input data volume of the vertex. Note currently Tez can only decrease the 
parallelism dynamically not increase. So in step 1 and 2, Pig overestimate the 
parallelism</li>
     </ul>
     <p>The following parameter control the behavior of automatic parallelism 
in Tez (share with MapReduce):</p>
 <source>


Reply via email to