[Pig Wiki] Update of "PigOverview" by ChrisOlston

Apache Wiki Mon, 10 Dec 2007 16:33:58 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by ChrisOlston:
http://wiki.apache.org/pig/PigOverview

------------------------------------------------------------------------------
- ---+++ What is Pig:
+ The purpose of this page is to give a quick tour through Pig, for a newcomer.
+ 
+ Complete documentation is available at: [http://wiki.apache.org/pig Pig Wiki 
Main Page]
+ 
+ == What is Pig: ==
  
   * Pig has two parts:
-    * A language for processing data, called <i>Pig Latin</i>.
+    * A language for processing data, called ''Pig Latin''.
-    * A set of <i>evaluation mechanisms</i> for evaluating a Pig Latin 
program. Current evaluation mechanisms include (a) local evaluation in a single 
JVM, (2) evaluation by translation into one or more Map-Reduce jobs, executed 
using Hadoop.
+    * A set of ''evaluation mechanisms'' for evaluating a Pig Latin program. 
Current evaluation mechanisms include (a) local evaluation in a single JVM, (b) 
evaluation by translation into one or more Map-Reduce jobs, executed using 
Hadoop.
  
- ---+++ Pig Latin programs:
+ == Pig Latin programs: ==
  
   * Pig Latin has built-in relational-style operations such as filter, 
project, group, join. Pig Latin also has a map operation that applies a custom 
user function to every member of a set. In Pig Latin, the map operation is 
called "foreach".
  
   * Additionally, users can incorporate their own custom code into essentially 
any Pig Latin operation. For example, if a user has a function that determines 
whether a given image contains a human face, the user can ask Pig to filter 
images according to this function. Pig will then evaluate this function on the 
user's behalf, over the images. If the evaluation mechanism incorporates 
parallelism, as is the case with the Hadoop evaluation mechanism, then the 
user's function will be executed in a parallel fashion.
  
- ---+++ Data:
+ == Data: ==
  
   * Pig can process data of any format. Some standard formats, e.g. tab 
delimited text files, are supported via built-in capabilities. A user can add 
support for a file format by writing a function that parses the bytes of a file 
into objects in Pig's data model, and vice versa.
   * Pig's data model is similar to the relational data model, except that 
tuples can be nested. For example, you can have a table of tuples, where the 
third field of each tuple contains a table. In Pig, tables are called bags.
  
- ---+++ 
+ == Examples: ==
  
+ === Example 1: Face Detection ===
+ 
+ Suppose you have a function makeThumbnail() that converts an image into a 
small thumbnail representation. You want to convert a set of images into 
thumbnails. A Pig Latin program to do this is:
+ 
+ {{{
+ images = load '/myimages' using myImageStorageFunc() as (id, img);
+ thumbnails = foreach images generate id, makeThumbnail(img);
+ store thumbnails into '/mythumbnails' using myImageStorageFunc();
+ }}}
+ 
+ The first line tells Pig: (1) what is the input to your computation (in this 
case, the content of the directory '/myimages'), (2) how can Pig convert the 
file content into tuples (in this case, by invoking myImageStorageFunc), (3) 
the schema of the parsed file content (in this case each tuple has two fields: 
id and img).
+ 
+ #2 and #3 are optional. If #2 is omitted, Pig will attempt to parse the file 
using its default parser. If #3 is omitted, then the user must refer to tuple 
fields by position instead of by name (e.g., $0 for id, and $1 for img).
+ 
+ The second line instructs Pig to convert every (id, img) pair into an (id, 
thumbnail) pair, by running the makeThumbnail function on each image.
+ 
+ The third line instructs Pig to store the result into the directory 
'/mythumbnails', and encode the tuples into the file according to the 
myImageStorageFunc() function.
+ 
+ Most Pig Latin commands consist of an assignment to a variable (e.g., images, 
thumbnails). These variables denote tables, but these tables are not 
necessarily materialized on disk or in the memory of any one machine. The final 
"store" command causes Pig to compile the preceeding commands into an execution 
plan, e.g., one or more Map-Reduce jobs to execute on Hadoop. In the above 
example, the program will get compiled into a single Map-Reduce job where the 
Reduce phase is disabled, i.e., the output of the Map is the final output.
+

[Pig Wiki] Update of "PigOverview" by ChrisOlston

Reply via email to