Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by ChrisOlston:
http://wiki.apache.org/pig/PigOverview

------------------------------------------------------------------------------
- The purpose of this page is to give a quick tour through Pig, for a newcomer.
+ The purpose of this page is to give a quick tour of Pig. It is intentionally 
high level and misses some details, so as to make it easy to consume by 
somebody who just wants to understand the main capabilities of Pig.
  
  Complete documentation is available at: [http://wiki.apache.org/pig Pig Wiki 
Main Page]
  
  == What is Pig: ==
  
-  * Pig has two parts:
+ Pig has two parts:
-    * A language for processing data, called ''Pig Latin''.
+  * A language for processing data, called ''Pig Latin''.
-    * A set of ''evaluation mechanisms'' for evaluating a Pig Latin program. 
Current evaluation mechanisms include (a) local evaluation in a single JVM, (b) 
evaluation by translation into one or more Map-Reduce jobs, executed using 
Hadoop.
+  * A set of ''evaluation mechanisms'' for evaluating a Pig Latin program. 
Current evaluation mechanisms include (a) local evaluation in a single JVM, (b) 
evaluation by translation into one or more Map-Reduce jobs, executed using 
Hadoop.
  
  == Pig Latin programs: ==
  
@@ -16, +16 @@

  
   * Additionally, users can incorporate their own custom code into essentially 
any Pig Latin operation. For example, if a user has a function that determines 
whether a given image contains a human face, the user can ask Pig to filter 
images according to this function. Pig will then evaluate this function on the 
user's behalf, over the images. If the evaluation mechanism incorporates 
parallelism, as is the case with the Hadoop evaluation mechanism, then the 
user's function will be executed in a parallel fashion.
  
- == Data: ==
+ == How to run Pig: ==
  
-  * Pig can process data of any format. Some standard formats, e.g. tab 
delimited text files, are supported via built-in capabilities. A user can add 
support for a file format by writing a function that parses the bytes of a file 
into objects in Pig's data model, and vice versa.
+ You have three options:
+  * Interactive shell, called ''Grunt''
+  * Script file
+  * Embed in a host language; currently we support Java as the host language 
(embedding Pig Latin in Java is very similar to JDBC)
+ 
+ == Data formats: ==
+ 
+  * Pig can process data of any format. (Pigs eat anything! .. or is that 
goats?) A few common formats such as tab delimited text files, are supported 
via built-in capabilities. A user can add support for a file format by writing 
a function that parses the bytes of a file into objects in Pig's data model, 
and vice versa.
   * Pig's data model is similar to the relational data model, except that 
tuples can be nested. For example, you can have a table of tuples, where the 
third field of each tuple contains a table. In Pig, tables are called bags.
+ 
+ == Other capabilities: ==
+ 
+  * '''Combining multiple data sets:''' Pig can combine multiple data sets, 
via operations such as join, union or cogroup.
+  * '''Splitting data sets:''' Pig can split a single data set into multiple 
ones, using an operation called split.
+  * '''Semistructured data:''' Pig supports "maps" of (key, value) pairs, 
where retrieving the value associated with a given key is an efficient 
operation. Maps provide a convenient way to represent semistructured data, 
where the set of non-null fields varies from record to record. Maps are helpful 
when processing JSON, XML, and sparse relational data (i.e., tables with a lot 
of null values).
  
  == Examples: ==
  
- === Example 1: Face Detection ===
+ === Example 1: Hello Pig ===
  
- Suppose you have a function makeThumbnail() that converts an image into a 
small thumbnail representation. You want to convert a set of images into 
thumbnails. A Pig Latin program to do this is:
+ Suppose you have a function, makeThumbnail(), that converts an image into a 
small thumbnail representation. You want to convert a set of images into 
thumbnails. A Pig Latin program to do this is:
  
  {{{
- images = load '/myimages' using myImageStorageFunc() as (id, img);
+ images = load '/myimages' using myImageStorageFunc();
- thumbnails = foreach images generate id, makeThumbnail(img);
+ thumbnails = foreach images generate makeThumbnail(*);
  store thumbnails into '/mythumbnails' using myImageStorageFunc();
  }}}
  
- The first line tells Pig: (1) what is the input to your computation (in this 
case, the content of the directory '/myimages'), (2) how can Pig convert the 
file content into tuples (in this case, by invoking myImageStorageFunc), (3) 
the schema of the parsed file content (in this case each tuple has two fields: 
id and img).
+ The first line tells Pig: (1) what is the input to your computation (in this 
case, the content of the directory '/myimages'), and (2) how can Pig interpret 
the file and delineate individual images (in this case, by invoking 
myImageStorageFunc). (#2 is optional; if omitted, Pig will attempt to parse the 
file using its default parser.)
  
- #2 and #3 are optional. If #2 is omitted, Pig will attempt to parse the file 
using its default parser. If #3 is omitted, then the user must refer to tuple 
fields by position instead of by name (e.g., $0 for id, and $1 for img).
+ The second line instructs Pig to convert every image into a thumbnail, by 
running the user's makeThumbnail function on each image.
  
- The second line instructs Pig to convert every (id, img) pair into an (id, 
thumbnail) pair, by running the makeThumbnail function on each image.
- 
- The third line instructs Pig to store the result into the directory 
'/mythumbnails', and encode the tuples into the file according to the 
myImageStorageFunc() function.
+ The third line instructs Pig to store the result into the directory 
'/mythumbnails', and encode the data into the file according to the 
myImageStorageFunc() function.
  
  Most Pig Latin commands consist of an assignment to a variable (e.g., images, 
thumbnails). These variables denote tables, but these tables are not 
necessarily materialized on disk or in the memory of any one machine. The final 
"store" command causes Pig to compile the preceeding commands into an execution 
plan, e.g., one or more Map-Reduce jobs to execute on Hadoop. In the above 
example, the program will get compiled into a single Map-Reduce job where the 
Reduce phase is disabled, i.e., the output of the Map is the final output.
  
+ 
+ === Example 2: Using the relational-style operations ===
+ 
+ Suppose you have a log of users visiting web pages, which has entries of the 
form (user, url, time). Say you want to compute the average number of page 
visits done by a user (e.g., if the answer is 4, that means that, on average, 
each user generated four page visit events in the log). Here's a Pig Latin 
program that computes this number:
+ 
+ {{{
+      VISITS = load '/visits' as (user, url, time);
+ USER_VISITS = group VISITS by user;
+ USER_COUNTS = foreach USER_VISITS generate group as user, COUNT(VISITS) as 
numvisits;
+  ALL_COUNTS = group USER_COUNTS all;
+   AVG_COUNT = foreach ALL_COUNTS generate AVG(USER_COUNTS.numvisits);
+ 
+ dump AVG_COUNT;
+ }}}
+ 
+ The first line loads the data and specifies the schema. The "using" clause 
has been omitted because here we assuming the data is in a tab-delimited text 
format that Pig can parse by default. Since this data is multifaceted, we use 
the "as" clause to assign names to the data fields: user, url, time. The "as" 
clause is optional; if it is not used you can refer to the fields by position 
(e.g., $0 for user, $1 for url, $2 for time).
+ 
+ The second line forms groups of tuples, one group for each unique user. The 
third line computes the size of each group, i.e., the number of log events 
associated with each user.
+ 
+ The fourth line places all tuples output from the previous step into a single 
group, and the fifth line computes the average of these values, i.e., the 
average of the per-user counts. If you wanted, say, standard deviation instead 
of average, which at present is not a built-in Pig function, you could write 
your own function and referenced it in the "generate" clause in place of AVG.
+ 
+ Since the output is small (a single number), the user decided to use "dump" 
instead of "store" to produce the output. Dump causes the output to be printed 
to the screen, instead of written to a file. In general, we recommend that you 
use this command with caution :)
+ 
+ === Example 3: Combining multiple data sets ===
+ 
+ One of the key features of Pig is that you can combine multiple data sets 
using operations like join, union, cogroup. (These operations are explained in 
detail in our documentation.)
+ 
+ Suppose we have our web page visit log from Example 2, and we also have a 
file that records the pagerank of each URL in the known universe, including all 
URLs in the visit log. If you aren't familiar with pagerank, just think of it 
as a numeric quality score of each web page. (In this example we assume the 
pagerank values have been computed in advance; although the iterative pagerank 
computation can be expressed in Pig Latin with an outer Java function 
controlling the looping.) Say you want to identify users who tend visit "good" 
pages, defined in terms of average pagerank exceeding some threshold. Here it 
is in Pig Latin:
+ 
+ {{{
+       VISITS = load '/visits' as (user, url, time);
+        PAGES = load '/pages' as (url, pagerank);
+ VISITS_PAGES = join VISITS by url, PAGES by url;
+  USER_VISITS = group VISITS_PAGES by user;
+   USER_AVGPR = foreach USER_VISITS generate group, AVG(VISITS_PAGES.pagerank) 
as avgpr;
+   GOOD_USERS = filter USER_AVGPR by avgpr > '0.5';
+ 
+ store GOOD_USERS into '/goodusers';
+ }}}
+ 
+ The first two lines load our two data sets (visits and pages). The third line 
specifies a join over the two sets -- it finds visits tuples that have the same 
URL as pages tuples, and glues them together. Hence the VISITS_PAGES table 
gives us the pagerank of the URL in each visit tuple. 
+ 
+ The fourth line groups tuples by user, and the fifth line computes the 
average pagerank of each user's visited URLs.
+ 
+ The fifth line filters out uses whose pagerank is not greater than 0.5.
+ 

Reply via email to