[ 
https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-59:
-----------------------------

    Assignee: Shubham Chopra

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>            Assignee: Shubham Chopra
>             Fix For: 0.1.0
>
>         Attachments: displayAlternate.patch, ExampleGenerator.patch, 
> ExampleGenerator.patch, ExampleGenerator.patch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people 
> debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are 
> transformed by the sequence of Pig commands in the user's program. I have an 
> algorithm that can select an appropriate and concise set of example data 
> items automatically. It does a better job than random sampling would do; for 
> example, random sampling suffers from the drawback that selective operations 
> such as filters or joins can eliminate *all* the sampled data items, giving 
> you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig 
> programs on large data sets, which has a long turnaround time and wastes 
> system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain 
> the aforementioned algorithm. The algorithm uses the "Local" execution 
> operators (it does not run on hadoop), so as to generate illustrative example 
> data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes 
> called "provenance") of data items as they flow through the local operator 
> tree corresponding to the user's Pig program. So I will have to add a 
> "lineage tracer" to the Local operators, which maintains a side data 
> structure to represent the lineage, or derivation sequence, among data items. 
> The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal 
> Pig operation.
> I will add a new method to PigServer called 
> "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to 
> be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it 
> will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, 
> COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 
> 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to