[ 
https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565948#action_12565948
 ] 

Utkarsh Srivastava commented on PIG-59:
---------------------------------------

Shubham, sorry for the delay in reviewing this. Chris, a TODO item in there for 
you. Here are my comments on the patch:

Major comments:

 * Is this patch before or after PIG-32? I guess before. I think you will have 
to merge the new trunk and generate a new patch.

 * Most of the logic is one monolithic class Exgen.java. Is there some logical 
way to break out the functionality into smaller classes? 

 * I didn't check the functionality of ExGen.java itself. Chris, can you do 
that? I merely checked for the effects on the current code.

 * POLoad
Why are the changes not in a if(lineageTracer!=null) block?



Other minor comments:

 * Test pattern in build.xml: 
Why did you have to change build.xml? The tests should be in the same suite.


 * LineageTracer
Why is the ASF License removed?



> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: ExGenPatch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people 
> debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are 
> transformed by the sequence of Pig commands in the user's program. I have an 
> algorithm that can select an appropriate and concise set of example data 
> items automatically. It does a better job than random sampling would do; for 
> example, random sampling suffers from the drawback that selective operations 
> such as filters or joins can eliminate *all* the sampled data items, giving 
> you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig 
> programs on large data sets, which has a long turnaround time and wastes 
> system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain 
> the aforementioned algorithm. The algorithm uses the "Local" execution 
> operators (it does not run on hadoop), so as to generate illustrative example 
> data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes 
> called "provenance") of data items as they flow through the local operator 
> tree corresponding to the user's Pig program. So I will have to add a 
> "lineage tracer" to the Local operators, which maintains a side data 
> structure to represent the lineage, or derivation sequence, among data items. 
> The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal 
> Pig operation.
> I will add a new method to PigServer called 
> "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to 
> be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it 
> will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, 
> COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 
> 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to