[jira] Commented: (PIG-958) Splitting output data on key field

Pradeep Kamath (JIRA) Wed, 23 Sep 2009 17:49:43 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758952#action_12758952
 ]


Pradeep Kamath commented on PIG-958:
------------------------------------

In general the patch looks good - do address the hudson QA issues. Some code 
review comments:

1) I don't think pig.properties changes should be part of this patch - was this 
accidental? Likewise for Main.java.

2)
{noformat}
 72   /**                                                                       
                                                                                
                                                                           
 73    *  storeUID is needed to get the around the case of multi-query 
optimisation where                                                              
                                                                                
    
 74    *  multiple instances of MultiStorage could be running in a single 
mapper/reducer                                                                  
                                                                                
 
 75    *  and run the risk of overwriting each other's output.                  
                                                                                
                                                                           
 76    */      
{noformat}
Wouldn't the multiple instances be writing to different locations in which case 
there should be no race condition right?

3)
{noformat}
172   private void initJobSpecificParams() throws IOException {                 
                                                                                
                                                                           
173     if (partition == null || outputPath == null) {   
{noformat}
Wouldn't the outputPath never be null since it is initialized in the 
constructor?

4) Consider removing the log.debug() since these are all in the inner loop and 
would possibly impact performance.

5)
{noformat}
291         String fieldValueBasedPathStr = 
fieldValueBasedPath.toUri().getPath(); 
{noformat}
This variable is only really used in a log.debug(), I think removing this and 
using fieldValueBasedPath in all the fs.create() will make the code shorter and 
cleaner.

6) Some comments on why the move of the output dirs is needed and in the 
moveResults() and removePart() methods would be helpful. Additional comments on 
the code flow would also help


> Splitting output data on key field
> ----------------------------------
>
>                 Key: PIG-958
>                 URL: https://issues.apache.org/jira/browse/PIG-958
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.4.0
>            Reporter: Ankur
>         Attachments: 958.v2.patch
>
>
> Pig users often face the need to split the output records into a bunch of 
> files and directories depending on the type of record. Pig's SPLIT operator 
> is useful when record types are few and known in advance. In cases where type 
> is not directly known but is derived dynamically from values of a key field 
> in the output tuple, a custom store function is a better solution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-958) Splitting output data on key field

Reply via email to