[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760423#action_12760423
 ] 

Pradeep Kamath commented on PIG-953:
------------------------------------

Here is a proposal for dealing with Sort Column information in SortInfo. Rather 
than giving Arraylist of column names and separate array list of asc/desc 
flags, it would be good to have a unified structure containing both pieces of 
information per sort column. Also there are use cases for providing column 
names (zebra) and for them being optional and providing column positions 
instead which some other loader /optimizer might find useful. The type of the 
column might also be useful if available. Hence, the proposal is to have a 
SortColumn class with the following attributes : column name, column position 
(zero based index), column type, asc/desc flag. Then in SortInfo there would be 
a List<SortColumn> which would be available through a getter. This should 
address both the concerns above. Callers will need to explicity check for null 
column names and UNKNOWN column type since these two scenarios may occur if 
schema is not available for pig runtime to provide the information.

Thoughts?

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-953
>                 URL: https://issues.apache.org/jira/browse/PIG-953
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>         Attachments: PIG-953-2.patch, PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to