Pradeep Kamath commented on PIG-953:

  I looked at the ResourceSchema proposed in 
http://wiki.apache.org/pig/LoadStoreRedesignProposal and also spoke with Alan 
to understand the intent more. The eventual goal is for the setSchema() call in 
StoreFunc to give the ResourceSchema to the store implementation. The 
ResourceSchema will contain both pig schema information and sort column 
information. So Zebra or any other storage function which needs to know about 
sort columns will get the information from the ResourceSchema passed in 

However, today there is a way pig runtime conveys the pig schema to store 
functions (through StoreConfig). We need a separate way to give sort 
information since pig schema does not have the ability to give it. Since after 
the rewrite of load/store interfaces this problem will be solved through 
setSchema(), the solution which we will come up with now in this jira will 
anyway need to be re-written. So it is cleaner to only keep sort column 
information in SortColInfo and have an array of SortColInfo in SortInfo. If 
instead we use ResourceSchema then StoreConfig will have a pig Schema and a 
Resource Schema which would also be confusing to callers. 

In short, since this piece code of code will need a re-write later, it is 
better not to make it generic now and just address immediate needs and the 
re-write should remove multiple representations of schema/sort information.

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> ---------------------------------------------------------------------------------------------------------
>                 Key: PIG-953
>                 URL: https://issues.apache.org/jira/browse/PIG-953
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>         Attachments: PIG-953-2.patch, PIG-953.patch
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to