Yan Zhou updated PIG-997:

    Status: Patch Available  (was: Open)

> [zebra] Sorted Table Support by Zebra
> -------------------------------------
>                 Key: PIG-997
>                 URL: https://issues.apache.org/jira/browse/PIG-997
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Yan Zhou
>             Fix For: 0.6.0
>         Attachments: SortedTable.patch, SortedTable.patch
> This new feature is for Zebra to support sorted data in storage. As a storage 
> library, Zebra will not sort the data by itself. But it will support creation 
> and use of sorted data either through PIG  or through map/reduce tasks that 
> use Zebra as storage format.
> The sorted table keeps the data in a "totally sorted" manner across all 
> TFiles created by potentially all mappers or reducers.
> For sorted data creation through PIG's STORE operator ,  if the input data is 
> sorted through "ORDER BY", the new Zebra table will be marked as sorted on 
> the sorted columns;
> For sorted data creation though Map/Reduce tasks,  three new static methods 
> of the BasicTableOutput class will be provided to allow or help the user to 
> achieve the goal. "setSortInfo" allows the user to specify the sorted columns 
> of the input tuple to be stored; "getSortKeyGenerator" and "getSortKey" help 
> the user to generate the key acceptable by Zebra as a sorted key based upon 
> the schema, sorted columns and the input tuple.
> For sorted data read through PIG's LOAD operator, pass string "sorted" as an 
> extra argument to the TableLoader constructor to ask for sorted table to be 
> loaded;
> For sorted data read through Map/Reduce tasks, a new static method of 
> TableInputFormat class, requireSortedTable, can be called to ask for a sorted 
> table to be read. Additionally, an overloaded version of the new method can 
> be called to ask for a sorted table on specified sort columns and comparator.
> For this release, sorted table only supported sorting in ascending order, not 
> in descending order. In addition, the sort keys must be of simple types not 
> complex types such as RECORD, COLLECTION and MAP. 
> Multiple-key sorting is supported. But the ordering of the multiple sort keys 
> is significant with the first sort column being the primary sort key, the 
> second being the secondary sort key, etc.
> In this release, the sort keys are stored along with the sort columns where 
> the keys were originally created from, resulting in some data storage 
> redundancy.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to