[ 
https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765663#action_12765663
 ] 

Dmitriy V. Ryaboy commented on PIG-966:
---------------------------------------

Regarding historgram representation:

I took a look at how Postgres does it, and they simply store 3 arrays:

* An array of "Most Common Values", which contains exactly what it sounds like, 
ordered in decreasing frequency
* A matching array of frequencies, expressed as a fraction of the total row 
count in the relation.
* an array of sorted values chosen in such a way that the number of rows with 
values between A[i] and A[i+1] is roughly the same for all i.  An interesting 
optimization they perform is that if the most common values array described 
above is defined for this field, then the values in that array are not included 
when calculating the boundaries for the histogram. They say that's called a 
"compressed histogram", if someone wants to dig up some papers on this.

Any objections to this design?



> Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
> ---------------------------------------------------------------
>
>                 Key: PIG-966
>                 URL: https://issues.apache.org/jira/browse/PIG-966
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>
> I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces 
> significantly.  See http://wiki.apache.org/pig/LoadStoreRedesignProposal for 
> full details

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to