[ https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765663#action_12765663 ]
Dmitriy V. Ryaboy commented on PIG-966: --------------------------------------- Regarding historgram representation: I took a look at how Postgres does it, and they simply store 3 arrays: * An array of "Most Common Values", which contains exactly what it sounds like, ordered in decreasing frequency * A matching array of frequencies, expressed as a fraction of the total row count in the relation. * an array of sorted values chosen in such a way that the number of rows with values between A[i] and A[i+1] is roughly the same for all i. An interesting optimization they perform is that if the most common values array described above is defined for this field, then the values in that array are not included when calculating the boundaries for the histogram. They say that's called a "compressed histogram", if someone wants to dig up some papers on this. Any objections to this design? > Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces > --------------------------------------------------------------- > > Key: PIG-966 > URL: https://issues.apache.org/jira/browse/PIG-966 > Project: Pig > Issue Type: Improvement > Components: impl > Reporter: Alan Gates > Assignee: Alan Gates > > I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces > significantly. See http://wiki.apache.org/pig/LoadStoreRedesignProposal for > full details -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.