neumarcx commented on issue #568: Add Aggregate Median to SPARQL ARQ syntax URL: https://github.com/apache/jena/pull/568#issuecomment-493487384 > One observation is that the Commons Math library used notes that for percentile based stats to be evaluated correctly the data should be at least partially ordered (http://commons.apache.org/proper/commons-math/javadocs/api-3.6/org/apache/commons/math3/stat/descriptive/rank/Percentile.html). Since aggregation is computed prior to sorting in SPARQL there’s no guarantee that the accumulator will see the data in a sensible order and thus calculate a meaningful result. > > Could do with some test cases to see what happens in the case of generated random data inputs and may need some refactoring to do internal sorting of the accumulated values prior to passing them to the Math library @rvesse worthwhile observation on Commons Math library, but by definition median is the middle in a sorted order. Same is the case for the Commons Math implementation here. The sort in the Commons Math library is certainly much more efficient than a standard sort on e.g. java.util.Arrays. In preliminary tests I tend run into heap space issues with sets +200m aggregate values in an array and for +1 billion values in settings with large Xmx allocation. Do you see this being an issue for a general release in ARQ?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
