Hi Sanne, # Functionality > Why is the number of documents fetched via Degrees approximately half > of what you're fetching via Radians? > The test is randomized at this time. I do not have a fixed test set :s sorry In one run the same points are use for the 4 type of request. But switching radians<=>degrees branch and then it is a whole new test. I ll work on that to have a fixed center points set.
> This seems to be true for both "Grid" and "DoubleRange", while the > numbers (approximately) match when you apply the Distance Filter as > well. > The dynamic is the same on the two approach : 1°) First having a super set of candidates by querying the Lucene Index (avoiding distance calculation again the whole dataset) 2°) Refining candidates by true distance calculation > It was my understanding that using either Degrees or Radians in the > index should just be a performance difference, with no difference in > functionality (like result numbers) ? > Yes as soon as the dataset will be non random each run > Applying the Distance Filter seems to fix the result, but this filter > is optional right? > For exact answer to the question 'What are the document at x km max of point (lat,long), no it is not optional as the first request to the Lucene index returns false candidates. # Performance > The overhead mismatch you point out sheds some suspicion on the > overall test validity as I see no reason either.. how long are you > running these tests? I've had very bad experiences with short > performance tests; I've since learnt do use -XX:+PrintCompilation to > know how long (at the bare minimum) I have to run any test before JIT > has finished compiling all the relevant code. > Right. I will run much more longer tests. Thanks Sanne > On 11 May 2012 08:40, Nicolas Helleringer <nicolas.hellerin...@gmail.com> > wrote: > > There, back and again ... > > > > After fixing a bug in grid search here are some updated results on 2k > calls > > > > Degrees : > > Mean time with Grid : 4.4897266425641025 ms. Average number of docs > fetched > > : 2506.96 > > Mean time with Grid + Distance filter : 6.4930799487179485 ms. Average > > number of docs fetched : 425.33435897435896 > > Mean time with DoubleRange : 14.430638703076923 ms. Average number of > docs > > fetched : 542.0410256410256 > > Mean time with DoubleRange + Distance filter : 20.483300545128206 ms. > > Average number of docs fetched : 425.33435897435896 > > > > Radians : > > Mean time with Grid : 5.650845744102564 ms. Average number of docs > fetched > > : 5074.830769230769 > > Mean time with Grid + Distance filter : 8.627138825128204 ms. Average > number > > of docs fetched : 426.7902564102564 > > Mean time with DoubleRange : 15.337755502564102 ms. Average number of > docs > > fetched : 1087.705641025641 > > Mean time with DoubleRange + Distance filter : 20.82852138769231 ms. > Average > > number of docs fetched : 426.7902564102564 > > > > Next thing I do not explain yet is the distance filter overhead mismatch > : > > It is less on grid search with more docs to test than on DoubleRange. > > > > Niko > > > > > > 2012/5/7 Nicolas Helleringer <nicolas.hellerin...@gmail.com> > >> > >> Here are some results : > >> > >> Mean time with Grid : 4.9297471630769225 ms. Average number of docs > >> fetched : 2416.373846153846 > >> Mean time with Grid + Distance filter : 6.48634534 ms. Average number of > >> docs fetched : 425.84 > >> Mean time with DoubleRange : 15.39593650051282 ms. Average number of > docs > >> fetched : 542.72 > >> Mean time with DoubleRange + Distance filter : 21.158394677435897 ms. > >> Average number of docs fetched : 425.8779487179487 > >> > >> Sounds weird that with distance filter the two results are note the > same. > >> I shall investigate that. > >> > >> Niko > >> > >> 2012/5/7 Emmanuel Bernard <emman...@hibernate.org> > >>> > >>> Do you know the average amount of POI that were filtered in memory but > >>> the DistanceFilter during these runs? > >>> > >>> Emmanuel > >>> > >>> On 7 mai 2012, at 10:31, Nicolas Helleringer wrote: > >>> > >>> Hi all, > >>> > >>> I have done a radian patch/branch and some benchmarks on geonames > french > >>> database. > >>> > >>> Benchs are on 2k calls each run. > >>> > >>> Radians: > >>> run 1 > >>> Mean time with Grid : 4.808043092820513 ms > >>> Mean time with Grid + Distance filter : 6.571108878461538 ms > >>> Mean time with DoubleRange : 14.62661525128205 ms > >>> Mean time with DoubleRange + Distance filter : 20.143597923076925 ms > >>> > >>> run 2 > >>> Mean time with Grid : 5.290368523076923 ms > >>> Mean time with Grid + Distance filter : 6.706567517435897 ms > >>> Mean time with DoubleRange : 14.878960702564102 ms > >>> Mean time with DoubleRange + Distance filter : 20.75806591948718 ms > >>> > >>> Degrees: > >>> run 1 > >>> Mean time with Grid : 5.101956610769231 ms > >>> Mean time with Grid + Distance filter : 6.548685109230769 ms > >>> Mean time with DoubleRange : 14.767478146153845 ms > >>> Mean time with DoubleRange + Distance filter : 20.668063972820512 ms > >>> > >>> run 2 > >>> Mean time with Grid : 4.683360031282051 ms > >>> Mean time with Grid + Distance filter : 6.7065247435897435 ms > >>> Mean time with DoubleRange : 14.617140157948716 ms > >>> Mean time with DoubleRange + Distance filter : 20.074868595897435 ms > >>> > >>> The radian branch is here for review > >>> : > https://github.com/nicolashelleringer/hibernate-search/tree/HSEARCH-923-RADIANS > >>> > >>> While moving from degrees to radians I have seen that DSL has still > some > >>> work to do. > >>> I shall focus on that now. > >>> > >>> Niko > >>> > >>> 2012/5/3 Sanne Grinovero <sa...@hibernate.org> > >>>> > >>>> > >>>> On May 3, 2012 10:10 AM, "Emmanuel Bernard" <emman...@hibernate.org> > >>>> wrote: > >>>> > > >>>> > How comes the DistanceFilter has to compute the distance for the > whole > >>>> > corpus? > >>>> > >>>> You're right in that's not always the case, but it's possible. If > there > >>>> are more filters enabled and they are executed first, our filter will > need > >>>> to do the math only on the matched documents by the previous filters, > but if > >>>> there are no other constraints or filters our DistanceFilter might > need to > >>>> process all documents in all segments. This happens also when a limit > is > >>>> enabled on the collector - although limited to the current index > segment - > >>>> when the filter needs to be cached as it needs to evaluate each > document in > >>>> the segment. > >>>> > >>>> In our case this DistanceFilter is only applied after RangeQuery was > >>>> applied on both longitude and latitude, so I'm not sure if this is a > big > >>>> problem; personally I was just wondering but I'd be fine in keeping > this as > >>>> a possible future improvement - but if we go for a separate issue, > let's > >>>> keep in mind that that the index format would not be backwards > compatible. > >>>> > >>>> > >>>> > >>>> > By the way the actual storage (say via Hibernate ORM, or Infinispan) > >>>> > does not need to store in radian, so we don't need to do a > conversion when > >>>> > reading an entity. > >>>> > >>>> Right, another reason to index only in whatever format makes querying > >>>> more efficient. > >>>> > >>>> -- Sanne > >>>> > >>>> > >>>> > > >>>> > On 3 mai 2012, at 10:45, Sanne Grinovero wrote: > >>>> > > >>>> > > The reason for my comment is that the code is doing a conversion > to > >>>> > > radians in the DistanceFilter, which needs to be extremely > efficient > >>>> > > as it's not only applied on the resultset but potentially on the > >>>> > > whole > >>>> > > corpus of all Documents in the index. > >>>> > > So even if it's true that conversion would be needed on the final > >>>> > > results, we always expect people to retrieve only a limited amount > >>>> > > of > >>>> > > entities (like with pagination), while the index might need to > >>>> > > perform > >>>> > > this computation millions of times per query. > >>>> > > > >>>> > > If I look at the complexity of Point.getDistanceTo(double, > double), > >>>> > > I > >>>> > > get a feeling that that method will hardly provide speedy queries > >>>> > > because of the complex computations in it - this is just > speculation > >>>> > > at this point of course, to be sure we'd need to compare them > with a > >>>> > > large enough dataset, but it seems quite obvious that storing > >>>> > > normalized radians should be more efficient as it would avoid a > good > >>>> > > deal of math to be executed on each Document in the index. > >>>> > > > >>>> > > Also if we assume people might want to use radians in their user > >>>> > > data > >>>> > > (I know some who definitely would never touch decimals for such a > >>>> > > use > >>>> > > case), there would be no need at all to convert the end result. > >>>> > > > >>>> > > Some more thoughts inline: > >>>> > > > >>>> > > On 3 May 2012 09:12, Nicolas Helleringer > >>>> > > <nicolas.hellerin...@gmail.com> wrote: > >>>> > >> Hi all, > >>>> > >> > >>>> > >> Sanne and I have been wondering about the way the spatial > >>>> > >> branch/module/functionality for Hibernate Search shall store its > >>>> > >> coordinates in the Lucene index. > >>>> > >> > >>>> > >> Today it is implemented with decimal degree for : > >>>> > >> - easy debugging/readability > >>>> > >> - ease of conversion on storage as we want to accept mainly > decimal > >>>> > >> degree > >>>> > >> from users data > >>>> > > > >>>> > > Valid points, but consider that "storage" is going to be way > slower > >>>> > > anyway, and typically you'll process a Document to evaluate it > for a > >>>> > > hit many many orders of magnitude more frequently than the times > you > >>>> > > store it. > >>>> > > > >>>> > >> > >>>> > >> Sanne pointed out that when the search is done there is quite a > few > >>>> > >> conversion to radians for distance calculation and suggested that > >>>> > >> we may > >>>> > >> store directly coordinates under their radians form. > >>>> > >> > >>>> > >> I have tried a patch to implement this and as I was coding it I > >>>> > >> feel that > >>>> > >> the code was less readable, in the coordinates normalisation > mainly > >>>> > >> and > >>>> > >> that there was as many conversion as before. > >>>> > >> Conversions had moved from search to import / export of > coordinates > >>>> > >> in and > >>>> > >> out the spatial module scope to user scope. > >>>> > > > >>>> > > I'm sure the amount of points in the code in which they are > >>>> > > converted > >>>> > > won't change. I'm concerned about the cardinality of the > collections > >>>> > > on which it's applied ;) > >>>> > > "Less readable" isn't nice, but we can work on that I guess? > >>>> > > > >>>> > >> > >>>> > >> What the docs does not tell (yet), is that we are waiting for WGS > >>>> > >> 84 (this > >>>> > >> is a coordinate system) decimal degree coordinates input, as > these > >>>> > >> are > >>>> > >> quite a de facto standard (GPS output this way). > >>>> > > > >>>> > > How does it affect this? > >>>> > > > >>>> > >> > >>>> > >> Today this is not the purpose of Hibernate Search spatial > >>>> > >> initiative to > >>>> > >> handle projections. There are opensource libs to handle that on > >>>> > >> user side > >>>> > >> very well (Proj4j) > >>>> > >> > >>>> > >> So. The question is : shall we store as radians or decimal > degree ? > >>>> > >> > >>>> > >> Niko > >>>> > >> > >>>> > >> P.S : Hope it is clear. If not ask for more. > >>>> > > > >>>> > > Thanks! > >>>> > > Sanne > >>>> > > _______________________________________________ > >>>> > > hibernate-dev mailing list > >>>> > > hibernate-dev@lists.jboss.org > >>>> > > https://lists.jboss.org/mailman/listinfo/hibernate-dev > >>>> > > >>> > >>> > >>> > >> > > > _______________________________________________ hibernate-dev mailing list hibernate-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/hibernate-dev