Thanks for letting me know about this Rob. I think geonames is much simpler (and much less data) to work with than wikipedia. It's plain tab-delimited and I like that it includes the population. I'll press forward with my benchmark module based patch. I can relatively easily switch between the lat-lon type and my geohash type since they both conform to the SpatialQueriable interface, and so consequently I don't need two complete Lucene checkouts. I had to add Solr & spatial as dependencies to the benchmark module but it's worth it to me.
~ David On Dec 28, 2010, at 11:18 AM, Robert Muir wrote: > On Tue, Dec 28, 2010 at 10:47 AM, Smiley, David W. <dsmi...@mitre.org> wrote: >> Presently, I’m working on Lucene’s benchmark contrib module to evaluate the >> performance of SOLR-2155 compared to the LatLon type (i.e. a pair of lat-lon >> range queries), and then I’ll work on a more efficient probably non-geohash >> implementation but based on the same underlying concept of a hierarchical >> grid. I’m using the geonames.org data set. Unfortunately, the benchmark >> code seems very oriented to a generic title-body document whereas I’m >> looking to create lat-lon pairs… and furthermore to create documents >> containing multiple lat-lon pairs, and even furthermore a query generator >> that generates random box queries centered on a random location from the >> data set. I seem to be stretching the benchmark framework beyond the >> use-case it was designed for and so perhaps it won’t be committable but at >> least I’ll have a patch for other geospatial birds-of-a-feather like you to >> use. >> >> Stretch away. The Title/Body orientation is just a relic of what we have >> done in the past, it doesn't have to stay that way. > > just for reference, a couple of us are using a python front-end to > contrib/benchmark that Mike developed: > > http://code.google.com/p/luceneutil/ > > This is nice as its designed for you to just declare 'competitors' (2 > checkouts of solrcene), and then you run the python script and it > gives you the relative comparison... because they are 2 different > checkouts its simple to compare different approaches, and each > checkout can run with a different index (e.g. different codecs or test > index format changes). > > I thought it might be interesting to you, because there's a variety of > queries tested here like numeric range, sorting, primary-key lookup, > span queries etc beyond the "standard" set of queries. The framework > also ensures that you are bringing back the same results in the same > order, runs multiple iterations (including iterations in new JVMs), > makes it easy to test optimized, optimized with deletions, > multi-segment, multi-segment with deletions, and can output to txt, > html, jira format for convenience. > > currently we are generally testing with a line file format from > wikipedia, but besides geonames i wanted to point out that wikipedia > does include lat/long information for many articles (this is a major > source for much of geonames place data!). > > it would definitely be cool if we could test spatial queries with this > as well... e.g by parsing out the lat/long from the wikipedia XML and > adding to the line files, and adding some spatial queries to the > default list of queries being tested. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org