[
https://issues.apache.org/jira/browse/LUCENE-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558403#comment-13558403
]
David Smiley commented on LUCENE-4698:
--------------------------------------
A couple months ago I developed an alternative approach based on DocValues of
bytes (StraightBytesDocValuesField)*. I own the intellectual property but it's
not committable as there were some hacks I had to do, and I did it Solr-side vs
Lucene.
*: FYI I couldn't simply use floats because the float based DV fields are a
single float per document, not variable. I was initially concerned about the
overhead of converting 4 bytes to a float all the time but some benchmarking
showed that this was quite negligible compared to the other things going on.
Besides, given more time, I'd like to use 3-bytes per float (the mantissa) and
interleave the lat & lon and use sortable de-ref'ed bytes so that nearby
spatial data is co-located.
One of the bigger obstacles to deal with in making progress is how multi-valued
spatial is handled at the SpatialStrategy level. With a non-DV approach, you
can write code that looks at the indexed terms (varies per strategy) and figure
out which points each document has (this is what the ShapeFieldCache does). So
no-change to the SpatialStrategy. But with DV, SpatialStrategy.createFields()
should only be called once per document, and with all of the points together so
that it can return a DocValues based field. We don't quite yet have a
MultiPoint shape in Spatial4j (very close!), plus Solr has real problems with
this model (to be addressed in another issue). In particular, Solr's
DocumentBuilder will take a java.util.Collection and invoke
FieldType.createField() for each value without giving the FieldType the
opportunity to see all values at once. I firmly believe DocValues is the way
to go so these problems need to be tackled.
> Overhaul ShapeFieldCache because its a memory pig
> -------------------------------------------------
>
> Key: LUCENE-4698
> URL: https://issues.apache.org/jira/browse/LUCENE-4698
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/spatial
> Reporter: David Smiley
>
> The org.apache.lucene.spatial.util.ShapeFieldCache* classes together
> implement a spatial field cache for points, similar to FieldCache for other
> fields. It supports a variable number of points per document, and it's
> currently only used by the SpatialPrefixTree strategy because that's the only
> strategy that supports a variable number of points per document. The other
> spatial strategies use the FieldCache. The ShapeFieldCache has problems:
> * It's a memory pig. Each point is stored as a Point object, instead of an
> array of x & y coordinates. Furthermore, each Point is in an ArrayList that
> exists for each Document. It's not done any differently when your spatial
> data isn't multi-valued.
> * The cache is not per-segment, it's per-IndexReader, thereby making it
> un-friendly to NRT search.
> * The cache entries don't self-expire optimally to free up memory. The cache
> is simply stored in a WeakHashMap<IndexReader,ShapeFieldCache>. The big cache
> entries are only freed when the WeakHashMap is used and the JVM realizes the
> IndexSearcher instance has been GC'ed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]