[ 
https://issues.apache.org/jira/browse/LUCENE-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558403#comment-13558403
 ] 

David Smiley commented on LUCENE-4698:
--------------------------------------

A couple months ago I developed an alternative approach based on DocValues of 
bytes (StraightBytesDocValuesField)*.  I own the intellectual property but it's 
not committable as there were some hacks I had to do, and I did it Solr-side vs 
Lucene.

*: FYI I couldn't simply use floats because the float based DV fields are a 
single float per document, not variable.  I was initially concerned about the 
overhead of converting 4 bytes to a float all the time but some benchmarking 
showed that this was quite negligible compared to the other things going on.  
Besides, given more time, I'd like to use 3-bytes per float (the mantissa) and 
interleave the lat & lon and use sortable de-ref'ed bytes so that nearby 
spatial data is co-located.

One of the bigger obstacles to deal with in making progress is how multi-valued 
spatial is handled at the SpatialStrategy level.  With a non-DV approach, you 
can write code that looks at the indexed terms (varies per strategy) and figure 
out which points each document has (this is what the ShapeFieldCache does).  So 
no-change to the SpatialStrategy.  But with DV, SpatialStrategy.createFields() 
should only be called once per document, and with all of the points together so 
that it can return a DocValues based field.  We don't quite yet have a 
MultiPoint shape in Spatial4j (very close!), plus Solr has real problems with 
this model (to be addressed in another issue).  In particular, Solr's 
DocumentBuilder will take a java.util.Collection and invoke 
FieldType.createField() for each value without giving the FieldType the 
opportunity to see all values at once.  I firmly believe DocValues is the way 
to go so these problems need to be tackled.

                
> Overhaul ShapeFieldCache because its a memory pig
> -------------------------------------------------
>
>                 Key: LUCENE-4698
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4698
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/spatial
>            Reporter: David Smiley
>
> The org.apache.lucene.spatial.util.ShapeFieldCache* classes together 
> implement a spatial field cache for points, similar to FieldCache for other 
> fields.  It supports a variable number of points per document, and it's 
> currently only used by the SpatialPrefixTree strategy because that's the only 
> strategy that supports a variable number of points per document.  The other 
> spatial strategies use the FieldCache.  The ShapeFieldCache has problems:
> * It's a memory pig. Each point is stored as a Point object, instead of an 
> array of x & y coordinates. Furthermore, each Point is in an ArrayList that 
> exists for each Document. It's not done any differently when your spatial 
> data isn't multi-valued.
> * The cache is not per-segment, it's per-IndexReader, thereby making it 
> un-friendly to NRT search.
> * The cache entries don't self-expire optimally to free up memory. The cache 
> is simply stored in a WeakHashMap<IndexReader,ShapeFieldCache>. The big cache 
> entries are only freed when the WeakHashMap is used and the JVM realizes the 
> IndexSearcher instance has been GC'ed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to