[
https://issues.apache.org/jira/browse/LUCENE-4931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13630488#comment-13630488
]
Uwe Schindler commented on LUCENE-4931:
---------------------------------------
Robert, you are right.
I added the approach on a IndexWriter-based cache to the issue, because this is
how Lucene 3.x worked. In Lucene 3.x you were able to recreate Field instances
on every document and IndexWriter would still reuse the internal per-thread
AttributeSource.
In Lucene 4.0, all NumericFields can be reused and that works. The NumericField
reuses the NumericTokenStream inside (but its lazy inited to not load it
together if you are only interested in stored fields on the IndexReader side).
Whenever you set a new value to the Field, it is reset to the new value. Of
course if you don't reuse your NumericField instances, its created over an over
(this was the same in Lucene 3.x).
A default Field (e.g. StringField) creates a new TokenStream on every call to
Field#tokenStream(). So if you reuse your StringField instances, the
TokenStreams are recreated again and again. This is the bug here.
I will provide a patch that des the same like NumericField does: I will add the
same logic like used for NumericTokenStream:
- StringTokenStream gets a pkg-private reset(String) method, the String is
removed from ctor.
- On Field#tokenStream() it first lazyly creates the TokenStream (if needed)
using the no-arg ctor, it then calls reset(stringValue()) and returns it
> Make oal.document.Field reuse its internal StringTokenStream
> ------------------------------------------------------------
>
> Key: LUCENE-4931
> URL: https://issues.apache.org/jira/browse/LUCENE-4931
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/index
> Affects Versions: 4.0, 4.1, 4.2, 4.2.1
> Reporter: Uwe Schindler
>
> Followup from LUCENE-4930:
> Field.java has a private StringTokenStream which is used as TokenStream
> implementation for StringField (single value String tokens). Unfortunately
> this TokenStream is created on every new document/field while indexing,
> making the cost of creating the TS a significant time. With very old Java
> versions this also involves a lock in ReferenceQueue.poll() when called from
> addAttribute().
> In Lucene 3.x, DocInverterPerThread has a private thread-local
> AttributeSource for reusing, but because this was factored out to Field.java,
> we can no longer use CloseableThreadLocal (because Field are not Closeable).
> We should maybe move the special One-Token TokenStream back to
> DocInverterPerThread and just let Field.java delegate there. I know this
> would let us move back to 3.x where we had special handling of single token
> Fields in the indexer....
> Another approach would be to make Field.java use a static KeywordAnalyzer (it
> needs then be moved to core) or we add a ThreadLocal to Field.java (which may
> be expensive). Unfortunately this makes it hard to maintain, as the
> thread-localness is also needed to be bound to the IndexWriter instance.
> Because you could have 2 IndexWriters open at same time and add documents to
> both of them from one thread... This brings us back to my previous solution.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]