Hi all,

I’ve been on holiday and away from a keyboard for a week, so that means I of 
course spent my time thinking about lucene Analyzers and specifically their 
ReuseStrategies…

Building a TokenStream can be quite a heavy operation, and so we try and reuse 
already-constructed token streams as much as possible.  This is particularly 
important at index time, as having to create lots and lots of very short-lived 
token streams for documents with many short text fields could mean that we 
spend longer building these objects than we do pulling data from them.  To help 
support this, lucene Analyzers have a ReuseStrategy, which defaults to storing 
a map of fields to token streams in a ThreadLocal object.  Because ThreadLocals 
can behave badly when it comes to containers that have large thread pools, we 
use a special CloseableThreadLocal class that can null out its contents once 
the Analyzer is done with, and this leads to Analyzer itself being Closeable.  
This makes extending analyzers more complicated, as delegating wrappers need to 
ensure that they don’t end up sharing token streams with their delegates.

It’s common to use the same analyzer for indexing and for parsing user queries. 
 At query time, reusing token streams is a lot less important - the amount of 
time spent building the query is typically much lower than the amount of time 
spent rewriting and executing it.  The fact that this re-use is only really 
useful for index time and that the lifecycle of the analyzer is therefore very 
closely tied to the lifecycle of its associated IndexWriter makes me think that 
we should think about moving the re-use strategies into IndexWriter itself.  
One option would be to have token streams be constructed once per 
DocumentsWriterPerThread, which would lose some re-use but would mean we could 
avoid ThreadLocals entirely.

Any thoughts?
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to