RE: TokenStreamComponents in Lucene 4.0

Uwe Schindler Tue, 20 Nov 2012 01:23:28 -0800

Hi,

all the components of your Tokenstream in Lucene 4.0 are *required* tob e 
reuseable, see the documentation:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/Analyzer.html


All your components must implement reset() according to the Tokenstream 
contract:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html

The createComponents() method of Analyzers is only called *once* for each 
thread and the Tokenstream is *reused* for later documents. The Analyzer will 
call the final method Tokenizer#setReader() to notify the Tokenizer of a new 
Reader (this method will update the protected "input" field in the Tokenizer 
base class) and then it will reset() the whole tokenization chain. The custom 
TokenStream components must "initialize" themselves with the new settings on 
the reset() method.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: Carsten Schnober [mailto:schno...@ids-mannheim.de]
> Sent: Tuesday, November 20, 2012 10:15 AM
> To: java-user@lucene.apache.org
> Subject: Re: TokenStreamComponents in Lucene 4.0
> 
> Am 19.11.2012 17:44, schrieb Carsten Schnober:
> 
> Hi,
> 
> > However, after switching to Lucene 4 and TokenStreamComponents, I'm
> > getting a strange behaviour: only the first document in the collection
> > is tokenized properly. The others do appear in the index, but
> > un-tokenized, although I have tried not to change anything in the logic.
> > The Analyzer now has this createComponents() method calling the custom
> > TokenStreamComponents class with my custom Tokenizer:
> 
> After some debugging, it turns out that the Analyer method
> createComponents() is called only once, for the first document. This seems
> to be the problem, the other documents are just not analyzed.
> Here's the loop that creates the fields and supposedly calls the analyzer.
> Does anyone have a hint why this does only happend for the first document;
> the loop itself runs once for every document though:
> 
> ---------------------------------------------------------------
> 
> List<de.ids_mannheim.korap.main.Document> documents; Version
> lucene_version = Version.LUCENE_40; Analyzer analyzer = new
> KoraAnalyzer(); IndexWriterConfig config = new
> IndexWriterConfig(lucene_version, analyzer); IndexWriter writer = new
> IndexWriter(dir, config); [...]
> 
> for (de.ids_mannheim.korap.main.Document doc : documents) {
>   luceneDocument = new Document();
> 
>   /* Store document name/ID */
>   Field idField = new StringField(titleFieldName, doc.getDocid(),
> Field.Store.YES);
> 
>   /* Store tokens */
>   String layerFile = layer.getFile();
>   Field textFieldAnalyzed = new TextField(textFieldName, layerFile,
> Field.Store.YES);
> 
>   luceneDocument.add(textFieldAnalyzed);
>   luceneDocument.add(idField);
> 
>   try {
>     writer.addDocument(luceneDocument);
>   } catch (IOException e) {
>     jlog.error("Error adding document
> "+doc.getDocid()+":\n"+e.getLocalizedMessage());
>   }
> }
> [...]
> writer.close();
> -------------------------------------------------------------------
> 
> The class de.ids_mannheim.korap.main.Document defines our own
> document objects from which the relevant information can be read as shown
> in the loop. The list 'documents' is filled in in intermediately called 
> method.
> Best,
> Carsten
> 
> --
> Institut für Deutsche Sprache | http://www.ids-mannheim.de
> Projekt KorAP                 | http://korap.ids-mannheim.de
> Tel. +49-(0)621-43740789      | schno...@ids-mannheim.de
> Korpusanalyseplattform der nächsten Generation Next Generation Corpus
> Analysis Platform
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: TokenStreamComponents in Lucene 4.0

Reply via email to