Re: TokenStreamComponents in Lucene 4.0

Carsten Schnober Tue, 20 Nov 2012 01:15:26 -0800

Am 19.11.2012 17:44, schrieb Carsten Schnober:

Hi,


> However, after switching to Lucene 4 and TokenStreamComponents, I'm
> getting a strange behaviour: only the first document in the collection
> is tokenized properly. The others do appear in the index, but
> un-tokenized, although I have tried not to change anything in the logic.
> The Analyzer now has this createComponents() method calling the custom
> TokenStreamComponents class with my custom Tokenizer:

After some debugging, it turns out that the Analyer method
createComponents() is called only once, for the first document. This
seems to be the problem, the other documents are just not analyzed.
Here's the loop that creates the fields and supposedly calls the
analyzer. Does anyone have a hint why this does only happend for the
first document; the loop itself runs once for every document though:

---------------------------------------------------------------

List<de.ids_mannheim.korap.main.Document> documents;
Version lucene_version = Version.LUCENE_40;
Analyzer analyzer = new KoraAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(lucene_version, analyzer);
IndexWriter writer = new IndexWriter(dir, config);
[...]

for (de.ids_mannheim.korap.main.Document doc : documents) {
  luceneDocument = new Document();
                        
  /* Store document name/ID */
  Field idField = new StringField(titleFieldName, doc.getDocid(),
Field.Store.YES);
                        
  /* Store tokens */
  String layerFile = layer.getFile();
  Field textFieldAnalyzed = new TextField(textFieldName, layerFile,
Field.Store.YES);
                
  luceneDocument.add(textFieldAnalyzed);
  luceneDocument.add(idField);
                                                
  try {
    writer.addDocument(luceneDocument);
  } catch (IOException e) {
    jlog.error("Error adding document
"+doc.getDocid()+":\n"+e.getLocalizedMessage());
  }
}
[...]
writer.close();
-------------------------------------------------------------------

The class de.ids_mannheim.korap.main.Document defines our own document
objects from which the relevant information can be read as shown in the
loop. The list 'documents' is filled in in intermediately called method.
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | [email protected]
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: TokenStreamComponents in Lucene 4.0

Reply via email to