Re: term counts during indexing

Gerret Apelt Wed, 29 Oct 2003 19:48:50 -0800

Peter Keegan wrote:

Is there a simple and efficient way of determining the number of tokens
added
to a document after adding each field ('Document.add), as a result of the
actions
of the Analyzer, without having to re-parse the field

Peter --

you can ask the Document instance.

Document doc = getDocumentInstanceFromSomewhere(); int termCount = 0; Enumertion fields = doc.fields(); while (fields.hasMoreElements()) { Field field = (Field)fields.nextElement(); String fieldName = field.name(); String[] fieldTerms = doc.getValues(fieldName); termCount += fieldTerms.length; } System.out.println("The fields of the document together contain "+termCount+" terms.");

Note that 1) I haven't tried to compile this code, so I'm not sure if it works 2) this will only work for those fields where field.isStored() == true. If the field isnt stored in the index, then you don't have a choice but to go back to the document.

[not sure on the following, so please correct me if in error:] Remember that unStored fields are indexed, so you can query on them, but the field terms themselves are not stored in the index. Therefore you cannot count them by asking Lucene. A Lucene field instance also has no way to reference the source of the terms that are added to it. The field doesn't care where its terms came from. So if field.isStored() == false, then for that particular field Lucene cannot tell you how many terms are in it. You'll have to write your own code that analyzes the original data source in this case.

Alternatively, is there a way to determine the number of tokens added after adding the document to the index ('IndexWriter.addDocument')?

Whether you want the termCount for a document before or after you add the document to the index doesn't matter, so the answer is "see above".

cheers,
Gerret


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: term counts during indexing

Reply via email to