sorry for the delay; I just accidentally saw your reply in the mailing list archive -- mustave overlooked it in my inbox :(
Peter Keegan wrote:
Lucene uses an inverted index, so the index is based on a mapping from "term" instances to the documents that contain them, as opposed to "document" instances mapping to a list of terms contained in that document (which is a fancy way of saying, "Lucene doesn't store documents; filesystems do that").As I understand it, the field text is being tokenized by the analyzer when IndexWriter.addDocument is called. At this point, the tokens are indexed and/or stored. Would it be possible for 'addDocument' to save and make the _actual_ counts of 'tokens stored' and 'tokens indexed' available in either the Document or IndexWriter object? I guess I may be turning this into a feature request :)
So in terms of the index representation, Lucene could not simply add a "term count" parameter to the entry for a given document, because (unless we're talking about a stored field) there is no table in which such an entry could exist. You would need to add a totally new data structure to the index, which can store document properties for un-stored fields. This which sort of defeats the purpose of un-stored fields. It sounds wrong to have an un-stored field and store its termcount.
Here's a proposal for a hack you could do: write an Analyzer wrapper that counts tokens emitted by the Analyzer's TokenStream's next() method, which it is called by IndexWriter.addDocument(Document). When TokenStream.next() returns null, you can store the tokenCount that you have maintained in a file or database. This is fairly ugly but it has the advantage that it will work for for non-stored fields.
I doubt there will be much support for extending Lucene to store field properties for unstored fields. Maybe there could be another field type called TERMCOUNTED_FIELD? Maybe some of the core coders could comment.
Also, I can't find this method from the code snippit provided by Gerret (I'm
using v1.2):
String[] fieldTerms = doc.getValues(fieldName);
hmm, it must have been added later then: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Document.html
cheers, Gerret
Thanks, Peter
----- Original Message ----- From: "Gerret Apelt" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, October 29, 2003 9:44 PM
Subject: Re: term counts during indexing
Peter Keegan wrote:after
Is there a simple and efficient way of determining the number of tokensPeter --
added
to a document after adding each field ('Document.add), as a result of the
actions
of the Analyzer, without having to re-parse the field
you can ask the Document instance.
Document doc = getDocumentInstanceFromSomewhere(); int termCount = 0; Enumertion fields = doc.fields(); while (fields.hasMoreElements()) { Field field = (Field)fields.nextElement(); String fieldName = field.name(); String[] fieldTerms = doc.getValues(fieldName); termCount += fieldTerms.length; } System.out.println("The fields of the document together contain "+termCount+" terms.");
Note that 1) I haven't tried to compile this code, so I'm not sure if it works 2) this will only work for those fields where field.isStored() == true. If the field isnt stored in the index, then you don't have a choice but to go back to the document.
[not sure on the following, so please correct me if in error:] Remember that unStored fields are indexed, so you can query on them, but the field terms themselves are not stored in the index. Therefore you cannot count them by asking Lucene. A Lucene field instance also has no way to reference the source of the terms that are added to it. The field doesn't care where its terms came from. So if field.isStored() == false, then for that particular field Lucene cannot tell you how many terms are in it. You'll have to write your own code that analyzes the original data source in this case.
Alternatively, is there a way to determine the number of tokens added
adding the document to the index ('IndexWriter.addDocument')?
Whether you want the termCount for a document before or after you add the document to the index doesn't matter, so the answer is "see above".
cheers, Gerret
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
