Gerret, Sorry, I didn't mean to suggest changing the index format to save the counts. But your suggestion of adding a 'term counting' analyzer at the end of the filter chain makes more sense to me (and now seems so obvious).
Thanks, Peter ----- Original Message ----- From: "Gerret Apelt" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Thursday, November 06, 2003 8:01 PM Subject: Re: term counts during indexing > Peter -- > > sorry for the delay; I just accidentally saw your reply in the mailing > list archive -- mustave overlooked it in my inbox :( > > Peter Keegan wrote: > > >As I understand it, the field text is being tokenized by the analyzer when > >IndexWriter.addDocument is called. At this point, the tokens are indexed > >and/or stored. Would it be possible for 'addDocument' to save and make the > >_actual_ counts of 'tokens stored' and 'tokens indexed' available in either > >the Document or IndexWriter object? I guess I may be turning this into a > >feature request :) > > > > > > > Lucene uses an inverted index, so the index is based on a mapping from > "term" instances to the documents that contain them, as opposed to > "document" instances mapping to a list of terms contained in that > document (which is a fancy way of saying, "Lucene doesn't store > documents; filesystems do that"). > So in terms of the index representation, Lucene could not simply add a > "term count" parameter to the entry for a given document, because > (unless we're talking about a stored field) there is no table in which > such an entry could exist. You would need to add a totally new data > structure to the index, which can store document properties for > un-stored fields. This which sort of defeats the purpose of un-stored > fields. It sounds wrong to have an un-stored field and store its termcount. > > Here's a proposal for a hack you could do: write an Analyzer wrapper > that counts tokens emitted by the Analyzer's TokenStream's next() > method, which it is called by IndexWriter.addDocument(Document). When > TokenStream.next() returns null, you can store the tokenCount that you > have maintained in a file or database. This is fairly ugly but it has > the advantage that it will work for for non-stored fields. > > I doubt there will be much support for extending Lucene to store field > properties for unstored fields. Maybe there could be another field type > called TERMCOUNTED_FIELD? Maybe some of the core coders could comment. > > >Also, I can't find this method from the code snippit provided by Gerret (I'm > >using v1.2): > > > > > >>String[] fieldTerms = doc.getValues(fieldName); > >> > >> > hmm, it must have been added later then: > http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Document.html > > cheers, > Gerret > > > > > > >Thanks, > >Peter > > > >----- Original Message ----- > >From: "Gerret Apelt" <[EMAIL PROTECTED]> > >To: "Lucene Users List" <[EMAIL PROTECTED]> > >Sent: Wednesday, October 29, 2003 9:44 PM > >Subject: Re: term counts during indexing > > > > > > > > > >>Peter Keegan wrote: > >> > >> > >> > >>>Is there a simple and efficient way of determining the number of tokens > >>>added > >>>to a document after adding each field ('Document.add), as a result of the > >>>actions > >>>of the Analyzer, without having to re-parse the field > >>> > >>> > >>Peter -- > >> > >>you can ask the Document instance. > >> > >>Document doc = getDocumentInstanceFromSomewhere(); > >>int termCount = 0; > >>Enumertion fields = doc.fields(); > >>while (fields.hasMoreElements()) { > >> Field field = (Field)fields.nextElement(); > >> String fieldName = field.name(); > >> String[] fieldTerms = doc.getValues(fieldName); > >> termCount += fieldTerms.length; > >>} > >>System.out.println("The fields of the document together contain > >>"+termCount+" terms."); > >> > >>Note that > >>1) I haven't tried to compile this code, so I'm not sure if it works > >>2) this will only work for those fields where field.isStored() == true. > >>If the field isnt stored in the index, then you don't have a choice but > >>to go back to the document. > >> > >>[not sure on the following, so please correct me if in error:] Remember > >>that unStored fields are indexed, so you can query on them, but the > >>field terms themselves are not stored in the index. Therefore you cannot > >>count them by asking Lucene. A Lucene field instance also has no way to > >>reference the source of the terms that are added to it. The field > >>doesn't care where its terms came from. So if field.isStored() == false, > >>then for that particular field Lucene cannot tell you how many terms are > >>in it. You'll have to write your own code that analyzes the original > >>data source in this case. > >> > >> > >> > >>>Alternatively, is there a way to determine the number of tokens added > >>> > >>> > >after > > > > > >>>adding the document to the index ('IndexWriter.addDocument')? > >>> > >>> > >>> > >>> > >>Whether you want the termCount for a document before or after you add > >>the document to the index doesn't matter, so the answer is "see above". > >> > >>cheers, > >>Gerret > >> > >> > >>--------------------------------------------------------------------- > >>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > >> > > > > > >--------------------------------------------------------------------- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
