Re: term counts during indexing

Peter Keegan Fri, 07 Nov 2003 10:42:59 -0800

Gerret,

Sorry, I didn't mean to suggest changing the index format to save the
counts. But your suggestion of adding a 'term counting' analyzer at the end
of the filter chain makes more sense to me (and now seems so obvious).


Thanks,
Peter

----- Original Message ----- 
From: "Gerret Apelt" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, November 06, 2003 8:01 PM
Subject: Re: term counts during indexing


> Peter --
>
> sorry for the delay; I just accidentally saw your reply in the mailing
> list archive -- mustave overlooked it in my inbox :(
>
> Peter Keegan wrote:
>
> >As I understand it, the field text is being tokenized by the analyzer
when
> >IndexWriter.addDocument is called. At this point, the tokens are indexed
> >and/or stored. Would it be possible for 'addDocument' to save and make
the
> >_actual_ counts of 'tokens stored' and 'tokens indexed' available in
either
> >the Document or IndexWriter object? I guess I may be turning this into a
> >feature request :)
> >
> >
> >
> Lucene uses an inverted index, so the index is based on a mapping from
> "term" instances to the documents that contain them, as opposed to
> "document" instances mapping to a list of terms contained in that
> document (which is a fancy way of saying, "Lucene doesn't store
> documents; filesystems do that").
> So in terms of the index representation, Lucene could not simply add a
> "term count" parameter to the entry for a given document, because
> (unless we're talking about a stored field) there is no table in which
> such an entry could exist. You would need to add a totally new data
> structure to the index, which can store document properties for
> un-stored fields. This which sort of defeats the purpose of un-stored
> fields. It sounds wrong to have an un-stored field and store its
termcount.
>
> Here's a proposal for a hack you could do: write an Analyzer wrapper
> that counts tokens emitted by the Analyzer's TokenStream's next()
> method, which it is called by IndexWriter.addDocument(Document). When
> TokenStream.next() returns null, you can store the tokenCount that you
> have maintained in a file or database. This is fairly ugly but it has
> the advantage that it will work for for non-stored fields.
>
> I doubt there will be much support for extending Lucene to store field
> properties for unstored fields. Maybe there could be another field type
> called TERMCOUNTED_FIELD? Maybe some of the core coders could comment.
>
> >Also, I can't find this method from the code snippit provided by Gerret
(I'm
> >using v1.2):
> >
> >
> >>String[] fieldTerms = doc.getValues(fieldName);
> >>
> >>
> hmm, it must have been added later then:
>
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Document.html
>
> cheers,
> Gerret
>
> >
> >
> >Thanks,
> >Peter
> >
> >----- Original Message ----- 
> >From: "Gerret Apelt" <[EMAIL PROTECTED]>
> >To: "Lucene Users List" <[EMAIL PROTECTED]>
> >Sent: Wednesday, October 29, 2003 9:44 PM
> >Subject: Re: term counts during indexing
> >
> >
> >
> >
> >>Peter Keegan wrote:
> >>
> >>
> >>
> >>>Is there a simple and efficient way of determining the number of tokens
> >>>added
> >>>to a document after adding each field ('Document.add), as a result of
the
> >>>actions
> >>>of the Analyzer, without having to re-parse the field
> >>>
> >>>
> >>Peter --
> >>
> >>you can ask the Document instance.
> >>
> >>Document doc = getDocumentInstanceFromSomewhere();
> >>int termCount = 0;
> >>Enumertion fields = doc.fields();
> >>while (fields.hasMoreElements()) {
> >>    Field field = (Field)fields.nextElement();
> >>    String fieldName = field.name();
> >>    String[] fieldTerms = doc.getValues(fieldName);
> >>    termCount += fieldTerms.length;
> >>}
> >>System.out.println("The fields of the document together contain
> >>"+termCount+" terms.");
> >>
> >>Note that
> >>1) I haven't tried to compile this code, so I'm not sure if it works
> >>2) this will only work for those fields where field.isStored() == true.
> >>If the field isnt stored in the index, then you don't have a choice but
> >>to go back to the document.
> >>
> >>[not sure on the following, so please correct me if in error:] Remember
> >>that unStored fields are indexed, so you can query on them, but the
> >>field terms themselves are not stored in the index. Therefore you cannot
> >>count them by asking Lucene. A Lucene field instance also has no way to
> >>reference the source of the terms that are added to it. The field
> >>doesn't care where its terms came from. So if field.isStored() == false,
> >>then for that particular field Lucene cannot tell you how many terms are
> >>in it. You'll have to write your own code that analyzes the original
> >>data source in this case.
> >>
> >>
> >>
> >>>Alternatively, is there a way to determine the number of tokens added
> >>>
> >>>
> >after
> >
> >
> >>>adding the document to the index ('IndexWriter.addDocument')?
> >>>
> >>>
> >>>
> >>>
> >>Whether you want the termCount for a document before or after you add
> >>the document to the index doesn't matter, so the answer is "see above".
> >>
> >>cheers,
> >>Gerret
> >>
> >>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: term counts during indexing

Reply via email to