I am not sure if you will be able to do that at Index time (that is,
without parsing the document text) - search engines usually maintain
an inverted index - so it doesn't store keywords by document but
rather - it maintains: for each keyword, list documents containing
that term and corresponding position information. So - I don't think
Document/Field class in lucene has getTermFrequency or similar type of
methods

Cheers
-Jha


On Dec 8, 2007 4:50 PM, Glenn Barney <[EMAIL PROTECTED]> wrote:
> Hi All,
>
> I've been reading and going through the nutch examples for a couple days but
> haven't found an exact answer to my problem.  I want to add a category field
> (with a boost score) to each document I index based on the text content of a
> web page.  For example, I'm creating the category farm, and I have a set
> list of keywords I want to map to the category farm (say "cow", "pig",
> "farm", and "farmer").  The boost score for the new field "farm" is relative
> to the frequency of these terms in my document.
>
> The examples in this forum all talk about 1)Scraping metadata from a html
> page while parsing and adding your category field if this metadata is
> present.  This doesn't work for me as I don't have any special metadata in
> my documents (I'm using the web) and 2)I don't want to do anything in the
> parse stage of crawling.  I want to add my new field in the index stage.  So
> that leaves method 2)In the index stage, I have a reference to the document
> text (in Parse.getText()) in filter() in IndexingFilter.  I can using java's
> string methods to search the text string for each of my terms one by one
> (and find repeats), and then create a score based on frequency and add this
> to a new field called "farm".  However *this is the whole point of indexing*
> and to my understanding lucene/nutch is already doing this, it's already
> tokenizing and already calculating term frequencys in the tokenized content
> field.
>
> As I index, I want to have nutch do its magic, tokenize and parse the
> content in the content field, then have me go in and use these results to
> add a new field based on these tokens.  I don't want to "index" the whole
> thing twice, I'm sure smarter people then I wrote a very effective
> tokenizing (say removing punctuation, effectively finding duplicate terms)
> implementation that I want to use.
>
> I guess if I had some magic pseudocode, I'm looking to do something like
> this
> filter (
>      for each word in my category
>         score += thisDocument.getFrequency(word); //uses the index that's
> being built before this filter applys
>      addNewField(farm, score) //set farm's boost to score
> )
>
> Is there any way (or any better way) to do what I want above?
> Thanks,
> -Glenn
>

Reply via email to