Hi All,

I've been reading and going through the nutch examples for a couple days but
haven't found an exact answer to my problem.  I want to add a category field
(with a boost score) to each document I index based on the text content of a
web page.  For example, I'm creating the category farm, and I have a set
list of keywords I want to map to the category farm (say "cow", "pig",
"farm", and "farmer").  The boost score for the new field "farm" is relative
to the frequency of these terms in my document.

The examples in this forum all talk about 1)Scraping metadata from a html
page while parsing and adding your category field if this metadata is
present.  This doesn't work for me as I don't have any special metadata in
my documents (I'm using the web) and 2)I don't want to do anything in the
parse stage of crawling.  I want to add my new field in the index stage.  So
that leaves method 2)In the index stage, I have a reference to the document
text (in Parse.getText()) in filter() in IndexingFilter.  I can using java's
string methods to search the text string for each of my terms one by one
(and find repeats), and then create a score based on frequency and add this
to a new field called "farm".  However *this is the whole point of indexing*
and to my understanding lucene/nutch is already doing this, it's already
tokenizing and already calculating term frequencys in the tokenized content
field.

As I index, I want to have nutch do its magic, tokenize and parse the
content in the content field, then have me go in and use these results to
add a new field based on these tokens.  I don't want to "index" the whole
thing twice, I'm sure smarter people then I wrote a very effective
tokenizing (say removing punctuation, effectively finding duplicate terms)
implementation that I want to use.

I guess if I had some magic pseudocode, I'm looking to do something like
this
filter (
     for each word in my category
        score += thisDocument.getFrequency(word); //uses the index that's
being built before this filter applys
     addNewField(farm, score) //set farm's boost to score
)

Is there any way (or any better way) to do what I want above?
Thanks,
-Glenn

Reply via email to