Re: adding category field based on terms

Jasper Kamperman Sat, 08 Dec 2007 13:56:33 -0800

There may be plenty other ways, but the indices that nutch createsare standard Lucene indices. So after nutch is done creating an indexyou can use IndexReaders/Writers which pretty much support all themethods you use in your "magic pseudo code".


On Dec 8, 2007, at 1:50 PM, Glenn Barney wrote:

Hi All,
I've been reading and going through the nutch examples for a coupledays buthaven't found an exact answer to my problem. I want to add acategory field(with a boost score) to each document I index based on the textcontent of aweb page. For example, I'm creating the category farm, and I havea set
list of keywords I want to map to the category farm (say "cow", "pig",
"farm", and "farmer"). The boost score for the new field "farm" isrelative
to the frequency of these terms in my document.
The examples in this forum all talk about 1)Scraping metadata froma html
page while parsing and adding your category field if this metadata is
present. This doesn't work for me as I don't have any specialmetadata inmy documents (I'm using the web) and 2)I don't want to do anythingin theparse stage of crawling. I want to add my new field in the indexstage. Sothat leaves method 2)In the index stage, I have a reference to thedocumenttext (in Parse.getText()) in filter() in IndexingFilter. I canusing java'sstring methods to search the text string for each of my terms oneby one(and find repeats), and then create a score based on frequency andadd thisto a new field called "farm". However *this is the whole point ofindexing*and to my understanding lucene/nutch is already doing this, it'salreadytokenizing and already calculating term frequencys in the tokenizedcontent
field.

As I index, I want to have nutch do its magic, tokenize and parse the
content in the content field, then have me go in and use theseresults toadd a new field based on these tokens. I don't want to "index" thewhole
thing twice, I'm sure smarter people then I wrote a very effective
tokenizing (say removing punctuation, effectively finding duplicateterms)
implementation that I want to use.
I guess if I had some magic pseudocode, I'm looking to do somethinglike
this
filter (
     for each word in my category
score += thisDocument.getFrequency(word); //uses the indexthat's
being built before this filter applys
     addNewField(farm, score) //set farm's boost to score
)

Is there any way (or any better way) to do what I want above?
Thanks,
-Glenn

Re: adding category field based on terms

Reply via email to