There may be plenty other ways, but the indices that nutch creates
are standard Lucene indices. So after nutch is done creating an index
you can use IndexReaders/Writers which pretty much support all the
methods you use in your "magic pseudo code".
On Dec 8, 2007, at 1:50 PM, Glenn Barney wrote:
Hi All,
I've been reading and going through the nutch examples for a couple
days but
haven't found an exact answer to my problem. I want to add a
category field
(with a boost score) to each document I index based on the text
content of a
web page. For example, I'm creating the category farm, and I have
a set
list of keywords I want to map to the category farm (say "cow", "pig",
"farm", and "farmer"). The boost score for the new field "farm" is
relative
to the frequency of these terms in my document.
The examples in this forum all talk about 1)Scraping metadata from
a html
page while parsing and adding your category field if this metadata is
present. This doesn't work for me as I don't have any special
metadata in
my documents (I'm using the web) and 2)I don't want to do anything
in the
parse stage of crawling. I want to add my new field in the index
stage. So
that leaves method 2)In the index stage, I have a reference to the
document
text (in Parse.getText()) in filter() in IndexingFilter. I can
using java's
string methods to search the text string for each of my terms one
by one
(and find repeats), and then create a score based on frequency and
add this
to a new field called "farm". However *this is the whole point of
indexing*
and to my understanding lucene/nutch is already doing this, it's
already
tokenizing and already calculating term frequencys in the tokenized
content
field.
As I index, I want to have nutch do its magic, tokenize and parse the
content in the content field, then have me go in and use these
results to
add a new field based on these tokens. I don't want to "index" the
whole
thing twice, I'm sure smarter people then I wrote a very effective
tokenizing (say removing punctuation, effectively finding duplicate
terms)
implementation that I want to use.
I guess if I had some magic pseudocode, I'm looking to do something
like
this
filter (
for each word in my category
score += thisDocument.getFrequency(word); //uses the index
that's
being built before this filter applys
addNewField(farm, score) //set farm's boost to score
)
Is there any way (or any better way) to do what I want above?
Thanks,
-Glenn