I am not sure if you will be able to do that at Index time (that is, without parsing the document text) - search engines usually maintain an inverted index - so it doesn't store keywords by document but rather - it maintains: for each keyword, list documents containing that term and corresponding position information. So - I don't think Document/Field class in lucene has getTermFrequency or similar type of methods
Cheers -Jha On Dec 8, 2007 4:50 PM, Glenn Barney <[EMAIL PROTECTED]> wrote: > Hi All, > > I've been reading and going through the nutch examples for a couple days but > haven't found an exact answer to my problem. I want to add a category field > (with a boost score) to each document I index based on the text content of a > web page. For example, I'm creating the category farm, and I have a set > list of keywords I want to map to the category farm (say "cow", "pig", > "farm", and "farmer"). The boost score for the new field "farm" is relative > to the frequency of these terms in my document. > > The examples in this forum all talk about 1)Scraping metadata from a html > page while parsing and adding your category field if this metadata is > present. This doesn't work for me as I don't have any special metadata in > my documents (I'm using the web) and 2)I don't want to do anything in the > parse stage of crawling. I want to add my new field in the index stage. So > that leaves method 2)In the index stage, I have a reference to the document > text (in Parse.getText()) in filter() in IndexingFilter. I can using java's > string methods to search the text string for each of my terms one by one > (and find repeats), and then create a score based on frequency and add this > to a new field called "farm". However *this is the whole point of indexing* > and to my understanding lucene/nutch is already doing this, it's already > tokenizing and already calculating term frequencys in the tokenized content > field. > > As I index, I want to have nutch do its magic, tokenize and parse the > content in the content field, then have me go in and use these results to > add a new field based on these tokens. I don't want to "index" the whole > thing twice, I'm sure smarter people then I wrote a very effective > tokenizing (say removing punctuation, effectively finding duplicate terms) > implementation that I want to use. > > I guess if I had some magic pseudocode, I'm looking to do something like > this > filter ( > for each word in my category > score += thisDocument.getFrequency(word); //uses the index that's > being built before this filter applys > addNewField(farm, score) //set farm's boost to score > ) > > Is there any way (or any better way) to do what I want above? > Thanks, > -Glenn >
