pls see the inline comments!! On Tue, Mar 17, 2009 at 7:34 PM, Lukas, Ray <ray.lu...@idearc.com> wrote:
> > I have some basic questions about Nutch. Can someone point me in the > right direction, or if you have time, maybe just blast out an answer. > > Question One: > I can see the terms that come from the web page. Can I set up a way to > also add these things to the index. In other words, if "ice cream" came > from a <h1> tag I want to know. Modifiy the index-plugin to include such a changes. You can add more fields in the plugin. Of course you need to modify HTML parser also so that it also keeps record of the headings in a document being parse. E.g you can includes the Field "Heading" in the index which contains the terms of a document which are there in headings While searching you can give more boost to the document if a query terms are found in the "Heading" field . For the you need to modify query formulation...for more see the documentation about lucene query formulation. > > > Question Two: > "Ice Cream" is really two words. But in the index it will be stored as > two entries. How can I tell Nutch (Lucene) that this and other things > are to be treated as one Token.. I know that somehow I will need to > supply a dictionary of these terms, but is it possible.. and if so how? > If you have Multi-word Extractor(MWE) or dictionary, before indexing a document you can invoke the MWE or look up in the dictionary , create a field "MWE" in the index, Give more boost if a query terms are found in MWE. In some sense Lucene/Nutch Ranking does handle it. For more details see the "coord" factor in lucene ranking. However, If you still wants to give more boost to the multi-world terms , you can do it by setting boost too hight in the lucene query ...again see lucene query formulation. > > Question Three ( is will start hunting for this ): > I have to hunt around for this so.. I have not yet.. but since I am > asking questions.. How can I add more stop words into the stop word > list? > You can look at the SMART system 's stop word list. Or you can generate using frequecy analysis on some document collections if you are looking for domain specific stop words. > > Question Four ( is will start hunting for this ): > Last one, promise.. The indexes themselves. Is there an explanation > written up for each of the fields in the index. > I m not sure but look at the nutch wiki .. you might get something. > > > Thanks for the help > Ray >