I found anything between <select> </select> is included in nutch's index. No matter it is selected or not. Since drop-down lists usually have many options in them, this reduce the quality of index. For example, every register form has all country names, or even all us states names on them.
In my case, this turns all country names to be useless keywords, since some of our pages has register form on it, and on matter you search for Italy Spain France or Iran, them jump in. I check nutch's code(parse-html) , it seems that this is by design, since nutch just take the parse result of nekohtml. In my project, I remove all of them by custom parser, actually I hacked the indexer. But does nutch has a general solution to this problem? Regards Pan -- View this message in context: http://www.nabble.com/invisible-%28not-choosed%29-drop-down-list-options-are-included-in-index-tf4345960.html#a12381455 Sent from the Nutch - User mailing list archive at Nabble.com.
