Existing PARSE-HTML plugin simply stores clean text (without HTML tags) for future indexing. It stores, for instance, content of huge <OPTIONS> tag which we don't need at all in 99.99% of cases.
I found this idea very interesting, Web-SQL: http://www.lotontech.com I've bought a book, Tony Loton "Web Content Mining with Java", it consists 90% from code which I don't really need... However, I am going to implement some kind of Web-SQL and Math. Statistics. Usually web-sites have 90% of similar HTML, and I need only subset. Also, I need to find a point in Nutch where I can replace Analyzer with my own "non-analyzer"; I don't need to remove stop-words etc. I'd like to use Lucene as a database too... To perform a lot of queries, to calc some statistics... -Fuad -----Original Message----- From: Jack Tang [mailto:[EMAIL PROTECTED] Sent: Thursday, August 18, 2005 10:15 PM To: [email protected] Subject: Parse-html should be enhanced! Hi Nutchers I think parse-html parse should be enhanced. In some of my projects(Intranet search engine), we only need the content in the specified detectors and filter the junk, say the content between <div class="start-here"> and </div> or some detectors like XPath. Any thoughts on this enhancement? Regards /Jack -- Keep Discovering ... ... http://www.jroller.com/page/jmars
