Waw, Efendi, the features you metioned sounds coool. Anyway, I hope nutch will both handle DOM tree parsing and information extraction(Very high level) well one day. My suggestion is adding one layer between DOM tree parsing and indexing for information extraction.
Comments? /Jack On 8/19/05, Fuad Efendi <[EMAIL PROTECTED]> wrote: > Existing PARSE-HTML plugin simply stores clean text (without HTML tags) > for future indexing. It stores, for instance, content of huge <OPTIONS> > tag which we don't need at all in 99.99% of cases. > > I found this idea very interesting, Web-SQL: > http://www.lotontech.com > I've bought a book, Tony Loton "Web Content Mining with Java", it > consists 90% from code which I don't really need... > However, I am going to implement some kind of Web-SQL and Math. > Statistics. Usually web-sites have 90% of similar HTML, and I need only > subset. > > Also, I need to find a point in Nutch where I can replace Analyzer with > my own "non-analyzer"; I don't need to remove stop-words etc. > > I'd like to use Lucene as a database too... To perform a lot of queries, > to calc some statistics... > > -Fuad > > > -----Original Message----- > From: Jack Tang [mailto:[EMAIL PROTECTED] > Sent: Thursday, August 18, 2005 10:15 PM > To: [email protected] > Subject: Parse-html should be enhanced! > > > Hi Nutchers > > I think parse-html parse should be enhanced. In some of my > projects(Intranet search engine), we only need the content in the > specified detectors and filter the junk, say the content between <div > class="start-here"> and </div> or some detectors like XPath. Any > thoughts on this enhancement? > > Regards > /Jack > -- > Keep Discovering ... ... > http://www.jroller.com/page/jmars > > > -- Keep Discovering ... ... http://www.jroller.com/page/jmars
