Hi Jack, I'd like to have more freedom with Nutch... We have two classes, ParseText and ParseData, which are stored somewhere (I am newbie!) and then indexed by Lucene. ParseText contains plain text (after parsing by existing parse-html plugin), and ParseData - links found on a page, metatags (not sure), etc.
org.apache.nutch.fetcher.Fetch - this class downloads smth using HTTP, then calls plugin-parser accordingly to "Content" of HTTP header (text/html in our case) I'd like to have more freedom, to add more fields to database before indexing. Probably I can use ParseData. I'd like to have two-step indexing process, first to index HTML tags and find similarities (like as usual header, footer, Options, Menu, (c), etc.), then to use second parsing and second indexing - to index only unique text. -Fuad -----Original Message----- From: Jack Tang [mailto:[EMAIL PROTECTED] Sent: Thursday, August 18, 2005 11:30 PM To: [email protected] Subject: Re: Parse-html should be enhanced! Waw, Efendi, the features you metioned sounds coool. Anyway, I hope nutch will both handle DOM tree parsing and information extraction(Very high level) well one day. My suggestion is adding one layer between DOM tree parsing and indexing for information extraction. Comments? /Jack On 8/19/05, Fuad Efendi <[EMAIL PROTECTED]> wrote: > Existing PARSE-HTML plugin simply stores clean text (without HTML > tags) for future indexing. It stores, for instance, content of huge > <OPTIONS> tag which we don't need at all in 99.99% of cases. > > I found this idea very interesting, Web-SQL: http://www.lotontech.com > I've bought a book, Tony Loton "Web Content Mining with Java", it > consists 90% from code which I don't really need... > However, I am going to implement some kind of Web-SQL and Math. > Statistics. Usually web-sites have 90% of similar HTML, and I need only > subset. > > Also, I need to find a point in Nutch where I can replace Analyzer > with my own "non-analyzer"; I don't need to remove stop-words etc. > > I'd like to use Lucene as a database too... To perform a lot of > queries, to calc some statistics... > > -Fuad > > > -----Original Message----- > From: Jack Tang [mailto:[EMAIL PROTECTED] > Sent: Thursday, August 18, 2005 10:15 PM > To: [email protected] > Subject: Parse-html should be enhanced! > > > Hi Nutchers > > I think parse-html parse should be enhanced. In some of my > projects(Intranet search engine), we only need the content in the > specified detectors and filter the junk, say the content between <div > class="start-here"> and </div> or some detectors like XPath. Any > thoughts on this enhancement? > > Regards > /Jack > -- > Keep Discovering ... ... > http://www.jroller.com/page/jmars > > > -- Keep Discovering ... ... http://www.jroller.com/page/jmars
