Hi Fuad On 8/19/05, Fuad Efendi <[EMAIL PROTECTED]> wrote: > Hi Jack, > > I'd like to have more freedom with Nutch... We have two classes, > ParseText and ParseData, which are stored somewhere (I am newbie!) and > then indexed by Lucene. ParseText contains plain text (after parsing by > existing parse-html plugin), and ParseData - links found on a page, > metatags (not sure), etc. > > org.apache.nutch.fetcher.Fetch - this class downloads smth using HTTP, > then calls plugin-parser accordingly to "Content" of HTTP header > (text/html in our case) > > I'd like to have more freedom, to add more fields to database before > indexing. Probably I can use ParseData. I totally agree with you. I'd like store the extracted information into the new map, say ExtractedInfo class
> I'd like to have two-step indexing process, first to index HTML tags and > find similarities (like as usual header, footer, Options, Menu, (c), > etc.), then to use second parsing and second indexing - to index only > unique text. > > -Fuad > > > -----Original Message----- > From: Jack Tang [mailto:[EMAIL PROTECTED] > Sent: Thursday, August 18, 2005 11:30 PM > To: [email protected] > Subject: Re: Parse-html should be enhanced! > > > Waw, Efendi, the features you metioned sounds coool. > Anyway, I hope nutch will both handle DOM tree parsing and information > extraction(Very high level) well one day. My suggestion is adding one > layer between DOM tree parsing and indexing for information extraction. > > Comments? > > /Jack > > On 8/19/05, Fuad Efendi <[EMAIL PROTECTED]> wrote: > > Existing PARSE-HTML plugin simply stores clean text (without HTML > > tags) for future indexing. It stores, for instance, content of huge > > <OPTIONS> tag which we don't need at all in 99.99% of cases. > > > > I found this idea very interesting, Web-SQL: http://www.lotontech.com > > I've bought a book, Tony Loton "Web Content Mining with Java", it > > consists 90% from code which I don't really need... > > However, I am going to implement some kind of Web-SQL and Math. > > Statistics. Usually web-sites have 90% of similar HTML, and I need > only > > subset. > > > > Also, I need to find a point in Nutch where I can replace Analyzer > > with my own "non-analyzer"; I don't need to remove stop-words etc. > > > > I'd like to use Lucene as a database too... To perform a lot of > > queries, to calc some statistics... > > > > -Fuad > > > > > > -----Original Message----- > > From: Jack Tang [mailto:[EMAIL PROTECTED] > > Sent: Thursday, August 18, 2005 10:15 PM > > To: [email protected] > > Subject: Parse-html should be enhanced! > > > > > > Hi Nutchers > > > > I think parse-html parse should be enhanced. In some of my > > projects(Intranet search engine), we only need the content in the > > specified detectors and filter the junk, say the content between <div > > class="start-here"> and </div> or some detectors like XPath. Any > > thoughts on this enhancement? > > > > Regards > > /Jack > > -- > > Keep Discovering ... ... > > http://www.jroller.com/page/jmars > > > > > > > > > -- > Keep Discovering ... ... > http://www.jroller.com/page/jmars > > > -- Keep Discovering ... ... http://www.jroller.com/page/jmars
