Re: Parse-html should be enhanced!

Jack Tang Thu, 18 Aug 2005 22:24:03 -0700

Hi Fuad

On 8/19/05, Fuad Efendi <[EMAIL PROTECTED]> wrote:
> Hi Jack,
> 
> I'd like to have more freedom with Nutch... We have two classes,
> ParseText and ParseData, which are stored somewhere (I am newbie!) and
> then indexed by Lucene. ParseText contains plain text (after parsing by
> existing parse-html plugin), and ParseData - links found on a page,
> metatags (not sure), etc.
> 
> org.apache.nutch.fetcher.Fetch - this class downloads smth using HTTP,
> then calls plugin-parser accordingly to "Content" of HTTP header
> (text/html in our case)
> 
> I'd like to have more freedom, to add more fields to database before
> indexing. Probably I can use ParseData.
I totally agree with you. 
I'd like store the extracted information into the new map, say
ExtractedInfo class


> I'd like to have two-step indexing process, first to index HTML tags and
> find similarities (like as usual header, footer, Options, Menu, (c),
> etc.), then to use second parsing and second indexing - to index only
> unique text.
> 
> -Fuad
> 
> 
> -----Original Message-----
> From: Jack Tang [mailto:[EMAIL PROTECTED]
> Sent: Thursday, August 18, 2005 11:30 PM
> To: [email protected]
> Subject: Re: Parse-html should be enhanced!
> 
> 
> Waw, Efendi, the features you metioned sounds coool.
> Anyway, I hope nutch will both handle DOM tree parsing and information
> extraction(Very high level) well one day. My suggestion is adding one
> layer between DOM tree parsing and indexing for information extraction.
> 
> Comments?
> 
> /Jack
> 
> On 8/19/05, Fuad Efendi <[EMAIL PROTECTED]> wrote:
> > Existing PARSE-HTML plugin simply stores clean text (without HTML
> > tags) for future indexing. It stores, for instance, content of huge
> > <OPTIONS> tag which we don't need at all in 99.99% of cases.
> >
> > I found this idea very interesting, Web-SQL: http://www.lotontech.com
> > I've bought a book, Tony Loton "Web Content Mining with Java", it
> > consists 90% from code which I don't really need...
> > However, I am going to implement some kind of Web-SQL and Math.
> > Statistics. Usually web-sites have 90% of similar HTML, and I need
> only
> > subset.
> >
> > Also, I need to find a point in Nutch where I can replace Analyzer
> > with my own "non-analyzer"; I don't need to remove stop-words etc.
> >
> > I'd like to use Lucene as a database too... To perform a lot of
> > queries, to calc some statistics...
> >
> > -Fuad
> >
> >
> > -----Original Message-----
> > From: Jack Tang [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, August 18, 2005 10:15 PM
> > To: [email protected]
> > Subject: Parse-html should be enhanced!
> >
> >
> > Hi Nutchers
> >
> > I think parse-html parse should be enhanced. In some of  my
> > projects(Intranet search engine), we only need the content in the
> > specified detectors and filter the junk, say the content between <div
> > class="start-here"> and </div> or some detectors like XPath. Any
> > thoughts on this enhancement?
> >
> > Regards
> > /Jack
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
> >
> >
> 
> 
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
> 
> 
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: Parse-html should be enhanced!

Reply via email to