RE: Parse-html should be enhanced!

Fuad Efendi Thu, 18 Aug 2005 21:24:31 -0700

Hi Jack,

I'd like to have more freedom with Nutch... We have two classes,
ParseText and ParseData, which are stored somewhere (I am newbie!) and
then indexed by Lucene. ParseText contains plain text (after parsing by
existing parse-html plugin), and ParseData - links found on a page,
metatags (not sure), etc.


org.apache.nutch.fetcher.Fetch - this class downloads smth using HTTP,
then calls plugin-parser accordingly to "Content" of HTTP header
(text/html in our case)

I'd like to have more freedom, to add more fields to database before
indexing. Probably I can use ParseData.

I'd like to have two-step indexing process, first to index HTML tags and
find similarities (like as usual header, footer, Options, Menu, (c),
etc.), then to use second parsing and second indexing - to index only
unique text.

-Fuad


-----Original Message-----
From: Jack Tang [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 18, 2005 11:30 PM
To: [email protected]
Subject: Re: Parse-html should be enhanced!


Waw, Efendi, the features you metioned sounds coool.
Anyway, I hope nutch will both handle DOM tree parsing and information
extraction(Very high level) well one day. My suggestion is adding one
layer between DOM tree parsing and indexing for information extraction.

Comments?

/Jack

On 8/19/05, Fuad Efendi <[EMAIL PROTECTED]> wrote:
> Existing PARSE-HTML plugin simply stores clean text (without HTML 
> tags) for future indexing. It stores, for instance, content of huge 
> <OPTIONS> tag which we don't need at all in 99.99% of cases.
> 
> I found this idea very interesting, Web-SQL: http://www.lotontech.com
> I've bought a book, Tony Loton "Web Content Mining with Java", it
> consists 90% from code which I don't really need...
> However, I am going to implement some kind of Web-SQL and Math.
> Statistics. Usually web-sites have 90% of similar HTML, and I need
only
> subset.
> 
> Also, I need to find a point in Nutch where I can replace Analyzer 
> with my own "non-analyzer"; I don't need to remove stop-words etc.
> 
> I'd like to use Lucene as a database too... To perform a lot of 
> queries, to calc some statistics...
> 
> -Fuad
> 
> 
> -----Original Message-----
> From: Jack Tang [mailto:[EMAIL PROTECTED]
> Sent: Thursday, August 18, 2005 10:15 PM
> To: [email protected]
> Subject: Parse-html should be enhanced!
> 
> 
> Hi Nutchers
> 
> I think parse-html parse should be enhanced. In some of  my 
> projects(Intranet search engine), we only need the content in the 
> specified detectors and filter the junk, say the content between <div 
> class="start-here"> and </div> or some detectors like XPath. Any 
> thoughts on this enhancement?
> 
> Regards
> /Jack
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
> 
> 
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

RE: Parse-html should be enhanced!

Reply via email to