Waw, Efendi, the features you metioned sounds coool.
Anyway, I hope nutch will both handle DOM tree parsing and information
extraction(Very high level) well one day. My suggestion is adding one
layer between DOM tree parsing and indexing for information
extraction.

Comments?

/Jack

On 8/19/05, Fuad Efendi <[EMAIL PROTECTED]> wrote:
> Existing PARSE-HTML plugin simply stores clean text (without HTML tags)
> for future indexing. It stores, for instance, content of huge <OPTIONS>
> tag which we don't need at all in 99.99% of cases.
> 
> I found this idea very interesting, Web-SQL:
> http://www.lotontech.com
> I've bought a book, Tony Loton "Web Content Mining with Java", it
> consists 90% from code which I don't really need...
> However, I am going to implement some kind of Web-SQL and Math.
> Statistics. Usually web-sites have 90% of similar HTML, and I need only
> subset.
> 
> Also, I need to find a point in Nutch where I can replace Analyzer with
> my own "non-analyzer"; I don't need to remove stop-words etc.
> 
> I'd like to use Lucene as a database too... To perform a lot of queries,
> to calc some statistics...
> 
> -Fuad
> 
> 
> -----Original Message-----
> From: Jack Tang [mailto:[EMAIL PROTECTED]
> Sent: Thursday, August 18, 2005 10:15 PM
> To: [email protected]
> Subject: Parse-html should be enhanced!
> 
> 
> Hi Nutchers
> 
> I think parse-html parse should be enhanced. In some of  my
> projects(Intranet search engine), we only need the content in the
> specified detectors and filter the junk, say the content between <div
> class="start-here"> and </div> or some detectors like XPath. Any
> thoughts on this enhancement?
> 
> Regards
> /Jack
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
> 
> 
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Reply via email to