Existing PARSE-HTML plugin simply stores clean text (without HTML tags)
for future indexing. It stores, for instance, content of huge <OPTIONS>
tag which we don't need at all in 99.99% of cases.

I found this idea very interesting, Web-SQL:
http://www.lotontech.com
I've bought a book, Tony Loton "Web Content Mining with Java", it
consists 90% from code which I don't really need...
However, I am going to implement some kind of Web-SQL and Math.
Statistics. Usually web-sites have 90% of similar HTML, and I need only
subset.

Also, I need to find a point in Nutch where I can replace Analyzer with
my own "non-analyzer"; I don't need to remove stop-words etc.

I'd like to use Lucene as a database too... To perform a lot of queries,
to calc some statistics...

-Fuad


-----Original Message-----
From: Jack Tang [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 18, 2005 10:15 PM
To: [email protected]
Subject: Parse-html should be enhanced!


Hi Nutchers

I think parse-html parse should be enhanced. In some of  my
projects(Intranet search engine), we only need the content in the
specified detectors and filter the junk, say the content between <div
class="start-here"> and </div> or some detectors like XPath. Any
thoughts on this enhancement?

Regards
/Jack
-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars


Reply via email to