Extracting the whole text of HTML documents when crawling

mohammad_108 Sun, 08 Feb 2009 05:06:04 -0800

I am quite new to nutch. After a while, I was successful in installing
cygwin, tomcat, and nutch. I began a crawl of apache.org, and received a
bulk of files, but don't know even how to read them. I have relized that
they are index files and I need to learn Lucene, however, I am also not
familiar with Lucene and Java.
I want to crawl the web for a keyword and extract the purified text of each
html document, and concatenate the html files. I don't know how to do this.
-- 
View this message in context: 
http://www.nabble.com/Extracting-the-whole-text-of-HTML-documents-when-crawling-tp21898694p21898694.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Extracting the whole text of HTML documents when crawling

Reply via email to