I am quite new to nutch. After a while, I was successful in installing cygwin, tomcat, and nutch. I began a crawl of apache.org, and received a bulk of files, but don't know even how to read them. I have relized that they are index files and I need to learn Lucene, however, I am also not familiar with Lucene and Java. I want to crawl the web for a keyword and extract the purified text of each html document, and concatenate the html files. I don't know how to do this. -- View this message in context: http://www.nabble.com/Extracting-the-whole-text-of-HTML-documents-when-crawling-tp21898694p21898694.html Sent from the Nutch - User mailing list archive at Nabble.com.
