Hi Everybody, We are planning to use Nutch-0.9 Web-Crawler. It works fine with any static website that has some static content. It crawls and creates the binary DB. We have another CMS that's content are stored as object in database. I mean to say contentfile name and its content both resides in Database. so when we do the crawling for this CMS System it gives this error..
Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439) at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) I presume that it is not able to get the index file for root directory..becuse in CMS system there is no index or any conent file... Lucene works fine with CMS system becuse on each content updation or creation, Lucene indexer generates the binary and update its binary DB and indexes. Now my question is... How Nuthch web-crawler will pick the content form the DB? Do we need to write our own indexer that will update Crawl Binary? Please share your idea... Kind Regards Chandra Infoaxon Technology -- View this message in context: http://www.nabble.com/How-to-Crawl-CMS-System-tp14500406p14500406.html Sent from the Nutch - Agent mailing list archive at Nabble.com.