[
https://issues.apache.org/jira/browse/NUTCH-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015886#comment-13015886
]
Niksa Jakovljevic commented on NUTCH-974:
-----------------------------------------
Hi Markus,
as I said I was using the same conf folder in both cases (nutch 1.2 and 1.1) so
I guess configuration is not the issue.
I didn't said that I started crawling using API, not script. I'll paste my code
so you can check again.
Thanks!
public void crawl() throws IOException {
Configuration conf =
NutchConfiguration.createCrawlConfiguration();
conf.set("http.agent.name", "Test");
conf.set("http.agent.description", "Test Desc");
conf.set("http.agent.url", "testAgent");
conf.set("http.agent.email", "[email protected]");
NutchJob job = new NutchJob(conf);
Path rootUrlDir = new Path("D:/Development/crawler/url");
Path dir = new Path("D:/tmp/crawl-" + getDate());
int threads = job.getInt("fetcher.threads.fetch", 2);
int depth = 2;
long topN = Long.MAX_VALUE;
String indexerName = "solr";
String solrUrl = "http://localhost:8081/solr/";
boolean isSolrIndex = StringUtils.equalsIgnoreCase(indexerName,
"solr");
FileSystem fs = FileSystem.get(job);
Path crawlDb = new Path(dir + "/crawldb");
Path linkDb = new Path(dir + "/linkdb");
Path segments = new Path(dir + "/segments");
Path indexes = new Path(dir + "/indexes");
Path index = new Path(dir + "/index");
Path tmpDir = job.getLocalPath("crawl" + Path.SEPARATOR +
getDate());
Injector injector = new Injector(conf);
Generator generator = new Generator(conf);
Fetcher fetcher = new Fetcher(conf);
ParseSegment parseSegment = new ParseSegment(conf);
CrawlDb crawlDbTool = new CrawlDb(conf);
LinkDb linkDbTool = new LinkDb(conf);
// initialize crawlDb
injector.inject(crawlDb, rootUrlDir);
int i;
for (i = 0; i < depth; i++) { // generate new segment
Path[] segs = generator.generate(crawlDb, segments, -1, topN,
System
.currentTimeMillis());
if (segs == null) {
LOG.info("Stopping at depth=" + i + " - no more URLs to
fetch.");
break;
}
fetcher.fetch(segs[0], threads,
org.apache.nutch.fetcher.Fetcher.isParsing(conf)); // fetch it
if (!Fetcher.isParsing(job)) {
parseSegment.parse(segs[0]); // parse it, if needed
}
crawlDbTool.update(crawlDb, segs, true, true); // update crawldb
}
if (i > 0) {
linkDbTool.invert(linkDb, segments, true, true, false); // invert
links
// index, dedup & merge
FileStatus[] fstats = fs.listStatus(segments,
HadoopFSUtil.getPassDirectoriesFilter(fs));
if (isSolrIndex) {
SolrIndexer indexer = new SolrIndexer(conf);
indexer.indexSolr(solrUrl, crawlDb, linkDb,
Arrays.asList(HadoopFSUtil.getPaths(fstats)));
}
else {
DeleteDuplicates dedup = new DeleteDuplicates(conf);
if(indexes != null) {
// Delete old indexes
if (fs.exists(indexes)) {
LOG.info("Deleting old indexes: " + indexes);
fs.delete(indexes, true);
}
// Delete old index
if (fs.exists(index)) {
LOG.info("Deleting old merged index: " + index);
fs.delete(index, true);
}
}
Indexer indexer = new Indexer(conf);
indexer.index(indexes, crawlDb, linkDb,
Arrays.asList(HadoopFSUtil.getPaths(fstats)));
IndexMerger merger = new IndexMerger(conf);
if(indexes != null) {
dedup.dedup(new Path[] { indexes });
fstats = fs.listStatus(indexes,
HadoopFSUtil.getPassDirectoriesFilter(fs));
merger.merge(HadoopFSUtil.getPaths(fstats), index, tmpDir);
}
}
}
LOG.info("crawl finished: " + dir);
}
> Parsing Error in Nutch 1.2 on Windows7
> --------------------------------------
>
> Key: NUTCH-974
> URL: https://issues.apache.org/jira/browse/NUTCH-974
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.2
> Environment: Windows7 64-bit, Cygwin 1.7.9-1
> Reporter: Niksa Jakovljevic
> Assignee: Markus Jelsma
>
> Hello World example of crawling does not work with Nutch 1.2 libs, but works
> fine with Nutch 1.1 libs. Note that same configuration is used in both Nutch
> 1.2 and Nutch 1.1.
> Nutch 1.2 always throws following exception:
> 2011-04-01 16:33:45,177 WARN parse.ParseUtil - Unable to successfully parse
> content http://www.test.com/ of type text/html
> 2011-04-01 16:33:45,177 WARN fetcher.Fetcher - Error parsing:
> http://www.test.com/: failed(2,200): org.apache.nutch.parse.ParseException:
> Unable to successfully parse content
> Thanks,
> Niksa Jakovljevic
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira