[jira] [Commented] (NUTCH-974) Parsing Error in Nutch 1.2 on Windows7

Niksa Jakovljevic (JIRA) Tue, 05 Apr 2011 05:07:54 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015886#comment-13015886
 ]


Niksa Jakovljevic commented on NUTCH-974:
-----------------------------------------

Hi Markus,

as I said I was using the same conf folder in both cases (nutch 1.2 and 1.1) so 
I guess configuration is not the issue. 
I didn't said that I started crawling using API, not script. I'll paste my code 
so you can check again.

Thanks!

public void crawl() throws IOException {
                Configuration conf = 
NutchConfiguration.createCrawlConfiguration();
                
                conf.set("http.agent.name", "Test");
                conf.set("http.agent.description", "Test Desc");
                conf.set("http.agent.url", "testAgent");
                conf.set("http.agent.email", "[email protected]");
                
                NutchJob job = new NutchJob(conf);
                Path rootUrlDir = new Path("D:/Development/crawler/url");
                Path dir = new Path("D:/tmp/crawl-" + getDate());
                int threads = job.getInt("fetcher.threads.fetch", 2);
                int depth = 2;
                long topN = Long.MAX_VALUE;
                String indexerName = "solr";
                String solrUrl = "http://localhost:8081/solr/";;

                boolean isSolrIndex = StringUtils.equalsIgnoreCase(indexerName, 
"solr");
                FileSystem fs = FileSystem.get(job);

                Path crawlDb = new Path(dir + "/crawldb");
                Path linkDb = new Path(dir + "/linkdb");
                Path segments = new Path(dir + "/segments");
                Path indexes = new Path(dir + "/indexes");
                Path index = new Path(dir + "/index");

                Path tmpDir = job.getLocalPath("crawl" + Path.SEPARATOR + 
getDate());
                Injector injector = new Injector(conf);
                Generator generator = new Generator(conf);
                Fetcher fetcher = new Fetcher(conf);
                ParseSegment parseSegment = new ParseSegment(conf);
                CrawlDb crawlDbTool = new CrawlDb(conf);
                LinkDb linkDbTool = new LinkDb(conf);
                
                 // initialize crawlDb
            injector.inject(crawlDb, rootUrlDir);
            int i;
            for (i = 0; i < depth; i++) {             // generate new segment
              Path[] segs = generator.generate(crawlDb, segments, -1, topN, 
System
                  .currentTimeMillis());
              if (segs == null) {
                LOG.info("Stopping at depth=" + i + " - no more URLs to 
fetch.");
                break;
              }
              fetcher.fetch(segs[0], threads, 
org.apache.nutch.fetcher.Fetcher.isParsing(conf));  // fetch it
              if (!Fetcher.isParsing(job)) {
                parseSegment.parse(segs[0]);    // parse it, if needed
              }
              crawlDbTool.update(crawlDb, segs, true, true); // update crawldb
            }
            if (i > 0) {
              linkDbTool.invert(linkDb, segments, true, true, false); // invert 
links

              // index, dedup & merge
              FileStatus[] fstats = fs.listStatus(segments, 
HadoopFSUtil.getPassDirectoriesFilter(fs));
              if (isSolrIndex) {
                SolrIndexer indexer = new SolrIndexer(conf);
                indexer.indexSolr(solrUrl, crawlDb, linkDb, 
                    Arrays.asList(HadoopFSUtil.getPaths(fstats)));
              }
              else {
                
                DeleteDuplicates dedup = new DeleteDuplicates(conf);        
                if(indexes != null) {
                  // Delete old indexes
                  if (fs.exists(indexes)) {
                    LOG.info("Deleting old indexes: " + indexes);
                    fs.delete(indexes, true);
                  }

                  // Delete old index
                  if (fs.exists(index)) {
                    LOG.info("Deleting old merged index: " + index);
                    fs.delete(index, true);
                  }
                }
                
                Indexer indexer = new Indexer(conf);
                indexer.index(indexes, crawlDb, linkDb, 
                    Arrays.asList(HadoopFSUtil.getPaths(fstats)));
                
                IndexMerger merger = new IndexMerger(conf);
                if(indexes != null) {
                  dedup.dedup(new Path[] { indexes });
                  fstats = fs.listStatus(indexes, 
HadoopFSUtil.getPassDirectoriesFilter(fs));
                  merger.merge(HadoopFSUtil.getPaths(fstats), index, tmpDir);
                }
              }
            }
            LOG.info("crawl finished: " + dir); 
        }



> Parsing Error in Nutch 1.2 on Windows7
> --------------------------------------
>
>                 Key: NUTCH-974
>                 URL: https://issues.apache.org/jira/browse/NUTCH-974
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.2
>         Environment: Windows7 64-bit, Cygwin 1.7.9-1
>            Reporter: Niksa Jakovljevic
>            Assignee: Markus Jelsma
>
> Hello World example of crawling does not work with Nutch 1.2 libs, but works 
> fine with Nutch 1.1 libs. Note that same configuration is used in both Nutch 
> 1.2 and Nutch 1.1.
> Nutch 1.2 always throws following exception:
> 2011-04-01 16:33:45,177 WARN  parse.ParseUtil - Unable to successfully parse 
> content http://www.test.com/ of type text/html
> 2011-04-01 16:33:45,177 WARN  fetcher.Fetcher - Error parsing: 
> http://www.test.com/: failed(2,200): org.apache.nutch.parse.ParseException: 
> Unable to successfully parse content
> Thanks,
> Niksa Jakovljevic

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-974) Parsing Error in Nutch 1.2 on Windows7

Reply via email to