Briggs wrote:
> nutch 0.7.2
>
> I have 2 scenarios (both using the exact same configurations):
>
> 1) Running the crawl tool from the command line:
>
>    ./bin/nutch crawl -local urlfile.txt -dir /tmp/somedir -depth 5
>
> 2) Running the crawl tool from a web app somewhere in code like:
>
>    final String[] args = new String[]{
>        "-local", "/tmp/urlfile.txt",
>        "-dir", "/tmp/somedir",
>        "-depth", "5"};
>
>    CrawlTool.main(args);
>
>
> When I run the first scenario, I may get thousands of pages, but when
> I run the second scenario my results vary wildly.  I mean, I get
> perhaps 0,1,10+, 100+.  But, I rarely ever get a good crawl from
> within a web application.  So, there are many things that could be
> going wrong here....
>
> 1) Is there some sort of parsing issue?  An xml parser, regex,
> timeouts... something?  Not sure.  But, it just won't crawl as well as
> the 'standalone mode'.
>
> 2) Is it a bad idea to use many concurrent CrawlTools, or even reusing
> a crawl tool (more than once) within a instance of a JVM?  It seems to
> have problems doing this. I am thinking there are some static
> references that don't really like handling such use. But this is just
> a wild accusation that I am not sure of.
>
>
>
Checking out the logs might help in this case. From my experience, i can 
say that there can be some classloading problem with the crawl running 
in a servlet container. I suggest you also try running the crawl step 
wise, by first running inject, generate, fetch. etc.




-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to