Briggs wrote:
> nutch 0.7.2
>
> I have 2 scenarios (both using the exact same configurations):
>
> 1) Running the crawl tool from the command line:
>
> ./bin/nutch crawl -local urlfile.txt -dir /tmp/somedir -depth 5
>
> 2) Running the crawl tool from a web app somewhere in code like:
>
> final String[] args = new String[]{
> "-local", "/tmp/urlfile.txt",
> "-dir", "/tmp/somedir",
> "-depth", "5"};
>
> CrawlTool.main(args);
>
>
> When I run the first scenario, I may get thousands of pages, but when
> I run the second scenario my results vary wildly. I mean, I get
> perhaps 0,1,10+, 100+. But, I rarely ever get a good crawl from
> within a web application. So, there are many things that could be
> going wrong here....
>
> 1) Is there some sort of parsing issue? An xml parser, regex,
> timeouts... something? Not sure. But, it just won't crawl as well as
> the 'standalone mode'.
>
> 2) Is it a bad idea to use many concurrent CrawlTools, or even reusing
> a crawl tool (more than once) within a instance of a JVM? It seems to
> have problems doing this. I am thinking there are some static
> references that don't really like handling such use. But this is just
> a wild accusation that I am not sure of.
>
>
>
Checking out the logs might help in this case. From my experience, i can
say that there can be some classloading problem with the crawl running
in a servlet container. I suggest you also try running the crawl step
wise, by first running inject, generate, fetch. etc.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general