Briggs wrote:
nutch 0.7.2
I have 2 scenarios (both using the exact same configurations):
1) Running the crawl tool from the command line:
./bin/nutch crawl -local urlfile.txt -dir /tmp/somedir -depth 5
2) Running the crawl tool from a web app somewhere in code like:
final String[] args = new String[]{
"-local", "/tmp/urlfile.txt",
"-dir", "/tmp/somedir",
"-depth", "5"};
CrawlTool.main(args);
When I run the first scenario, I may get thousands of pages, but when
I run the second scenario my results vary wildly. I mean, I get
perhaps 0,1,10+, 100+. But, I rarely ever get a good crawl from
within a web application. So, there are many things that could be
going wrong here....
1) Is there some sort of parsing issue? An xml parser, regex,
timeouts... something? Not sure. But, it just won't crawl as well as
the 'standalone mode'.
2) Is it a bad idea to use many concurrent CrawlTools, or even reusing
a crawl tool (more than once) within a instance of a JVM? It seems to
have problems doing this. I am thinking there are some static
references that don't really like handling such use. But this is just
a wild accusation that I am not sure of.
Checking out the logs might help in this case. From my experience, i can
say that there can be some classloading problem with the crawl running
in a servlet container. I suggest you also try running the crawl step
wise, by first running inject, generate, fetch. etc.