I think end to end testing must focus on end to end
problems (ie checking
pdf parsing is already checked by unit tests, and it
is really the right place for doing it).
Hate to say it, but today was the first time I got ant
test to work (hadn't tried too hard), and yeah, I saw
several such
Anyway you would post your conf/nutch-site.xml and
walk through your crawl process a bit?
Thanks,
Earl
--- Paul Harrison [EMAIL PROTECTED] wrote:
Murray,
We are running on the following:
5 Pentium 4 3.2 Ghz machines, 4 GB of RAM each, 1 40
GB OS drive and 2 SATA
250 GB data drives
I am trying to do a crawl on trunk of one of my sites,
and it isn't working. I make a file urls, that just
contains the site
http://shopthar.com/
in my conf/crawl-urlfilter.txt I have
+^http://shopthar.com/
I then do
bin/nutch crawl urls -dir crawl.test -depth 100
-threads 20
it kicks in
I have trying to get an answer to this same question
without much luck. I would like to see users start to
post their network setups and conf/nutch-site.xml
files, to the list and perhaps on a page in the wiki.
I can say that the mapreduce branch is aimed at doing
this
Hi Sébastien,
Yahoo! just hosed my message, glad I had it elsewhere.
As you probably saw in the OutlinkExtractor class,
the links are
extracted with a Regexp.
Ahh, didn't see it before, but I now see URL_PATTERN.
I know it's minor, but if you later apply
Jérôme,
which Nutch version do you use?
Kind of gave up on mapred for awhile, so I am using
trunk.
There were a bug concerning the content-types with
parameters such as
text/html; charset=iso-8859-1.
Yeah, when I telnet in to GET / shopthar.com, I get
Content-Type: text/html;
Trunk? Map reduce? Could you describe your box
setup, job division, and maybe post your
conf/nutch-site.xml file?
Just trying to get things going and not have much luck
with the mapreduce branch. I also tried trunk, the
crawl stops around 3 pages (out of maybe a million
), and once it's