Re: a simple map reduce tutorial

2005-10-04 Thread Earl Cahill
I think end to end testing must focus on end to end problems (ie checking pdf parsing is already checked by unit tests, and it is really the right place for doing it). Hate to say it, but today was the first time I got ant test to work (hadn't tried too hard), and yeah, I saw several such

RE: Nutch Search Speed Concern

2005-10-17 Thread Earl Cahill
Anyway you would post your conf/nutch-site.xml and walk through your crawl process a bit? Thanks, Earl --- Paul Harrison [EMAIL PROTECTED] wrote: Murray, We are running on the following: 5 Pentium 4 3.2 Ghz machines, 4 GB of RAM each, 1 40 GB OS drive and 2 SATA 250 GB data drives

crawl problems

2005-10-19 Thread Earl Cahill
I am trying to do a crawl on trunk of one of my sites, and it isn't working. I make a file urls, that just contains the site http://shopthar.com/ in my conf/crawl-urlfilter.txt I have +^http://shopthar.com/ I then do bin/nutch crawl urls -dir crawl.test -depth 100 -threads 20 it kicks in

Re: Nutch and Clustering

2005-10-20 Thread Earl Cahill
I have trying to get an answer to this same question without much luck. I would like to see users start to post their network setups and conf/nutch-site.xml files, to the list and perhaps on a page in the wiki. I can say that the mapreduce branch is aimed at doing this

Re: crawl problems (a bug/patch)

2005-10-20 Thread Earl Cahill
Hi Sébastien, Yahoo! just hosed my message, glad I had it elsewhere. As you probably saw in the OutlinkExtractor class, the links are extracted with a Regexp. Ahh, didn't see it before, but I now see URL_PATTERN. I know it's minor, but if you later apply

Re: crawl problems (a bug/patch)

2005-10-20 Thread Earl Cahill
Jérôme, which Nutch version do you use? Kind of gave up on mapred for awhile, so I am using trunk. There were a bug concerning the content-types with parameters such as text/html; charset=iso-8859-1. Yeah, when I telnet in to GET / shopthar.com, I get Content-Type: text/html;

Re: fetch questions - freezing

2005-10-28 Thread Earl Cahill
Trunk? Map reduce? Could you describe your box setup, job division, and maybe post your conf/nutch-site.xml file? Just trying to get things going and not have much luck with the mapreduce branch. I also tried trunk, the crawl stops around 3 pages (out of maybe a million ), and once it's