> I'm not sure what you mean.  I set environment
> variables in my .bashrc, 
> then simply use 'bin/start-all.sh' and 'bin/nutch
> crawl'.

Well, not sure if you looked at my tutorial, which is
now on the wiki

http://wiki.apache.org/nutch/SimpleMapReduceTutorial

but yeah, that is much simpler than what I am doing. 
Looks like a little example has been added to the FAQ,
which wasn't there last time I looked.

> NutchBean now looks for things in the subdirectory
> of the connected 
> directory named 'crawl'.  Is that an improvement or
> is it just confusing?

I think magic is ok so long as it is documented and it
works.

> I think it would be better to have the junit tests
> start jetty then 
> crawl localhost.  I'd love to see some end-to-end
> unit tests like that.

Think I will start to work on this.  Maybe start with
on page that contains just a few phrases, or maybe
just the word nutch, then make sure it can be queried
out in the end?  Could also check status through the
process to make sure everything looks good.  If
nothing else, I would likely understand the process
pretty well by the time I got done with my writing.

I think this would also make it nice to test things
like recursive linking, parsing pdfs or other file
formats, observing robots.txt or any crawling bugs
that are encountered and then fixed.

Suggestions for where to put such test content in the
tree?

> You should be able to add them to the wiki yourself.

Thanks, I added them.

Earl


        
                
______________________________________________________ 
Yahoo! for Good 
Donate to the Hurricane Katrina relief effort. 
http://store.yahoo.com/redcross-donate3/ 

Reply via email to