Yeah, the big link on the homepage is

<a href=/sitemap.html>browse</a>

which then opens several other pages.  All the links
on the site will with /.

So I tried

+^/

in my conf/crawl-urlfilter.txt with no luck.  

Against my better judgement, I also tried

+^/.*

which also didn't work.

Thanks,
Earl

--- Doug Cutting <[EMAIL PROTECTED]> wrote:

> The only link on http://shopthar.com/ to the domain
> shopthar.com is a 
> link to http://shopthar.com/.  So a crawl starting
> from that page that 
> only visits pages in shopthar.com will only find
> that one page.
> 
> % wget -q -O - http://shopthar.com/ | grep
> shopthar.com
>    <tr><td colspan=2>Welcome to
> shopthar.com</td></td></tr>
> <a href=http://shopthar.com/>shopthar.com</a> |
> 
> Doug
> 
> Earl Cahill wrote:
> > I am trying to do a crawl on trunk of one of my
> sites,
> > and it isn't working.  I make a file urls, that
> just
> > contains the site
> > 
> > http://shopthar.com/
> > 
> > in my conf/crawl-urlfilter.txt I have
> > 
> > +^http://shopthar.com/
> > 
> > I then do
> > 
> > bin/nutch crawl urls -dir crawl.test -depth 100
> > -threads 20
> > 
> > it kicks in and I get repeating chunks like
> > 
> > 051019 010450 Updating
> > /home/nutch/nutch/trunk/crawl.test/db
> > 051019 010450 Updating for
> >
>
/home/nutch/nutch/trunk/crawl.test/segments/20051019010449
> > 051019 010450 Finishing update
> > 051019 010450 Update finished
> > 051019 010450 FetchListTool started
> > 051019 010450 Overall processing: Sorted 0 entries
> in
> > 0.0 seconds.
> > 051019 010450 Overall processing: Sorted NaN
> > entries/second
> > 051019 010450 FetchListTool completed
> > 051019 010450 logging at INFO
> > 
> > For ages, but I only see two nutch hits in my
> access
> > log: one for my robots.txt and one for my front
> page. 
> > Nothing else.
> > 
> > The "crawl" finishes, then I do a search and can
> only
> > get a hits for the front page.  When I do the
> search
> > via lynx, I get a momentary
> > 
> > Bad partial reference!  Stripping lead dots.
> > 
> > I can't imagine this is really the problem, but
> pretty
> > well all my links are relative.  I mean nutch has
> to
> > be able to follow relative links, right?
> > 
> > Ideas?
> > 
> > Thanks,
> > Earl
> > 
> > 
> >             
> > __________________________________ 
> > Start your day with Yahoo! - Make it your home
> page! 
> > http://www.yahoo.com/r/hs
> 



                
__________________________________ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/

Reply via email to