problem crawling entire internal website

ksee Mon, 15 Mar 2010 12:08:58 -0700

Hi,

I'm a new nutch user. My company wants me to look into using this technology
to index our internal wiki website as well as sharepoint docs (using tika).


Right now, I just want nutch to index the entire wiki site but I'm having
problems. I've read other people's problems with this but I haven't found a
solution that worked for me.

I have nutch 1.0 installed.
The wiki site is MoinMoin if that helps. The pages don't have extensions
like .html. They're in the form of http://wiki:8000/Engineering as an
example. So all pages only have 1-level depth paths.

I'm running nutch with the follow command:
bin/nuch crawl urls -dir crawl -depth 100 -topN 1000000 > crawl.log

I have a urls folder with a file called wiki that points to the top-level
page of the site.

I set the crawl-urlfilter.txt to accept everything except the default
exclusions:
-^(file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-[...@=]
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
+.

And I set the db.ignore.external.links property in nutch-default.xml to true
so it doesn't go outside of the site. (db.ignore.interal.links is set to
false)

After the crawl command completes, the search returns some pages, but there
are still some pages that are maybe 2 or 3 levels from the starting page
that don't show up on search.

Any help would be appreciated.

Thanks,
Kane
-- 
View this message in context: 
http://old.nabble.com/problem-crawling-entire-internal-website-tp27908943p27908943.html
Sent from the Nutch - User mailing list archive at Nabble.com.

problem crawling entire internal website

Reply via email to