Hi,

I am currently configuring/customizing nutch to crawl a select number
of sites structured like news/auction/forum sites, where there are a
number of item listings (new headlines list/items being auctioned in
progress/forum thread list) and each item has its own page (actual
news/auction/forum information).

My planned setup is a specialized URL injector per site, a subprogram
which would crawl the item listings (various from site to site) and
peruse the item listings for each site and obtain URL links to the
pages for each individual item. Subsequently I would crawl these URL
links for the individual item details.

In addition, I am planning on using an actual database to store the
webdb information. I am not using nutch's in-built webdb data
structure because of the following:
(1) webdb does not seem to support multiple threads of injectors
adding URLs to it at the same time.
(2) webdb is not able to let you know at inject time if the URL to be
injected is already present in the webdb. I mean WebDBWriter does not
do that - I could find out using WebDBReader but again am afraid the
dbread/dbwrite locks + multiple injecting threads could cause issues.
Knowing if the URL to be injected is already present would aid in my
perusal of the item listings - ideally the items would be sorted in
reverse chronological order and I would know to stop when I hit an
item link that is already in the database.
However it would require more code to have the randomization of fetch
order(via sorting by MD5) and emission of fetch list done from a
database instead of webdb.

Does anyone here have ideas or experience in crawling these types of
site: how you've configured/customized the nutch to crawl them (did
you use specialized injectors or was able to get nutch's default wide
web-crawling working for them) or how you have customized or replaced
webdb for your needs? Any comments or ideas definitely appreciated :)

Regards,
CW

Reply via email to