Hi, I am currently configuring/customizing nutch to crawl a select number of sites structured like news/auction/forum sites, where there are a number of item listings (new headlines list/items being auctioned in progress/forum thread list) and each item has its own page (actual news/auction/forum information).
My planned setup is a specialized URL injector per site, a subprogram which would crawl the item listings (various from site to site) and peruse the item listings for each site and obtain URL links to the pages for each individual item. Subsequently I would crawl these URL links for the individual item details. In addition, I am planning on using an actual database to store the webdb information. I am not using nutch's in-built webdb data structure because of the following: (1) webdb does not seem to support multiple threads of injectors adding URLs to it at the same time. (2) webdb is not able to let you know at inject time if the URL to be injected is already present in the webdb. I mean WebDBWriter does not do that - I could find out using WebDBReader but again am afraid the dbread/dbwrite locks + multiple injecting threads could cause issues. Knowing if the URL to be injected is already present would aid in my perusal of the item listings - ideally the items would be sorted in reverse chronological order and I would know to stop when I hit an item link that is already in the database. However it would require more code to have the randomization of fetch order(via sorting by MD5) and emission of fetch list done from a database instead of webdb. Does anyone here have ideas or experience in crawling these types of site: how you've configured/customized the nutch to crawl them (did you use specialized injectors or was able to get nutch's default wide web-crawling working for them) or how you have customized or replaced webdb for your needs? Any comments or ideas definitely appreciated :) Regards, CW
