At 11:30 AM -0400 9/18/09, Paul Tomblin wrote:
Is anybody here familiar with how Desielpoint (DP) works?

Dieselpoint is designed specifically for intranets and therefore doesn't take robots.txt into account because the Dieselpoint administrator and the web administrator (theoretically) work toward the same goals (see the thread from last Friday, "Ignoring Robots.txt" for an instance where that wasn't the case).

Nutch is designed specifically for all-web crawling (like Google or Bing) and respects robots.txt because Nutch needs to be polite when indexing sites over which it has no control.

Your client has a robots.txt file to control Google and/or Bing, so Nutch is respecting it the same way Google or Bing would.

While Nutch is not designed as an intranet indexer, it can be used that way, but the Nutch administrator must make some compromises. You won't be able to build a script with wget and make URLs for Nutch to index, because Nutch will respect the robots.txt file. You have to figure out a way around the problem via the robots.txt file.

You can attack the problem in one of these ways:

*Modify the nutch-default.xml file, changing the http.robots.agents tag accordingly (search the list for "Jake Jacobson" to see how to do this). Then create a specific record in your client's robots.txt file that cites the Nutch user agent and allows a crawl of everything.

*Modify Nutch to ignore robots.txt files; you will need to work on the parse-html plugin.

*Modify the robots.txt file either by hand or by script. If you're only crawling once (unlikely), just open the robots.txt file, comment out the offending lines, save it, run Nutch, reopen the file, uncomment and save again. If you're crawling on a CRON job, write a script that alters the robots.txt file that CRONs just before Nutch's crawl and a script that changes robots.txt back after Nutch finishes.

*In Apache or whatever the HTTP server, create a ruleset that delivers one robots.txt file (which allows crawling everywhere) to the IP address where Nutch is running and the regular one to all other IP addresses.

There may be other kludges available.

Hope this helps.

\dmc

--
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
   David M. Cole                                            d...@colegroup.com
   Editor & Publisher, NewsInc. <http://newsinc.net>        V: (650) 557-2993
   Consultant: The Cole Group <http://colegroup.com/>       F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+

Reply via email to