At 11:30 AM -0400 9/18/09, Paul Tomblin wrote:
Is anybody here familiar with how Desielpoint (DP) works?
Dieselpoint is designed specifically for intranets and therefore
doesn't take robots.txt into account because the Dieselpoint
administrator and the web administrator (theoretically) work toward
the same goals (see the thread from last Friday, "Ignoring
Robots.txt" for an instance where that wasn't the case).
Nutch is designed specifically for all-web crawling (like Google or
Bing) and respects robots.txt because Nutch needs to be polite when
indexing sites over which it has no control.
Your client has a robots.txt file to control Google and/or Bing, so
Nutch is respecting it the same way Google or Bing would.
While Nutch is not designed as an intranet indexer, it can be used
that way, but the Nutch administrator must make some compromises. You
won't be able to build a script with wget and make URLs for Nutch to
index, because Nutch will respect the robots.txt file. You have to
figure out a way around the problem via the robots.txt file.
You can attack the problem in one of these ways:
*Modify the nutch-default.xml file, changing the http.robots.agents
tag accordingly (search the list for "Jake Jacobson" to see how to do
this). Then create a specific record in your client's robots.txt file
that cites the Nutch user agent and allows a crawl of everything.
*Modify Nutch to ignore robots.txt files; you will need to work on
the parse-html plugin.
*Modify the robots.txt file either by hand or by script. If you're
only crawling once (unlikely), just open the robots.txt file, comment
out the offending lines, save it, run Nutch, reopen the file,
uncomment and save again. If you're crawling on a CRON job, write a
script that alters the robots.txt file that CRONs just before Nutch's
crawl and a script that changes robots.txt back after Nutch finishes.
*In Apache or whatever the HTTP server, create a ruleset that
delivers one robots.txt file (which allows crawling everywhere) to
the IP address where Nutch is running and the regular one to all
other IP addresses.
There may be other kludges available.
Hope this helps.
\dmc
--
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
David M. Cole d...@colegroup.com
Editor & Publisher, NewsInc. <http://newsinc.net> V: (650) 557-2993
Consultant: The Cole Group <http://colegroup.com/> F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+