Is anybody here familiar with how Desielpoint (DP) works? I'm working on a contract to replace DP with Nutch because the person paying me decided that she didn't want to pay the licensing costs for DP. But one huge bone of contention has come up - on one of the sites that she tells DP to index, she only wants the one page (it's evidently a search page that she passes some parameters to). DP is happy to do it, but Nutch looks at the robots.txt file, says "hey, I'm not supposed to crawl this directory", and won't download the page. So she's mad at me because it's somehow my fault that DP works differently than Nutch. She keeps saying "DP is a proper commercial product, they wouldn't be doing something they're not supposed to do" (to which I think but don't say "tell that to all the companies that have been screwed by Microsoft"). So is DP doing the right thing by fetching the requested page or not? I'm tempted to just write a script that does a wget to fetch that one page to a local directory, and then tell Nutch to crawl that directory.
-- http://www.linkedin.com/in/paultomblin