Hi Richard,

I'm not a Droids expert, but I can answer some of the more general questions below.

On Jan 12, 2010, at 2:44pm, Richard Frovarp wrote:

I'm trying to figure out the best way to do a few things with Droids and Solr. Perhaps I've missed the functionality in the API, as I'm just at the planning stage. Any ideas or best practices would be appreciated.

1) We currently use Nutch for everything, crawling, indexing, and searching. It has the notion of searching against the text of anchor tags of incoming links. Is this useful?

Yes. In fact I'd say it's critical to getting good search results.

How would one do this with Droids? I was thinking of keep a db of outgoing links, and at the end of a crawl reverse the direction and update the records that have new links pointing at them. Any other ideas?

2) How best to do an incremental crawl? I'm going to want to do if- last-modified checks as I crawl.

Most crawlers, including Nutch, don't rely on any last modified data returned by servers. Why? Well, it's wrong much of the time. So you wind up having to fetch the content and (ideally) generate/compare a "signature hash" that tries to ignore meaningless differences such as a "number of visitors" counter.

Is there a way to do if-last-modified checks in Droids? Any ideas as to how to best do this? Furthermore, I'm going to want to get the HTTP status code to determine if a record should be deleted.

3) Updating the Regex block rules. I want to dynamically exclude different paths from my crawl in the long term. If I see a certain meta tag in page content, I don't want to go any further down that path. Any way to do this?

As far as getting the meta tag, this is pretty easy with Tika since it returns a metadata object with all of the meta tags that it finds during the HTML parse.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to