Re: Couple of Droids questions

Ken Krugler Tue, 12 Jan 2010 15:18:31 -0800

Hi Richard,

I'm not a Droids expert, but I can answer some of the more generalquestions below.


On Jan 12, 2010, at 2:44pm, Richard Frovarp wrote:

I'm trying to figure out the best way to do a few things with Droidsand Solr. Perhaps I've missed the functionality in the API, as I'mjust at the planning stage. Any ideas or best practices would beappreciated.
1) We currently use Nutch for everything, crawling, indexing, andsearching. It has the notion of searching against the text of anchortags of incoming links. Is this useful?


Yes. In fact I'd say it's critical to getting good search results.

How would one do this with Droids? I was thinking of keep a db ofoutgoing links, and at the end of a crawl reverse the direction andupdate the records that have new links pointing at them. Any otherideas?
2) How best to do an incremental crawl? I'm going to want to do if-last-modified checks as I crawl.

Most crawlers, including Nutch, don't rely on any last modified datareturned by servers. Why? Well, it's wrong much of the time. So youwind up having to fetch the content and (ideally) generate/compare a"signature hash" that tries to ignore meaningless differences such asa "number of visitors" counter.

Is there a way to do if-last-modified checks in Droids? Any ideas asto how to best do this? Furthermore, I'm going to want to get theHTTP status code to determine if a record should be deleted.
3) Updating the Regex block rules. I want to dynamically excludedifferent paths from my crawl in the long term. If I see a certainmeta tag in page content, I don't want to go any further down thatpath. Any way to do this?

As far as getting the meta tag, this is pretty easy with Tika since itreturns a metadata object with all of the meta tags that it findsduring the HTML parse.


-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Couple of Droids questions

Reply via email to