Re: Poll: Crawler flexibility?

searchfresco Wed, 24 Oct 2007 09:51:32 -0700

Hi Matt

I read through your comments last week and didn't have time to reply but
I wanted to share some thoughts.

> Crawling in the same manner as Google is probably a disaster for any
> startup. Whole-web crawling is quite tricky & expensive, and Google
> has done such a good job already here that, once your crawl succeeds,
> how do you provide results that are noticeably better than Google's?
> Failure to differentiate your product is also a quick path to death
> for a startup. 

I think at some level you are ignoring human intervention in this
process. A lot of what you are trying to achieve crawler-wise is doable
via tools already built into nutch, i.e: prune, url-filters, etc.. Do
you realize how much human intervention is involved in maintaining
google indexes? Assume you could assign an editor to each million page
segment and allow pruning, adding, etc. and you are probably close to
what the google does. There is no way any crawler can automagically
deliver a high quality index, there are just too many variables out
there in the wild.

> If you bet on Nutch as your foundation but cannot build a
> differentiated product quickly, you'll be screwed, and you will drop
> out of the Nutch community and move on. Nutch will lose a
> possibly-valuable contributor. 

That is missing the point IMO as nutch can be used to create search
indexes that deliver near identical (to google) result sets to users, so
its not a quality issue. I see failure of SE startups as more of
marketing issue. Have you considered googles weak points, click fraud,
hostility towards privacy, etc? Can you do better in those areas?

Soley blaming or relying on tech/nutch to compete in SE land is a  "fish
bowl" perspective.

Just my 2 cents.

John

Matt Kangas wrote:
> Dear nutch-user readers,
>
> I have a question for everyone here: Is the current Nutch crawler
> (Fetcher/Fetcher2) flexible enough for your needs?
> If not, what would you like to see it do?
>
> I'm asking because, last week, I suggested that the Nutch crawler
> could be much more useful to many people if it was structured more as
> a "crawler construction toolkit". But I realize that my comments could
> seem like sour grapes unless there's some plan for moving forward. So,
> I thought I'd just ask everybody what you think and tally the results.
>
> What kind of crawls would you like to do that aren't supported? I'll
> start with some nonstandard crawls I've done:
>
> 1) Outlinks-only crawl: crawl a specific website, keep only the
> outlinks from articles (, etc)
> 2) Crawl into CGIs w/o infinite crawl -- via crawl-depth filter
> 3) Plug in a "feature detector" (address, date, brand-name, etc) and
> use this signal to guide the crawl
>
> 4) .... (fill in your own here!)
>
> -- 
> Matt Kangas / [EMAIL PROTECTED]
>
>
>

Re: Poll: Crawler flexibility?

Reply via email to