Hi,
We're barely past the install stages with nutch, I'd like to ask the
more experienced a few general questions before I jump in with both feet.
I'm thinking about creating a country specific (by TLD) search engine.
- Can nutch only crawl specific TLD's? (i.e. like .it, or .uk.com). My
suspicion is that I could easily modify nutch to do this.
- Can I run crawlers on two seperate machines, then merge the results
for search? I'm guessing yes, just looking for confirmation.
- If I only use a specific TLD, I think I would need a 'submit your
site' function. Does nutch do this? I didn't see it in our install,
wondering if it's a common practice.
- In the future I think I'd want to branch out to other TLD's, but
keeping the results country specific (i.e. .com's that are relevant to
the country). I'm guessing this is a largish project that would require
substantial changes to the algorithm to rank a site's
'country-specificness'?
- I'm also considering hand editing the crawl, is this reasonably
possible? i.e. I unleash the crawler on a seed set of sites, then need
to hand approve any further sites that are found by the crawler from
there. Actually, I guess that's a double question - is it currently
technically possible, and secondly am I an idiot for even thinking of
such a task? :).
Thanks - I'm trying to get a handle on things I might run into before I
get too far into this. I'm confident I can make minor tweaks if needed,
but some of the above seem to me to need some heavy duty work if they're
not already available; perhaps more than I can do for what I'm looking
at as my next hobby.
Thanks!
- Using nutch for niche/country specific TLD Insurance Squared Inc.
-