Crawling two TLD's (Re: [Nutch-general] Re: RuntimeException: x point net.nutch.parse.Parser not found)

Andrzej Bialecki Wed, 13 Apr 2005 10:24:32 -0700

Hi EM,

In the future, please use the right Subject: line...

EM wrote:

Hi,
I'd like to starts with Nutch and I need some help and guidance. First I’ve read the tutorial and anything I could find distributed with the project. Some questions are still open, so here I go.

I’d like to try to index a two country’s TLDs which Google says it’s about 1 and 1.5 million pages, which, from what I’ve read, Nutch should be able to handle without problem. Can I start my crawling from this set of links? (i.e. any way of obtaining all results Google returns for ‘site:.xy’), I doubt it, but I have to ask non the less ;)

Yes, it's possible - some people alleged that MSN bot was seeded this way. You can also take DMOZ data, under World / whatever... You can also visit each country's most popular portals and collect links from there.

When searching, can the user configure which TLD (from the two intended) to be searched? Say, search TLD 1 or TLD 2 or both?

It can be easily done with an extension. Please take a look at the "index-more" plugin - you need to put the TLD name in another field.

The intended language isn’t English, what mapping should be done on the side of the crawler? I still want the search queries to be coming in as plain English, and to search the extended character set with it (map 'xyz' in foreign keyboart to 'abc' in english (ascii - something?).

You need to modify Nutch Analyzer to strip accents. The same Analyzer will be used for indexing and querying.

Any chance of auto repairing the query to a more ‘intended’ one (like Google does, ‘Did you mean XYZ’)?


Look into the archives of this list for a "Did you mean" thread.

It starts as a hobby project so I don’t plan spending too much $ on it. Can I run the crawler on my home DSL (160kb/s down, 60kb up) and then upload the database for the search engine to the server? (this might be obsolet depending on the next question)

Depending on your settings, each downloaded page will take on average 15-20kB of disk space. I tried in the past to run crawlers at one location, and then moving segments data to another location... well, it was painful, because of the size of segment data.

Can I run the search engine in a shared hosting (The one I have at present has jail ssh access) account and not on a dedicated server? (How hard will be switching to a dedicated server later on?)

Some ISP disallow running public services on a shared host, you need to make sure that you are able to do that (and that you are allowed to do that).

Moving to a dedicated server is no problem - you just move the data, install a Tomcat installation, and you are done.

What are the real-world expectations about the requirements? For a subset of 3 million pages (html, doc; what else can Nutch index?) what would be: Expected space/bandwidth/processor/ram requirements? Time crawling?

If you’ve read up to this point, thanks, I know it’s a long list of questions, I had to ask them.

Please read the Wiki pages about the hw/sw/bw requirements. Time for crawling will depend on your available bandwidth - in my experience, during crawling it is difficult to hit other bottlenecks than bandwidth. Other tasks will depend on your available disk IO/RAM/CPU (in this order) and the number of machines, if you decide to run some tasks in a distributed setup.


--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Crawling two TLD's (Re: [Nutch-general] Re: RuntimeException: x point net.nutch.parse.Parser not found)

Reply via email to