My apologies for attaching my previous message to this thread.
Emilijan
EM wrote:
Hi,
I'd like to starts with Nutch and I need some help and guidance. First I’ve read the tutorial and anything I could find distributed with the project. Some questions are still open, so here I go.
I’d like to try to index a two country’s TLDs which Google says it’s about 1 and 1.5 million pages, which, from what I’ve read, Nutch should be able to handle without problem. Can I start my crawling from this set of links? (i.e. any way of obtaining all results Google returns for ‘site:.xy’), I doubt it, but I have to ask non the less ;)
When searching, can the user configure which TLD (from the two intended) to be searched? Say, search TLD 1 or TLD 2 or both?
The intended language isn’t English, what mapping should be done on the side of the crawler? I still want the search queries to be coming in as plain English, and to search the extended character set with it (map 'xyz' in foreign keyboart to 'abc' in english (ascii - something?).
Any chance of auto repairing the query to a more ‘intended’ one (like Google does, ‘Did you mean XYZ’)?
It starts as a hobby project so I don’t plan spending too much $ on it.
Can I run the crawler on my home DSL (160kb/s down, 60kb up) and then upload the database for the search engine to the server? (this might be obsolet depending on the next question)
Can I run the search engine in a shared hosting (The one I have at present has jail ssh access) account and not on a dedicated server? (How hard will be switching to a dedicated server later on?)
What are the real-world expectations about the requirements? For a subset of 3 million pages (html, doc; what else can Nutch index?) what would be:
Expected space/bandwidth/processor/ram requirements?
Time crawling?
If you’ve read up to this point, thanks, I know it’s a long list of questions, I had to ask them.
Cheers, Emilijan
