Hello, i have some questions about nutch in general. I need to create a
simple web crawler, however we want to index a lot of documents it'll
probably be about 100 million in future. I have a couple of servers i can
use. I wanted to distribute the index between those computers, ideally i
want one computer to crawl the web and fetch pages. Then dedicated indexing
machines would take a subset from fetched pages and index them, then remove
the fetched files to save disk space. Search would be performed on  those
machines and then results would be combined. I realise that it will take a
lot of customisation. I was thinking about ndfs, but this will only utilize
disk space of my servers, and not the processors, and i need as fast search
results as possible. So here's a couple of questions:
1. How difficult would it be to make nutch fetcher to split fetched data to
some sections i can then index on different computers. Is it worth, or maybe
i would be better off writing a crawler from scratch ?
2. How much processor intensive is the searching ? Maybe ndfs would be good
enough ?
3. This should probably go to lucene mailing list, but how well lucene
handles huge indexes, i have 4 150GB drives, and they will get filled up
with indexed data someday (full text of documents will be indexed).  Will
lucene handle that ?
4. The systems will be searching a lot so i need fast response times, 1-2
seconds is acceptable, but not more (is that possible with that kind of
setup ?). Search queries will be simple, no fancy similiarity or complicated
boolean searches.
5. Any thoughts or suggestions are welcome.


Karol Rybak
Programmer
University of Internet Technology and Management
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to