|
To all Nutch Gurus My question is regarding Nutch
distributed WebDB and distributed Web crawling. It is mentioned on the Nutch
website that to span WebDB database across multiple
machines distributed WebDB is implemented. How do you
configure distributed WebDB? I looked into several
email threads and could not find a satisfactory answer. Only I know is how you
query from distributed WebDB: you create a
search-servers.txt at Nutch client i.e. Tomcat server
and in that file you give host info with port number and run bin\nutch server portnumber at other
servers where distributed WebDB is present. Is this
understanding correct? However when you run the fetch and it is
updating WebDB which is going to expand beyond one
machine, do you need to configure something specific so that WebDB knows that it has to expand to another machine? Also does Nutch support
distributed Web crawling? For example if I have a cluster of machines and I
need to perform web crawling utilizing them can I tell Nutch
in any way to span across machines to perform web crawling? Also do you think
that if the cluster of machines runs a software that
creates a single virtual machine kind of behavior than it really does not matter
to Nutch crawl procedure whether that virtual machine
is internally distributing the load to multiple machines to perform distributed
crawling? Thanks in advance for your quick pointers to my above
mentioned queries. ~ Ashish |
- [Nutch-general] Question: Nutch distributed WebDB... PATEL Ashish RD-ILAB-SSF
- Re: [Nutch-general] Question: Nutch distribu... Stefan Groschupf
