Hello dear expert,
I'm new in my corporation, and they were searching for a solution to crawl a
selection of 1 000 000 url.
Naturaly my choice was for nutch for his scalability and java code.
I begin working with nutch three weeks ago and appreciate many things.
but i have some questions i can't answer:
How can'i refresh my crawl? The script on the wiki page continu the
precedente crawl. I try to do a new crawl but it's not possible because the
directory already exist.
If i want to add new url to my index must i create a new index and then
merge it with the existant index?
I heard about subcollections, but i don't understand what is it?
And finally when i try to setup a map reduce configuration with the how to
on the wiki page and i launch a crawl i get this error on the hadoop log's:
dfs.DataNode - Failed to transfer blk_-4263222254813988872 to
tasktracker:50010
java.net.UnknownHostException: tasktracker
at java.net.PlainSocketImpl.connect(Unknown Source)
at java.net.SocksSocketImpl.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:797)
at java.lang.Thread.run(Unknown Source)
even if the box tasktracker was reachable before the launch of the crawl
with the administration interface she became unreachable after???
Thanks a lot for your help, and felicitations to the developer and for
wikipedia
--
View this message in context:
http://www.nabble.com/Nutch-Common-administration%27s-Task-tf2885119.html#a8060701
Sent from the Nutch - User mailing list archive at Nabble.com.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general