Hi, I started using Nutch recently and am now trying to get it to work with an HBase cluster. Now, I have two questions. Actually, I'm not that sure I should post the questions to this list or the HBase one, so let me know if it's not the right place.
At first, I followed Nutch2Tutorial [1] and Nutch fetched a few hundreds of thousands of pages until I stopped it. Then, I set up a small HBase cluster with 2 nodes and migrated data to the cluster. I checked briefly that the data had been migrated correctly by using HBase shell. So far, so good. I made a symbolic link to hbase-site.xml in NUTCH_HOME/runtime/local/conf and executed nutch, and it seemed to start running, but after several minutes, it threw the following exception at the bottom of the email. I did a search and it looks like it was caused by too many connections going to the ZooKeeper. So, I added the following lines to hbase-site.xml: <property> <name>hbase.zookeeper.property.maxClientCnxns</name> <value>100</value> </property> I also added this line to zoo.cfg: maxClientCnxns=100 Then, I restarted ZooKeeper and HBase, and ran Nutch again, but the same problem occurred. The number of connections to ZK reached 100 after several minutes and Nutch threw the same exception. Q1. My first question is how to fix this issue? Do I need any other settings fo Nutch to utilize an HBase cluster correctly? Q2. The second question is about Nutch and Hadoop. I didn't install Hadoop Job Tracker and Task Tracker because HBase itself doesn't need them according to a SO question [2], but does Nutch need them for some types of jobs? I looked for some documents or diagrams that describe the overall architecture of Nutch with Gora and HBase, but couldn't find a good one. Any help would be appreciated. Thanks, Kaz [1] http://wiki.apache.org/nutch/Nutch2Tutorial [2] http://stackoverflow.com/questions/10006649/hbase-do-i-need-jobtracker-tasktracker 2013-02-06 17:09:32,775 WARN zookeeper.ClientCnxn - Session 0x0 for server node1.xxxxxx.com/xxx.xxx.xxx.xxx:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198) at sun.nio.ch.IOUtil.read(IOUtil.java:166) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:817) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1089) 2013-02-06 17:09:34,337 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2013-02-06 17:09:34,337 WARN mapred.LocalJobRunner - job_local_0001 org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to connect to ZooKeeper but the connection closes immediately. This could be a sign that the server has too many connections (30 is the default). Consider inspecting your ZK server logs for that error and then make sure you are reusing HBaseConfiguration as often as you can. See HTable's javadoc for more information. at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:155) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:1 002) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.jav a:304) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:295) at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:157) at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:90) at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:108) at org.apache.gora.store.impl.DataStoreBase.readFields(DataStoreBase.java:181) at org.apache.gora.query.impl.QueryBase.readFields(QueryBase.java:222) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.gora.util.IOUtils.deserialize(IOUtils.java:217) at org.apache.gora.util.IOUtils.deserialize(IOUtils.java:237) at org.apache.gora.query.impl.PartitionQueryImpl.readFields(PartitionQueryImpl.java:141) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.gora.util.IOUtils.deserialize(IOUtils.java:217) at org.apache.gora.util.IOUtils.deserialize(IOUtils.java:237) at org.apache.gora.mapreduce.GoraInputSplit.readFields(GoraInputSplit.java:76) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:396) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:728) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:837) at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:903) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:133) ... 24 more