Nutch 2.1 + HBase cluster settings

k4200 Wed, 06 Feb 2013 01:48:47 -0800

Hi,

I started using Nutch recently and am now trying to get it to work
with an HBase cluster. Now, I have two questions. Actually, I'm not
that sure I should post the questions to this list or the HBase one,
so let me know if it's not the right place.


At first, I followed Nutch2Tutorial [1] and Nutch fetched a few
hundreds of thousands of pages until I stopped it. Then, I set up a
small HBase cluster with 2 nodes and migrated data to the cluster. I
checked briefly that the data had been migrated correctly by using
HBase shell. So far, so good.

I made a symbolic link to hbase-site.xml in
NUTCH_HOME/runtime/local/conf and executed nutch, and it seemed to
start running, but after several minutes, it threw the following
exception at the bottom of the email. I did a search and it looks like
it was caused by too many connections going to the ZooKeeper.

So, I added the following lines to hbase-site.xml:
  <property>
    <name>hbase.zookeeper.property.maxClientCnxns</name>
    <value>100</value>
  </property>

I also added this line to zoo.cfg:
maxClientCnxns=100

Then, I restarted ZooKeeper and HBase, and ran Nutch again, but the
same problem occurred. The number of connections to ZK reached 100
after several minutes and Nutch threw the same exception.

Q1. My first question is how to fix this issue? Do I need any other
settings fo Nutch to utilize an HBase cluster correctly?

Q2. The second question is about Nutch and Hadoop. I didn't install
Hadoop Job Tracker and Task Tracker because HBase itself doesn't need
them according to a SO question [2], but does Nutch need them for some
types of jobs? I looked for some documents or diagrams that describe
the overall architecture of Nutch with Gora and HBase, but couldn't
find a good one.

Any help would be appreciated.

Thanks,
Kaz

[1] http://wiki.apache.org/nutch/Nutch2Tutorial
[2] 
http://stackoverflow.com/questions/10006649/hbase-do-i-need-jobtracker-tasktracker

2013-02-06 17:09:32,775 WARN  zookeeper.ClientCnxn - Session 0x0 for
server node1.xxxxxx.com/xxx.xxx.xxx.xxx:2181, unexpected error,
closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcher.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
        at sun.nio.ch.IOUtil.read(IOUtil.java:166)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
        at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:817)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1089)
2013-02-06 17:09:34,337 WARN  mapred.FileOutputCommitter - Output path
is null in cleanup
2013-02-06 17:09:34,337 WARN  mapred.LocalJobRunner - job_local_0001
org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to
connect to ZooKeeper but the connection closes immediately. This could
be a sign that the server has too many connections (30 is the
default). Consider inspecting your ZK server logs for that error and
then make sure you are reusing HBaseConfiguration as often as you can.
See HTable's javadoc for more information.
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:155)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:1
002)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.jav
a:304)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:295)
        at 
org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:157)
        at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:90)
        at 
org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:108)
        at 
org.apache.gora.store.impl.DataStoreBase.readFields(DataStoreBase.java:181)
        at org.apache.gora.query.impl.QueryBase.readFields(QueryBase.java:222)
        at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
        at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
        at org.apache.gora.util.IOUtils.deserialize(IOUtils.java:217)
        at org.apache.gora.util.IOUtils.deserialize(IOUtils.java:237)
        at 
org.apache.gora.query.impl.PartitionQueryImpl.readFields(PartitionQueryImpl.java:141)
        at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
        at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
        at org.apache.gora.util.IOUtils.deserialize(IOUtils.java:217)
        at org.apache.gora.util.IOUtils.deserialize(IOUtils.java:237)
        at 
org.apache.gora.mapreduce.GoraInputSplit.readFields(GoraInputSplit.java:76)
        at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
        at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
        at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:396)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:728)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:837)
        at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:903)
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:133)
        ... 24 more

Nutch 2.1 + HBase cluster settings

Reply via email to