Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "GORA_HBase" page has been changed by FerdyGalema: http://wiki.apache.org/nutch/GORA_HBase?action=diff&rev1=10&rev2=11 Comment: Reflect changes in trunk: Added hbase-gora-mapping and added thrift to exclude - This document describes how to get Nutch 2.0 to use HBase as a backend for GORA and is based on the revision 993857 of the Nutch trunk + This document describes how to get Nutch to use HBase as a backend for GORA and is based on the revision 993857 of the Nutch trunk * Install and configure HBase 0.20.6. You can check it out from [[http://svn.apache.org/repos/asf/hbase/tags/0.20.6/|here]] ('''N.B.''' It is important that you grab HBase version 0.20.6 at this is supported by Gora) - * Add the following to nutch/ivy/ivy.xml (global exclusion): - - {{{ - <exclude module="thrift" /> - }}} - * Specify the GORA backend in nutch-site.xml {{{ @@ -20, +14 @@ }}} Note: Currently HBaseStore is NOT YET THREAD-SAFE, so all processes should have single threaded settings (i.e. set number of fetchers to 1). Work to make it thread-safe is in progress. - * Create a mapping file for hbase in conf/gora-hbase-mapping.xml - - {{{ - <?xml version="1.0" encoding="UTF-8"?> - <gora-orm> - <table name="webpage"> - <family name="p"/> <!-- This can also have params like compression, bloom filters --> - <family name="f"/> - <family name="s"/> - <family name="il"/> - <family name="ol"/> - <family name="h"/> - <family name="mtdt"/> - <family name="mk"/> - </table> - <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage"> - <!-- fetch fields --> - <field name="baseUrl" family="f" qualifier="bas"/> - <field name="status" family="f" qualifier="st"/> - <field name="prevFetchTime" family="f" qualifier="pts"/> - <field name="fetchTime" family="f" qualifier="ts"/> - <field name="fetchInterval" family="f" qualifier="fi"/> - <field name="retriesSinceFetch" family="f" qualifier="rsf"/> - <field name="reprUrl" family="f" qualifier="rpr"/> - <field name="content" family="f" qualifier="cnt"/> - <field name="contentType" family="f" qualifier="typ"/> - <field name="protocolStatus" family="f" qualifier="prot"/> - <field name="modifiedTime" family="f" qualifier="mod"/> - <!-- parse fields --> - <field name="title" family="p" qualifier="t"/> - <field name="text" family="p" qualifier="c"/> - <field name="parseStatus" family="p" qualifier="st"/> - <field name="signature" family="p" qualifier="sig"/> - <field name="prevSignature" family="p" qualifier="psig"/> - <!-- score fields --> - <field name="score" family="s" qualifier="s"/> - <field name="headers" family="h"/> - <field name="inlinks" family="il"/> - <field name="outlinks" family="ol"/> - <field name="metadata" family="mtdt"/> - <field name="markers" family="mk"/> - </class> - </gora-orm> - }}} * Compile Nutch -> ant runtime * Make sure HBase is started and working properly as per the quick start tutorial [[http://hbase.apache.org/book/quickstart.html|here]]

