Re: Crawling Using HBase as a back end --Issue

stack Mon, 20 Apr 2009 09:19:01 -0700

Ninad:

Are you using Nutch crawling?  If not, out of interest, why not?  Have you
seen NUTCH-650 -- it works I believe (jdcryans?).


Your PermalinkTable is small?  Has only a few rows?   Maybe down the size at
which this table splits by changing flush and maximum file size -- see
hbase-default.xml.

St.Ack

On Mon, Apr 20, 2009 at 4:14 AM, Jean-Daniel Cryans <[email protected]>wrote:

> Ninad,
>
> Regards the timeouts, I recently gave a tip in the thread "Tip when
> scanning and spending a lot of time on each row" which should solve
> your problem.
>
> Regards your table, you should split it. In the shell, type the
> command "tools" to see how to use the "split" command. Issue a couple
> of them, waiting a bit between each call.
>
> J-D
>
> On Mon, Apr 20, 2009 at 5:49 AM, Ninad Raut <[email protected]>
> wrote:
> > Hi,
> >
> > I have been trying crawling data using MapReduce on HBase. Here is the
> scenario:
> >
> > 1) I have a Fetch list which has all the permalinks to be fetched
> > .They are stored in a PermalinkTable
> >
> > 2) A MapReduce scans over each permalink and tries fetching for the
> > data and dumping it in ContentTable.
> >
> > Here are the issues I face:
> >
> > The permalink table is not split so I have just one map running on a
> > single machine. The use of mapreduce gets nullified.
> >
> > The map reduce keeps givinf scanner time our exceptions causing task
> > failures and further delays.
> >
> >
> > If any one can give me tips for this use case it would really help me.
> >
>

Re: Crawling Using HBase as a back end --Issue

Reply via email to