Re: Crawling Using HBase as a back end --Issue

Ninad Raut Mon, 20 Apr 2009 09:37:43 -0700

Nutch 650 looks good.. vl test it .Thanks for the direction. ...

On Mon, Apr 20, 2009 at 9:48 PM, stack <[email protected]> wrote:


> Ninad:
>
> Are you using Nutch crawling?  If not, out of interest, why not?  Have you
> seen NUTCH-650 -- it works I believe (jdcryans?).
>
> Your PermalinkTable is small?  Has only a few rows?   Maybe down the size
> at
> which this table splits by changing flush and maximum file size -- see
> hbase-default.xml.
>
> St.Ack
>
> On Mon, Apr 20, 2009 at 4:14 AM, Jean-Daniel Cryans <[email protected]
> >wrote:
>
> > Ninad,
> >
> > Regards the timeouts, I recently gave a tip in the thread "Tip when
> > scanning and spending a lot of time on each row" which should solve
> > your problem.
> >
> > Regards your table, you should split it. In the shell, type the
> > command "tools" to see how to use the "split" command. Issue a couple
> > of them, waiting a bit between each call.
> >
> > J-D
> >
> > On Mon, Apr 20, 2009 at 5:49 AM, Ninad Raut <[email protected]>
> > wrote:
> > > Hi,
> > >
> > > I have been trying crawling data using MapReduce on HBase. Here is the
> > scenario:
> > >
> > > 1) I have a Fetch list which has all the permalinks to be fetched
> > > .They are stored in a PermalinkTable
> > >
> > > 2) A MapReduce scans over each permalink and tries fetching for the
> > > data and dumping it in ContentTable.
> > >
> > > Here are the issues I face:
> > >
> > > The permalink table is not split so I have just one map running on a
> > > single machine. The use of mapreduce gets nullified.
> > >
> > > The map reduce keeps givinf scanner time our exceptions causing task
> > > failures and further delays.
> > >
> > >
> > > If any one can give me tips for this use case it would really help me.
> > >
> >
>

Re: Crawling Using HBase as a back end --Issue

Reply via email to