Re: Crawling Using HBase as a back end --Issue

stack Thu, 23 Apr 2009 13:47:43 -0700

Derek:

Can you say more on version of hbase and how it failed?


Thanks,
St.Ack


On Thu, Apr 23, 2009 at 12:31 PM, Derek Pappas <[email protected]> wrote:

>
> This is sort of off topic. We tried using hbase for storage in a crawler
> application. We gave up. We use Heritrix to crawl. We then use Lucene to
> index the arc files which contain the html files.
>
> Sent from my iPhone
>
> On Apr 22, 2009, at 10:41 PM, Andrew Purtell <[email protected]> wrote:
>
>
> Hi Ninad,
>
> I developed a crawling application for HBase with the same
> basic design if I understand you correctly.
>
> First, you can set the split threshold lower for your work
> table (the one which you run the TableMap job against). See
> this JIRA for more info in that regard:
>   https://issues.apache.org/jira/browse/HBASE-903
>
> As stack suggests you can also manually split the work table.
> Really, you should also prime it with > 1M jobs or similar,
> enough to store in enough data for splits to be meaningful.
> However, you also have to increase the scanner timeout and
> perhaps also the mapred job timeout to compensate for
> crawler maps which stall for long periods of time.
>
> After tinkering with this however I went in a different
> direction and used Heritrix 2.0 and the hbase-writer. See:
>   http://code.google.com/p/hbase-writer/
>
> Nutch would have been another option for me.
>
> Hope this helps,
>
>  - Andy
>
>
> From: Ninad Raut
> Subject: Re: Crawling Using HBase as a back end --Issue
> To: [email protected]
> Date: Monday, April 20, 2009, 9:37 AM
> Nutch 650 looks good.. vl test it .Thanks for the direction.
> ...
>
> On Mon, Apr 20, 2009 at 9:48 PM, stack
> <[email protected]> wrote:
>
> Ninad:
>
> Are you using Nutch crawling?  If not, out of
> interest, why not?  Have you
> seen NUTCH-650 -- it works I believe (jdcryans?).
>
> Your PermalinkTable is small?  Has only a few rows?
> Maybe down the size
> at
> which this table splits by changing flush and maximum
> file size -- see
> hbase-default.xml.
>
> St.Ack
>
> On Mon, Apr 20, 2009 at 4:14 AM, Jean-Daniel Cryans
> <[email protected]
> wrote:
>
> Ninad,
>
> Regards the timeouts, I recently gave a tip in
> the thread "Tip when
> scanning and spending a lot of time on each
> row" which should solve
> your problem.
>
> Regards your table, you should split it. In the
> shell, type the
> command "tools" to see how to use the
> "split" command. Issue a couple
> of them, waiting a bit between each call.
>
> J-D
>
> On Mon, Apr 20, 2009 at 5:49 AM, Ninad Raut
> <[email protected]>
> wrote:
> Hi,
>
> I have been trying crawling data using
> MapReduce on HBase. Here is the
> scenario:
>
> 1) I have a Fetch list which has all the
> permalinks to be fetched
> .They are stored in a PermalinkTable
>
> 2) A MapReduce scans over each permalink and
> tries fetching for the
> data and dumping it in ContentTable.
>
> Here are the issues I face:
>
> The permalink table is not split so I have
> just one map running on a
> single machine. The use of mapreduce gets
> nullified.
>
> The map reduce keeps givinf scanner time our
> exceptions causing task
> failures and further delays.
>
>
> If any one can give me tips for this use
> case it would really help me.
>
>
>
>
>
>
>
>

Re: Crawling Using HBase as a back end --Issue

Reply via email to