Derek: Can you say more on version of hbase and how it failed?
Thanks, St.Ack On Thu, Apr 23, 2009 at 12:31 PM, Derek Pappas <[email protected]> wrote: > > This is sort of off topic. We tried using hbase for storage in a crawler > application. We gave up. We use Heritrix to crawl. We then use Lucene to > index the arc files which contain the html files. > > Sent from my iPhone > > On Apr 22, 2009, at 10:41 PM, Andrew Purtell <[email protected]> wrote: > > > Hi Ninad, > > I developed a crawling application for HBase with the same > basic design if I understand you correctly. > > First, you can set the split threshold lower for your work > table (the one which you run the TableMap job against). See > this JIRA for more info in that regard: > https://issues.apache.org/jira/browse/HBASE-903 > > As stack suggests you can also manually split the work table. > Really, you should also prime it with > 1M jobs or similar, > enough to store in enough data for splits to be meaningful. > However, you also have to increase the scanner timeout and > perhaps also the mapred job timeout to compensate for > crawler maps which stall for long periods of time. > > After tinkering with this however I went in a different > direction and used Heritrix 2.0 and the hbase-writer. See: > http://code.google.com/p/hbase-writer/ > > Nutch would have been another option for me. > > Hope this helps, > > - Andy > > > From: Ninad Raut > Subject: Re: Crawling Using HBase as a back end --Issue > To: [email protected] > Date: Monday, April 20, 2009, 9:37 AM > Nutch 650 looks good.. vl test it .Thanks for the direction. > ... > > On Mon, Apr 20, 2009 at 9:48 PM, stack > <[email protected]> wrote: > > Ninad: > > Are you using Nutch crawling? If not, out of > interest, why not? Have you > seen NUTCH-650 -- it works I believe (jdcryans?). > > Your PermalinkTable is small? Has only a few rows? > Maybe down the size > at > which this table splits by changing flush and maximum > file size -- see > hbase-default.xml. > > St.Ack > > On Mon, Apr 20, 2009 at 4:14 AM, Jean-Daniel Cryans > <[email protected] > wrote: > > Ninad, > > Regards the timeouts, I recently gave a tip in > the thread "Tip when > scanning and spending a lot of time on each > row" which should solve > your problem. > > Regards your table, you should split it. In the > shell, type the > command "tools" to see how to use the > "split" command. Issue a couple > of them, waiting a bit between each call. > > J-D > > On Mon, Apr 20, 2009 at 5:49 AM, Ninad Raut > <[email protected]> > wrote: > Hi, > > I have been trying crawling data using > MapReduce on HBase. Here is the > scenario: > > 1) I have a Fetch list which has all the > permalinks to be fetched > .They are stored in a PermalinkTable > > 2) A MapReduce scans over each permalink and > tries fetching for the > data and dumping it in ContentTable. > > Here are the issues I face: > > The permalink table is not split so I have > just one map running on a > single machine. The use of mapreduce gets > nullified. > > The map reduce keeps givinf scanner time our > exceptions causing task > failures and further delays. > > > If any one can give me tips for this use > case it would really help me. > > > > > > > >
