Nutch 650 looks good.. vl test it .Thanks for the direction. ... On Mon, Apr 20, 2009 at 9:48 PM, stack <[email protected]> wrote:
> Ninad: > > Are you using Nutch crawling? If not, out of interest, why not? Have you > seen NUTCH-650 -- it works I believe (jdcryans?). > > Your PermalinkTable is small? Has only a few rows? Maybe down the size > at > which this table splits by changing flush and maximum file size -- see > hbase-default.xml. > > St.Ack > > On Mon, Apr 20, 2009 at 4:14 AM, Jean-Daniel Cryans <[email protected] > >wrote: > > > Ninad, > > > > Regards the timeouts, I recently gave a tip in the thread "Tip when > > scanning and spending a lot of time on each row" which should solve > > your problem. > > > > Regards your table, you should split it. In the shell, type the > > command "tools" to see how to use the "split" command. Issue a couple > > of them, waiting a bit between each call. > > > > J-D > > > > On Mon, Apr 20, 2009 at 5:49 AM, Ninad Raut <[email protected]> > > wrote: > > > Hi, > > > > > > I have been trying crawling data using MapReduce on HBase. Here is the > > scenario: > > > > > > 1) I have a Fetch list which has all the permalinks to be fetched > > > .They are stored in a PermalinkTable > > > > > > 2) A MapReduce scans over each permalink and tries fetching for the > > > data and dumping it in ContentTable. > > > > > > Here are the issues I face: > > > > > > The permalink table is not split so I have just one map running on a > > > single machine. The use of mapreduce gets nullified. > > > > > > The map reduce keeps givinf scanner time our exceptions causing task > > > failures and further delays. > > > > > > > > > If any one can give me tips for this use case it would really help me. > > > > > >
