Lucene ended upbeing used to index the contents of the Heritrix arc files. Tried using hbase to store html files and look them up but ran into bugs. Long story...
Sent from my iPhone On Apr 23, 2009, at 2:23 PM, "Jonathan Gray" <[email protected]> wrote: Lucene is rather poor at storing things and rather excellent at indexing them. HBase is excellent at storing things and very poor (or not capable) of indexing them. If you just need indexing, are dealing with sharding of your indexes, and don't need to store the original content, then Lucene is all you need. Storing the original content inside of Lucene will dramatically increase index size, and it's not always as easy to partition, so sometimes you need to a separate persistent store. Were you trying to index in HBase? Or what was it that turned you away? JG On Thu, April 23, 2009 12:31 pm, Derek Pappas wrote: This is sort of off topic. We tried using hbase for storage in a crawler application. We gave up. We use Heritrix to crawl. We then use Lucene to index the arc files which contain the html files. Sent from my iPhone On Apr 22, 2009, at 10:41 PM, Andrew Purtell <[email protected]> wrote: Hi Ninad, I developed a crawling application for HBase with the same basic design if I understand you correctly. First, you can set the split threshold lower for your work table (the one which you run the TableMap job against). See this JIRA for more info in that regard: https://issues.apache.org/jira/browse/HBASE-903 As stack suggests you can also manually split the work table. Really, you should also prime it with > 1M jobs or similar, enough to store in enough data for splits to be meaningful. However, you also have to increase the scanner timeout and perhaps also the mapred job timeout to compensate for crawler maps which stall for long periods of time. After tinkering with this however I went in a different direction and used Heritrix 2.0 and the hbase-writer. See: http://code.google.com/p/hbase-writer/ Nutch would have been another option for me. Hope this helps, - Andy From: Ninad Raut Subject: Re: Crawling Using HBase as a back end --Issue To: [email protected] Date: Monday, April 20, 2009, 9:37 AM Nutch 650 looks good.. vl test it .Thanks for the direction. ... On Mon, Apr 20, 2009 at 9:48 PM, stack <[email protected]> wrote: Ninad: Are you using Nutch crawling? If not, out of interest, why not? Have you seen NUTCH-650 -- it works I believe (jdcryans?). Your PermalinkTable is small? Has only a few rows? Maybe down the size at which this table splits by changing flush and maximum file size -- see hbase-default.xml. St.Ack On Mon, Apr 20, 2009 at 4:14 AM, Jean-Daniel Cryans <[email protected] wrote: Ninad, Regards the timeouts, I recently gave a tip in the thread "Tip when scanning and spending a lot of time on each row" which should solve your problem. Regards your table, you should split it. In the shell, type the command "tools" to see how to use the "split" command. Issue a couple of them, waiting a bit between each call. J-D On Mon, Apr 20, 2009 at 5:49 AM, Ninad Raut <[email protected]> wrote: Hi, I have been trying crawling data using MapReduce on HBase. Here is the scenario: 1) I have a Fetch list which has all the permalinks to be fetched .They are stored in a PermalinkTable 2) A MapReduce scans over each permalink and tries fetching for the data and dumping it in ContentTable. Here are the issues I face: The permalink table is not split so I have just one map running on a single machine. The use of mapreduce gets nullified. The map reduce keeps givinf scanner time our exceptions causing task failures and further delays. If any one can give me tips for this use case it would really help me.
