large reduce group sizes

2007-10-11 Thread Joydeep Sen Sarma
Hi all, I am facing a problem with aggregations where reduce groups are extremely large. It's a very common usage scenario - for example someone might want the equivalent of 'count (distinct.field2) from events e group by e.field1'. the natural thing to do is emit e.field1 as the map-key

RE: large reduce group sizes

2007-10-11 Thread Runping Qi
The values to reduce is an disk backed iterator. The problematic part is to compute the distinct count. You have to keep the unique values in memory, or you have to use some other tricks. One of such tricks is sampling. The other is to do write the values out to disk to do a merge sort, then read

RE: large reduce group sizes

2007-10-11 Thread Joydeep Sen Sarma
great! Didn't realize that the iterator was disk based. The below sounds very doable. Will give it a shot. Do you see this as an option in the mapred job (optionally sort values)? -Original Message- From: Runping Qi [mailto:[EMAIL PROTECTED] Sent: Thursday, October 11, 2007 2:04 PM To:

RE: large reduce group sizes

2007-10-11 Thread Joydeep Sen Sarma
Yeah - I am doing it with two MR jobs right now. Understood the second solution. Is this what Pig uses internally (lazy - should just look at the code)? (One of the issues is that the optimal implementation requires anticipating the group size. Easy to do by custom code, hard to do automatically

Re: large reduce group sizes

2007-10-11 Thread Ted Dunning
Why do you need to know the group size? Did I miss a transition in exactly what you are talking about? On 10/11/07 2:57 PM, Joydeep Sen Sarma [EMAIL PROTECTED] wrote: Yeah - I am doing it with two MR jobs right now. ... (One of the issues is that the optimal implementation requires

HBase performance

2007-10-11 Thread Rafael Turk
Hi All, Does any one have comments about how Hbase will perform in a 4 node cluster compared to an equivalent MySQL configuration? Thanks, Rafael

Re: HBase performance

2007-10-11 Thread Michael Bieniosek
MySQL and hbase are optimized for different operations. What are you trying to do? -Michael On 10/11/07 3:35 PM, Rafael Turk [EMAIL PROTECTED] wrote: Hi All, Does any one have comments about how Hbase will perform in a 4 node cluster compared to an equivalent MySQL configuration? Thanks,

coding question: user's global variables

2007-10-11 Thread James Yu
A very basic question: where to store my personal global variables such that the map and/or reduce functions can see it? Thanks, James

RE: HBase performance

2007-10-11 Thread Jim Kellerman
12345678901234567890123456789012345678901234567890123456789012345 Performance always depends on the work load. However, having said that, you should read Michael Stonebraker's paper The End of an Architectural Era (It's Time for a Complete Rewrite) which was presented at the Very Large Database

Re: coding question: user's global variables

2007-10-11 Thread James Yu
Problem solved. Please ignore. On 10/11/07, James Yu [EMAIL PROTECTED] wrote: A very basic question: where to store my personal global variables such that the map and/or reduce functions can see it? Thanks, James

I have a new cluster (Xserve + 4 Mac Minis) How to Hadoop?

2007-10-11 Thread Bob Futrelle
I'm a rank beginner with clusters, but am determined to move into them, starting with Hadoop. I have a habuntu machine under VMware on my MacBook Pro for starters (got it on a DVD when visiting at the Googleplex). Now I've just received an Apple Xserve and four Mac Minis to set up a cluster.

Re: coding question: user's global variables

2007-10-11 Thread Bob Futrelle
Yeah, but what's the answer? - rpf On 10/11/07, James Yu [EMAIL PROTECTED] wrote: Problem solved. Please ignore. On 10/11/07, James Yu [EMAIL PROTECTED] wrote: A very basic question: where to store my personal global variables such that the map and/or reduce functions can see it?

Possible for recovering the corrupted sequence file?

2007-10-11 Thread qi wu
Hi, I am using Nutch/Hadoop with single node mode.Nutch failed to generate a new segement and in the hadoop log I find the error message below: 007-10-12 11:09:53,961 INFO crawl.Generator - Generator: jobtracker is 'local', generating exactly one partition. 2007-10-12 11:09:58,602 WARN

Re: coding question: user's global variables

2007-10-11 Thread James Yu
For example: I put all user global variables in a class I called MyGlobals public class MyGlobals { static public int var1; ... } Then, in whatever map function I have, I can refer to my globals like this: public void map(LongWritable key, Text value, OutputCollector output, Reporter