Hi all,
I am facing a problem with aggregations where reduce groups are
extremely large.
It's a very common usage scenario - for example someone might want the
equivalent of 'count (distinct.field2) from events e group by e.field1'.
the natural thing to do is emit e.field1 as the map-key
The values to reduce is an disk backed iterator.
The problematic part is to compute the distinct count.
You have to keep the unique values in memory, or you have to use some other
tricks.
One of such tricks is sampling. The other is to do write the values out to
disk to do a merge sort, then read
great! Didn't realize that the iterator was disk based.
The below sounds very doable. Will give it a shot. Do you see this as an
option in the mapred job (optionally sort values)?
-Original Message-
From: Runping Qi [mailto:[EMAIL PROTECTED]
Sent: Thursday, October 11, 2007 2:04 PM
To:
Yeah - I am doing it with two MR jobs right now.
Understood the second solution. Is this what Pig uses internally (lazy -
should just look at the code)?
(One of the issues is that the optimal implementation requires
anticipating the group size. Easy to do by custom code, hard to do
automatically
Why do you need to know the group size?
Did I miss a transition in exactly what you are talking about?
On 10/11/07 2:57 PM, Joydeep Sen Sarma [EMAIL PROTECTED] wrote:
Yeah - I am doing it with two MR jobs right now.
...
(One of the issues is that the optimal implementation requires
Hi All,
Does any one have comments about how Hbase will perform in a 4 node cluster
compared to an equivalent MySQL configuration?
Thanks,
Rafael
MySQL and hbase are optimized for different operations. What are you trying to
do?
-Michael
On 10/11/07 3:35 PM, Rafael Turk [EMAIL PROTECTED] wrote:
Hi All,
Does any one have comments about how Hbase will perform in a 4 node cluster
compared to an equivalent MySQL configuration?
Thanks,
A very basic question: where to store my personal global variables such
that the map and/or reduce functions can see it?
Thanks,
James
12345678901234567890123456789012345678901234567890123456789012345
Performance always depends on the work load. However, having said
that, you should read Michael Stonebraker's paper The End of an
Architectural Era (It's Time for a Complete Rewrite) which was
presented at the Very Large Database
Problem solved. Please ignore.
On 10/11/07, James Yu [EMAIL PROTECTED] wrote:
A very basic question: where to store my personal global variables such
that the map and/or reduce functions can see it?
Thanks,
James
I'm a rank beginner with clusters, but am determined to move into
them, starting with Hadoop. I have a habuntu machine under VMware on
my MacBook Pro for starters (got it on a DVD when visiting at the
Googleplex).
Now I've just received an Apple Xserve and four Mac Minis to set up a cluster.
Yeah, but what's the answer?
- rpf
On 10/11/07, James Yu [EMAIL PROTECTED] wrote:
Problem solved. Please ignore.
On 10/11/07, James Yu [EMAIL PROTECTED] wrote:
A very basic question: where to store my personal global variables such
that the map and/or reduce functions can see it?
Hi,
I am using Nutch/Hadoop with single node mode.Nutch failed to generate a new
segement and in the hadoop log I find
the error message below:
007-10-12 11:09:53,961 INFO crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2007-10-12 11:09:58,602 WARN
For example:
I put all user global variables in a class I called MyGlobals
public class MyGlobals {
static public int var1;
...
}
Then, in whatever map function I have, I can refer to my globals like this:
public void map(LongWritable key, Text value, OutputCollector output,
Reporter
14 matches
Mail list logo