never mind i got it.
Elia Mazzawi wrote:
I need some help with the implementation, to have the mapper produce
key=id, value = type,timestamp
which is essentially string, string
what do i give output.collect for the Value, i want to store type,
timestamp it only takes <Text, IntWritable> but i want to store <Text,
Text> ? or what can i store in there.
here is my reducer which doesn't work because output.collect doesn't
want <Text, Text>
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private Text Key = new Text();
private Text Value = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws
IOException {
String line = value.toString();
// line is parsed and now i have 2 strings
// String S1; // contains the key
// String S2; // contains the value
Key.set(S1);
Value.set(S2);
output.collect(Key, Value);
}
}
Miles Osborne wrote:
unless you have a gigantic number of items with the same id, this is
straightforward. have a mapper emit items of the form:
key=id, value = type,timestamp
and your reducer will then see all ids that have the same value
together.
it is then a simple matter to process all items with the same id. for
example, you could simply read them into a list and work on them in any
manner you see fit.
(note that hadoop is perfectly fine at dealing with multi-line
items. all
you need do is make sure that the items you want to process together all
share the same key)
Miles
2008/7/18 Elia Mazzawi <[EMAIL PROTECTED]>:
well here is the problem I'm trying to solve,
I have a data set that looks like this:
ID type Timestamp
A1 X 1215647404
A2 X 1215647405
A3 X 1215647406
A1 Y 1215647409
I want to count how many A1 Y, show up within 5 seconds of an A1 X
I was planning to have the data sorted by ID then timestamp,
then read it backwards, (or have it sorted by reverse timestamp)
go through it cashing all Y's for the same ID for 5 seconds to
either find
a matching X or not.
the results don't need to be 100% accurate.
so if hadoop gives the same file with the same lines in order then this
will work.
seems hadoop is really good at solving problems that depend on 1
line at a
time? but not multi lines?
hadoop has to get data in order, and be able to work on multi lines,
otherwise how can it be setting records in data sorts.
I'd appreciate other suggestions to go about doing this.
Jim R. Wilson wrote:
does wordcount get the lines in order? or are they random? can i have
hadoop return them in reverse order?
You can't really depend on the order that the lines are given - it's
best to think of them as random. The purpose of MapReduce/Hadoop is
to distribute a problem among a number of cooperating nodes.
The idea is that any given line can be interpreted separately,
completely independent of any other line. So in wordcount, this makes
sense. For example, say you and I are nodes. Each of us gets half the
lines in a file and we can count the words we see and report on them -
it doesn't matter what order we're given the lines, or which lines
we're given, or even whether we get the same number of lines (if
you're faster at it, or maybe you get shorter lines, you may get more
lines to process in the interest of saving time).
So if the project you're working on requires getting the lines in a
particular order, then you probably need to rethink your approach. It
may be that hadoop isn't right for your problem, or maybe that the
problem just needs to be attacked in a different way. Without knowing
more about what you're trying to achieve, I can't offer any specifics.
Good luck!
-- Jim
On Thu, Jul 17, 2008 at 4:41 PM, Elia Mazzawi
<[EMAIL PROTECTED]> wrote:
I have a program based on wordcount.java
and I have files that are smaller than 64mb files (so i believe
each file
is
one task )
do does wordcount get the lines in order? or are they random? can
i have
hadoop return them in reverse order?
Jim R. Wilson wrote:
It sounds to me like you're talking about hadoop streaming
(correct me
if I'm wrong there). In that case, there's really no "order" to the
lines being doled out as I understand it. Any given line could be
handed to any given mapper task running on any given node.
I may be wrong, of course, someone closer to the project could give
you the right answer in that case.
-- Jim R. Wilson (jimbojw)
On Thu, Jul 17, 2008 at 4:06 PM, Elia Mazzawi
<[EMAIL PROTECTED]> wrote:
is there a way to have hadoop hand over the lines of a file
backwards
to
my
mapper ?
as in give the last line first.