Hi Gabriel,
Yes indeed this is a small PoC to get familiar with Crunch in relation
to my problem. Basically I have the following algo at play:
1. Read data rows
2. Create custom keys for each of them, built using various attributes
of data (this time it is just a simple hash code, but I would like to
emit multiple key-value pairs)
3. Group similar data based on created Keys
4. Iterate over individual items in the group and do extensive
comparison between all of them
I just built an outline in the test case to see what/how can be done,
can you advise something better ?
regards
Rahul
On 28-06-2012 12:30, Gabriel Reid wrote:
Hi Rahul,
Ok, looks like I misunderstood your code. In that case, you're indeed
correct that a
PeekingIterator won't help you -- it looks like you will indeed need
to store the data
in a collection per group in order to do the processing that you're
trying to do.
Am I correct in assuming that this code is an attempt to get familiar
with Crunch,
and less about solving a real-world problem right now? If you are trying to put
together a solution for a problem, maybe you could outline what you're trying
to get to -- there may be a better way to get there. I noticed that
you're grouping
values by the hash code of the input line, which looks questionable to me.
Regards,
Gabriel
On Thu, Jun 28, 2012 at 8:05 AM, Rahul<[email protected]> wrote:
Hi Gabriel,
I am doing n*(n-1) comparisons here every element would be compared with
every other element, so peeking iterator would not help much. It would give
me the next element but I need to keep all the elements that have been
accessed once in another Collection so that I can iterate over them again
and again.
or Is there some thing that would help here ?
regards,
Rahul
On 27-06-2012 17:48, Gabriel Reid wrote:
On Wed, Jun 27, 2012 at 1:41 PM, Rahul<[email protected]> wrote:
I am trying to create multiple iterators in a DoFn process method.
public void process(Pair<Integer, Iterable<TupleN>> input,
Emitter<Pair<String, Integer>> emitter) {}
Every time I ask a iterator it gives back the same one and thus I could
not
not traverse the list again and again as I am hitting the following stack
trace .
The Iterable.iterator call always returns the same iterator is because
this
is the behaviour that is inherited from the reduce method of the Hadoop
Reducer class (and this behaviour is there because of the underlying way
in which Hadoop MapReduce functions). In both Crunch and pure MapReduce,
you've just got one shot at looping over an Iterable in a reducer (or DoFn
that is functioning on a PGroupedTable).
If I understood your code correctly, you're trying to loop over an
Iterable
while looking at two consecutive elements at a time. Probably the easiest
way of doing this is using the PeekingIterator class in Google Guava
(http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/collect/Iterators.html#peekingIterator%28java.util.Iterator%29).
This will allow
you to look one element ahead within an iterator.
Regards,
Gabriel