Rachit Soni created CRUNCH-373:
----------------------------------
Summary: Problem while Performing MapSide join with
ImmutableBytesWritable/Text
Key: CRUNCH-373
URL: https://issues.apache.org/jira/browse/CRUNCH-373
Project: Crunch
Issue Type: Bug
Components: Core
Reporter: Rachit Soni
Assignee: Josh Wills
I have been having issues performing MapSide Join with ImmutableBytesWritable
as the join key and it always have only 1 value in the map created in the
initialize method of MapSideJoinDoFn[1]. With the same set of data if I perform
reduce side join it works perfectly fine giving me the correct result.
Additionally, I am making sure the map can be loaded in memory.
The result in both the above cases are different. When I dug up the code where
Map side join is being performed in MapSideDoFn [1] when the right side is
taken in memory and converted to map [2] all the keys get over written with the
last key that is being updated on the map. Seems like there it is referencing
the same memory location each and every time and is not cloning it properly.
This only happens when I use ImmutableBytesWritable/Text, anything except
ImmutableBytesWritable/Text works perfectly fine.
It looks like SeqFileReaderFactory (which I believe implements the PTable under
the hood for writables) does indeed reuse keys/values [3] in much the same ways
reducers do. So, I think in this code [4] it needs to clone the keys/values
rather than just store them in a map
Also, I am attaching a test which I wrote to reproduce the issue.
[1]
https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/lib/join/MapsideJoinStrategy.java#L131
[2]
https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/lib/join/MapsideJoinStrategy.java#L153
[3]
https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/seq/SeqFileReaderFactory.java#L88
[4]
https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/lib/join/MapsideJoinStrategy.java#L153
--
This message was sent by Atlassian JIRA
(v6.2#6252)