[ 
https://issues.apache.org/jira/browse/CRUNCH-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13801568#comment-13801568
 ] 

Gabriel Reid commented on CRUNCH-278:
-------------------------------------

Looks good to me in general, and it seems to work as it should.

About the name, yeah, ReadableData isn't super clear, but it's of course 
difficult to think of something better. DataSourceIterable? 
MaterializableDataStream? MaterializableIterable? 

A few other things I came up with while going over it:
* if you don't materialize a PCollection in a whole stream of DoFns between a 
Source and where a ReadableData is being used (e.g. MapsideJoinStrategy), then 
everything gets run in-memory. I know that this is what was discussed/agreed 
upon, but it's also a pretty big change with the previous (default) behaviour 
which worries me a little bit. Maybe we should at least add some javadoc in 
MapsideJoin to make this clear.
* I think that DelegatingReadableData and DoFnIterator should probably be moved 
to the o.a.c.impl.mr.collect package. They're only used from that package 
right, and they're currently in o.a.c.util which I think is intended to include 
user-facing utils.
* in the default case (ReadableData is read directly from the source and DoFns 
are run there) it doesn't show up in the job plan dotfile at all. Definitely 
not a reason to block this ticket, but it is something that should be added in 
a follow-up ticket.
* the file headers seem a little wonky. I saw a couple of files with Cloudera 
headers, which I'm guessing is a bigger issue, as well as a few with some kind 
of messed up pasted-in file headers (i.e. TrevniReadableData.java). I know 
that's being super nit-picky (and I totally feel like I'm being _that_ guy for 
pointing it out), but I guess that's stuff that's probably best to keep in 
order.

> Improvements to MapsideJoin code
> --------------------------------
>
>                 Key: CRUNCH-278
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-278
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core, MapReduce Patterns
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-278b.patch, CRUNCH-278.patch
>
>
> The fact that we have special-case code in the MapsideJoinStrategy for the 
> in-memory and MR-based Pipeline instances has always bugged me, so I set out 
> to eliminate the distinction between the two impls by creating a new 
> interface, ReadableSourceBundle<T>, that encapsulates the MR and in-memory 
> specific logic for doing mapside joins in order to remove the special-case 
> code in MapsideJoinStrategy and hopefully make other implementations that use 
> our mapside-join patterns much easier to test.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to