[ https://issues.apache.org/jira/browse/CRUNCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Wills updated CRUNCH-557: ------------------------------ Attachment: CRUNCH-557.patch Fix for this, which is tiny and probably a bit too hacky at the moment. [~gabriel.reid] or [~mkwhitacre], could you clean this up and merge it to master when you're happy with it? I'll be offline the next few days. > Fix file distribution from HDFS in Crunch-on-Spark > -------------------------------------------------- > > Key: CRUNCH-557 > URL: https://issues.apache.org/jira/browse/CRUNCH-557 > Project: Crunch > Issue Type: Bug > Reporter: Josh Wills > Attachments: CRUNCH-557.patch > > > From the user list: > I was trying to determine effect of changing JoinStrategy on a Spark > pipeline. I noticed that my pipeline works fine with DefaultJoinStrategy, > however I could not get it to working with MapSideJoinStrategy and > BloomFilterJoinStrategy. For MapSideJoinStrategy I get an exceptions[1] on > driver itself and for BloomFilterJoinStrategy I get exceptions[2] in one of > the stages. I have not tried to do any configuration changes but I did run > tests with datasets of different sizes to ensure that my PCollection is small > enough to fit in memory. I am running spark in yarn-client mode with Crunch > 0.11.0-cdh5.4.2. > [1] https://gist.github.com/anonymous/15d6c691b743ad392d42 > [2] https://gist.github.com/anonymous/b02a82401a30a69f1cff > The bug is in the SparkRuntime.distributeFiles method, which needs to include > a scheme for the URI it's handing to Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332)