[jira] [Updated] (CRUNCH-557) Fix file distribution from HDFS in Crunch-on-Spark

Josh Wills (JIRA) Wed, 02 Sep 2015 15:57:07 -0700

     [ 
https://issues.apache.org/jira/browse/CRUNCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Josh Wills updated CRUNCH-557:
------------------------------
    Attachment: CRUNCH-557.patch

Fix for this, which is tiny and probably a bit too hacky at the moment. 
[~gabriel.reid] or [~mkwhitacre], could you clean this up and merge it to 
master when you're happy with it? I'll be offline the next few days.

> Fix file distribution from HDFS in Crunch-on-Spark
> --------------------------------------------------
>
>                 Key: CRUNCH-557
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-557
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Josh Wills
>         Attachments: CRUNCH-557.patch
>
>
> From the user list:
> I was trying to determine effect of changing JoinStrategy on a Spark 
> pipeline. I noticed that my pipeline works fine with DefaultJoinStrategy, 
> however I could not get it to working with MapSideJoinStrategy and 
> BloomFilterJoinStrategy. For MapSideJoinStrategy I get an exceptions[1] on 
> driver itself and for BloomFilterJoinStrategy I get exceptions[2] in one of 
> the stages. I have not tried to do any configuration changes but I did run 
> tests with datasets of different sizes to ensure that my PCollection is small 
> enough to fit in memory. I am running spark in yarn-client mode with Crunch 
> 0.11.0-cdh5.4.2.
> [1] https://gist.github.com/anonymous/15d6c691b743ad392d42
> [2] https://gist.github.com/anonymous/b02a82401a30a69f1cff
> The bug is in the SparkRuntime.distributeFiles method, which needs to include 
> a scheme for the URI it's handing to Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CRUNCH-557) Fix file distribution from HDFS in Crunch-on-Spark

Reply via email to