[jira] [Created] (CRUNCH-557) Fix file distribution from HDFS in Crunch-on-Spark

Josh Wills (JIRA) Wed, 02 Sep 2015 15:55:16 -0700

Josh Wills created CRUNCH-557:
---------------------------------

             Summary: Fix file distribution from HDFS in Crunch-on-Spark
                 Key: CRUNCH-557
                 URL: https://issues.apache.org/jira/browse/CRUNCH-557
             Project: Crunch
          Issue Type: Bug
            Reporter: Josh Wills



>From the user list:

I was trying to determine effect of changing JoinStrategy on a Spark pipeline. 
I noticed that my pipeline works fine with DefaultJoinStrategy, however I could 
not get it to working with MapSideJoinStrategy and BloomFilterJoinStrategy. For 
MapSideJoinStrategy I get an exceptions[1] on driver itself and for 
BloomFilterJoinStrategy I get exceptions[2] in one of the stages. I have not 
tried to do any configuration changes but I did run tests with datasets of 
different sizes to ensure that my PCollection is small enough to fit in memory. 
I am running spark in yarn-client mode with Crunch 0.11.0-cdh5.4.2.

[1] https://gist.github.com/anonymous/15d6c691b743ad392d42
[2] https://gist.github.com/anonymous/b02a82401a30a69f1cff

The bug is in the SparkRuntime.distributeFiles method, which needs to include a 
scheme for the URI it's handing to Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CRUNCH-557) Fix file distribution from HDFS in Crunch-on-Spark

Reply via email to