Josh Wills created CRUNCH-557:
---------------------------------
Summary: Fix file distribution from HDFS in Crunch-on-Spark
Key: CRUNCH-557
URL: https://issues.apache.org/jira/browse/CRUNCH-557
Project: Crunch
Issue Type: Bug
Reporter: Josh Wills
>From the user list:
I was trying to determine effect of changing JoinStrategy on a Spark pipeline.
I noticed that my pipeline works fine with DefaultJoinStrategy, however I could
not get it to working with MapSideJoinStrategy and BloomFilterJoinStrategy. For
MapSideJoinStrategy I get an exceptions[1] on driver itself and for
BloomFilterJoinStrategy I get exceptions[2] in one of the stages. I have not
tried to do any configuration changes but I did run tests with datasets of
different sizes to ensure that my PCollection is small enough to fit in memory.
I am running spark in yarn-client mode with Crunch 0.11.0-cdh5.4.2.
[1] https://gist.github.com/anonymous/15d6c691b743ad392d42
[2] https://gist.github.com/anonymous/b02a82401a30a69f1cff
The bug is in the SparkRuntime.distributeFiles method, which needs to include a
scheme for the URI it's handing to Spark.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)