Kiyan Ahmadizadeh created CRUNCH-73:
---------------------------------------
Summary: Scrunch applications using PipelineApp do not properly
serialize closures to MapReduce tasks.
Key: CRUNCH-73
URL: https://issues.apache.org/jira/browse/CRUNCH-73
Project: Crunch
Issue Type: Bug
Components: Scrunch
Affects Versions: 0.4.0
Reporter: Kiyan Ahmadizadeh
Assignee: Kiyan Ahmadizadeh
One of the great potential advantages of using Scala for writing MapReduce
pipelines is the ability to send side data as part of function closures, rather
than through Hadoop Configurations or the Distributed Cache. As an absurdly
simple example, consider the following Scala PipelineApp that divides all
elements of a numeric PCollection by an arbitrary argument:
object DivideApp extends PipelineApp {
val divisor = Integer.valueOf(args(0))
val nums = read(From.textFile("numbers.txt"))
val dividedNums = nums.map { n => n / divisor }
dividedNums.write(To.textFile("dividedNums"))
run()
}
Executing this PipelineApp fails. MapReduce tasks get a value of "null" for
divisor (or 0 if divisor is forced to be a primitive numeric type). This
indicates that an error is occurring in the serialization of Scala function
closures that causes unbound variables in the closure to take on their default
JVM values.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira