Hi Ted,

Perhaps this might help? Thanks for your response. I am trying to
access/read binary files stored over a series of servers.

Line used to build RDD:
val BIN_pairRDD: RDD[(BIN_Key, BIN_Value)]  =
spark.newAPIHadoopFile("not.used", classOf[BIN_InputFormat],
classOf[BIN_Key], classOf[BIN_Value], config);

In order to support this, we have the following custom classes:
- BIN_Key and BIN_Value as the paired entry for the RDD
- BIN_RecordReader and BIN_FileSplit to handle the special splits
- BIN_FileSplit overrides getLocations() and getLocationInfo(), and we have
verified that the right IP address is being sent to Spark.
- BIN_InputFormat queries a database for details about every split to be
created; as in, which file to read and the IP address where that file is
local.

When it works:
- No problems running a local job
- No problems running in a cluster when there is 1 computer as Master and
another computer with 3 workers along with the files to process.

When it fails:
- When running in a cluster with multiple workers and files spread across
multiple computers. Jobs are not assigned to the nodes where the files are
local.

Thanks,
Raajen

Reply via email to