Not sure about your specific situation but it reminds me of wondering why a
job only has one mapper assigned to it; is the total dataset big enough to
require partitioning?

On Thursday, April 23, 2015, Pat Ferrel <[email protected]> wrote:

> Using Spark streaming to create a large volume of small nano-batch input
> files, ~4k per file, thousands of ‘part-xxxxx’ files.  When reading the
> nano-batch files and doing a cooccurrence calculation my tasks run only on
> the machine where it was launched. I’m launching in “yarn-client” mode. The
> rdd is created using sc.textFile(“list of thousand files”)
>
> The driver launches the sc.textFile then creates several intermediate rdds
> and finally a DrmRdd[Int]. This goes into cooccurrence. From the read
> onward, all tasks run only on the machine where the driver was launched.
>
> What would cause the read to occur only on the machine that launched the
> driver? I’ve seen this with and without Yarn.
>
> Do I need to do something to the RDD after reading? Has some partition
> factor been applied to all derived rdds?

Reply via email to