That’s a good point. We’re starting with small data and increasing so I think 
of it as large but right now, not so much—I lied about the thousands, only 
hundreds right now.

I bet that’s it.

On Apr 23, 2015, at 10:26 AM, Andrew Musselman <[email protected]> 
wrote:

Not sure about your specific situation but it reminds me of wondering why a
job only has one mapper assigned to it; is the total dataset big enough to
require partitioning?

On Thursday, April 23, 2015, Pat Ferrel <[email protected]> wrote:

> Using Spark streaming to create a large volume of small nano-batch input
> files, ~4k per file, thousands of ‘part-xxxxx’ files.  When reading the
> nano-batch files and doing a cooccurrence calculation my tasks run only on
> the machine where it was launched. I’m launching in “yarn-client” mode. The
> rdd is created using sc.textFile(“list of thousand files”)
> 
> The driver launches the sc.textFile then creates several intermediate rdds
> and finally a DrmRdd[Int]. This goes into cooccurrence. From the read
> onward, all tasks run only on the machine where the driver was launched.
> 
> What would cause the read to occur only on the machine that launched the
> driver? I’ve seen this with and without Yarn.
> 
> Do I need to do something to the RDD after reading? Has some partition
> factor been applied to all derived rdds?

Reply via email to