There's no canonical way to do this as I understand. For instance, when
running under YARN, you have completely no idea where your containers would
be started. Moreover, if one of the containers would fail, it might be
restarted on another machine so the machine number might change at runtime

To check the current number of machines you can do something like this
(python):

import socket
machines = sc.parallelize(xrange(1000)).mapPartitions(lambda x:
[socket.gethostname()]).distinct().collect()


On Fri, Aug 28, 2015 at 9:09 PM, Jason <ja...@jasonknight.us> wrote:

> I've wanted similar functionality too: when network IO bound (for me I was
> trying to pull things from s3 to hdfs) I wish there was a `.mapMachines`
> api where I wouldn't have to try guess at the proper partitioning of a
> 'driver' RDD for `sc.parallelize(1 to N, N).map( i=> pull the i'th chunk
> from S3 )`.
>
> On Thu, Aug 27, 2015 at 10:01 AM Young, Matthew T <
> matthew.t.yo...@intel.com> wrote:
>
>> What’s the canonical way to find out the number of physical machines in a
>> cluster at runtime in Spark? I believe SparkContext.defaultParallelism will
>> give me the number of cores, but I’m interested in the number of NICs.
>>
>>
>>
>> I’m writing a Spark streaming application to ingest from Kafka with the
>> Receiver API and want to create one DStream per physical machine for read
>> parallelism’s sake. How can I figure out at run time how many machines
>> there are so I know how many DStreams to create?
>>
>


-- 
Best regards, Alexey Grishchenko

phone: +353 (87) 262-2154
email: programme...@gmail.com
web:   http://0x0fff.com

Reply via email to