Thanks Dean for the point to point answers. Really appreciate it

I would like to have your view on one more point:

As per my understanding Spark steaming in high availability mode can be 
achieved by putting multiple master node (one active and rest in passive 
mode) within one cluster. Keeping the same principal i.e. active, passive 
master node in same cluster we can achieve high availability within each 
cluster. It raises 2 questions:

1. How master node can replicate the running state to all passive nodes so 
that once if it fails one of passive node can take up from there?
2. How worker node will get to know where to send completed job once they 
complete their job before master node crashed. Seems that here if somehow 
passive master node takes control as per point 1 then it has to reschedule 
all jobs again to worker node.

Please have your views on it

Regards
Neeraj

On Monday, April 20, 2015 at 7:53:32 PM UTC+5:30, Dean Wampler wrote:
>
> Answers inline below.
>
> On Sunday, April 19, 2015 at 4:04:15 AM UTC-5, tomerneeraj wrote:
>>
>> Hi, 
>>
>> We would like to use spark without Hadoop. To use it in highly scalable 
>> and high availability mode, yarn and hdfs Api do the purpose of resource 
>> scheduling and shared storage. We have data stored in separate disk(not 
>> shared). Couple of queries regarding this 
>>
>> 1. Can we replace YARN with Akka cluster for resource scheduling(master 
>> and worker node work distribution )?? 
>>
>
> Akka cluster doesn't have the resource management capabilities nor 
> integration with Spark that are required. We at Typesafe are considering 
> implementing this capability. For now, your best alternatives to YARN are 
> Mesos, for which we are offering production support, and standalone mode, 
> where you manually configure a cluster yourself. Mesos is best for 
> general-purpose, multi-job and multi-use clustering, while standalone is 
> fine if you have just a few jobs running, like a continuous streaming job 
> with its own, dedicated hardware.
>
>
>> 2. Is it necessary to have shared file system for spark streaming. Can we 
>> have standalone disk for master and worker in spark streaming and resource 
>> scheduling without sharing any disk between spark nodes?? 
>>
>
> It's necessary to have shared filesystem. It could be NFS, but you'll have 
> poor I/O performance. Fortunately, running HDFS without the rest of Hadoop 
> is not difficult. It might be possible to use other distributed filesystems 
> like Ceph, but I haven't tried that.
>
>
>> 3. What is the algorithm to distribute traffic by master node to worker 
>> node and how does spark streaming scale. Is there any way AKKA cluster 
>> helping it somehow?? 
>>
>
> Spark does a good job partitioning data, even incoming streams, across the 
> cluster. When reading from a distributed file system it knows about (i.e., 
> HDFS and S3), it can read and process blocks in parallel. Akka messaging is 
> used for some internal communications, but Spark isn't "deeply" dependent 
> on Akka.
>
> Akka would be an excellent foundation for a big data system. At Typesafe, 
> we're thinking about how to make use of it for different use cases ;)
>  
>
>>
>> Regards 
>> Neeraj 
>>
>

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Reply via email to