Re: Spark streaming multi-tasking during I/O

2015-08-22 Thread Sateesh Kavuri
Hi Akhil,

Think of the scenario as running a piece of code in normal Java with
multiple threads. Lets say there are 4 threads spawned by a Java process to
handle reading from database, some processing and storing to database. In
this process, while a thread is performing a database I/O, the CPU could
allow another thread to perform the processing, thus efficiently using the
resources.

Incase of Spark, while a node executor is running the same "read from DB =>
process data => store to DB", during the "read from DB" and "store to DB"
phase, the CPU is not given to other requests in queue, since the executor
will allocate the resources completely to the current ongoing request.

Does not flag spark.streaming.concurrentJobs enable this kind of scenario
or is there any other way to achieve what I am looking for

Thanks,
Sateesh

On Sat, Aug 22, 2015 at 7:26 PM, Akhil Das 
wrote:

> Hmm for a singl core VM you will have to run it in local mode(specifying
> master= local[4]). The flag is available in all the versions of spark i
> guess.
> On Aug 22, 2015 5:04 AM, "Sateesh Kavuri" 
> wrote:
>
>> Thanks Akhil. Does this mean that the executor running in the VM can
>> spawn two concurrent jobs on the same core? If this is the case, this is
>> what we are looking for. Also, which version of Spark is this flag in?
>>
>> Thanks,
>> Sateesh
>>
>> On Sat, Aug 22, 2015 at 1:44 AM, Akhil Das 
>> wrote:
>>
>>> You can look at the spark.streaming.concurrentJobs by default it runs a
>>> single job. If set it to 2 then it can run 2 jobs parallely. Its an
>>> experimental flag, but go ahead and give it a try.
>>> On Aug 21, 2015 3:36 AM, "Sateesh Kavuri" 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> My scenario goes like this:
>>>> I have an algorithm running in Spark streaming mode on a 4 core virtual
>>>> machine. Majority of the time, the algorithm does disk I/O and database
>>>> I/O. Question is, during the I/O, where the CPU is not considerably loaded,
>>>> is it possible to run any other task/thread so as to efficiently utilize
>>>> the CPU?
>>>>
>>>> Note that one DStream of the algorithm runs completely on a single CPU
>>>>
>>>> Thank you,
>>>> Sateesh
>>>>
>>>
>>


Re: Spark streaming multi-tasking during I/O

2015-08-22 Thread Sateesh Kavuri
Thanks Akhil. Does this mean that the executor running in the VM can spawn
two concurrent jobs on the same core? If this is the case, this is what we
are looking for. Also, which version of Spark is this flag in?

Thanks,
Sateesh

On Sat, Aug 22, 2015 at 1:44 AM, Akhil Das 
wrote:

> You can look at the spark.streaming.concurrentJobs by default it runs a
> single job. If set it to 2 then it can run 2 jobs parallely. Its an
> experimental flag, but go ahead and give it a try.
> On Aug 21, 2015 3:36 AM, "Sateesh Kavuri" 
> wrote:
>
>> Hi,
>>
>> My scenario goes like this:
>> I have an algorithm running in Spark streaming mode on a 4 core virtual
>> machine. Majority of the time, the algorithm does disk I/O and database
>> I/O. Question is, during the I/O, where the CPU is not considerably loaded,
>> is it possible to run any other task/thread so as to efficiently utilize
>> the CPU?
>>
>> Note that one DStream of the algorithm runs completely on a single CPU
>>
>> Thank you,
>> Sateesh
>>
>


Re: Spark streaming multi-tasking during I/O

2015-08-22 Thread Sateesh Kavuri
Hi Rishitesh,

We are not using any RDD's to parallelize the processing and all of the
algorithm runs on a single core (and in a single thread). The parallelism
is done at the user level

The disk can be started in a separate IO, but then the executor will not be
able to take up more jobs, since thats how I believe Spark is designed by
default

On Sat, Aug 22, 2015 at 12:51 AM, Rishitesh Mishra  wrote:

> Hi Sateesh,
> It is interesting to know , how did you determine that the Dstream runs on
> a single core. Did you mean receivers?
>
> Coming back to your question, could you not start disk io in a separate
> thread, so that the sceduler can go ahead and assign other tasks ?
> On 21 Aug 2015 16:06, "Sateesh Kavuri"  wrote:
>
>> Hi,
>>
>> My scenario goes like this:
>> I have an algorithm running in Spark streaming mode on a 4 core virtual
>> machine. Majority of the time, the algorithm does disk I/O and database
>> I/O. Question is, during the I/O, where the CPU is not considerably loaded,
>> is it possible to run any other task/thread so as to efficiently utilize
>> the CPU?
>>
>> Note that one DStream of the algorithm runs completely on a single CPU
>>
>> Thank you,
>> Sateesh
>>
>


Spark streaming multi-tasking during I/O

2015-08-21 Thread Sateesh Kavuri
Hi,

My scenario goes like this:
I have an algorithm running in Spark streaming mode on a 4 core virtual
machine. Majority of the time, the algorithm does disk I/O and database
I/O. Question is, during the I/O, where the CPU is not considerably loaded,
is it possible to run any other task/thread so as to efficiently utilize
the CPU?

Note that one DStream of the algorithm runs completely on a single CPU

Thank you,
Sateesh


Re: Spark or Storm

2015-06-16 Thread Sateesh Kavuri
Probably overloading the question a bit.

In Storm, Bolts have the functionality of getting triggered on events. Is
that kind of functionality possible with Spark streaming? During each phase
of the data processing, the transformed data is stored to the database and
this transformed data should then be sent to a new pipeline for further
processing

How can this be achieved using Spark?



On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast  wrote:

> I have a use-case where a stream of Incoming events have to be aggregated
> and joined to create Complex events. The aggregation will have to happen at
> an interval of 1 minute (or less).
>
> The pipeline is :
>   send events
>  enrich event
> Upstream services ---> KAFKA -> event Stream
> Processor > Complex Event Processor > Elastic
> Search.
>
> From what I understand, Storm will make a very good ESP and Spark
> Streaming will make a good CEP.
>
> But, we are also evaluating Storm with Trident.
>
> How does Spark Streaming compare with Storm with Trident?
>
> Sridhar Chellappa
>
>
>
>
>
>
>
>   On Wednesday, 17 June 2015 10:02 AM, ayan guha 
> wrote:
>
>
> I have a similar scenario where we need to bring data from kinesis to
> hbase. Data volecity is 20k per 10 mins. Little manipulation of data will
> be required but that's regardless of the tool so we will be writing that
> piece in Java pojo.
> All env is on aws. Hbase is on a long running EMR and kinesis on a
> separate cluster.
> TIA.
> Best
> Ayan
> On 17 Jun 2015 12:13, "Will Briggs"  wrote:
>
> The programming models for the two frameworks are conceptually rather
> different; I haven't worked with Storm for quite some time, but based on my
> old experience with it, I would equate Spark Streaming more with Storm's
> Trident API, rather than with the raw Bolt API. Even then, there are
> significant differences, but it's a bit closer.
>
> If you can share your use case, we might be able to provide better
> guidance.
>
> Regards,
> Will
>
> On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote:
>
> Hi All,
>
> I am evaluating spark VS storm ( spark streaming  ) and i am not able to
> see what is equivalent of Bolt in storm inside spark.
>
> Any help will be appreciated on this ?
>
> Thanks ,
> Ashish
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>


Re: Spark ML decision list

2015-06-05 Thread Sateesh Kavuri
Is there an existing way in SparkML to convert a decision tree to a
decision list?

On Thu, Jun 4, 2015 at 10:50 PM, Reza Zadeh  wrote:

> The closest algorithm to decision lists that we have is decision trees
> https://spark.apache.org/docs/latest/mllib-decision-tree.html
>
> On Thu, Jun 4, 2015 at 2:14 AM, Sateesh Kavuri 
> wrote:
>
>> Hi,
>>
>> I have used weka machine learning library for generating a model for my
>> training set. I have used the PART algorithm (decision lists) from weka.
>>
>> Now, I would like to use spark ML for the PART algo for my training set
>> and could not seem to find a parallel. Could anyone point out the
>> corresponding algorithm or even if its available in Spark ML?
>>
>> Thanks,
>> Sateesh
>>
>
>


Spark ML decision list

2015-06-04 Thread Sateesh Kavuri
Hi,

I have used weka machine learning library for generating a model for my
training set. I have used the PART algorithm (decision lists) from weka.

Now, I would like to use spark ML for the PART algo for my training set and
could not seem to find a parallel. Could anyone point out the corresponding
algorithm or even if its available in Spark ML?

Thanks,
Sateesh


Re: Connection pooling in spark jobs

2015-04-02 Thread Sateesh Kavuri
Each executor runs for about 5 secs until which time the db connection can
potentially be open. Each executor will have 1 connection open.
Connection pooling surely has its advantages of performance and not hitting
the dbserver for every open/close. The database in question is not just
used by the spark jobs, but is shared by other systems and so the spark
jobs have to better at managing the resources.

I am not really looking for a db connections counter (will let the db
handle that part), but rather have a pool of connections on spark end so
that the connections can be reused across jobs


On Fri, Apr 3, 2015 at 10:21 AM, Charles Feduke 
wrote:

> How long does each executor keep the connection open for? How many
> connections does each executor open?
>
> Are you certain that connection pooling is a performant and suitable
> solution? Are you running out of resources on the database server and
> cannot tolerate each executor having a single connection?
>
> If you need a solution that limits the number of open connections
> [resource starvation on the DB server] I think you'd have to fake it with a
> centralized counter of active connections, and logic within each executor
> that blocks when the counter is at a given threshold. If the counter is not
> at threshold, then an active connection can be created (after incrementing
> the shared counter). You could use something like ZooKeeper to store the
> counter value. This would have the overall effect of decreasing performance
> if your required number of connections outstrips the database's resources.
>
> On Fri, Apr 3, 2015 at 12:22 AM Sateesh Kavuri 
> wrote:
>
>> But this basically means that the pool is confined to the job (of a
>> single app) in question, but is not sharable across multiple apps?
>> The setup we have is a job server (the spark-jobserver) that creates
>> jobs. Currently, we have each job opening and closing a connection to the
>> database. What we would like to achieve is for each of the jobs to obtain a
>> connection from a db pool
>>
>> Any directions on how this can be achieved?
>>
>> --
>> Sateesh
>>
>> On Thu, Apr 2, 2015 at 7:00 PM, Cody Koeninger 
>> wrote:
>>
>>> Connection pools aren't serializable, so you generally need to set them
>>> up inside of a closure.  Doing that for every item is wasteful, so you
>>> typically want to use mapPartitions or foreachPartition
>>>
>>> rdd.mapPartition { part =>
>>> setupPool
>>> part.map { ...
>>>
>>>
>>>
>>> See "Design Patterns for using foreachRDD" in
>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
>>>
>>> On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri >> > wrote:
>>>
>>>> Right, I am aware on how to use connection pooling with oracle, but the
>>>> specific question is how to use it in the context of spark job execution
>>>> On 2 Apr 2015 17:41, "Ted Yu"  wrote:
>>>>
>>>>> http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm
>>>>>
>>>>> The question doesn't seem to be Spark specific, btw
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> > On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri 
>>>>> wrote:
>>>>> >
>>>>> > Hi,
>>>>> >
>>>>> > We have a case that we will have to run concurrent jobs (for the
>>>>> same algorithm) on different data sets. And these jobs can run in parallel
>>>>> and each one of them would be fetching the data from the database.
>>>>> > We would like to optimize the database connections by making use of
>>>>> connection pooling. Any suggestions / best known ways on how to achieve
>>>>> this. The database in question is Oracle
>>>>> >
>>>>> > Thanks,
>>>>> > Sateesh
>>>>>
>>>>
>>>
>>


Re: Connection pooling in spark jobs

2015-04-02 Thread Sateesh Kavuri
But this basically means that the pool is confined to the job (of a single
app) in question, but is not sharable across multiple apps?
The setup we have is a job server (the spark-jobserver) that creates jobs.
Currently, we have each job opening and closing a connection to the
database. What we would like to achieve is for each of the jobs to obtain a
connection from a db pool

Any directions on how this can be achieved?

--
Sateesh

On Thu, Apr 2, 2015 at 7:00 PM, Cody Koeninger  wrote:

> Connection pools aren't serializable, so you generally need to set them up
> inside of a closure.  Doing that for every item is wasteful, so you
> typically want to use mapPartitions or foreachPartition
>
> rdd.mapPartition { part =>
> setupPool
> part.map { ...
>
>
>
> See "Design Patterns for using foreachRDD" in
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
>
> On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri 
> wrote:
>
>> Right, I am aware on how to use connection pooling with oracle, but the
>> specific question is how to use it in the context of spark job execution
>> On 2 Apr 2015 17:41, "Ted Yu"  wrote:
>>
>>> http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm
>>>
>>> The question doesn't seem to be Spark specific, btw
>>>
>>>
>>>
>>>
>>> > On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri 
>>> wrote:
>>> >
>>> > Hi,
>>> >
>>> > We have a case that we will have to run concurrent jobs (for the same
>>> algorithm) on different data sets. And these jobs can run in parallel and
>>> each one of them would be fetching the data from the database.
>>> > We would like to optimize the database connections by making use of
>>> connection pooling. Any suggestions / best known ways on how to achieve
>>> this. The database in question is Oracle
>>> >
>>> > Thanks,
>>> > Sateesh
>>>
>>
>


Re: Connection pooling in spark jobs

2015-04-02 Thread Sateesh Kavuri
Right, I am aware on how to use connection pooling with oracle, but the
specific question is how to use it in the context of spark job execution
On 2 Apr 2015 17:41, "Ted Yu"  wrote:

> http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm
>
> The question doesn't seem to be Spark specific, btw
>
>
>
>
> > On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri 
> wrote:
> >
> > Hi,
> >
> > We have a case that we will have to run concurrent jobs (for the same
> algorithm) on different data sets. And these jobs can run in parallel and
> each one of them would be fetching the data from the database.
> > We would like to optimize the database connections by making use of
> connection pooling. Any suggestions / best known ways on how to achieve
> this. The database in question is Oracle
> >
> > Thanks,
> > Sateesh
>


Connection pooling in spark jobs

2015-04-02 Thread Sateesh Kavuri
Hi,

We have a case that we will have to run concurrent jobs (for the same
algorithm) on different data sets. And these jobs can run in parallel and
each one of them would be fetching the data from the database.
We would like to optimize the database connections by making use of
connection pooling. Any suggestions / best known ways on how to achieve
this. The database in question is Oracle

Thanks,
Sateesh