Re: Spark Job Server application compilation issue

2018-03-14 Thread sujeet jog
Thanks for pointing .


On Wed, Mar 14, 2018 at 11:19 PM, Vadim Semenov <va...@datadoghq.com> wrote:

> This question should be directed to the `spark-jobserver` group:
> https://github.com/spark-jobserver/spark-jobserver#contact
>
> They also have a gitter chat.
>
> Also include the errors you get once you're going to be asking them a
> question
>
> On Wed, Mar 14, 2018 at 1:37 PM, sujeet jog <sujeet@gmail.com> wrote:
>
>>
>> Input is a json request, which would be decoded in myJob() & processed
>> further.
>>
>> Not sure what is wrong with below code, it emits errors as unimplemented
>> methods (runJob/validate),
>> any pointers on this would be helpful,
>>
>> jobserver-0.8.0
>>
>> object MyJobServer extends SparkSessionJob {
>>
>>   type JobData = String
>>   type JobOutput = Seq[String]
>>
>>   def myJob(a : String)  = {
>> }
>>
>>   def runJob(sc: SparkContext, runtime: JobEnvironment, data: JobData):
>> JobOutput = {
>>myJob(a)
>>}
>>
>>  def validate(sc: SparkContext, runtime: JobEnvironment, config: Config):
>> JobData Or Every[ValidationProblem] = {
>>Good(config.root().render())
>>  }
>>
>>
>
>
> --
> Sent from my iPhone
>


Spark Job Server application compilation issue

2018-03-14 Thread sujeet jog
Input is a json request, which would be decoded in myJob() & processed
further.

Not sure what is wrong with below code, it emits errors as unimplemented
methods (runJob/validate),
any pointers on this would be helpful,

jobserver-0.8.0

object MyJobServer extends SparkSessionJob {

  type JobData = String
  type JobOutput = Seq[String]

  def myJob(a : String)  = {
}

  def runJob(sc: SparkContext, runtime: JobEnvironment, data: JobData):
JobOutput = {
   myJob(a)
   }

 def validate(sc: SparkContext, runtime: JobEnvironment, config: Config):
JobData Or Every[ValidationProblem] = {
   Good(config.root().render())
 }


running Spark-JobServer in eclipse

2018-03-03 Thread sujeet jog
Is there a way to run Spark-JobServer in eclipse ?.. any pointers in this
regard ?


read parallel processing spark-cassandra

2018-02-13 Thread sujeet jog
Folks,

I have a time series table with each record being 350 columns.

the primary key is  ((date, bucket), objectid, timestamp)
objective is to  read 1 day worth of data, which comes to around 12k
partitions, each partition has around 25MB of data,
I see only 1 task active during the read operation, on a 5 node cluster, (8
cores each ),  does this mean not enough spark partitions are getting
created ?
i have also set the input.split.size_in_mb to a lower number. like 10 .
Any pointers in this regard would be helpful.,


Thanks,


Spark Docker

2017-12-25 Thread sujeet jog
Folks,

Can you share your experience of running spark under docker  on a single
local / standalone node.
Anybody using it under production environments ?,  we have a existing
Docker Swarm deployment, and i want to run Spark in a seperate FAT VM
hooked / controlled by docker swarm

I know there is no official clustering support for running spark under
docker swarm,  but can it be used to run on a single FAT VM controlled by
Swarm.

Any insights on this would be appreciated / production mode experiences etc.

Thanks,
Sujeet


running dockerized spark applications in DC/OS

2017-08-31 Thread sujeet jog
Folks,

Does any body have  production experience in running dockerized spark
application on DC/OS, and can the spark cluster run other than spark stand
alone mode ?..
What are the major differences between running  spark with Mesos Cluster
manager Vs running Spark as dockerized container under DC/OS.. performance
issues if any ?..

"It would be helpful if anybody can share some technical details on
dockerizing spark applications in DC/OS".
or point to the right location on the web,   unfortunately i could not find
much information on this.

-Sujeet


Re: Cassandra querying time stamps

2017-06-20 Thread sujeet jog
Correction.

On Tue, Jun 20, 2017 at 5:27 PM, sujeet jog <sujeet@gmail.com> wrote:

> , Below is the query, looks like from physical plan, the query is same as
> that of cqlsh,
>
>  val query = s"""(select * from model_data
> where TimeStamp > \'$timeStamp+\' and TimeStamp <=
> \'$startTS+\'
> and MetricID = $metricID)"""
>
> println("Model query" + query)
>
> val df = spark.read
>   .format(Config.dbDriver)
>   .options(Map("Keyspace" -> Config.dbName, "table" ->
> "ml_forecast_tbl"))
>   .load
>
>
>df.createorReplaceTempView("mode_data")
>val modelDF = spark.sql(query).cache.
>println(spark.sql(query).queryExecution)
>
>
>
> == Physical Plan ==
> InMemoryTableScan [MetricID#9045, TimeStamp#9046, ResourceID#9047,
> Forecast#9048, GlobalThresholdMax#9049, GlobalThresholdMin#9050, Hi85#9051,
> Hi99#9052, Low85#9053, Low99#9054]
> :  +- InMemoryRelation [MetricID#9045, TimeStamp#9046, ResourceID#9047,
> Forecast#9048, GlobalThresholdMax#9049, GlobalThresholdMin#9050, Hi85#9051,
> Hi99#9052, Low85#9053, Low99#9054], true, 1, StorageLevel(disk, memory,
> deserialized, 1 replicas)
> : :  +- *Filter cast(TimeStamp#9046 as string) > 2016-01-22
> 00:00:00+) && (cast(TimeStamp#9046 as string) <= 2016-01-22
> 00:30:00+)) && isnotnull(TimeStamp#9046)) && isnotnull(MetricID#9045))
> : : +- *Scan org.apache.spark.sql.cassandra.
> CassandraSourceRelation@40dc2ade [MetricID#9045,TimeStamp#9046,
> ResourceID#9047,Forecast#9048,GlobalThresholdMax#9049,
> GlobalThresholdMin#9050,Hi85#9051,Hi99#9052,Low85#9053,Low99#9054]
> PushedFilters: [IsNotNull(TimeStamp), IsNotNull(MetricID),
> EqualTo(MetricID,1)], ReadSchema: struct<MetricID:int,TimeStamp:
> timestamp,ResourceID:string,Forecast:double,GlobalThresholdMax:doub...
>
>
> On Tue, Jun 20, 2017 at 5:13 PM, Riccardo Ferrari <ferra...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Personally I would inspect how dates are managed. How does your spark
>> code looks like? What does the explain say. Does TimeStamp gets parsed the
>> same way?
>>
>> Best,
>>
>> On Tue, Jun 20, 2017 at 12:52 PM, sujeet jog <sujeet@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I have a table as below
>>>
>>> CREATE TABLE analytics_db.ml_forecast_tbl (
>>>"MetricID" int,
>>>"TimeStamp" timestamp,
>>>"ResourceID" timeuuid
>>>"Value"   double,
>>> PRIMARY KEY ("MetricID", "TimeStamp", "ResourceID")
>>> )
>>>
>>> select * from ml_forecast_tbl where "MetricID" = 1 and "TimeStamp" >
>>> '2016-01-22 00:00:00+' and "TimeStamp" <= '2016-01-22 00:30:00+' ;
>>>
>>>  MetricID | TimeStamp   | ResourceID
>>>   | Value|
>>> --+-+---
>>> ---+--+
>>> |1 | 2016-01-22 00:30:00.00+ |
>>> 4a925190-3b13-11e7-83c6-a32261219d03 | 32.16177 |
>>> | 23.74124 | 15.2371
>>> 1 | 2016-01-22 00:30:00.00+ |
>>> 4a92c6c0-3b13-11e7-83c6-a32261219d03 | 32.16177 |
>>> | 23.74124 | 15.2371
>>> 1 | 2016-01-22 00:30:00.00+ |
>>> 4a936300-3b13-11e7-83c6-a32261219d03 | 32.16177 |
>>> | 23.74124 | 15.2371
>>> 1 | 2016-01-22 00:30:00.00+ |
>>> 4a93d830-3b13-11e7-83c6-a32261219d03 | 32.16177 |
>>> | 23.74124 | 15.2371
>>>
>>> This query runs perfectly fine from cqlsh,   but not with Spark SQL, it
>>> just emits empty results,
>>> Is there a catch to think about on querying timestamp ranges with
>>> cassandra spark connector
>>>
>>> Any inputs on this ?..
>>>
>>>
>>> Thanks,
>>> Sujeet
>>>
>>
>>
>


Re: Cassandra querying time stamps

2017-06-20 Thread sujeet jog
, Below is the query, looks like from physical plan, the query is same as
that of cqlsh,

 val query = s"""(select * from model_data
where TimeStamp > \'$timeStamp+\' and TimeStamp <=
\'$startTS+\'
and MetricID = $metricID)"""

println("Model query" + query)

val df = spark.read
  .format(Config.dbDriver)
  .options(Map("Keyspace" -> Config.dbName, "table" ->
"ml_forecast_tbl"))
  .load

val d = spark.sparkContext.cassandraTable("analytics_db",
"ml_forecast_tbl")
 .where(" \"TimeStamp\" > ? and \"TimeStamp\" <= ? and \"MetricID\" =
1",
  timeStamp + "+", startTS + "+")


   d.createorReplaceTempView("mode_data")
   val modelDF = spark.sql(query).cache.
   println(spark.sql(query).queryExecution)



== Physical Plan ==
InMemoryTableScan [MetricID#9045, TimeStamp#9046, ResourceID#9047,
Forecast#9048, GlobalThresholdMax#9049, GlobalThresholdMin#9050, Hi85#9051,
Hi99#9052, Low85#9053, Low99#9054]
:  +- InMemoryRelation [MetricID#9045, TimeStamp#9046, ResourceID#9047,
Forecast#9048, GlobalThresholdMax#9049, GlobalThresholdMin#9050, Hi85#9051,
Hi99#9052, Low85#9053, Low99#9054], true, 1, StorageLevel(disk, memory,
deserialized, 1 replicas)
: :  +- *Filter cast(TimeStamp#9046 as string) > 2016-01-22
00:00:00+) && (cast(TimeStamp#9046 as string) <= 2016-01-22
00:30:00+)) && isnotnull(TimeStamp#9046)) && isnotnull(MetricID#9045))
: : +- *Scan
org.apache.spark.sql.cassandra.CassandraSourceRelation@40dc2ade
[MetricID#9045,TimeStamp#9046,ResourceID#9047,Forecast#9048,GlobalThresholdMax#9049,GlobalThresholdMin#9050,Hi85#9051,Hi99#9052,Low85#9053,Low99#9054]
PushedFilters: [IsNotNull(TimeStamp), IsNotNull(MetricID),
EqualTo(MetricID,1)], ReadSchema:
struct<MetricID:int,TimeStamp:timestamp,ResourceID:string,Forecast:double,GlobalThresholdMax:doub...


On Tue, Jun 20, 2017 at 5:13 PM, Riccardo Ferrari <ferra...@gmail.com>
wrote:

> Hi,
>
> Personally I would inspect how dates are managed. How does your spark code
> looks like? What does the explain say. Does TimeStamp gets parsed the same
> way?
>
> Best,
>
> On Tue, Jun 20, 2017 at 12:52 PM, sujeet jog <sujeet@gmail.com> wrote:
>
>> Hello,
>>
>> I have a table as below
>>
>> CREATE TABLE analytics_db.ml_forecast_tbl (
>>"MetricID" int,
>>"TimeStamp" timestamp,
>>"ResourceID" timeuuid
>>"Value"   double,
>> PRIMARY KEY ("MetricID", "TimeStamp", "ResourceID")
>> )
>>
>> select * from ml_forecast_tbl where "MetricID" = 1 and "TimeStamp" >
>> '2016-01-22 00:00:00+' and "TimeStamp" <= '2016-01-22 00:30:00+' ;
>>
>>  MetricID | TimeStamp   | ResourceID
>>   | Value|
>> --+-+---
>> ---+--+
>> |1 | 2016-01-22 00:30:00.00+ |
>> 4a925190-3b13-11e7-83c6-a32261219d03 | 32.16177 |
>> | 23.74124 | 15.2371
>> 1 | 2016-01-22 00:30:00.00+ |
>> 4a92c6c0-3b13-11e7-83c6-a32261219d03 | 32.16177 |
>> | 23.74124 | 15.2371
>> 1 | 2016-01-22 00:30:00.00+ |
>> 4a936300-3b13-11e7-83c6-a32261219d03 | 32.16177 |
>> | 23.74124 | 15.2371
>> 1 | 2016-01-22 00:30:00.00+ |
>> 4a93d830-3b13-11e7-83c6-a32261219d03 | 32.16177 |
>> | 23.74124 | 15.2371
>>
>> This query runs perfectly fine from cqlsh,   but not with Spark SQL, it
>> just emits empty results,
>> Is there a catch to think about on querying timestamp ranges with
>> cassandra spark connector
>>
>> Any inputs on this ?..
>>
>>
>> Thanks,
>> Sujeet
>>
>
>


Cassandra querying time stamps

2017-06-20 Thread sujeet jog
Hello,

I have a table as below

CREATE TABLE analytics_db.ml_forecast_tbl (
   "MetricID" int,
   "TimeStamp" timestamp,
   "ResourceID" timeuuid
   "Value"   double,
PRIMARY KEY ("MetricID", "TimeStamp", "ResourceID")
)

select * from ml_forecast_tbl where "MetricID" = 1 and "TimeStamp" >
'2016-01-22 00:00:00+' and "TimeStamp" <= '2016-01-22 00:30:00+' ;

 MetricID | TimeStamp   | ResourceID
| Value|
--+-+--+--+
|1 | 2016-01-22 00:30:00.00+ |
4a925190-3b13-11e7-83c6-a32261219d03 | 32.16177 |
| 23.74124 | 15.2371
1 | 2016-01-22 00:30:00.00+ |
4a92c6c0-3b13-11e7-83c6-a32261219d03 | 32.16177 |
| 23.74124 | 15.2371
1 | 2016-01-22 00:30:00.00+ |
4a936300-3b13-11e7-83c6-a32261219d03 | 32.16177 |
| 23.74124 | 15.2371
1 | 2016-01-22 00:30:00.00+ |
4a93d830-3b13-11e7-83c6-a32261219d03 | 32.16177 |
| 23.74124 | 15.2371

This query runs perfectly fine from cqlsh,   but not with Spark SQL, it
just emits empty results,
Is there a catch to think about on querying timestamp ranges with cassandra
spark connector

Any inputs on this ?..


Thanks,
Sujeet


Re: JSON Arrays and Spark

2016-10-12 Thread sujeet jog
I generally use Play Framework Api's for comple json structures.

https://www.playframework.com/documentation/2.5.x/ScalaJson#Json

On Wed, Oct 12, 2016 at 11:34 AM, Kappaganthu, Sivaram (ES) <
sivaram.kappagan...@adp.com> wrote:

> Hi,
>
>
>
> Does this mean that handling any Json with kind of below schema  with
> spark is not a good fit?? I have requirement to parse the below Json that
> spans across multiple lines. Whats the best way to parse the jsns of this
> kind?? Please suggest.
>
>
>
> root
>
> |-- maindate: struct (nullable = true)
>
> ||-- mainidnId: string (nullable = true)
>
> |-- Entity: array (nullable = true)
>
> ||-- element: struct (containsNull = true)
>
> |||-- Profile: struct (nullable = true)
>
> ||||-- Kind: string (nullable = true)
>
> |||-- Identifier: string (nullable = true)
>
> |||-- Group: array (nullable = true)
>
> ||||-- element: struct (containsNull = true)
>
> |||||-- Period: struct (nullable = true)
>
> ||||||-- pid: string (nullable = true)
>
> ||||||-- pDate: string (nullable = true)
>
> ||||||-- quarter: long (nullable = true)
>
> ||||||-- labour: array (nullable = true)
>
> |||||||-- element: struct (containsNull = true)
>
> ||||||||-- category: string (nullable = true)
>
> ||||||||-- id: string (nullable = true)
>
> ||||||||-- person: struct (nullable = true)
>
> |||||||||-- address: array (nullable =
> true)
>
> ||||||||||-- element: struct
> (containsNull = true)
>
> |||||||||||-- city: string
> (nullable = true)
>
> |||||||||||-- line1: string
> (nullable = true)
>
> |||||||||||-- line2: string
> (nullable = true)
>
> |||||||||||-- postalCode: string
> (nullable = true)
>
> |||||||||||-- state: string
> (nullable = true)
>
> |||||||||||-- type: string
> (nullable = true)
>
> |||||||||-- familyName: string (nullable =
> true)
>
> ||||||||-- tax: array (nullable = true)
>
> |||||||||-- element: struct (containsNull
> = true)
>
> ||||||||||-- code: string (nullable =
> true)
>
> ||||||||||-- qwage: double (nullable =
> true)
>
> ||||||||||-- qvalue: double (nullable
> = true)
>
> ||||||||||-- qSubjectvalue: double
> (nullable = true)
>
> ||||||||||-- qfinalvalue: double
> (nullable = true)
>
> ||||||||||-- ywage: double (nullable =
> true)
>
> ||||||||||-- yalue: double (nullable =
> true)
>
> ||||||||||-- ySubjectvalue: double
> (nullable = true)
>
> ||||||||||-- yfinalvalue: double
> (nullable = true)
>
> ||||||||-- tProfile: array (nullable = true)
>
> |||||||||-- element: struct (containsNull
> = true)
>
> ||||||||||-- isExempt: boolean
> (nullable = true)
>
> ||||||||||-- jurisdiction: struct
> (nullable = true)
>
> |||||||||||-- code: string
> (nullable = true)
>
> ||||||||||-- maritalStatus: string
> (nullable = true)
>
> ||||||||||-- numberOfDeductions: long
> (nullable = true)
>
> ||||||||-- wDate: struct (nullable = true)
>
> |||||||||-- originalHireDate: string
> (nullable = true)
>
> ||||||-- year: long (nullable = true)
>
>
>
>
>
> *From:* Luciano Resende [mailto:luckbr1...@gmail.com]
> *Sent:* Monday, October 10, 2016 11:39 PM
> *To:* Jean Georges Perrin
> *Cc:* user @spark
> *Subject:* Re: JSON Arrays and Spark
>
>
>
> Please take a look at
> http://spark.apache.org/docs/latest/sql-programming-guide.
> html#json-datasets
>
> Particularly the note at the required format :
>
> Note that the file that is offered as *a json file* is not a typical JSON
> file. Each line must contain a separate, self-contained valid JSON object.
> As a consequence, a regular multi-line JSON file will most often fail.
>
>
>
> On Mon, Oct 10, 2016 at 9:57 AM, Jean Georges Perrin  wrote:
>
> Hi folks,
>
>
>
> I am trying to parse JSON arrays and it’s getting a little crazy (for me

Convert RDD to JSON Rdd and append more information

2016-09-20 Thread sujeet jog
Hi,

I have a Rdd of n rows,  i want to transform this to a Json RDD, and also
add some more information , any idea how to accomplish this .


ex : -

i have rdd with n rows with data like below ,  ,

 16.9527493170273,20.1989561393151,15.7065424947394
 17.9527493170273,21.1989561393151,15.7065424947394
 18.9527493170273,22.1989561393151,15.7065424947394


would like to add few rows highlited to the beginning of RDD like below, is
there a way to
do this and transform it to JSON,  the reason being i intend to push this
as input  to some application via pipeRDD for some processing, and want to
enforce a JSON structure on the input.

*{*
*TimeSeriesID : 1234*
*NumOfInputSamples : 1008 *
*Request Type : Fcast*
 16.9527493170273,20.1989561393151,15.7065424947394
 17.9527493170273,21.1989561393151,15.7065424947394
 18.9527493170273,22.1989561393151,15.7065424947394
}


Thanks,
Sujeet


Partition n keys into exacly n partitions

2016-09-12 Thread sujeet jog
Hi,

Is there a way to partition set of data with n keys into exactly n
partitions.

For ex : -

tuple of 1008 rows with key as x
tuple of 1008 rows with key as y   and so on  total 10 keys ( x, y etc )

Total records = 10080
NumOfKeys = 10

i want to partition the 10080 elements into exactly 10 partitions with each
partition having elements with unique key

Is there a way to make this happen ?.. any ideas on implementing custom
partitioner.


The current partitioner i'm using is HashPartitioner from which there are
cases where key.hascode() % numPartitions  for keys of x & y become same.

 hence many elements with different keys fall into single partition at
times.



Thanks,
Sujeet


Re: iterating over DataFrame Partitions sequentially

2016-09-10 Thread sujeet jog
Thank you Jacob,
It works for me.

On Sat, Sep 10, 2016 at 12:54 AM, Jakob Odersky <ja...@odersky.com> wrote:

> > Hi Jakob, I have a DataFrame with like 10 patitions, based on the exact
> content on each partition i want to batch load some other data from DB, i
> cannot operate in parallel due to resource contraints i have,  hence want
> to sequential iterate over each partition and perform operations.
>
>
> Ah I see. I think in that case your best option is to run several
> jobs, selecting different subsets of your dataframe for each job and
> running them one after the other. One way to do that would be to get
> the underlying rdd, mapping with the partition's index and then
> filtering and itering over every element. Eg.:
>
> val withPartitionIndex = df.rdd.mapPartitionWithIndex((idx, it) =>
> it.map(elem => (idx, elem))
>
> for (i <- 0 until n) {
>   withPartitionIndex.filter{case (idx, _) => idx == i}.foreach{ case
> (idx, elem) =>
> //do something with elem
>   }
> }
>
> it's not the best use-case of Spark though and will probably be a
> performance bottleneck.
>
> On Fri, Sep 9, 2016 at 11:45 AM, Jakob Odersky <ja...@odersky.com> wrote:
> > Hi Sujeet,
> >
> > going sequentially over all parallel, distributed data seems like a
> > counter-productive thing to do. What are you trying to accomplish?
> >
> > regards,
> > --Jakob
> >
> > On Fri, Sep 9, 2016 at 3:29 AM, sujeet jog <sujeet@gmail.com> wrote:
> >> Hi,
> >> Is there a way to iterate over a DataFrame with n partitions
> sequentially,
> >>
> >>
> >> Thanks,
> >> Sujeet
> >>
>


iterating over DataFrame Partitions sequentially

2016-09-09 Thread sujeet jog
Hi,
Is there a way to iterate over a DataFrame with n partitions sequentially,


Thanks,
Sujeet


Re: Dataframe write to DB , loosing primary key index & data types.

2016-08-24 Thread sujeet jog
There was a inherent bug in my code which did this,

On Wed, Aug 24, 2016 at 8:07 PM, sujeet jog <sujeet@gmail.com> wrote:

> Hi,
>
> I have a table with definition as below , when i write any records to this
> table, the varchar(20 ) gets changes to text, and it also losses the
> primary key index,
> any idea how to write data with spark SQL without loosing the primary key
> index & data types. ?
>
>
> MariaDB [analytics]> show columns from fcast;
> +-+-+--+-+--
> -+-+
> | Field   | Type| Null | Key | Default   | Extra
> |
> +-+-+--+-+--
> -+-+
> | TimeSeriesID| varchar(20) | NO   | PRI |   |
> |
> | TimeStamp   | timestamp   | NO   | PRI | CURRENT_TIMESTAMP | on
> update CURRENT_TIMESTAMP |
> | Forecast| double  | YES  | | NULL  |
> |
>
> I'm just doinig DF.write.mode("append").jdbc
>
> Thanks,
>
>


Dataframe write to DB , loosing primary key index & data types.

2016-08-24 Thread sujeet jog
Hi,

I have a table with definition as below , when i write any records to this
table, the varchar(20 ) gets changes to text, and it also losses the
primary key index,
any idea how to write data with spark SQL without loosing the primary key
index & data types. ?


MariaDB [analytics]> show columns from fcast;
+-+-+--+-+---+-+
| Field   | Type| Null | Key | Default   | Extra
|
+-+-+--+-+---+-+
| TimeSeriesID| varchar(20) | NO   | PRI |   |
|
| TimeStamp   | timestamp   | NO   | PRI | CURRENT_TIMESTAMP | on
update CURRENT_TIMESTAMP |
| Forecast| double  | YES  | | NULL  |
|

I'm just doinig DF.write.mode("append").jdbc

Thanks,


Re: call a mysql stored procedure from spark

2016-08-15 Thread sujeet jog
Thanks Michael, Michael,

Ayan
rightly said, yes this stored procedure is invoked from driver, this
creates the temporary table is DB, the reason being i want to load some
specific data after processing it, i do not wish to bring it in spark,
instead want to keep the processing at DB level,  later once the temp table
is prepared, i would load it via sparkSQL in the executor to process
further.


On Mon, Aug 15, 2016 at 4:24 AM, ayan guha <guha.a...@gmail.com> wrote:

> More than technical feasibility, I would ask why to invoke a stored
> procedure for every row? If not, jdbcRdd is moot point.
>
> In case stored procedure should be invoked from driver, it can be easily
> done. Or at most for each partition, at each executor.
> On 15 Aug 2016 03:06, "Mich Talebzadeh" <mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> The link deals with JDBC and states:
>>
>> [image: Inline images 1]
>>
>> So it is only SQL. It lacks functionality on Stored procedures with
>> returning result set.
>>
>> This is on an Oracle table
>>
>> scala>  var _ORACLEserver = "jdbc:oracle:thin:@rhes564:1521:mydb12"
>> _ORACLEserver: String = jdbc:oracle:thin:@rhes564:1521:mydb12
>> scala>  var _username = "scratchpad"
>> _username: String = scratchpad
>> scala> var _password = "xxx"
>> _password: String = oracle
>>
>> scala> val s = HiveContext.read.format("jdbc").options(
>>  | Map("url" -> _ORACLEserver,
>>  | *"dbtable" -> "exec weights_sp",*
>>  | "user" -> _username,
>>  | "password" -> _password)).load
>> java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist
>>
>>
>> and that stored procedure exists in Oracle
>>
>> scratch...@mydb12.mich.LOCAL> desc weights_sp
>> PROCEDURE weights_sp
>>
>>
>> HTH
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 14 August 2016 at 17:42, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>> As described here
>>> <http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases>,
>>> you can use the DataSource API to connect to an external database using
>>> JDBC.  While the dbtable option is usually just a table name, it can
>>> also be any valid SQL command that returns a table when enclosed in
>>> (parentheses).  I'm not certain, but I'd expect you could use this feature
>>> to invoke a stored procedure and return the results as a DataFrame.
>>>
>>> On Sat, Aug 13, 2016 at 10:40 AM, sujeet jog <sujeet@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Is there a way to call a stored procedure using spark ?
>>>>
>>>>
>>>> thanks,
>>>> Sujeet
>>>>
>>>
>>>
>>


call a mysql stored procedure from spark

2016-08-13 Thread sujeet jog
Hi,

Is there a way to call a stored procedure using spark ?


thanks,
Sujeet


Re: update specifc rows to DB using sqlContext

2016-08-11 Thread sujeet jog
I read the table via spark SQL , and perform some  ML activity on the data
, and the resultant will be to update some specific columns with the ML
improvised result,
hence i do not have a option to do the whole operation in MySQL,


Thanks,
Sujeet

On Thu, Aug 11, 2016 at 3:29 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Ok it is clearer now.
>
> You are using Spark as the query tool on an RDBMS table? Read table via
> JDBC, write back updating certain records.
>
> I have not done this myself but I suspect the issue would be if Spark
> write will commit the transaction and maintains ACID compliance. (locking
> the rows etc).
>
> I know it cannot do this to a Hive transactional table.
>
> Any reason why you are not doing the whole operation in MySQL itself?
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 11 August 2016 at 10:46, sujeet jog <sujeet@gmail.com> wrote:
>
>> 1 ) using mysql DB
>> 2 ) will be inserting/update/overwrite to the same table
>> 3 ) i want to update a specific column in a record, the data is read via
>> Spark SQL,
>>
>> on the below table which is read via sparkSQL, i would like to update the
>> NumOfSamples column .
>>
>> consider DF as the dataFrame which holds the records,  registered as
>> temporary table MS .
>>
>> spark.sqlContext.write.format("jdbc").option("url", url
>> ).option("dbtable", "update ms  set NumOfSamples = 20 where 'TimeSeriesID =
>> '1000'" As MS ).save
>>
>> I believe updating a record via sparkSQL is not supported,  the only
>> workaround is to open up a jdbc connection without using spark API's and do
>> a direct update ?..
>>
>> Sample Ex : -
>>
>> mysql> show columns from ms;
>> +--+-+--+-+-+---+
>> | Field| Type| Null | Key | Default | Extra |
>> +--+-+--+-+-+---+
>> | TimeSeriesID | varchar(20) | YES  | | NULL|   |
>> | NumOfSamples | int(11) | YES  | | NULL|   |
>> +--+-+--+-+-+---+
>>
>>
>> Thanks,
>> Sujeet
>>
>>
>>
>> On Tue, Aug 9, 2016 at 6:31 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>
>>>1. what is the underlying DB, say Hive etc
>>>2. Is table transactional or you are going to do insert/overwrite to
>>>the same table
>>>3. can you do all this in the database itself assuming it is an RDBMS
>>>4. Can you provide the sql or pseudo code for such an update
>>>
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 9 August 2016 at 13:39, sujeet jog <sujeet@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Is it possible to update certain columnr records  in DB  from spark,
>>>>
>>>> for example i have 10 rows with 3 columns  which are read from Spark
>>>> SQL,
>>>>
>>>> i want to update specific column entries  and write back to DB, but
>>>> since RDD"s are immutable i believe this would be difficult, is there a
>>>> workaround.
>>>>
>>>>
>>>> Thanks,
>>>> Sujeet
>>>>
>>>
>>>
>>
>


Re: update specifc rows to DB using sqlContext

2016-08-11 Thread sujeet jog
1 ) using mysql DB
2 ) will be inserting/update/overwrite to the same table
3 ) i want to update a specific column in a record, the data is read via
Spark SQL,

on the below table which is read via sparkSQL, i would like to update the
NumOfSamples column .

consider DF as the dataFrame which holds the records,  registered as
temporary table MS .

spark.sqlContext.write.format("jdbc").option("url", url ).option("dbtable",
"update ms  set NumOfSamples = 20 where 'TimeSeriesID = '1000'" As MS ).save

I believe updating a record via sparkSQL is not supported,  the only
workaround is to open up a jdbc connection without using spark API's and do
a direct update ?..

Sample Ex : -

mysql> show columns from ms;
+--+-+--+-+-+---+
| Field| Type| Null | Key | Default | Extra |
+--+-+--+-+-+---+
| TimeSeriesID | varchar(20) | YES  | | NULL|   |
| NumOfSamples | int(11) | YES  | | NULL|   |
+--+-+--+-+-+---+


Thanks,
Sujeet



On Tue, Aug 9, 2016 at 6:31 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
>
>1. what is the underlying DB, say Hive etc
>2. Is table transactional or you are going to do insert/overwrite to
>the same table
>3. can you do all this in the database itself assuming it is an RDBMS
>4. Can you provide the sql or pseudo code for such an update
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 9 August 2016 at 13:39, sujeet jog <sujeet@gmail.com> wrote:
>
>> Hi,
>>
>> Is it possible to update certain columnr records  in DB  from spark,
>>
>> for example i have 10 rows with 3 columns  which are read from Spark SQL,
>>
>> i want to update specific column entries  and write back to DB, but since
>> RDD"s are immutable i believe this would be difficult, is there a
>> workaround.
>>
>>
>> Thanks,
>> Sujeet
>>
>
>


update specifc rows to DB using sqlContext

2016-08-09 Thread sujeet jog
Hi,

Is it possible to update certain columnr records  in DB  from spark,

for example i have 10 rows with 3 columns  which are read from Spark SQL,

i want to update specific column entries  and write back to DB, but since
RDD"s are immutable i believe this would be difficult, is there a
workaround.


Thanks,
Sujeet


Re: how to run local[k] threads on a single core

2016-08-05 Thread sujeet jog
Thanks,

Since i'm running in local mode,  i plan to pin down the JVM to a CPU with
taskset -cp  ,  hopefully with this all the tasks should operate
on the specified CPU cores..

Thanks,
Sujeet

On Thu, Aug 4, 2016 at 8:11 PM, Daniel Darabos <
daniel.dara...@lynxanalytics.com> wrote:

> You could run the application in a Docker container constrained to one CPU
> with --cpuset-cpus (https://docs.docker.com/engine/reference/run/#/cpuset-
> constraint).
>
> On Thu, Aug 4, 2016 at 8:51 AM, Sun Rui <sunrise_...@163.com> wrote:
>
>> I don’t think it possible as Spark does not support thread to CPU
>> affinity.
>> > On Aug 4, 2016, at 14:27, sujeet jog <sujeet@gmail.com> wrote:
>> >
>> > Is there a way we can run multiple tasks concurrently on a single core
>> in local mode.
>> >
>> > for ex :- i have 5 partition ~ 5 tasks, and only a single core , i want
>> these tasks to run concurrently, and specifiy them to use /run on a single
>> core.
>> >
>> > The machine itself is say 4 core, but i want to utilize only 1 core out
>> of it,.
>> >
>> > Is it possible ?
>> >
>> > Thanks,
>> > Sujeet
>> >
>>
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


how to run local[k] threads on a single core

2016-08-04 Thread sujeet jog
Is there a way we can run multiple tasks concurrently on a single core in
local mode.

for ex :- i have 5 partition ~ 5 tasks, and only a single core , i want
these tasks to run concurrently, and specifiy them to use /run on a single
core.

The machine itself is say 4 core, but i want to utilize only 1 core out of
it,.

Is it possible ?

Thanks,
Sujeet


Re: Load selected rows with sqlContext in the dataframe

2016-07-22 Thread sujeet jog
Thanks Todd.

On Thu, Jul 21, 2016 at 9:18 PM, Todd Nist <tsind...@gmail.com> wrote:

> You can set the dbtable to this:
>
> .option("dbtable", "(select * from master_schema where 'TID' = '100_0')")
>
> HTH,
>
> Todd
>
>
> On Thu, Jul 21, 2016 at 10:59 AM, sujeet jog <sujeet@gmail.com> wrote:
>
>> I have a table of size 5GB, and want to load selective rows into
>> dataframe instead of loading the entire table in memory,
>>
>>
>> For me memory is a constraint hence , and i would like to peridically
>> load few set of rows and perform dataframe operations on it,
>>
>> ,
>> for the "dbtable"  is there a way to perform select * from master_schema
>> where 'TID' = '100_0';
>> which can load only this to memory as dataframe .
>>
>>
>>
>> Currently  I'm using code as below
>> val df  =  sqlContext.read .format("jdbc")
>>   .option("url", url)
>>   .option("dbtable", "master_schema").load()
>>
>>
>> Thansk,
>> Sujeet
>>
>
>


Load selected rows with sqlContext in the dataframe

2016-07-21 Thread sujeet jog
I have a table of size 5GB, and want to load selective rows into dataframe
instead of loading the entire table in memory,


For me memory is a constraint hence , and i would like to peridically load
few set of rows and perform dataframe operations on it,

,
for the "dbtable"  is there a way to perform select * from master_schema
where 'TID' = '100_0';
which can load only this to memory as dataframe .



Currently  I'm using code as below
val df  =  sqlContext.read .format("jdbc")
  .option("url", url)
  .option("dbtable", "master_schema").load()


Thansk,
Sujeet


Re: Using R code as part of a Spark Application

2016-06-30 Thread sujeet jog
Thanks for the link Sun,  I believe running external Scripts like R code in
Data Frames is a much needed facility,  for example for the algorithms that
are not available in MLLIB, invoking such from a R script would definitely
be a powerful feature when your APP is Scala/Python based,  you don;t have
to use Spark-R for this sake when much of your application code is in
Scala/python.

On Thu, Jun 30, 2016 at 8:25 AM, Sun Rui <sunrise_...@163.com> wrote:

> Hi, Gilad,
>
> You can try the dapply() and gapply() function in SparkR in Spark 2.0.
> Yes, it is required that R installed on each worker node.
>
> However, if your Spark application is Scala/Java based, it is not
> supported for now to run R code in DataFrames. There is closed lira
> https://issues.apache.org/jira/browse/SPARK-14746 which remains
> discussion purpose. You have to convert DataFrames to RDDs, and use pipe()
> on RDDs to launch external R processes and run R code.
>
> On Jun 30, 2016, at 07:08, Xinh Huynh <xinh.hu...@gmail.com> wrote:
>
> It looks like it. "DataFrame UDFs in R" is resolved in Spark 2.0:
> https://issues.apache.org/jira/browse/SPARK-6817
>
> Here's some of the code:
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/r/MapPartitionsRWrapper.scala
>
> /**
> * A function wrapper that applies the given R function to each partition.
> */
> private[sql] case class MapPartitionsRWrapper(
> func: Array[Byte],
> packageNames: Array[Byte],
> broadcastVars: Array[Broadcast[Object]],
> inputSchema: StructType,
> outputSchema: StructType) extends (Iterator[Any] => Iterator[Any])
>
> Xinh
>
> On Wed, Jun 29, 2016 at 2:59 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> Here we (or certainly I) am not talking about R Server, but plain vanilla
>> R, as used with Spark and SparkR. Currently, SparkR doesn't distribute R
>> code at all (it used to, sort of), so I'm wondering if that is changing
>> back.
>>
>> On Wed, Jun 29, 2016 at 10:53 PM, John Aherne <john.ahe...@justenough.com
>> > wrote:
>>
>>> I don't think R server requires R on the executor nodes. I originally
>>> set up a SparkR cluster for our Data Scientist on Azure which required that
>>> I install R on each node, but for the R Server set up, there is an extra
>>> edge node with R server that they connect to. From what little research I
>>> was able to do, it seems that there are some special functions in R Server
>>> that can distribute the work to the cluster.
>>>
>>> Documentation is light, and hard to find but I found this helpful:
>>>
>>> https://blogs.msdn.microsoft.com/uk_faculty_connection/2016/05/10/r-server-for-hdinsight-running-on-microsoft-azure-cloud-data-science-challenges/
>>>
>>>
>>>
>>> On Wed, Jun 29, 2016 at 3:29 PM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> Oh, interesting: does this really mean the return of distributing R
>>>> code from driver to executors and running it remotely, or do I
>>>> misunderstand? this would require having R on the executor nodes like
>>>> it used to?
>>>>
>>>> On Wed, Jun 29, 2016 at 5:53 PM, Xinh Huynh <xinh.hu...@gmail.com>
>>>> wrote:
>>>> > There is some new SparkR functionality coming in Spark 2.0, such as
>>>> > "dapply". You could use SparkR to load a Parquet file and then run
>>>> "dapply"
>>>> > to apply a function to each partition of a DataFrame.
>>>> >
>>>> > Info about loading Parquet file:
>>>> >
>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/sparkr.html#from-data-sources
>>>> >
>>>> > API doc for "dapply":
>>>> >
>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/api/R/index.html
>>>> >
>>>> > Xinh
>>>> >
>>>> > On Wed, Jun 29, 2016 at 6:54 AM, sujeet jog <sujeet@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> try Spark pipeRDD's , you can invoke the R script from pipe , push
>>>> the
>>>> >> stuff you want to do on the Rscript stdin,  p
>>>> >>
>>>> >>
>>>> >> On Wed, Jun 29, 2016 at 7:10 PM, Gilad Landau <
>>>> gilad.lan...@clicktale.com>
>>>> >> wrote:
>>>> >>>
>>>> >>> Hello,
>>>> >>>
>>&g

Re: Using R code as part of a Spark Application

2016-06-29 Thread sujeet jog
try Spark pipeRDD's , you can invoke the R script from pipe , push  the
stuff you want to do on the Rscript stdin,  p


On Wed, Jun 29, 2016 at 7:10 PM, Gilad Landau 
wrote:

> Hello,
>
>
>
> I want to use R code as part of spark application (the same way I would do
> with Scala/Python).  I want to be able to run an R syntax as a map function
> on a big Spark dataframe loaded from a parquet file.
>
> Is this even possible or the only way to use R is as part of RStudio
> orchestration of our Spark  cluster?
>
>
>
> Thanks for the help!
>
>
>
> Gilad
>
>
>


Re: Spark jobs

2016-06-29 Thread sujeet jog
check if this helps,

from multiprocessing import Process

def training() :
print ("Training Workflow")

cmd = spark/bin/spark-submit  ./ml.py & "
os.system(cmd)

w_training  = Process(target = training)



On Wed, Jun 29, 2016 at 6:28 PM, Joaquin Alzola 
wrote:

> Hi,
>
>
>
> This is a totally newbie question but I seem not to find the link ….. when
> I create a spark-submit python script to be launch …
>
>
>
> how should I call it from the main python script with a subprocess.popen?
>
>
>
> BR
>
>
>
> Joaquin
>
>
>
>
>
>
>
>
>
>
>
>
> This email is confidential and may be subject to privilege. If you are not
> the intended recipient, please do not copy or disclose its content but
> contact the sender immediately upon receipt.
>


Re: Can we use existing R model in Spark

2016-05-30 Thread sujeet jog
Try to invoke a R script from Spark using rdd pipe method , get the work
done & and receive the model back in RDD.


for ex :-
.   rdd.pipe("")


On Mon, May 30, 2016 at 3:57 PM, Sun Rui  wrote:

> Unfortunately no. Spark does not support loading external modes (for
> examples, PMML) for now.
> Maybe you can try using the existing random forest model in Spark.
>
> On May 30, 2016, at 18:21, Neha Mehta  wrote:
>
> Hi,
>
> I have an existing random forest model created using R. I want to use that
> to predict values on Spark. Is it possible to do the same? If yes, then how?
>
> Thanks & Regards,
> Neha
>
>
>


Re: local Vs Standalonecluster production deployment

2016-05-28 Thread sujeet jog
Great, Thanks.

On Sun, May 29, 2016 at 12:38 AM, Chris Fregly <ch...@fregly.com> wrote:

> btw, here's a handy Spark Config Generator by Ewan Higgs in in Gent,
> Belgium:
>
> code:  https://github.com/ehiggs/spark-config-gen
>
> demo:  http://ehiggs.github.io/spark-config-gen/
>
> my recent tweet on this:
> https://twitter.com/cfregly/status/736631633927753729
>
> On Sat, May 28, 2016 at 10:50 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> hang on. Free is telling me you have 8GB of memory. I was under the
>> impression that you had 4GB of RAM :)
>>
>> So with no app you have 3.99GB free ~ 4GB
>>  1st app takes 428MB of memory and the second is 425MB so pretty lean apps
>>
>> The question is the apps that I run take 2-3GB each. But your mileage
>> varies. If you end up with free memory running these minute apps and no
>> sudden spike in memory/cpu usage then as long as they run and finish within
>> SLA you should be OK whichever environment you run. May be you apps do not
>> require that amount of memory.
>>
>> I don't think there is clear cut answer to NOT to use local mode in prod.
>> Others may have different opinions on this.
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 28 May 2016 at 18:37, sujeet jog <sujeet@gmail.com> wrote:
>>
>>> ran these from muliple bash shell for now, probably a multi threaded
>>> python script would do ,  memory and resource allocations are seen as
>>> submitted parameters
>>>
>>>
>>> *say before running any applications . *
>>>
>>> [root@fos-elastic02 ~]# /usr/bin/free
>>>  total   used   free sharedbuffers cached
>>> Mem:   8058568*4066296 *   3992272  10172 141368
>>>  1549520
>>> -/+ buffers/cache:23754085683160
>>> Swap:  8290300 1086728181628
>>>
>>>
>>> *only 1 App : *
>>>
>>> [root@fos-elastic02 ~]# /usr/bin/free
>>>  total   used   free sharedbuffers cached
>>> Mem:   8058568*4494488*3564080  10172 141392
>>>  1549948
>>> -/+ buffers/cache:28031485255420
>>> Swap:  8290300 1086728181628
>>>
>>>
>>> ran the single APP twice in parallel ( memory used double as expected )
>>>
>>> [root@fos-elastic02 ~]# /usr/bin/free
>>>  total   used   free sharedbuffers cached
>>> Mem:   8058568*4919532 *   3139036  10172 141444
>>>  1550376
>>> -/+ buffers/cache:32277124830856
>>> Swap:  8290300 1086728181628
>>>
>>>
>>> Curious to know if local mode is used in real deployments where there is
>>> a scarcity of resources.
>>>
>>>
>>> Thanks,
>>> Sujeet
>>>
>>> On Sat, May 28, 2016 at 10:50 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> OK that is good news. So briefly how do you kick off spark-submit for
>>>> each (or sparkConf). In terms of memory/resources allocations.
>>>>
>>>> Now what is the output of
>>>>
>>>> /usr/bin/free
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 28 May 2016 at 18:12, sujeet jog <sujeet@gmail.com> wrote:
>>>>
>>>>> Yes Mich,
>>>>> They are currently emitting the results parallely,
>>>>> http://localhost:4040 &  http://localhost:4041 , i also see the
>>>>> monitoring from these URL's,
>>>>>
>>>>>
>>>>> On Sat, May 28, 2016 at 10:37 PM, Mich Talebzadeh <
>>>>> mich.talebza...@gm

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread sujeet jog
ran these from muliple bash shell for now, probably a multi threaded python
script would do ,  memory and resource allocations are seen as submitted
parameters


*say before running any applications . *

[root@fos-elastic02 ~]# /usr/bin/free
 total   used   free sharedbuffers cached
Mem:   8058568*4066296 *   3992272  10172 1413681549520
-/+ buffers/cache:23754085683160
Swap:  8290300 1086728181628


*only 1 App : *

[root@fos-elastic02 ~]# /usr/bin/free
 total   used   free sharedbuffers cached
Mem:   8058568*4494488*3564080  10172 1413921549948
-/+ buffers/cache:28031485255420
Swap:  8290300 1086728181628


ran the single APP twice in parallel ( memory used double as expected )

[root@fos-elastic02 ~]# /usr/bin/free
 total   used   free sharedbuffers cached
Mem:   8058568*4919532 *   3139036  10172 1414441550376
-/+ buffers/cache:32277124830856
Swap:  8290300 1086728181628


Curious to know if local mode is used in real deployments where there is a
scarcity of resources.


Thanks,
Sujeet

On Sat, May 28, 2016 at 10:50 PM, Mich Talebzadeh <mich.talebza...@gmail.com
> wrote:

> OK that is good news. So briefly how do you kick off spark-submit for each
> (or sparkConf). In terms of memory/resources allocations.
>
> Now what is the output of
>
> /usr/bin/free
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 May 2016 at 18:12, sujeet jog <sujeet@gmail.com> wrote:
>
>> Yes Mich,
>> They are currently emitting the results parallely,
>> http://localhost:4040 &  http://localhost:4041 , i also see the
>> monitoring from these URL's,
>>
>>
>> On Sat, May 28, 2016 at 10:37 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> ok they are submitted but the latter one 14302 is it doing anything?
>>>
>>> can you check it with jmonitor or the logs created
>>>
>>> HTH
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 28 May 2016 at 18:03, sujeet jog <sujeet@gmail.com> wrote:
>>>
>>>> Thanks Ted,
>>>>
>>>> Thanks Mich,  yes i see that i can run two applications by submitting
>>>> these,  probably Driver + Executor running in a single JVM .  In-Process
>>>> Spark.
>>>>
>>>> wondering if this can be used in production systems,  the reason for me
>>>> considering local instead of standalone cluster mode is purely because of
>>>> CPU/MEM resources,  i.e,  i currently do not have the liberty to use 1
>>>> Driver & 1 Executor per application,( running in a embedded network
>>>> switch  )
>>>>
>>>>
>>>> jps output
>>>> [root@fos-elastic02 ~]# jps
>>>> 14258 SparkSubmit
>>>> 14503 Jps
>>>> 14302 SparkSubmit
>>>> ,
>>>>
>>>> On Sat, May 28, 2016 at 10:21 PM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Ok so you want to run all this in local mode. In other words something
>>>>> like below
>>>>>
>>>>> ${SPARK_HOME}/bin/spark-submit \
>>>>>
>>>>> --master local[2] \
>>>>>
>>>>> --driver-memory 2G \
>>>>>
>>>>> --num-executors=1 \
>>>>>
>>>>> --executor-memory=2G \
>>>>>
>>>>> --executor-cores=2 \
>>>>>
>>>>>
>>>>> I am not sure it will work for multiple drivers (app/JVM).  The only
>>>>> way you can find out is to do try it running two apps simultaneously. You
>>>>> have a number of tools.
>>>>>
>>>>>
>>>>>
>>>>

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread sujeet jog
Yes Mich,
They are currently emitting the results parallely,http://localhost:4040
&  http://localhost:4041 , i also see the monitoring from these URL's,


On Sat, May 28, 2016 at 10:37 PM, Mich Talebzadeh <mich.talebza...@gmail.com
> wrote:

> ok they are submitted but the latter one 14302 is it doing anything?
>
> can you check it with jmonitor or the logs created
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 May 2016 at 18:03, sujeet jog <sujeet@gmail.com> wrote:
>
>> Thanks Ted,
>>
>> Thanks Mich,  yes i see that i can run two applications by submitting
>> these,  probably Driver + Executor running in a single JVM .  In-Process
>> Spark.
>>
>> wondering if this can be used in production systems,  the reason for me
>> considering local instead of standalone cluster mode is purely because of
>> CPU/MEM resources,  i.e,  i currently do not have the liberty to use 1
>> Driver & 1 Executor per application,( running in a embedded network
>> switch  )
>>
>>
>> jps output
>> [root@fos-elastic02 ~]# jps
>> 14258 SparkSubmit
>> 14503 Jps
>> 14302 SparkSubmit
>> ,
>>
>> On Sat, May 28, 2016 at 10:21 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Ok so you want to run all this in local mode. In other words something
>>> like below
>>>
>>> ${SPARK_HOME}/bin/spark-submit \
>>>
>>> --master local[2] \
>>>
>>> --driver-memory 2G \
>>>
>>> --num-executors=1 \
>>>
>>> --executor-memory=2G \
>>>
>>> --executor-cores=2 \
>>>
>>>
>>> I am not sure it will work for multiple drivers (app/JVM).  The only way
>>> you can find out is to do try it running two apps simultaneously. You have
>>> a number of tools.
>>>
>>>
>>>
>>>1. use jps to see the apps and PID
>>>2. use jmonitor to see memory/cpu/heap usage for each spark-submit
>>>job
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 28 May 2016 at 17:41, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> Sujeet:
>>>>
>>>> Please also see:
>>>>
>>>> https://spark.apache.org/docs/latest/spark-standalone.html
>>>>
>>>> On Sat, May 28, 2016 at 9:19 AM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi Sujeet,
>>>>>
>>>>> if you have a single machine then it is Spark standalone mode.
>>>>>
>>>>> In Standalone cluster mode Spark allocates resources based on cores.
>>>>> By default, an application will grab all the cores in the cluster.
>>>>>
>>>>> You only have one worker that lives within the driver JVM process that
>>>>> you start when you start the application with spark-shell or spark-submit
>>>>> in the host where the cluster manager is running.
>>>>>
>>>>> The Driver node runs on the same host that the cluster manager is
>>>>> running. The Driver requests the Cluster Manager for resources to run
>>>>> tasks. The worker is tasked to create the executor (in this case there is
>>>>> only one executor) for the Driver. The Executor runs tasks for the Driver.
>>>>> Only one executor can be allocated on each worker per application. In your
>>>>> case you only have
>>>>>
>>>>>
>>>>> The minimum you will need will be 2-4G of RAM and two cores. Well that
>>>>> is my experience. Yes you can submit more than one spark-submit (the
>>>>> driver) but they may queue up behind the running one if there is not 
>>>>> enough
>>>>> resources.
>>>>>
>>

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread sujeet jog
Thanks Ted,

Thanks Mich,  yes i see that i can run two applications by submitting
these,  probably Driver + Executor running in a single JVM .  In-Process
Spark.

wondering if this can be used in production systems,  the reason for me
considering local instead of standalone cluster mode is purely because of
CPU/MEM resources,  i.e,  i currently do not have the liberty to use 1
Driver & 1 Executor per application,( running in a embedded network
switch  )


jps output
[root@fos-elastic02 ~]# jps
14258 SparkSubmit
14503 Jps
14302 SparkSubmit
,

On Sat, May 28, 2016 at 10:21 PM, Mich Talebzadeh <mich.talebza...@gmail.com
> wrote:

> Ok so you want to run all this in local mode. In other words something
> like below
>
> ${SPARK_HOME}/bin/spark-submit \
>
> --master local[2] \
>
> --driver-memory 2G \
>
> --num-executors=1 \
>
> --executor-memory=2G \
>
> --executor-cores=2 \
>
>
> I am not sure it will work for multiple drivers (app/JVM).  The only way
> you can find out is to do try it running two apps simultaneously. You have
> a number of tools.
>
>
>
>1. use jps to see the apps and PID
>2. use jmonitor to see memory/cpu/heap usage for each spark-submit job
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 May 2016 at 17:41, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Sujeet:
>>
>> Please also see:
>>
>> https://spark.apache.org/docs/latest/spark-standalone.html
>>
>> On Sat, May 28, 2016 at 9:19 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Sujeet,
>>>
>>> if you have a single machine then it is Spark standalone mode.
>>>
>>> In Standalone cluster mode Spark allocates resources based on cores. By
>>> default, an application will grab all the cores in the cluster.
>>>
>>> You only have one worker that lives within the driver JVM process that
>>> you start when you start the application with spark-shell or spark-submit
>>> in the host where the cluster manager is running.
>>>
>>> The Driver node runs on the same host that the cluster manager is
>>> running. The Driver requests the Cluster Manager for resources to run
>>> tasks. The worker is tasked to create the executor (in this case there is
>>> only one executor) for the Driver. The Executor runs tasks for the Driver.
>>> Only one executor can be allocated on each worker per application. In your
>>> case you only have
>>>
>>>
>>> The minimum you will need will be 2-4G of RAM and two cores. Well that
>>> is my experience. Yes you can submit more than one spark-submit (the
>>> driver) but they may queue up behind the running one if there is not enough
>>> resources.
>>>
>>>
>>> You pointed out that you will be running few applications in parallel on
>>> the same host. The likelihood is that you are using a VM machine for this
>>> purpose and the best option is to try running the first one, Check Web GUI
>>> on  4040 to see the progress of this Job. If you start the next JVM then
>>> assuming it is working, it will be using port 4041 and so forth.
>>>
>>>
>>> In actual fact try the command "free" to see how much free memory you
>>> have.
>>>
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 28 May 2016 at 16:42, sujeet jog <sujeet@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a question w.r.t  production deployment mode of spark,
>>>>
>>>> I have 3 applications which i would like to run independently on a
>>>> single machine, i need to run the drivers in the same machine.
>>>>
>>>> The amount of resources i have is also limited, like 4- 5GB RAM , 3 - 4
>>>> cores.
>>>>
>>>> For deployment in standalone mode : i believe i need
>>>>
>>>> 1 Driver JVM,  1 worker node ( 1 executor )
>>>> 1 Driver JVM,  1 worker node ( 1 executor )
>>>> 1 Driver JVM,  1 worker node ( 1 executor )
>>>>
>>>> The issue here is i will require 6 JVM running in parallel, for which i
>>>> do not have sufficient CPU/MEM resources,
>>>>
>>>>
>>>> Hence i was looking more towards a local mode deployment mode, would
>>>> like to know if anybody is using local mode where Driver + Executor run in
>>>> a single JVM in production mode.
>>>>
>>>> Are there any inherent issues upfront using local mode for production
>>>> base systems.?..
>>>>
>>>>
>>>
>>
>


local Vs Standalonecluster production deployment

2016-05-28 Thread sujeet jog
Hi,

I have a question w.r.t  production deployment mode of spark,

I have 3 applications which i would like to run independently on a single
machine, i need to run the drivers in the same machine.

The amount of resources i have is also limited, like 4- 5GB RAM , 3 - 4
cores.

For deployment in standalone mode : i believe i need

1 Driver JVM,  1 worker node ( 1 executor )
1 Driver JVM,  1 worker node ( 1 executor )
1 Driver JVM,  1 worker node ( 1 executor )

The issue here is i will require 6 JVM running in parallel, for which i do
not have sufficient CPU/MEM resources,


Hence i was looking more towards a local mode deployment mode, would like
to know if anybody is using local mode where Driver + Executor run in a
single JVM in production mode.

Are there any inherent issues upfront using local mode for production base
systems.?..


sparkApp on standalone/local mode with multithreading

2016-05-25 Thread sujeet jog
I had few questions w.r.t to Spark deployment & and way i want to use, It
would be helpful if you can answer few.

I plan to use Spark on a embedded switch, which has limited set of
resources,  like say 1 or 2  dedicated cores  and 1.5GB of memory,
want to model a network traffic with time series algorithms,  the
algorithms i want to use currently do no exist in spark, so i'm writing it
using R,
I plan to use Pipe to get this executed from Spark.

The reason i'm using Spark other then the ETL functions is because of
portability, so that the same code can be reused on a x86 platform with
more CPU & memory resources if required.

within the same Application i would like to create multiple threads , one
thread doining the testing of ML , and other training at some specific
time, and perhaps some other thread to do any other activity if required,

Can you please let me know if you see any apparent issues from your
experience on spark with this kind of design.


Re: Scala vs Python for Spark ecosystem

2016-04-20 Thread sujeet jog
It depends on the trade off's you wish to have,

Python being a interpreted language, speed of execution will be lesser, but
it being a very common language used across, people can jump in hands on
quickly

Scala programs run in java environment,  so it's obvious you will get good
execution speed,  although it's not common for people to know this language
readily.


Pyspark API's i believe will have everything which Scala Spark API's offer
in long run.



On Wed, Apr 20, 2016 at 12:14 PM, berkerkozan  wrote:

> I know scala better than python but my team (2 other my friend) knows only
> python. We want to use graphx or maybe try graphframes.
> What will be the future of these 2 languages for spark ecosystem? Will
> python cover everything scala can in short time periods? what do you
> advice?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-for-Spark-ecosystem-tp26805.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Aggregate subsequenty x row values together.

2016-03-28 Thread sujeet jog
Hi Ted,

There is no row key persey, and i actually do not want to sort , want to
aggregate the subsequent x rows together as a mean value maintaing the
order of the row entries,

For ex : -
Input rdd
[ 12, 45 ]
[ 14, 50 ]
[ 10, 35 ]
[ 11, 50 ]

expected output rdd ,  the below is actually a aggregation by mean on
subsequent 2 rows each.

[13, 47.5]
[10.5,  42.5]


@ Alexander :   Yes inducing dummy key seems to be one of the ways, ,can
you please post a snippet if possible on how to achieve this...


On Mon, Mar 28, 2016 at 10:30 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Can you describe your use case a bit more ?
>
> Since the row keys are not sorted in your example, there is a chance that
> you get indeterministic results when you aggregate on groups of two
> successive rows.
>
> Thanks
>
> On Mon, Mar 28, 2016 at 9:21 AM, sujeet jog <sujeet@gmail.com> wrote:
>
>> Hi,
>>
>> I have a RDD  like this .
>>
>> [ 12, 45 ]
>> [ 14, 50 ]
>> [ 10, 35 ]
>> [ 11, 50 ]
>>
>> i want to aggreate values of first two rows into 1 row and subsequenty
>> the next two rows into another single row...
>>
>> i don't have a key to aggregate for using some of the aggregate pyspark
>> functions, how to achieve it ?
>>
>>
>>
>


Aggregate subsequenty x row values together.

2016-03-28 Thread sujeet jog
Hi,

I have a RDD  like this .

[ 12, 45 ]
[ 14, 50 ]
[ 10, 35 ]
[ 11, 50 ]

i want to aggreate values of first two rows into 1 row and subsequenty the
next two rows into another single row...

i don't have a key to aggregate for using some of the aggregate pyspark
functions, how to achieve it ?


Run External R script from Spark

2016-03-21 Thread sujeet jog
Hi,

I have been working on a POC on some time series related stuff, i'm using
python since i need spark streaming and sparkR is yet to have a spark
streaming front end,  couple of algorithms i want to use are not yet
present in Spark-TS package, so I'm thinking of invoking a external R
script for the Algorithm part & pass the data from Spark to the R script
via pipeRdd,


What i wanted to understand is can something like this be used in a
production deployment,  since passing the data via R script would mean lot
of serializing and would actually not use the power of spark for parallel
execution,

Has anyone used this kind of workaround  Spark -> pipeRdd-> R script.


Thanks,
Sujeet