Re: flatMap() returning large class

2017-12-14 Thread Richard Garris
Hi Don,

Good to hear from you. I think the problem is that regardless of whether
you use yield or a generator - Spark internally will produce the entire
result as a single large JVM object which will blow up your heap space.

Would it be possible to shrink the overall size of the image object storing
it as a vector or Array vs a large Java class object?

That might be the more prudent approach.

-RG

Richard Garris

Principal Architect

Databricks, Inc

650.200.0840

rlgar...@databricks.com

On December 14, 2017 at 10:23:00 AM, Marcelo Vanzin (van...@cloudera.com)
wrote:

This sounds like something mapPartitions should be able to do, not
sure if there's an easier way.

On Thu, Dec 14, 2017 at 10:20 AM, Don Drake  wrote:
> I'm looking for some advice when I have a flatMap on a Dataset that is
> creating and returning a sequence of a new case class
> (Seq[BigDataStructure]) that contains a very large amount of data, much
> larger than the single input record (think images).
>
> In python, you can use generators (yield) to bypass creating a large list
of
> structures and returning the list.
>
> I'm programming this is in Scala and was wondering if there are any
similar
> tricks to optimally return a list of classes?? I found the for/yield
> semantics, but it appears the compiler is just creating a sequence for
you
> and this will blow through my Heap given the number of elements in the
list
> and the size of each element.
>
> Is there anything else I can use?
>
> Thanks.
>
> -Don
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> https://twitter.com/dondrake
> 800-733-2143



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


kinesis throughput problems

2017-12-14 Thread Jeremy Kelley
We have a largeish kinesis stream with about 25k events per second and each 
record is around 142k.  I have tried multiple cluster sizes, multiple batch 
sizes, multiple parameters...  I am doing minimal transformations on the data.  
Whatever happens I can sustain consuming 25k with minimal effort and cluster 
load for about 5-10 minutes and then always always the stream shapes down and 
hovers around 5k EPS.  

I can give MANY more details but I was curious if anyone had seen similar 
behavior.

Thanks,
Jeremy


-- 
Jeremy Kelley | Technical Director, Data
jkel...@carbonblack.com | Carbon Black Threat Engineering




smime.p7s
Description: S/MIME cryptographic signature


Re: Feature generation / aggregate functions / timeseries

2017-12-14 Thread Georg Heiler
Also the rdd stat counter will already conpute most of your desired metrics
as well as df.describe
https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
Georg Heiler  schrieb am Do. 14. Dez. 2017 um
19:40:

> Look at custom UADF functions
>  schrieb am Do. 14. Dez. 2017 um 09:31:
>
>> Hi dear spark community !
>>
>> I want to create a lib which generates features for potentially very
>> large datasets, so I believe spark could be a nice tool for that.
>> Let me explain what I need to do :
>>
>> Each file 'F' of my dataset is composed of at least :
>> - an id ( string or int )
>> - a timestamp ( or a long value )
>> - a value ( generaly a double )
>>
>> I want my tool to :
>> - compute aggregate function for many pairs 'instants + duration'
>> ===> FOR EXAMPLE :
>> = compute for the instant 't = 2001-01-01' aggregate functions for
>> data between 't-1month and t' and 't-12months and t-9months' and this,
>> FOR EACH ID !
>> ( aggregate functions such as
>> min/max/count/distinct/last/mode/kurtosis... or even user defined ! )
>>
>> My constraints :
>> - I don't want to compute aggregate for each tuple of 'F'
>> ---> I want to provide a list of couples 'instants + duration' (
>> potentially large )
>> - My 'window' defined by the duration may be really large ( but may
>> contain only a few values... )
>> - I may have many id...
>> - I may have many timestamps...
>>
>> 
>> 
>> 
>>
>> Let me describe this with some kind of example to see if SPARK ( SPARK
>> STREAMING ? ) may help me to do that :
>>
>> Let's imagine that I have all my data in a DB or a file with the
>> following columns :
>> id | timestamp(ms) | value
>> A | 100 |  100
>> A | 1000500 |  66
>> B | 100 |  100
>> B | 110 |  50
>> B | 120 |  200
>> B | 250 |  500
>>
>> ( The timestamp is a long value, so as to be able to express date in ms
>> from -01-01. to today )
>>
>> I want to compute operations such as min, max, average, last on the
>> value column, for a these couples :
>> -> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [
>> t-1000ms and t ]
>> -> instant = 133 / [-5000ms, -2500 ] ( i.e. : aggregate data between
>> [ t-5000ms and t-2500ms ]
>>
>>
>> And this will produce this kind of output :
>>
>> id | timestamp(ms) | min_value | max_value | avg_value | last_value
>> ---
>> A | 1000500| min...| max   | avg   | last
>> B | 1000500| min...| max   | avg   | last
>> A | 133| min...| max   | avg   | last
>> B | 133| min...| max   | avg   | last
>>
>>
>>
>> Do you think we can do this efficiently with spark and/or spark
>> streaming, and do you have an idea on "how" ?
>> ( I have tested some solutions but I'm not really satisfied ATM... )
>>
>>
>> Thanks a lot Community :)
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Feature generation / aggregate functions / timeseries

2017-12-14 Thread Georg Heiler
Look at custom UADF functions.
 schrieb am Do. 14. Dez. 2017 um 09:31:

> Hi dear spark community !
>
> I want to create a lib which generates features for potentially very
> large datasets, so I believe spark could be a nice tool for that.
> Let me explain what I need to do :
>
> Each file 'F' of my dataset is composed of at least :
> - an id ( string or int )
> - a timestamp ( or a long value )
> - a value ( generaly a double )
>
> I want my tool to :
> - compute aggregate function for many pairs 'instants + duration'
> ===> FOR EXAMPLE :
> = compute for the instant 't = 2001-01-01' aggregate functions for
> data between 't-1month and t' and 't-12months and t-9months' and this,
> FOR EACH ID !
> ( aggregate functions such as
> min/max/count/distinct/last/mode/kurtosis... or even user defined ! )
>
> My constraints :
> - I don't want to compute aggregate for each tuple of 'F'
> ---> I want to provide a list of couples 'instants + duration' (
> potentially large )
> - My 'window' defined by the duration may be really large ( but may
> contain only a few values... )
> - I may have many id...
> - I may have many timestamps...
>
> 
> 
> 
>
> Let me describe this with some kind of example to see if SPARK ( SPARK
> STREAMING ? ) may help me to do that :
>
> Let's imagine that I have all my data in a DB or a file with the
> following columns :
> id | timestamp(ms) | value
> A | 100 |  100
> A | 1000500 |  66
> B | 100 |  100
> B | 110 |  50
> B | 120 |  200
> B | 250 |  500
>
> ( The timestamp is a long value, so as to be able to express date in ms
> from -01-01. to today )
>
> I want to compute operations such as min, max, average, last on the
> value column, for a these couples :
> -> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [
> t-1000ms and t ]
> -> instant = 133 / [-5000ms, -2500 ] ( i.e. : aggregate data between
> [ t-5000ms and t-2500ms ]
>
>
> And this will produce this kind of output :
>
> id | timestamp(ms) | min_value | max_value | avg_value | last_value
> ---
> A | 1000500| min...| max   | avg   | last
> B | 1000500| min...| max   | avg   | last
> A | 133| min...| max   | avg   | last
> B | 133| min...| max   | avg   | last
>
>
>
> Do you think we can do this efficiently with spark and/or spark
> streaming, and do you have an idea on "how" ?
> ( I have tested some solutions but I'm not really satisfied ATM... )
>
>
> Thanks a lot Community :)
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: flatMap() returning large class

2017-12-14 Thread Marcelo Vanzin
This sounds like something mapPartitions should be able to do, not
sure if there's an easier way.

On Thu, Dec 14, 2017 at 10:20 AM, Don Drake  wrote:
> I'm looking for some advice when I have a flatMap on a Dataset that is
> creating and returning a sequence of a new case class
> (Seq[BigDataStructure]) that contains a very large amount of data, much
> larger than the single input record (think images).
>
> In python, you can use generators (yield) to bypass creating a large list of
> structures and returning the list.
>
> I'm programming this is in Scala and was wondering if there are any similar
> tricks to optimally return a list of classes?? I found the for/yield
> semantics, but it appears the compiler is just creating a sequence for you
> and this will blow through my Heap given the number of elements in the list
> and the size of each element.
>
> Is there anything else I can use?
>
> Thanks.
>
> -Don
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> https://twitter.com/dondrake
> 800-733-2143



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



flatMap() returning large class

2017-12-14 Thread Don Drake
I'm looking for some advice when I have a flatMap on a Dataset that is
creating and returning a sequence of a new case class
(Seq[BigDataStructure]) that contains a very large amount of data, much
larger than the single input record (think images).

In python, you can use generators (yield) to bypass creating a large list
of structures and returning the list.

I'm programming this is in Scala and was wondering if there are any similar
tricks to optimally return a list of classes?? I found the for/yield
semantics, but it appears the compiler is just creating a sequence for you
and this will blow through my Heap given the number of elements in the list
and the size of each element.

Is there anything else I can use?

Thanks.

-Don

-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake 
800-733-2143


Re: bulk upsert data batch from Kafka dstream into Postgres db

2017-12-14 Thread salemi
Thank you for your response. In case of an update we need sometime to just
update a record and in other cases we need to update the existing record and
insert a new record. The statement you proposed doesn't  handle that.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: bulk upsert data batch from Kafka dstream into Postgres db

2017-12-14 Thread Cody Koeninger
Modern versions of postgres have upsert, ie insert into ... on
conflict ... do update

On Thu, Dec 14, 2017 at 11:26 AM, salemi  wrote:
> Thank you for your respond.
> The approach loads just the data into the DB. I am looking for an approach
> that allows me to update  existing entries in the DB amor insert a new entry
> if it doesn't exist.
>
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: bulk upsert data batch from Kafka dstream into Postgres db

2017-12-14 Thread salemi
Thank you for your respond. 
The approach loads just the data into the DB. I am looking for an approach
that allows me to update  existing entries in the DB amor insert a new entry
if it doesn't exist.






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark multithreaded job submission from driver

2017-12-14 Thread Michael Artz
Hi,
   I was wanting to pull data from about 1500 remote Oracle tables with
Spark, and I want to have a multi-threaded application that  picks up a
table per thread or maybe 10 tables per thread and launches a spark job to
read from their respective tables.

I read official spark site
*https://spark.apache.org/docs/latest/job-scheduling.html
 *I can see
"...cluster
managers that Spark runs on provide facilities for scheduling across
applications
.
Second, *within* each Spark application, multiple “jobs” (Spark actions)
may be running concurrently if they were submitted by different threads.
This is common if your application is serving requests over the network.
Spark includes a fair scheduler

to
schedule resources within each SparkContext."

Also you might have noticed in this SO post
*https://stackoverflow.com/questions/30862956/concurrent-job-execution-in-spark

*that there was no accepted answer on this similar question and the most
upvoted answer starts with "This is not really in the spirit of Spark", and
that is A. Everyone knows it's not in the spirit of Spark and B. Who cares
what is the spirit of Spark, that doesn't actually mean anything.

Has anyone gotten something like this to work before? Did you have to do
anything special? I was thinking of sending out a message to the dev group
too because maybe the person that actually wrote the website can give a
little more color to the above statement.

Thanks, Mike


Re: bulk upsert data batch from Kafka dstream into Postgres db

2017-12-14 Thread Cody Koeninger
use foreachPartition(), get a connection from a jdbc connection pool,
and insert the data the same way you would in a non-spark program.

If you're only doing inserts, postgres COPY will be faster (e.g.
https://discuss.pivotal.io/hc/en-us/articles/204237003), but if you're
doing updates that's not an option.

Depending on how many spark partitions you have, coalesce() to
decrease the number of partitions may help avoid database contention
and speed things up, but you'll need to experiment.

On Wed, Dec 13, 2017 at 11:52 PM, salemi  wrote:
> Hi All,
>
> we are consuming messages from Kafka using Spark dsteam. Once the processing
> is done we would like to update/insert the data in bulk fashion into the
> database.
>
> I was wondering what the best solution for this might be. Our Postgres
> database table is not partitioned.
>
>
> Thank you,
>
> Ali
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



cosine similarity implementation in Java Spark

2017-12-14 Thread Donni Khan
Hi all,
Is there any  Implemenation of cosine similarity supports Java?

Thanks,
Donni


cosine similarity in Java Spark

2017-12-14 Thread Donni Khan
Hi all,
Is there any  Implemenation of cosine similarity supports Java?

Thanks,
Donni


Feature generation / aggregate functions / timeseries

2017-12-14 Thread julio . cesare

Hi dear spark community !

I want to create a lib which generates features for potentially very 
large datasets, so I believe spark could be a nice tool for that.

Let me explain what I need to do :

Each file 'F' of my dataset is composed of at least :
- an id ( string or int )
- a timestamp ( or a long value )
- a value ( generaly a double )

I want my tool to :
- compute aggregate function for many pairs 'instants + duration'
===> FOR EXAMPLE :
= compute for the instant 't = 2001-01-01' aggregate functions for 
data between 't-1month and t' and 't-12months and t-9months' and this, 
FOR EACH ID !
( aggregate functions such as 
min/max/count/distinct/last/mode/kurtosis... or even user defined ! )


My constraints :
- I don't want to compute aggregate for each tuple of 'F'
---> I want to provide a list of couples 'instants + duration' ( 
potentially large )
- My 'window' defined by the duration may be really large ( but may 
contain only a few values... )

- I may have many id...
- I may have many timestamps...





Let me describe this with some kind of example to see if SPARK ( SPARK 
STREAMING ? ) may help me to do that :


Let's imagine that I have all my data in a DB or a file with the 
following columns :

id | timestamp(ms) | value
A | 100 |  100
A | 1000500 |  66
B | 100 |  100
B | 110 |  50
B | 120 |  200
B | 250 |  500

( The timestamp is a long value, so as to be able to express date in ms 
from -01-01. to today )


I want to compute operations such as min, max, average, last on the 
value column, for a these couples :
-> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [ 
t-1000ms and t ]
-> instant = 133 / [-5000ms, -2500 ] ( i.e. : aggregate data between 
[ t-5000ms and t-2500ms ]



And this will produce this kind of output :

id | timestamp(ms) | min_value | max_value | avg_value | last_value
---
A | 1000500| min...| max   | avg   | last
B | 1000500| min...| max   | avg   | last
A | 133| min...| max   | avg   | last
B | 133| min...| max   | avg   | last



Do you think we can do this efficiently with spark and/or spark 
streaming, and do you have an idea on "how" ?

( I have tested some solutions but I'm not really satisfied ATM... )


Thanks a lot Community :)

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Fwd: Feature Generation for Large datasets composed of many time series

2017-12-14 Thread julio . cesare

Hi dear spark community !

I want to create a lib which generates features for potentially very 
large datasets, so I believe spark could be a nice tool for that.

Let me explain what I need to do :

Each file 'F' of my dataset is composed of at least :
- an id ( string or int )
- a timestamp ( or a long value )
- a value ( generaly a double )

I want my tool to :
- compute aggregate function for many pairs 'instants + duration'
===> FOR EXAMPLE :
= compute for the instant 't = 2001-01-01' aggregate functions for 
data between 't-1month and t' and 't-12months and t-9months' and this, 
FOR EACH ID !
( aggregate functions such as 
min/max/count/distinct/last/mode/kurtosis... or even user defined ! )


My constraints :
- I don't want to compute aggregate for each tuple of 'F'
---> I want to provide a list of couples 'instants + duration' ( 
potentially large )
- My 'window' defined by the duration may be really large ( but may 
contain only a few values... )

- I may have many id...
- I may have many timestamps...





Let me describe this with some kind of example to see if SPARK ( SPARK 
STREAMING ? ) may help me to do that :


Let's imagine that I have all my data in a DB or a file with the 
following columns :

id | timestamp(ms) | value
A | 100 |  100
A | 1000500 |  66
B | 100 |  100
B | 110 |  50
B | 120 |  200
B | 250 |  500

( The timestamp is a long value, so as to be able to express date in ms 
from -01-01. to today )


I want to compute operations such as min, max, average, last on the 
value column, for a these couples :
-> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [ 
t-1000ms and t ]
-> instant = 133 / [-5000ms, -2500 ] ( i.e. : aggregate data between 
[ t-5000ms and t-2500ms ]



And this will produce this kind of output :

id | timestamp(ms) | min_value | max_value | avg_value | last_value
---
A | 1000500| min...| max   | avg   | last
B | 1000500| min...| max   | avg   | last
A | 133| min...| max   | avg   | last
B | 133| min...| max   | avg   | last



Do you think we can do this efficiently with spark and/or spark 
streaming, and do you have an idea on "how" ?

( I have tested some solutions but I'm not really satisfied ATM... )


Thanks a lot Community :)

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org