Scaling issues due to contention in Random

2016-11-24 Thread Prasun Ratn
Hi,

I am seeing perf degradation in the Spark/Pi example on a single-node
setup (using local[K])

Using 1, 2, 4, and 8 cores, this is the execution time in seconds for
the same number of iterations:-
Random: 4.0, 7.0, 12.96, 17.96

If I change the code to use ThreadLocalRandom
(https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala#L35)
it scales properly:-
ThreadLocalRandom: 2.2, 1.4, 1.07, 1.00

I see a similar issue in Kryo serializer in another app - the push
function shows up at the top of profile data, but goes away completely
if I use ThreadLocalRandom

https://github.com/EsotericSoftware/kryo/blob/master/src/com/esotericsoftware/kryo/util/ObjectMap.java#L259

The JDK documentation
(https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ThreadLocalRandom.html)
says:

> When applicable, use of ThreadLocalRandom? rather than shared Random objects 
> in concurrent programs will typically encounter much less overhead and 
> contention. Use of ThreadLocalRandom? is particularly appropriate when 
> multiple tasks (for example, each a ForkJoinTask? ) use random numbers in 
> parallel in thread pools

I am using Spark 1.5 and Java 1.8.0_91.

Is there any reason to prefer Random over ThreadLocalRandom?

Thanks
Prasun

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Parquet-like partitioning support in spark SQL's in-memory columnar cache

2016-11-24 Thread Reynold Xin
It's already there isn't it? The in-memory columnar cache format.


On Thu, Nov 24, 2016 at 9:06 PM, Nitin Goyal  wrote:

> Hi,
>
> Do we have any plan of supporting parquet-like partitioning support in
> Spark SQL in-memory cache? Something like one RDD[CachedBatch] per
> in-memory cache partition.
>
>
> -Nitin
>


Parquet-like partitioning support in spark SQL's in-memory columnar cache

2016-11-24 Thread Nitin Goyal
Hi,

Do we have any plan of supporting parquet-like partitioning support in
Spark SQL in-memory cache? Something like one RDD[CachedBatch] per
in-memory cache partition.


-Nitin


Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

2016-11-24 Thread nirandap
Hi Maciej,

Thanks again for the reply. Once small clarification about the answer about
my #1 point.
I put local[4] and shouldn't this be forcing spark to read from 4
partitions in parallel and write in parallel (by parallel I mean, the order
from which partition, the data is read from a set of 4 partitions, is
non-deterministic)? That was the reason why I was surprised to see that the
final results are in the same order.

On Tue, Nov 22, 2016 at 5:24 PM, Maciej Szymkiewicz [via Apache Spark
Developers List]  wrote:

> On 11/22/2016 12:11 PM, nirandap wrote:
>
> Hi Maciej,
>
> Thank you for your reply.
>
> I have 2 queries.
> 1. I can understand your explanation. But in my experience, when I check
> the final RDBMS table, I see that the results follow the expected order,
> without an issue. Is this just a coincidence?
>
> Not exactly a coincidence. This is typically a result of a physical
> location on the disk. If writes and reads are sequential, (this is usually
> the case) you'll see things in the expected order, but you have to remember
> that location on disk is not stable. For example if you perform some
> updates, deletes and VACUM ALL (PostgreSQL) physical location on disk will
> change and with it things you see.
>
> There of course more advanced mechanisms out there. For example modern
> columnar RDBMS like HANA use techniques like dimensions sorting and
> differential stores so even the initial order may differ. There probably
> some other solutions which choose different strategies (maybe some times
> series oriented projects?) I am not aware of.
>
>
> 2. I was further looking into this. So, say I run this query
> "select value, count(*) from table1 group by value order by value"
>
> and I call df.collect() in the resultant dataframe. From my experience, I
> see that the given values follow the expected order. May I know how spark
> manages to retain the order of the results in a collect operation?
>
> Once you execute ordered operation each partition is sorted and the order
> of partitions defines the global ordering. All what collect does is just
> preserving this order by creating an array of results for each partition
> and flattening it.
>
>
> Best
>
>
> On Mon, Nov 21, 2016 at 3:02 PM, Maciej Szymkiewicz [via Apache Spark
> Developers List] <[hidden email]
> > wrote:
>
>> In commonly used RDBM systems relations have no fixed order and physical
>> location of the records can change during routine maintenance operations.
>> Unless you explicitly order data during retrieval order you see is
>> incidental and not guaranteed.
>>
>> Conclusion: order of inserts just doesn't matter.
>> On 11/21/2016 10:03 AM, Niranda Perera wrote:
>>
>> Hi,
>>
>> Say, I have a table with 1 column and 1000 rows. I want to save the
>> result in a RDBMS table using the jdbc relation provider. So I run the
>> following query,
>>
>> "insert into table table2 select value, count(*) from table1 group by
>> value order by value"
>>
>> While debugging, I found that the resultant df from select value,
>> count(*) from table1 group by value order by value would have around 200+
>> partitions and say I have 4 executors attached to my driver. So, I would
>> have 200+ writing tasks assigned to 4 executors. I want to understand, how
>> these executors are able to write the data to the underlying RDBMS table of
>> table2 without messing up the order.
>>
>> I checked the jdbc insertable relation and in jdbcUtils [1] it does the
>> following
>>
>> df.foreachPartition { iterator =>
>>   savePartition(getConnection, table, iterator, rddSchema, nullTypes,
>> batchSize, dialect)
>> }
>>
>> So, my understanding is, all of my 4 executors will parallely run the
>> savePartition function (or closure) where they do not know which one should
>> write data before the other!
>>
>> In the savePartition method, in the comment, it says
>> "Saves a partition of a DataFrame to the JDBC database.  This is done in
>>* a single database transaction in order to avoid repeatedly inserting
>>* data as much as possible."
>>
>> I want to understand, how these parallel executors save the partition
>> without harming the order of the results? Is it by locking the database
>> resource, from each executor (i.e. ex0 would first obtain a lock for the
>> table and write the partition0, while ex1 ... ex3 would wait till the lock
>> is released )?
>>
>> In my experience, there is no harm done to the order of the results at
>> the end of the day!
>>
>> Would like to hear from you guys! :-)
>>
>> [1] https://github.com/apache/spark/blob/v1.6.2/sql/core/src
>> /main/scala/org/apache/spark/sql/execution/datasources/
>> jdbc/JdbcUtils.scala#L277
>>
>> --
>> Niranda Perera
>> @n1r44 
>> > target="_blank">+94 71 554 8430
>> https://www.linkedin.com/in/niranda
>> https://pythagoreanscript.wordpress.com/
>>
>>
>> --
>> Best regards,
>> Maciej Szymkiewicz
>>
>>
>>
>> ---

[no subject]

2016-11-24 Thread Rostyslav Sotnychenko


Re: SparkUI via proxy

2016-11-24 Thread Georg Heiler
Sehr Port forwarding will help you out.
marco rocchi  schrieb am Do. 24. Nov.
2016 um 16:33:

> Hi,
> I'm working with Apache Spark in order to develop my master thesis.I'm new
> in spark and working with cluster. I searched through internet but I didn't
> found a way to solve.
> My problem is the following one: from my pc I can access to a master node
> of a cluster only via proxy.
> To connect to proxy and then to master node,I have to set up an ssh
> tunnel, but from parctical point of view I have no idea of how in this way
> I can interact with WebUI spark.
> Anyone can help me?
> Thanks in advance
>


SparkUI via proxy

2016-11-24 Thread marco rocchi
Hi,
I'm working with Apache Spark in order to develop my master thesis.I'm new
in spark and working with cluster. I searched through internet but I didn't
found a way to solve.
My problem is the following one: from my pc I can access to a master node
of a cluster only via proxy.
To connect to proxy and then to master node,I have to set up an ssh tunnel,
but from parctical point of view I have no idea of how in this way I can
interact with WebUI spark.
Anyone can help me?
Thanks in advance


Re: Handling questions in the mailing lists

2016-11-24 Thread eliasah
Besides the traffic eventual issue, I don't believe that it would benefit
users to get a standalone site. Some great answers are provided by users
that aren't spark experts but maybe java, python, aws or even some system
experts why do we want to play alone ? 

We are trying nevertheless the animate the apache spark chat room which
isn't as obvious as one might want it to be. 

I'd rather things stay the way they are on SO. There is a bunch of us that
actually are very active and answer as much as we can and we'll be glad to
help.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Handling-questions-in-the-mailing-lists-tp19690p20012.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



RE: Handling questions in the mailing lists

2016-11-24 Thread Ioannis.Deligiannis
…my 0.1 cent ☺
As a Spark and SO user, I would not find a separate SE a good thing.

*Part of the SO beauty is that you can filter easily and track different topics 
from one dashboard.
*Being part of SO also gets good exposure as it raises awareness of Spark 
across a wider audience.
*High reputation users, even if they are say “python centric”, add value by 
moderating/commenting.
*I don’t think Spark-specific is a good thing either. Spark is typically 
combined with a huge range of other technologies (Avro, Parquet, Hadoop, 
Python, R, Scala, Akka, Java, HBase to name a few). Users that are specialists 
in these topics can provide value and help build quality in Spark tag. By 
getting a new SE you kind of exclude them.
*It will take time to build enough reputable users to share the moderation 
burden
*A high-rep Java user is likely to ask a good question. Forcing people to join 
an SE with rep being reset you will lose the ability to track your user (and 
may I say potential Evangelists) quality. By observation(no idea if true), 
questions by high-rep users attract much better attention than any user with 
100 or less.
*Last but not least, high-rep users usually know, follow and impose SO rules 
and best practices quite well where a Spark centric SE might not be as 
rule-focused. Even though rules can sometimes be annoying, overall they build 
quality questions so more users get involved.


From: Sean Owen [mailto:so...@cloudera.com]
Sent: 24 November 2016 10:53
To: assaf.mendelson; dev@spark.apache.org
Subject: Re: Handling questions in the mailing lists

Here's a view into the requirements, for example: 
http://area51.stackexchange.com/proposals/76571/emacs

You're right there is a lot of activity on SO, easily 30-40 questions per day. 
One thing I noticed about, for example, the Data Science SE is that most 
questions relevant to it were still posted on SO or Cross Validated. It 
struggles as an SE even though there is, out there, more than enough activity 
that _should_ be on the specific SE.

There are more niche things that end up working as an SE, so I'm not dead set 
against it, though it would remain unofficial and my gut is that it might just 
split the conversation yet further. I'd leave it, however, to anyone active on 
SO already to decide that it's worth a dedicated SE and just do it.

On Thu, Nov 24, 2016 at 10:45 AM assaf.mendelson 
mailto:assaf.mendel...@rsa.com>> wrote:
I am not sure what is enough traffic. Some of the SE groups already existing do 
not have that much traffic.
Specifically the  user mailing list has ~50 emails per day. It wouldn’t be much 
of a stretch to extract 1-2 questions per day from that.  In the regular 
stackoverflow the apache-spark had more than 50 new questions in the last 24 
hours alone 
(http://stackoverflow.com/questions/tagged/apache-spark?sort=newest&pageSize=50).

I believe this should be enough traffic (and the traffic would rise once 
quality answers begin to appear).


From: Sean Owen [via Apache Spark Developers List] 
[mailto:ml-node+[hidden 
email]]
Sent: Thursday, November 24, 2016 12:32 PM

To: Mendelson, Assaf
Subject: Re: Handling questions in the mailing lists

I don't think there's nearly enough traffic to sustain a stand-alone SE. I 
helped mod the Data Science SE and it's still not technically critical mass 
after 2 years. It would just fracture the discussion to yet another place.
On Thu, Nov 24, 2016 at 6:52 AM assaf.mendelson <[hidden 
email]>
 wrote:
Sorry to reawaken this, but I just noticed it is possible to propose new topic 
specific sites 
(http://area51.stackexchange.com/faq

Re: Handling questions in the mailing lists

2016-11-24 Thread Sean Owen
Here's a view into the requirements, for example:
http://area51.stackexchange.com/proposals/76571/emacs

You're right there is a lot of activity on SO, easily 30-40 questions per
day. One thing I noticed about, for example, the Data Science SE is that
most questions relevant to it were still posted on SO or Cross Validated.
It struggles as an SE even though there is, out there, more than enough
activity that _should_ be on the specific SE.

There are more niche things that end up working as an SE, so I'm not dead
set against it, though it would remain unofficial and my gut is that it
might just split the conversation yet further. I'd leave it, however, to
anyone active on SO already to decide that it's worth a dedicated SE and
just do it.

On Thu, Nov 24, 2016 at 10:45 AM assaf.mendelson 
wrote:

> I am not sure what is enough traffic. Some of the SE groups already
> existing do not have that much traffic.
>
> Specifically the  user mailing list has ~50 emails per day. It wouldn’t be
> much of a stretch to extract 1-2 questions per day from that.  In the
> regular stackoverflow the apache-spark had more than 50 new questions in
> the last 24 hours alone (
> http://stackoverflow.com/questions/tagged/apache-spark?sort=newest&pageSize=50).
>
>
>
>
> I believe this should be enough traffic (and the traffic would rise once
> quality answers begin to appear).
>
>
>
>
>
> *From:* Sean Owen [via Apache Spark Developers List] [mailto:ml-node+[hidden
> email] ]
> *Sent:* Thursday, November 24, 2016 12:32 PM
>
>
> *To:* Mendelson, Assaf
> *Subject:* Re: Handling questions in the mailing lists
>
>
>
> I don't think there's nearly enough traffic to sustain a stand-alone SE. I
> helped mod the Data Science SE and it's still not technically critical mass
> after 2 years. It would just fracture the discussion to yet another place.
>
> On Thu, Nov 24, 2016 at 6:52 AM assaf.mendelson <[hidden email]
> > wrote:
>
> Sorry to reawaken this, but I just noticed it is possible to propose new
> topic specific sites (http://area51.stackexchange.com/faq)  for stack
> overflow. So for example we might have a spark.stackexchange.com spark
> specific site.
>
> The advantage of such a site are many. First of all it is spark specific.
> Secondly the reputation of people would be on spark and not on general
> questions and lastly (and most importantly in my opinion) it would have
> spark based moderators (which are all spark moderator as opposed to general
> technology).
>
>
>
> The process of creating such a site is not complicated. Basically someone
> creates a proposal (I have no problem doing so). Then creating 5 example
> questions (something we want on the site) and get 5 people need to ‘follow’
> it within 3 days. This creates a “definition” phase. The goal is to get at
> least 40 questions that embody the goal of the site and have at least 10
> net votes and enough people follow it. When enough traction has been made
> (enough questions and enough followers) then the site moves to commitment
> phase. In this phase users “commit” to being on the site (basically this is
> aimed to see the community of experts is big enough). Once all this happens
> the site moves into beta. This means the site becomes active and it will
> become a full site if it sees enough traction.
>
>
>
> I would suggest trying to set this up.
>
>
>
> Thanks,
>
> Assaf
>
>
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Handling-questions-in-the-mailing-lists-tp19690p20007.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] 
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> 
>
> --
> View this message in context: RE: Handling questions in the mailing lists
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>


RE: Handling questions in the mailing lists

2016-11-24 Thread assaf.mendelson
I am not sure what is enough traffic. Some of the SE groups already existing do 
not have that much traffic.
Specifically the  user mailing list has ~50 emails per day. It wouldn’t be much 
of a stretch to extract 1-2 questions per day from that.  In the regular 
stackoverflow the apache-spark had more than 50 new questions in the last 24 
hours alone 
(http://stackoverflow.com/questions/tagged/apache-spark?sort=newest&pageSize=50).

I believe this should be enough traffic (and the traffic would rise once 
quality answers begin to appear).


From: Sean Owen [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n2000...@n3.nabble.com]
Sent: Thursday, November 24, 2016 12:32 PM
To: Mendelson, Assaf
Subject: Re: Handling questions in the mailing lists

I don't think there's nearly enough traffic to sustain a stand-alone SE. I 
helped mod the Data Science SE and it's still not technically critical mass 
after 2 years. It would just fracture the discussion to yet another place.
On Thu, Nov 24, 2016 at 6:52 AM assaf.mendelson <[hidden 
email]> wrote:
Sorry to reawaken this, but I just noticed it is possible to propose new topic 
specific sites (http://area51.stackexchange.com/faq)  for stack overflow. So 
for example we might have a 
spark.stackexchange.com spark specific site.
The advantage of such a site are many. First of all it is spark specific. 
Secondly the reputation of people would be on spark and not on general 
questions and lastly (and most importantly in my opinion) it would have spark 
based moderators (which are all spark moderator as opposed to general 
technology).

The process of creating such a site is not complicated. Basically someone 
creates a proposal (I have no problem doing so). Then creating 5 example 
questions (something we want on the site) and get 5 people need to ‘follow’ it 
within 3 days. This creates a “definition” phase. The goal is to get at least 
40 questions that embody the goal of the site and have at least 10 net votes 
and enough people follow it. When enough traction has been made (enough 
questions and enough followers) then the site moves to commitment phase. In 
this phase users “commit” to being on the site (basically this is aimed to see 
the community of experts is big enough). Once all this happens the site moves 
into beta. This means the site becomes active and it will become a full site if 
it sees enough traction.

I would suggest trying to set this up.

Thanks,
Assaf



If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/Handling-questions-in-the-mailing-lists-tp19690p20007.html
To start a new topic under Apache Spark Developers List, email 
ml-node+s1001551n1...@n3.nabble.com
To unsubscribe from Apache Spark Developers List, click 
here.
NAML




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Handling-questions-in-the-mailing-lists-tp19690p20008.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Handling questions in the mailing lists

2016-11-24 Thread Sean Owen
I don't think there's nearly enough traffic to sustain a stand-alone SE. I
helped mod the Data Science SE and it's still not technically critical mass
after 2 years. It would just fracture the discussion to yet another place.

On Thu, Nov 24, 2016 at 6:52 AM assaf.mendelson 
wrote:

> Sorry to reawaken this, but I just noticed it is possible to propose new
> topic specific sites (http://area51.stackexchange.com/faq)  for stack
> overflow. So for example we might have a spark.stackexchange.com spark
> specific site.
>
> The advantage of such a site are many. First of all it is spark specific.
> Secondly the reputation of people would be on spark and not on general
> questions and lastly (and most importantly in my opinion) it would have
> spark based moderators (which are all spark moderator as opposed to general
> technology).
>
>
>
> The process of creating such a site is not complicated. Basically someone
> creates a proposal (I have no problem doing so). Then creating 5 example
> questions (something we want on the site) and get 5 people need to ‘follow’
> it within 3 days. This creates a “definition” phase. The goal is to get at
> least 40 questions that embody the goal of the site and have at least 10
> net votes and enough people follow it. When enough traction has been made
> (enough questions and enough followers) then the site moves to commitment
> phase. In this phase users “commit” to being on the site (basically this is
> aimed to see the community of experts is big enough). Once all this happens
> the site moves into beta. This means the site becomes active and it will
> become a full site if it sees enough traction.
>
>
>
> I would suggest trying to set this up.
>
>
>
> Thanks,
>
> Assaf
>
>
>