Re: welcome a new batch of committers

2018-10-05 Thread Bhupendra Mishra
Congratulations to all of you
Good Luck
Regards

On Wed, Oct 3, 2018 at 2:29 PM Reynold Xin  wrote:

> Hi all,
>
> The Apache Spark PMC has recently voted to add several new committers to
> the project, for their contributions:
>
> - Shane Knapp (contributor to infra)
> - Dongjoon Hyun (contributor to ORC support and other parts of Spark)
> - Kazuaki Ishizaki (contributor to Spark SQL)
> - Xingbo Jiang (contributor to Spark Core and SQL)
> - Yinan Li (contributor to Spark on Kubernetes)
> - Takeshi Yamamuro (contributor to Spark SQL)
>
> Please join me in welcoming them!
>
>


Coalesce behaviour

2018-10-05 Thread Sergey Zhemzhitsky
Hello guys,

Currently I'm a little bit confused with coalesce behaviour.

Consider the following usecase - I'd like to join two pretty big RDDs.
To make a join more stable and to prevent it from failures by OOM RDDs
are usually repartitioned to redistribute data more evenly and to
prevent every partition from hitting 2GB limit. Then after join with a
lot of partitions.

Then after successful join I'd like to save the resulting dataset.
But I don't need such a huge amount of files as the number of
partitions/tasks during joining. Actually I'm fine with such number of
files as the total number of executor cores allocated to the job. So
I've considered using a coalesce.

The problem is that coalesce with shuffling disabled prevents join
from using the specified number of partitions and instead forces join
to use the number of partitions provided to coalesce

scala> sc.makeRDD(1 to 100, 20).repartition(100).coalesce(5,
false).toDebugString
res5: String =
(5) CoalescedRDD[15] at coalesce at :25 []
 |  MapPartitionsRDD[14] at repartition at :25 []
 |  CoalescedRDD[13] at repartition at :25 []
 |  ShuffledRDD[12] at repartition at :25 []
 +-(20) MapPartitionsRDD[11] at repartition at :25 []
|   ParallelCollectionRDD[10] at makeRDD at :25 []

With shuffling enabled everything is ok, e.g.

scala> sc.makeRDD(1 to 100, 20).repartition(100).coalesce(5, true).toDebugString
res6: String =
(5) MapPartitionsRDD[24] at coalesce at :25 []
 |  CoalescedRDD[23] at coalesce at :25 []
 |  ShuffledRDD[22] at coalesce at :25 []
 +-(100) MapPartitionsRDD[21] at coalesce at :25 []
 |   MapPartitionsRDD[20] at repartition at :25 []
 |   CoalescedRDD[19] at repartition at :25 []
 |   ShuffledRDD[18] at repartition at :25 []
 +-(20) MapPartitionsRDD[17] at repartition at :25 []
|   ParallelCollectionRDD[16] at makeRDD at :25 []

In that case the problem is that for pretty huge datasets additional
reshuffling can take hours or at least comparable amount of time as
for the join itself.

So I'd like to understand whether it is a bug or just an expected behaviour?
In case it is expected is there any way to insert additional
ShuffleMapStage into an appropriate position of DAG but without
reshuffling itself?

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: welcome a new batch of committers

2018-10-05 Thread Suresh Thalamati
Congratulations to all!

-suresh

On Wed, Oct 3, 2018 at 1:59 AM Reynold Xin  wrote:

> Hi all,
>
> The Apache Spark PMC has recently voted to add several new committers to
> the project, for their contributions:
>
> - Shane Knapp (contributor to infra)
> - Dongjoon Hyun (contributor to ORC support and other parts of Spark)
> - Kazuaki Ishizaki (contributor to Spark SQL)
> - Xingbo Jiang (contributor to Spark Core and SQL)
> - Yinan Li (contributor to Spark on Kubernetes)
> - Takeshi Yamamuro (contributor to Spark SQL)
>
> Please join me in welcoming them!
>
>


Re: welcome a new batch of committers

2018-10-05 Thread Xiao Li
Congratulations all!

Weiqing Yang  于2018年10月3日周三 下午11:20写道:

> Congratulations everyone!
>
> On Wed, Oct 3, 2018 at 11:14 PM, Driesprong, Fokko 
> wrote:
>
>> Congratulations all!
>>
>> Op wo 3 okt. 2018 om 23:03 schreef Bryan Cutler :
>>
>>> Congratulations everyone! Very well deserved!!
>>>
>>> On Wed, Oct 3, 2018, 1:59 AM Reynold Xin  wrote:
>>>
 Hi all,

 The Apache Spark PMC has recently voted to add several new committers
 to the project, for their contributions:

 - Shane Knapp (contributor to infra)
 - Dongjoon Hyun (contributor to ORC support and other parts of Spark)
 - Kazuaki Ishizaki (contributor to Spark SQL)
 - Xingbo Jiang (contributor to Spark Core and SQL)
 - Yinan Li (contributor to Spark on Kubernetes)
 - Takeshi Yamamuro (contributor to Spark SQL)

 Please join me in welcoming them!


>


Re: [DISCUSS][K8S] Local dependencies with Kubernetes

2018-10-05 Thread Yinan Li
> Just to be clear: in client mode things work right? (Although I'm not
really familiar with how client mode works in k8s - never tried it.)

If the driver runs on the submission client machine, yes, it should just
work. If the driver runs in a pod, however, it faces the same problem as in
cluster mode.

Yinan

On Fri, Oct 5, 2018 at 11:06 AM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> @Marcelo is correct. Mesos does not have something similar. Only Yarn does
> due to the distributed cache thing.
> I have described most of the above in the the jira also there are some
> other options.
>
> Best,
> Stavros
>
> On Fri, Oct 5, 2018 at 8:28 PM, Marcelo Vanzin <
> van...@cloudera.com.invalid> wrote:
>
>> On Fri, Oct 5, 2018 at 7:54 AM Rob Vesse  wrote:
>> > Ideally this would all just be handled automatically for users in the
>> way that all other resource managers do
>>
>> I think you're giving other resource managers too much credit. In
>> cluster mode, only YARN really distributes local dependencies, because
>> YARN has that feature (its distributed cache) and Spark just uses it.
>>
>> Standalone doesn't do it (see SPARK-4160) and I don't remember seeing
>> anything similar on the Mesos side.
>>
>> There are things that could be done; e.g. if you have HDFS you could
>> do a restricted version of what YARN does (upload files to HDFS, and
>> change the "spark.jars" and "spark.files" URLs to point to HDFS
>> instead). Or you could turn the submission client into a file server
>> that the cluster-mode driver downloads files from - although that
>> requires connectivity from the driver back to the client.
>>
>> Neither is great, but better than not having that feature.
>>
>> Just to be clear: in client mode things work right? (Although I'm not
>> really familiar with how client mode works in k8s - never tried it.)
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
>
>


Re: [DISCUSS][K8S] Local dependencies with Kubernetes

2018-10-05 Thread Stavros Kontopoulos
@Marcelo is correct. Mesos does not have something similar. Only Yarn does
due to the distributed cache thing.
I have described most of the above in the the jira also there are some
other options.

Best,
Stavros

On Fri, Oct 5, 2018 at 8:28 PM, Marcelo Vanzin 
wrote:

> On Fri, Oct 5, 2018 at 7:54 AM Rob Vesse  wrote:
> > Ideally this would all just be handled automatically for users in the
> way that all other resource managers do
>
> I think you're giving other resource managers too much credit. In
> cluster mode, only YARN really distributes local dependencies, because
> YARN has that feature (its distributed cache) and Spark just uses it.
>
> Standalone doesn't do it (see SPARK-4160) and I don't remember seeing
> anything similar on the Mesos side.
>
> There are things that could be done; e.g. if you have HDFS you could
> do a restricted version of what YARN does (upload files to HDFS, and
> change the "spark.jars" and "spark.files" URLs to point to HDFS
> instead). Or you could turn the submission client into a file server
> that the cluster-mode driver downloads files from - although that
> requires connectivity from the driver back to the client.
>
> Neither is great, but better than not having that feature.
>
> Just to be clear: in client mode things work right? (Although I'm not
> really familiar with how client mode works in k8s - never tried it.)
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS][K8S] Local dependencies with Kubernetes

2018-10-05 Thread Yinan Li
Agreed with Marcelo that this is not a unique problem to Spark on k8s. For
a lot of organizations, hosting dependencies on HDFS seems the choice. One
option that the Spark Operator

does is to automatically upload application dependencies on the submission
client machine to a user-specified S3 or GCS bucket and substitute the
local dependencies with the remote ones. But regardless of which option to
use to stage local dependencies, it generally only works for small ones
like jars or small config/data files.

Yinan

On Fri, Oct 5, 2018 at 10:28 AM Marcelo Vanzin 
wrote:

> On Fri, Oct 5, 2018 at 7:54 AM Rob Vesse  wrote:
> > Ideally this would all just be handled automatically for users in the
> way that all other resource managers do
>
> I think you're giving other resource managers too much credit. In
> cluster mode, only YARN really distributes local dependencies, because
> YARN has that feature (its distributed cache) and Spark just uses it.
>
> Standalone doesn't do it (see SPARK-4160) and I don't remember seeing
> anything similar on the Mesos side.
>
> There are things that could be done; e.g. if you have HDFS you could
> do a restricted version of what YARN does (upload files to HDFS, and
> change the "spark.jars" and "spark.files" URLs to point to HDFS
> instead). Or you could turn the submission client into a file server
> that the cluster-mode driver downloads files from - although that
> requires connectivity from the driver back to the client.
>
> Neither is great, but better than not having that feature.
>
> Just to be clear: in client mode things work right? (Although I'm not
> really familiar with how client mode works in k8s - never tried it.)
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS][K8S] Local dependencies with Kubernetes

2018-10-05 Thread Marcelo Vanzin
On Fri, Oct 5, 2018 at 7:54 AM Rob Vesse  wrote:
> Ideally this would all just be handled automatically for users in the way 
> that all other resource managers do

I think you're giving other resource managers too much credit. In
cluster mode, only YARN really distributes local dependencies, because
YARN has that feature (its distributed cache) and Spark just uses it.

Standalone doesn't do it (see SPARK-4160) and I don't remember seeing
anything similar on the Mesos side.

There are things that could be done; e.g. if you have HDFS you could
do a restricted version of what YARN does (upload files to HDFS, and
change the "spark.jars" and "spark.files" URLs to point to HDFS
instead). Or you could turn the submission client into a file server
that the cluster-mode driver downloads files from - although that
requires connectivity from the driver back to the client.

Neither is great, but better than not having that feature.

Just to be clear: in client mode things work right? (Although I'm not
really familiar with how client mode works in k8s - never tried it.)

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS][K8S] Local dependencies with Kubernetes

2018-10-05 Thread Stavros Kontopoulos
Hi Rob,

Interesting topic and affects UX a lot. I provided my thoughts in the
related jira.

Best,
Stavros

On Fri, Oct 5, 2018 at 5:53 PM, Rob Vesse  wrote:

> Folks
>
>
>
> One of the big limitations of the current Spark on K8S implementation is
> that it isn’t possible to use local dependencies (SPARK-23153 [1]) i.e.
> code, JARs, data etc that only lives on the submission client.  This
> basically leaves end users with several options on how to actually run
> their Spark jobs under K8S:
>
>
>
>1. Store local dependencies on some external distributed file system
>e.g. HDFS
>2. Build custom images with their local dependencies
>3. Mount local dependencies into volumes that are mounted by the K8S
>pods
>
>
>
> In all cases the onus is on the end user to do the prep work.  Option 1 is
> unfortunately rare in the environments we’re looking to deploy Spark and
> Option 2 tends to be a non-starter as many of our customers whitelist
> approved images i.e. custom images are not permitted.
>
>
>
> Option 3 is more workable but still requires the users to provide a bunch
> of extra config options to configure this for simple cases or rely upon the
> pending pod template feature for complex cases.
>
>
>
> Ideally this would all just be handled automatically for users in the way
> that all other resource managers do, the K8S backend even did this at one
> point in the downstream fork but after a long discussion [2] this got
> dropped in favour of using Spark standard mechanisms i.e. spark-submit.
> Unfortunately this apparently was never followed through upon as it doesn’t
> work with master as of today.  Moreover I am unclear how this would work in
> the case of Spark on K8S cluster mode where the driver itself is inside a
> pod since the spark-submit mechanism is based upon copying from the drivers
> filesystem to the executors via a file server on the driver, if the driver
> is inside a pod it won’t be able to see local files on the submission
> client.  I think this may work out of the box with client mode but I
> haven’t dug into that enough to verify yet.
>
>
>
> I would like to start work on addressing this problem but to be honest I
> am unclear where to start with this.  It seems using the standard
> spark-submit mechanism is the way to go but I’m not sure how to get around
> the driver pod issue.  I would appreciate any pointers from folks who’ve
> looked at this previously on how and where to start on this.
>
>
>
> Cheers,
>
>
>
> Rob
>
>
>
> [1] https://issues.apache.org/jira/browse/SPARK-23153
>
> [2] https://lists.apache.org/thread.html/82b4ae9a2eb5ddeb3f7240ebf154f0
> 6f19b830f8b3120038e5d687a1@%3Cdev.spark.apache.org%3E
>


Spark github sync works now

2018-10-05 Thread Xiao Li
FYI. The Spark github sync was 7 hour behind this morning. You might get
fail merges because of this. Just triggered a re-sync. It should work now.

Thanks,

Xiao


[DISCUSS][K8S] Local dependencies with Kubernetes

2018-10-05 Thread Rob Vesse
Folks

 

One of the big limitations of the current Spark on K8S implementation is that 
it isn’t possible to use local dependencies (SPARK-23153 [1]) i.e. code, JARs, 
data etc that only lives on the submission client.  This basically leaves end 
users with several options on how to actually run their Spark jobs under K8S:

 
Store local dependencies on some external distributed file system e.g. HDFS
Build custom images with their local dependencies
Mount local dependencies into volumes that are mounted by the K8S pods
 

In all cases the onus is on the end user to do the prep work.  Option 1 is 
unfortunately rare in the environments we’re looking to deploy Spark and Option 
2 tends to be a non-starter as many of our customers whitelist approved images 
i.e. custom images are not permitted.

 

Option 3 is more workable but still requires the users to provide a bunch of 
extra config options to configure this for simple cases or rely upon the 
pending pod template feature for complex cases.

 

Ideally this would all just be handled automatically for users in the way that 
all other resource managers do, the K8S backend even did this at one point in 
the downstream fork but after a long discussion [2] this got dropped in favour 
of using Spark standard mechanisms i.e. spark-submit.  Unfortunately this 
apparently was never followed through upon as it doesn’t work with master as of 
today.  Moreover I am unclear how this would work in the case of Spark on K8S 
cluster mode where the driver itself is inside a pod since the spark-submit 
mechanism is based upon copying from the drivers filesystem to the executors 
via a file server on the driver, if the driver is inside a pod it won’t be able 
to see local files on the submission client.  I think this may work out of the 
box with client mode but I haven’t dug into that enough to verify yet.

 

I would like to start work on addressing this problem but to be honest I am 
unclear where to start with this.  It seems using the standard spark-submit 
mechanism is the way to go but I’m not sure how to get around the driver pod 
issue.  I would appreciate any pointers from folks who’ve looked at this 
previously on how and where to start on this.

 

Cheers,

 

Rob

 

[1] https://issues.apache.org/jira/browse/SPARK-23153

[2] 
https://lists.apache.org/thread.html/82b4ae9a2eb5ddeb3f7240ebf154f06f19b830f8b3120038e5d687a1@%3Cdev.spark.apache.org%3E