Re: Apache Spark Docker image repository

2020-02-11 Thread Erik Erlandson
My takeaway from the last time we discussed this was:
1) To be ASF compliant, we needed to only publish images at official
releases
2) There was some ambiguity about whether or not a container image that
included GPL'ed packages (spark images do) might trip over the GPL "viral
propagation" due to integrating ASL and GPL in a "binary release".  The
"air gap" GPL provision may apply - the GPL software interacts only at
command-line boundaries.

On Wed, Feb 5, 2020 at 1:23 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> From 2020, shall we have an official Docker image repository as an
> additional distribution channel?
>
> I'm considering the following images.
>
> - Public binary release (no snapshot image)
> - Public non-Spark base image (OS + R + Python)
>   (This can be used in GitHub Action Jobs and Jenkins K8s Integration
> Tests to speed up jobs and to have more stabler environments)
>
> Bests,
> Dongjoon.
>


Re: Initial Decom PR for Spark 3?

2020-02-08 Thread Erik Erlandson
I'd be willing to pull this in, unless others have concerns post branch-cut.

On Tue, Feb 4, 2020 at 2:51 PM Holden Karau  wrote:

> Hi Y’all,
>
> I’ve got a K8s graceful decom PR (
> https://github.com/apache/spark/pull/26440
>  ) I’d love to try and get in for Spark 3, but I don’t want to push on it
> if folks don’t think it’s worth it. I’ve been working on it since 2017 and
> it was really close in November but then I had the crash and had to step
> back for awhile.
>
> It’s effectiveness is behind a feature flag and it’s been outstanding for
> awhile so those points are in its favour. It does however change things in
> core which is not great.
>
> Cheers,
>
> Holden
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [DISCUSS][SPARK-30275] Discussion about whether to add a gitlab-ci.yml file

2020-01-26 Thread Erik Erlandson
Can a '.gitlab-ci.yml' be considered code, in the same way that the k8s
related dockerfiles are code?  In other words, something like: "here is a
piece of code you might choose to use for building your own binaries, that
is not specifically endorsed by Apache Spark"? So it would not be involved
in the creation of nightly binaries by Apache Spark per se, but could be
used by individuals for that purpose.

On Thu, Jan 23, 2020 at 3:52 PM Jim Kleckner  wrote:

> I understand that "non-dev" persons could become confused and that some
> sort of signposting/warning makes sense.
>
> Certainly I consider my personal registry on gitlab.com as ephemeral and
> not intended to publish.
> We have our own private instance of gitlab where I put artifacts that are
> derived and this was needed to work with GKE as mentioned since 2.4.4 does
> not out of the box work with service accounts the way we use them..
>
> I can keep this file as a branch of my own that I manually merge when
> needed if others don't find this useful or the risk of confusion is greater
> than the value.
>
> Simply close as not desirable the JIRA at:
> https://issues.apache.org/jira/browse/SPARK-30275
>
> And now there are discussions both in email and JIRA...
>
> Jim
>
>
>
>
> On Thu, Jan 23, 2020 at 11:15 AM Sean Owen  wrote:
>
>> Yeah the color on this is that 'snapshot' or 'nightly' builds are not
>> quite _discouraged_ by the ASF, but need to be something only devs are
>> likely to find and clearly signposted, because they aren't official
>> blessed releases. It gets into a gray area if the project is
>> 'officially' hosting a way to get snapshot builds. It is not at all
>> impossible, just something that's come up and generated some angst in
>> the past, so we dropped it.
>>
>> On Thu, Jan 23, 2020 at 1:09 PM Dongjoon Hyun 
>> wrote:
>> >
>> > Hi, Jim.
>> >
>> > Thank you for the proposal. I understand the request.
>> > However, the following key benefit sounds like unofficial snapshot
>> binary releases.
>> >
>> > > For example, this was used to build a version of spark that included
>> SPARK-28938 which has yet to be released and was necessary for
>> spark-operator to work properly with GKE service accounts
>> >
>> > Historically, we removed the existing snapshot binaries in some
>> personal repositories and there is no plan to add it back.
>> > Also, for snapshot dev jars, we use only the official Apache Maven
>> snapshot repository.
>> >
>> > For official releases, we aim to release Apache Spark source code (and
>> its artifacts) according to the pre-defined release cadence in an official
>> manner.
>> >
>> > BTW, SPARK-28938 doesn't mean that we need to publish a docker image.
>> Even in the official release, as you know, we only provide a reference
>> Dockerfile. That's the reason why we don't publish docker image via GitHub
>> Action (as of Today).
>> >
>> > To achieve the following custom requirement, I'd like to recommend you
>> to have your own Dockerfile.
>> > That is the best way for you to have the flexibility.
>> >
>> > > One value of this is the ability to create versions of dependent
>> packages such as spark-on-k8s-operator
>> >
>> > Thanks,
>> > Dongjoon.
>> >
>> >
>> > On Thu, Jan 23, 2020 at 9:32 AM Jim Kleckner 
>> wrote:
>> >>
>> >> This story [1] proposes adding a .gitlab-ci.yml file to make it easy
>> to create artifacts and images for spark.
>> >>
>> >> Using this mechanism, people can submit any subsequent version of
>> spark for building and image hosting with gitlab.com.
>> >>
>> >> There is a companion WIP branch [2] with a candidate and example for
>> doing this.
>> >> The exact steps for building are in the yml file [3].
>> >> The images get published into the namespace of the user as here [4]
>> >>
>> >> One value of this is the ability to create versions of dependent
>> packages such as spark-on-k8s-operator that might use upgraded packages or
>> modifications for testing.  For example, this was used to build a version
>> of spark that included SPARK-28938 which has yet to be released and was
>> necessary for spark-operator to work properly with GKE service accounts [5].
>> >>
>> >> Comments about desirability?
>> >>
>> >> [1] https://issues.apache.org/jira/browse/SPARK-30275
>> >> [2] https://gitlab.com/jkleckner/spark/tree/add-gitlab-ci-yml
>> >> [3]
>> https://gitlab.com/jkleckner/spark/blob/add-gitlab-ci-yml/.gitlab-ci.yml
>> >> [4] https://gitlab.com/jkleckner/spark/container_registry
>> >> [5]
>> https://gitlab.com/jkleckner/spark-on-k8s-operator/container_registry
>>
>


[DISCUSS] commit to ExpressionEncoder?

2019-10-22 Thread Erik Erlandson
Currently the design of Encoder implies the possibility that encoders might
be customized, or at least that there are other internal alternatives to
ExpressionEncoder. However there are both implicit and explicit
restrictions in spark-sql, such that ExpressionEncoder is the only
functional option, including but not limited to:

  def encoderFor[A : Encoder]: ExpressionEncoder[A] =
implicitly[Encoder[A]] match {
case e: ExpressionEncoder[A] =>
  e.assertUnresolved()
  e
case _ => sys.error(s"Only expression encoders are supported today")
  }

My impression, based on recent work around the SQL code, is that:
1) ExpressionEncoder does support the space of plausible data types and
objects, including custom ones.
2) Supporting other subclasses of Encoder in actuality would be nontrivial

Should spark SQL more explicitly commit to an "ExpressionEncoder only"
position?


Re: [k8s] Spark operator (the Java one)

2019-10-19 Thread Erik Erlandson
> It's applicable regardless of if the operators are maintained as part of
> Spark core or not, with the maturity of Kubernetes features around CRD
> support and webhooks. The GCP Spark operator supports a lot of additional
> pod/container configs using a webhook, and this approach seems pretty
> successful so far.
>

Agreed (in fact the existence of >= two independent operator projects
testifies to this). I do believe this has implications for how feature
requests for spark-on-k8s get fielded here upstream. There's a non-zero
amount of cognitive load involved with recommending that a feature request
be deferred to some independent operator project. Going forward, will that
complicate the upstream story for spark-submit, history server, and shuffle
service on a kube backend?


Re: Spark 3.0 preview release feature list and major changes

2019-10-19 Thread Erik Erlandson
I'd like to get SPARK-27296 onto 3.0:
SPARK-27296  Efficient
User Defined Aggregators



On Mon, Oct 7, 2019 at 3:03 PM Xingbo Jiang  wrote:

> Hi all,
>
> I went over all the finished JIRA tickets targeted to Spark 3.0.0, here
> I'm listing all the notable features and major changes that are ready to
> test/deliver, please don't hesitate to add more to the list:
>
> SPARK-11215  Multiple
> columns support added to various Transformers: StringIndexer
>
> SPARK-11150  Implement
> Dynamic Partition Pruning
>
> SPARK-13677  Support
> Tree-Based Feature Transformation
>
> SPARK-16692  Add
> MultilabelClassificationEvaluator
>
> SPARK-19591  Add
> sample weights to decision trees
>
> SPARK-19712  Pushing
> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>
> SPARK-19827  R API for
> Power Iteration Clustering
>
> SPARK-20286  Improve
> logic for timing out executors in dynamic allocation
>
> SPARK-20636  Eliminate
> unnecessary shuffle with adjacent Window expressions
>
> SPARK-22148  Acquire
> new executors to avoid hang because of blacklisting
>
> SPARK-22796  Multiple
> columns support added to various Transformers: PySpark QuantileDiscretizer
>
> SPARK-23128  A new
> approach to do adaptive execution in Spark SQL
>
> SPARK-23674  Add Spark
> ML Listener for Tracking ML Pipeline Status
>
> SPARK-23710  Upgrade
> the built-in Hive to 2.3.5 for hadoop-3.2
>
> SPARK-24333  Add fit
> with validation set to Gradient Boosted Trees: Python API
>
> SPARK-24417  Build and
> Run Spark on JDK11
>
> SPARK-24615 
> Accelerator-aware task scheduling for Spark
>
> SPARK-24920  Allow
> sharing Netty's memory pool allocators
>
> SPARK-25250  Fix race
> condition with tasks running when new attempt for same stage is created
> leads to other task in the next attempt running on the same partition id
> retry multiple times
>
> SPARK-25341  Support
> rolling back a shuffle map stage and re-generate the shuffle files
>
> SPARK-25348  Data
> source for binary files
>
> SPARK-25603 
> Generalize Nested Column Pruning
>
> SPARK-26132  Remove
> support for Scala 2.11 in Spark 3.0.0
>
> SPARK-26215  define
> reserved keywords after SQL standard
>
> SPARK-26412  Allow
> Pandas UDF to take an iterator of pd.DataFrames
>
> SPARK-26785  data
> source v2 API refactor: streaming write
>
> SPARK-26956  remove
> streaming output mode from data source v2 APIs
>
> SPARK-27064  create
> StreamingWrite at the beginning of streaming execution
>
> SPARK-27119  Do not
> infer schema when reading Hive serde table with native data source
>
> SPARK-27225  Implement
> join strategy hints
>
> SPARK-27240  Use
> pandas DataFrame for struct type argument in Scalar Pandas UDF
>
> SPARK-27338  Fix
> deadlock between TaskMemoryManager and
> UnsafeExternalSorter$SpillableIterator
>
> SPARK-27396  Public
> APIs for extended Columnar Processing Support
>
> SPARK-27589 
> Re-implement file sources with data source V2 API
>
> SPARK-27677 
> Disk-persisted RDD blocks served by shuffle service, and ignored for
> Dynamic Allocation
>
> SPARK-27699  Partially
> push down disjunctive 

[DISCUSS] remove 'private[spark]' scoping from UserDefinedType

2019-10-19 Thread Erik Erlandson
The 3.0 release is as good an opportunity as any to make UserDefinedType
public again. What does the community think?
Cheers,
Erik


Re: [k8s] Spark operator (the Java one)

2019-10-16 Thread Erik Erlandson
Folks have (correctly) pointed out that an operator does not need to be
coupled to the Apache Spark project. However, I believe there are some
strategic community benefits to supporting a Spark operator that should be
weighed against the costs of maintaining one.

*) The Kubernetes ecosystem is evolving toward adopting operators as the de
facto standard for deploying and manipulating software resources on a kube
cluster. Supporting an out-of-the-box operator will increase the
attractiveness of Spark for users and stakeholders in the Kubernetes
ecosystem and maximize future uptake; it will continue to keep the barrier
to entry low for Spark on Kubernetes.

*) An operator provides a unified and idiomatic kube front-end not just for
spark job submissions, but also standalone spark clusters in the cloud, the
spark history server and eventually the modernized shuffle service, when
that is completed.

*) It represents an additional channel for exposing kube-specific features,
that might otherwise need to be plumbed through spark-submit or the k8s
backend.

Cheers,
Erik

On Thu, Oct 10, 2019 at 9:23 PM Yinan Li  wrote:

> +1. This and the GCP Spark Operator, although being very useful for k8s
> users, are not something needed by all Spark users, not even by all Spark
> on k8s users.
>
>
> On Thu, Oct 10, 2019 at 6:34 PM Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com> wrote:
>
>> Hi all,
>>
>> I also left a comment on the PR with more details. I dont see why the
>> java operator should be maintained by the Spark project.
>> This is an interesting project and could thrive on its own as an external
>> operator project.
>>
>> Best,
>> Stavros
>>
>> On Thu, Oct 10, 2019 at 7:51 PM Sean Owen  wrote:
>>
>>> I'd have the same question on the PR - why does this need to be in the
>>> Apache Spark project vs where it is now? Yes, it's not a Spark package
>>> per se, but it seems like this is a tool for K8S to use Spark rather
>>> than a core Spark tool.
>>>
>>> Yes of course all the packages, licenses, etc have to be overhauled,
>>> but that kind of underscores that this is a dump of a third party tool
>>> that works fine on its own?
>>>
>>> On Thu, Oct 10, 2019 at 9:30 AM Jiri Kremser 
>>> wrote:
>>> >
>>> > Hello,
>>> >
>>> >
>>> > Spark Operator is a tool that can deploy/scale and help with
>>> monitoring of Spark clusters on Kubernetes. It follows the operator pattern
>>> [1] introduced by CoreOS so it watches for changes in custom resources
>>> representing the desired state of the clusters and does the steps to
>>> achieve this state in the Kubernetes by using the K8s client. It’s written
>>> in Java and there is an overlap with the spark dependencies (logging, k8s
>>> client, apache-commons-*, fasterxml-jackson, etc.). The operator contains
>>> also metadata that allows it to deploy smoothly using the operatorhub.io
>>> [2]. For a very basic info, check the readme on the project page including
>>> the gif :) Other unique feature to this operator is the ability (it’s
>>> optional) to compile itself to a native image using GraalVM compiler to be
>>> able to start fast and have a very low memory footprint.
>>> >
>>> >
>>> > We would like to contribute this project to Spark’s code base. It
>>> can’t be distributed as a spark package, because it’s not a library that
>>> can be used from Spark environment. So if you are interested, the directory
>>> under resource-managers/kubernetes/spark-operator/ could be a suitable
>>> destination.
>>> >
>>> >
>>> > The current repository is radanalytics/spark-operator [2] on GitHub
>>> and it contains also a test suite [3] that verifies if the operator can
>>> work well on K8s (using minikube) and also on OpenShift. I am not sure how
>>> to transfer those tests in case you would be interested in those as well.
>>> >
>>> >
>>> > I’ve already opened the PR [5], but it got closed, so I am opening the
>>> discussion here first. The PR contained old package names with our
>>> organisation called radanalytics.io but we are willing to change that
>>> to anything that will be more aligned with the existing Spark conventions,
>>> same holds for the license headers in all the source files.
>>> >
>>> >
>>> > jk
>>> >
>>> >
>>> >
>>> > [1]: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
>>> >
>>> > [2]: https://operatorhub.io/operator/radanalytics-spark
>>> >
>>> > [3]: https://github.com/radanalyticsio/spark-operator
>>> >
>>> > [4]: https://travis-ci.org/radanalyticsio/spark-operator
>>> >
>>> > [5]: https://github.com/apache/spark/pull/26075
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>


Re: UDAFs have an inefficiency problem

2019-09-30 Thread Erik Erlandson
On the PR review, there were questions about adding a new aggregating
class, and whether or not Aggregator[IN,BUF,OUT] could be used.  I added a
proof of concept solution based on enhancing Aggregator to the pull-req:
https://github.com/apache/spark/pull/25024/

I wrote up my findings on the PR but the gist is that Aggregator is a
feasible option, however it does not provide *total* feature parity with
UDAF.  Note that this PR now includes two candidate solutions, for
comparison purposes, as well as an extra test file (tdigest.scala).
Eventually one of these solutions will be removed, depending on what option
is selected.

I'm pushing this forward now with the goal of getting a solution into the
upcoming 3.0 branch cut

On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson  wrote:

> I describe some of the details here:
> https://issues.apache.org/jira/browse/SPARK-27296
>
> The short version of the story is that aggregating data structures (UDTs)
> used by UDAFs are serialized to a Row object, and de-serialized, for every
> row in a data frame.
> Cheers,
> Erik
>
>


Re: Thoughts on Spark 3 release, or a preview release

2019-09-16 Thread Erik Erlandson
I'm in favor of adding SPARK-25299
 - Use remote storage
for persisting shuffle data
https://issues.apache.org/jira/browse/SPARK-25299

If that is far enough along to get onto the roadmap.


On Wed, Sep 11, 2019 at 11:37 AM Sean Owen  wrote:

> I'm curious what current feelings are about ramping down towards a
> Spark 3 release. It feels close to ready. There is no fixed date,
> though in the past we had informally tossed around "back end of 2019".
> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
> due.
>
> What are the few major items that must get done for Spark 3, in your
> opinion? Below are all of the open JIRAs for 3.0 (which everyone
> should feel free to update with things that aren't really needed for
> Spark 3; I already triaged some).
>
> For me, it's:
> - DSv2?
> - Finishing touches on the Hive, JDK 11 update
>
> What about considering a preview release earlier, as happened for
> Spark 2, to get feedback much earlier than the RC cycle? Could that
> even happen ... about now?
>
> I'm also wondering what a realistic estimate of Spark 3 release is. My
> guess is quite early 2020, from here.
>
>
>
> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog
> uses
> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
> SPARK-28588 Build a SQL reference doc
> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
> SPARK-28684 Hive module support JDK 11
> SPARK-28548 explain() shows wrong result for persisted DataFrames
> after some operations
> SPARK-28372 Document Spark WEB UI
> SPARK-28476 Support ALTER DATABASE SET LOCATION
> SPARK-28264 Revisiting Python / pandas UDF
> SPARK-28301 fix the behavior of table name resolution with multi-catalog
> SPARK-28155 do not leak SaveMode to file source v2
> SPARK-28103 Cannot infer filters from union table with empty local
> relation table properly
> SPARK-28024 Incorrect numeric values when out of range
> SPARK-27936 Support local dependency uploading from --py-files
> SPARK-27884 Deprecate Python 2 support in Spark 3.0
> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
> SPARK-27780 Shuffle server & client should be versioned to enable
> smoother upgrade
> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
> of joined tables > 12
> SPARK-27471 Reorganize public v2 catalog API
> SPARK-27520 Introduce a global config system to replace hadoopConfiguration
> SPARK-24625 put all the backward compatible behavior change configs
> under spark.sql.legacy.*
> SPARK-24640 size(null) returns null
> SPARK-24702 Unable to cast to calendar interval in spark sql.
> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
> SPARK-24941 Add RDDBarrier.coalesce() function
> SPARK-25017 Add test suite for ContextBarrierState
> SPARK-25083 remove the type erasure hack in data source scan
> SPARK-25383 Image data source supports sample pushdown
> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
> default
> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
> efficiency problem
> SPARK-25128 multiple simultaneous job submissions against k8s backend
> cause driver pods to hang
> SPARK-26731 remove EOLed spark jobs from jenkins
> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
> SPARK-21559 Remove Mesos fine-grained mode
> SPARK-24942 Improve cluster resource management with jobs containing
> barrier stage
> SPARK-25914 Separate projection from grouping and aggregate in logical
> Aggregate
> SPARK-26022 PySpark Comparison with Pandas
> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
> SPARK-26221 Improve Spark SQL instrumentation and metrics
> SPARK-26425 Add more constraint checks in file streaming source to
> avoid checkpoint corruption
> SPARK-25843 Redesign rangeBetween API
> SPARK-25841 Redesign window function rangeBetween API
> SPARK-25752 Add trait to easily whitelist logical operators that
> produce named output from CleanupAliases
> SPARK-23210 Introduce the concept of default value to schema
> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
> aggregate
> SPARK-25531 new write APIs for data source v2
> SPARK-25547 Pluggable jdbc connection factory
> SPARK-20845 Support specification of column names in INSERT INTO
> SPARK-24417 Build and Run Spark on JDK11
> SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
> SPARK-25074 Implement maxNumConcurrentTasks() in
> MesosFineGrainedSchedulerBackend
> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
> SPARK-25186 Stabilize Data Source V2 API
> SPARK-25376 Scenarios we should 

Re: Data Property Accumulators

2019-08-21 Thread Erik Erlandson
I'm wondering whether keeping track of accumulation in "consistent mode" is
like a case for mapping straight to the Try value, so parsedData has type
RDD[Try[...]], and counting failures is
parsedData.filter(_.isFailure).count, etc

Put another way: Consistent mode accumulation seems (to me) like it is
trying to obey spark's RDD compute model, contrasted with legacy
accumulators which subvert that model. I think the fact that your "option
3" is sending information about accumulators down through mapping function
api, as well as passing through an Option" stage, is also hinting at that
idea.

That might mean the idiomatic way to do consistent mode is via the existing
spark API, and using constructs like Try, Either, Option, Tuple, or just a
new column carrying additional accumulator channels.


On Fri, Aug 16, 2019 at 5:48 PM Holden Karau  wrote:

> Are folks interested in seeing data property accumulators for RDDs? I made
> a proposal for this back in Spark 2016 (
> https://docs.google.com/document/d/1lR_l1g3zMVctZXrcVjFusq2iQVpr4XvRK_UUDsDr6nk/edit
>  ) but
> ABI compatibility was a stumbling block I couldn't design around. I can
> look at reviving it for Spark 3 or just go ahead and close out this idea.
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: UDAFs have an inefficiency problem

2019-07-05 Thread Erik Erlandson
I submitted a PR for this:
https://github.com/apache/spark/pull/25024

On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson  wrote:

> I describe some of the details here:
> https://issues.apache.org/jira/browse/SPARK-27296
>
> The short version of the story is that aggregating data structures (UDTs)
> used by UDAFs are serialized to a Row object, and de-serialized, for every
> row in a data frame.
> Cheers,
> Erik
>
>


Re: UDAFs have an inefficiency problem

2019-03-27 Thread Erik Erlandson
BTW, if this is known, is there an existing JIRA I should link to?

On Wed, Mar 27, 2019 at 4:36 PM Erik Erlandson  wrote:

>
> At a high level, some candidate strategies are:
> 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF
> trait itself) so that the update method can do the right thing.
> 2. Expose TypedImperativeAggregate to users for defining their own, since
> it already does the right thing.
> 3. As a workaround, allow users to define their own sub-classes of
> DataType.  It would essentially allow one to define the sqlType of the UDT
> to be the aggregating object itself and make ser/de a no-op.  I tried doing
> this and it will compile, but spark's internals only consider a predefined
> universe of DataType classes.
>
> All of these options are likely to have implications for the catalyst
> systems. I'm not sure if they are minor more substantial.
>
> On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin  wrote:
>
>> Yes this is known and an issue for performance. Do you have any thoughts
>> on how to fix this?
>>
>> On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson 
>> wrote:
>>
>>> I describe some of the details here:
>>> https://issues.apache.org/jira/browse/SPARK-27296
>>>
>>> The short version of the story is that aggregating data structures
>>> (UDTs) used by UDAFs are serialized to a Row object, and de-serialized, for
>>> every row in a data frame.
>>> Cheers,
>>> Erik
>>>
>>>


Re: UDAFs have an inefficiency problem

2019-03-27 Thread Erik Erlandson
At a high level, some candidate strategies are:
1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF
trait itself) so that the update method can do the right thing.
2. Expose TypedImperativeAggregate to users for defining their own, since
it already does the right thing.
3. As a workaround, allow users to define their own sub-classes of
DataType.  It would essentially allow one to define the sqlType of the UDT
to be the aggregating object itself and make ser/de a no-op.  I tried doing
this and it will compile, but spark's internals only consider a predefined
universe of DataType classes.

All of these options are likely to have implications for the catalyst
systems. I'm not sure if they are minor more substantial.

On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin  wrote:

> Yes this is known and an issue for performance. Do you have any thoughts
> on how to fix this?
>
> On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson 
> wrote:
>
>> I describe some of the details here:
>> https://issues.apache.org/jira/browse/SPARK-27296
>>
>> The short version of the story is that aggregating data structures (UDTs)
>> used by UDAFs are serialized to a Row object, and de-serialized, for every
>> row in a data frame.
>> Cheers,
>> Erik
>>
>>


UDAFs have an inefficiency problem

2019-03-27 Thread Erik Erlandson
I describe some of the details here:
https://issues.apache.org/jira/browse/SPARK-27296

The short version of the story is that aggregating data structures (UDTs)
used by UDAFs are serialized to a Row object, and de-serialized, for every
row in a data frame.
Cheers,
Erik


Re: SPARk-25299: Updates As Of December 19, 2018

2019-01-09 Thread Erik Erlandson
Curious how SPARK-25299 (where file tracking is pushed to spark drivers, at
least in option-5) interacts with Splash. The shuffle data location in
SPARK-25299 would now have additional "fallback" logic for recovering from
executor loss.

On Thu, Jan 3, 2019 at 6:24 AM Peter Rudenko 
wrote:

> Hi Matt, i'm a developer of SparkRDMA shuffle manager:
> https://github.com/Mellanox/SparkRDMA
> Thanks for your effort on improving Spark Shuffle API. We are very
> interested in participating in this. Have for now several comments:
> 1. Went through these 4 documents:
>
>
> https://docs.google.com/document/d/1tglSkfblFhugcjFXZOxuKsCdxfrHBXfxgTs-sbbNB3c/edit#
> 
>
>
> https://docs.google.com/document/d/1TA-gDw3ophy-gSu2IAW_5IMbRK_8pWBeXJwngN9YB80/edit
>
>
> https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40
>
>
> https://docs.google.com/document/d/1kSpbBB-sDk41LeORm3-Hfr-up98Ozm5wskvB49tUhSs/edit#
> 
> As i understood there's 2 discussions: improving shuffle manager API
> itself (Splash manager) and improving external shuffle service
>
> 
> 2. We may consider to revisiting SPIP: RDMA Accelerated Shuffle Engine
>  whether to support
> RDMA in the main codebase or at least as a first-class shuffle plugin
> (there are not much other open source shuffle plugins exists). We actively
> develop it, adding new features. RDMA is now available on Azure (
> https://azure.microsoft.com/en-us/blog/introducing-the-new-hb-and-hc-azure-vm-sizes-for-hpc/),
> Alibaba  and other cloud providers. For now we support only memory <->
> memory transfer, but rdma is extensible to NVM and GPU data transfer.
> 3. We have users that are interested in having this feature (
> https://issues.apache.org/jira/browse/SPARK-12196) - we can consider
> adding it to this new API.
>
> Let me know if you need help in review / testing / benchmark.
> I'll look more on documents and PR,
>
> Thanks,
> Peter Rudenko
> Software engineer at Mellanox Technologies.
>
>
> ср, 19 груд. 2018 о 20:54 John Zhuge  пише:
>
>> Matt, appreciate the update!
>>
>> On Wed, Dec 19, 2018 at 10:51 AM Matt Cheah  wrote:
>>
>>> Hi everyone,
>>>
>>>
>>>
>>> Earlier this year, we proposed SPARK-25299
>>> , proposing the idea
>>> of using other storage systems for persisting shuffle files. Since that
>>> time, we have been continuing to work on prototypes for this project. In
>>> the interest of increasing transparency into our work, we have created a 
>>> progress
>>> report document
>>> 
>>> where you may find a summary of the work we have been doing, as well as
>>> links to our prototypes on Github. We would ask that anyone who is very
>>> familiar with the inner workings of Spark’s shuffle could provide feedback
>>> and comments on our work thus far. We welcome any further discussion in
>>> this space. You may comment in this e-mail thread or by commenting on the
>>> progress report document.
>>>
>>>
>>>
>>> Looking forward to hearing from you. Thanks,
>>>
>>>
>>>
>>> -Matt Cheah
>>>
>>
>>
>> --
>> John
>>
>


Re: Remove non-Tungsten mode in Spark 3?

2019-01-09 Thread Erik Erlandson
Removing the user facing config seems like a good idea from the standpoint
of reducing cognitive load, and documentation

On Fri, Jan 4, 2019 at 7:03 AM Sean Owen  wrote:

> OK, maybe leave in tungsten for 3.0.
> I did a quick check, and removing StaticMemoryManager saves a few hundred
> lines. It's used in MemoryStore tests internally though, and not a trivial
> change to remove it. It's also used directly in HashedRelation. It could
> still be worth removing it as a user-facing option to reduce confusion
> about memory tuning, but it wouldn't take out much code. What do you all
> think?
>
> On Thu, Jan 3, 2019 at 9:41 PM Reynold Xin  wrote:
>
>> The issue with the offheap mode is it is a pretty big behavior change and
>> does require additional setup (also for users that run with UDFs that
>> allocate a lot of heap memory, it might not be as good).
>>
>> I can see us removing the legacy mode since it's been legacy for a long
>> time and perhaps very few users need it. How much code does it remove
>> though?
>>
>>
>> On Thu, Jan 03, 2019 at 2:55 PM, Sean Owen  wrote:
>>
>>> Just wondering if there is a good reason to keep around the pre-tungsten
>>> on-heap memory mode for Spark 3, and make spark.memory.offHeap.enabled
>>> always true? It would simplify the code somewhat, but I don't feel I'm so
>>> aware of the tradeoffs.
>>>
>>> I know we didn't deprecate it, but it's been off by default for a long
>>> time. It could be deprecated, too.
>>>
>>> Same question for spark.memory.useLegacyMode and all its various
>>> associated settings? Seems like these should go away at some point, and
>>> Spark 3 is a good point. Same issue about deprecation though.
>>>
>>> - To
>>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>
>>


Re: What's a blocker?

2018-10-25 Thread Erik Erlandson
I'd like to expand a bit on the phrase "opportunity cost" to try and make
it more concrete: delaying a release means that the  community is *not*
receiving various bug fixes (and features).  Just as a particular example,
the wait for 2.3.2 delayed a fix for the Py3.7 iterator breaking change
that was also causing a correctness bug.  It also delays community feedback
from running new releases.  That in and of itself does not give an answer
to block/not-block for any specific case, but it's another way of saying
that blocking a release *prevents* people from getting bug fixes, as well
as potentially fixing bugs.


On Thu, Oct 25, 2018 at 7:39 AM Sean Owen  wrote:

> What does "PMC members aren't saying its a block for reasons other then
> the actual impact the jira has" mean that isn't already widely agreed?
> Likewise "Committers and PMC members should not be saying its not a
> blocker because they personally or their company doesn't care about this
> feature or api". It sounds like insinuation, and I'd rather make it
> explicit -- call out the bad actions -- or keep it to observable technical
> issues.
>
> Likewise one could say there's a problem just because A thinks X should be
> a blocker and B disagrees. I see no bad faith, process problem, or obvious
> errors. Do you? I see disagreement, and it's tempting to suspect motives. I
> have seen what I think are actual bad-faith decisions in the past in this
> project, too. I don't see it here though and want to stick to 'now'.
>
> (Aside: the implication is that those representing vendors are
> steam-rolling a release. Actually, the cynical incentives cut the other way
> here. Blessing the latest changes as OSS Apache Spark is predominantly
> beneficial to users of OSS, not distros. In fact, it forces distros to make
> changes. And broadly, vendors have much more accountability for quality of
> releases, because they're paid to.)
>
>
> I'm still not sure what specifically the objection is to what here? I
> understand a lot is in flight and nobody agrees with every decision made,
> but, what else is new?
> Concretely: the release is held again to fix a few issues, in the end. For
> the map_filter issue, that seems like the right call, and there are a few
> other important issues that could be quickly fixed too. All is well there,
> yes?
>
> This has surfaced some implicit reasoning about releases that we could
> make explicit, like:
>
> (Sure, if you want to write down things like, release blockers should be
> decided in the interests of the project by the PMC, OK)
>
> We have a time-based release schedule, so time matters. There is an
> opportunity cost to not releasing. The bar for blockers goes up over time.
>
> Not all regressions are blockers. Would you hold a release over a trivial
> regression? but then which must or should block? There's no objective
> answer, but a reasonable rule is: non-trivial regressions from minor
> release x.y to x.{y+1} block releases. Regressions from x.{y-1} to x.{y+1}
> should, but not necessarily, block the release. We try hard to avoid
> regressions in x.y.0 releases because these are generally consumed by
> aggressive upgraders, on x.{y-1}.z now. If a bug exists in x.{y-1}, they're
> not affected or worked around it. The cautious upgrader goes from maybe
> x.{y-2}.z to x.y.1 later. They're affected, but not before, maybe, a
> maintenance release. A crude argument, and it's not an argument that
> regressions are OK. It's an argument that 'old' regressions matter less.
> And maybe it's reasonable to draw the "must" vs "should" line between them.
>
>
>
> On Thu, Oct 25, 2018 at 8:51 AM Tom Graves  wrote:
>
>> So just to clarify a few things in case people didn't read the entire
>> thread in the PR, the discussion is what is the criteria for a blocker and
>> really my concerns are what people are using as criteria for not marking a
>> jira as a blocker.
>>
>> The only thing we have documented to mark a jira as a blocker is for
>> correctness issues: http://spark.apache.org/contributing.html.  And
>> really I think that is initially mark it as a blocker to bring attention to
>> it.
>> The final decision as to whether something is a blocker is up to the PMC
>> who votes on whether a release passes.  I think it would be impossible to
>> properly define what a blocker is with strict rules.
>>
>> Personally from this thread I would like to make sure committers and PMC
>> members aren't saying its a block for reasons other then the actual impact
>> the jira has and if its at all in question it should be brought to the
>> PMC's attention for a vote.  I agree with others that if its during an RC
>> it should be talked about on the RC thread.
>>
>> A few specific things that were said that I disagree with are:
>>- its not a blocker because it was also an issue in the last release
>> (meaning feature release).  ie the bug was introduced in 2.2 and now we are
>> doing 2.4 so its automatically not a blocker.  This to me is just 

Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Erik Erlandson
I would be comfortable making the integration testing manual for now.  A
JIRA for ironing out how to make it reliable for automatic as a goal for
3.0 seems like a good idea.

On Thu, Oct 25, 2018 at 8:11 AM Sean Owen  wrote:

> Forking this thread.
>
> Because we'll have another RC, we could possibly address these two
> issues. Only if we have a reliable change of course.
>
> Is it easy enough to propagate the -Pscala-2.12 profile? can't hurt.
>
> And is it reasonable to essentially 'disable'
> kubernetes/integration-tests by removing it from the kubernetes
> profile? it doesn't mean it goes away, just means it's run manually,
> not automatically. Is that actually how it's meant to be used anyway?
> in the short term? given the discussion around its requirements and
> minikube and all that?
>
> (Actually, this would also 'solve' the Scala 2.12 build problem too)
>
> On Tue, Oct 23, 2018 at 2:45 PM Sean Owen  wrote:
> >
> > To be clear I'm currently +1 on this release, with much commentary.
> >
> > OK, the explanation for kubernetes tests makes sense. Yes I think we
> need to propagate the scala-2.12 build profile to make it work. Go for it,
> if you have a lead on what the change is.
> > This doesn't block the release as it's an issue for tests, and only
> affects 2.12. However if we had a clean fix for this and there were another
> RC, I'd include it.
> >
> > Dongjoon has a good point about the spark-kubernetes-integration-tests
> artifact. That doesn't sound like it should be published in this way,
> though, of course, we publish the test artifacts from every module already.
> This is only a bit odd in being a non-test artifact meant for testing. But
> it's special testing! So I also don't think that needs to block a release.
> >
> > This happens because the integration tests module is enabled with the
> 'kubernetes' profile too, and also this output is copied into the release
> tarball at kubernetes/integration-tests/tests. Do we need that in a binary
> release?
> >
> > If these integration tests are meant to be run ad hoc, manually, not
> part of a normal test cycle, then I think we can just not enable it with
> -Pkubernetes. If it is meant to run every time, then it sounds like we need
> a little extra work shown in recent PRs to make that easier, but then, this
> test code should just be the 'test' artifact parts of the kubernetes
> module, no?
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [MLlib] PCA Aggregator

2018-10-19 Thread Erik Erlandson
For 3rd-party libs, I have been publishing independently, for example at
isarn-sketches-spark or silex:
https://github.com/isarn/isarn-sketches-spark
https://github.com/radanalyticsio/silex

Either of these repos provide some good working examples of publishing a
spark UDAF or ML library for jvm and pyspark.
(If anyone is interested in contributing new components to either of these,
feel free to reach out)

For people new to Spark library dev, Will Benton and I recently gave at
talk at SAI-EU on publishing Spark libraries:
https://databricks.com/session/apache-spark-for-library-developers-2
Cheers,
Erik

On Fri, Oct 19, 2018 at 9:40 AM Stephen Boesch  wrote:

> Erik - is there a current locale for approved/recommended third party
> additions?  The spark-packages has been stale for years it seems.
>
> Am Fr., 19. Okt. 2018 um 07:06 Uhr schrieb Erik Erlandson <
> eerla...@redhat.com>:
>
>> Hi Matt!
>>
>> There are a couple ways to do this. If you want to submit it for
>> inclusion in Spark, you should start by filing a JIRA for it, and then a
>> pull request.   Another possibility is to publish it as your own 3rd party
>> library, which I have done for aggregators before.
>>
>>
>> On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders  wrote:
>>
>>> I built an Aggregator that computes PCA on grouped datasets. I wanted to
>>> use the PCA functions provided by MLlib, but they only work on a full
>>> dataset, and I needed to do it on a grouped dataset (like a
>>> RelationalGroupedDataset).
>>>
>>> So I built a little Aggregator that can do that, here’s an example of
>>> how it’s called:
>>>
>>> val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
>>>
>>> // For each grouping, compute a PCA matrix/vector
>>> val pcaModels = inputData
>>>   .groupBy(keys:_*)
>>>   .agg(pcaAggregation.as(pcaOutput))
>>>
>>> I used the same algorithms under the hood as
>>> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works
>>> directly on Datasets without converting to RDD first.
>>>
>>> I’ve seen others who wanted this ability (for example on Stack Overflow)
>>> so I’d like to contribute it if it would be a benefit to the larger
>>> community.
>>>
>>> So.. is this something worth contributing to MLlib? And if so, what are
>>> the next steps to start the process?
>>>
>>> thanks!
>>>
>>


Re: [MLlib] PCA Aggregator

2018-10-19 Thread Erik Erlandson
Hi Matt!

There are a couple ways to do this. If you want to submit it for inclusion
in Spark, you should start by filing a JIRA for it, and then a pull
request.   Another possibility is to publish it as your own 3rd party
library, which I have done for aggregators before.


On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders  wrote:

> I built an Aggregator that computes PCA on grouped datasets. I wanted to
> use the PCA functions provided by MLlib, but they only work on a full
> dataset, and I needed to do it on a grouped dataset (like a
> RelationalGroupedDataset).
>
> So I built a little Aggregator that can do that, here’s an example of how
> it’s called:
>
> val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
>
> // For each grouping, compute a PCA matrix/vector
> val pcaModels = inputData
>   .groupBy(keys:_*)
>   .agg(pcaAggregation.as(pcaOutput))
>
> I used the same algorithms under the hood as
> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works
> directly on Datasets without converting to RDD first.
>
> I’ve seen others who wanted this ability (for example on Stack Overflow)
> so I’d like to contribute it if it would be a benefit to the larger
> community.
>
> So.. is this something worth contributing to MLlib? And if so, what are
> the next steps to start the process?
>
> thanks!
>


Re: Starting to make changes for Spark 3 -- what can we delete?

2018-10-17 Thread Erik Erlandson
My understanding was that the legacy mllib api was frozen, with all new dev
going to ML, but it was not going to be removed. Although removing it would
get rid of a lot of `OldXxx` shims.

On Wed, Oct 17, 2018 at 12:55 AM Marco Gaido  wrote:

> Hi all,
>
> I think a very big topic on this would be: what do we want to do with the
> old mllib API? For long I have been told that it was going to be removed on
> 3.0. Is this still the plan?
>
> Thanks,
> Marco
>
> Il giorno mer 17 ott 2018 alle ore 03:11 Marcelo Vanzin
>  ha scritto:
>
>> Might be good to take a look at things marked "@DeveloperApi" and
>> whether they should stay that way.
>>
>> e.g. I was looking at SparkHadoopUtil and I've always wanted to just
>> make it private to Spark. I don't see why apps would need any of those
>> methods.
>> On Tue, Oct 16, 2018 at 10:18 AM Sean Owen  wrote:
>> >
>> > There was already agreement to delete deprecated things like Flume and
>> > Kafka 0.8 support in master. I've got several more on my radar, and
>> > wanted to highlight them and solicit general opinions on where we
>> > should accept breaking changes.
>> >
>> > For example how about removing accumulator v1?
>> > https://github.com/apache/spark/pull/22730
>> >
>> > Or using the standard Java Optional?
>> > https://github.com/apache/spark/pull/22383
>> >
>> > Or cleaning up some old workarounds and APIs while at it?
>> > https://github.com/apache/spark/pull/22729 (still in progress)
>> >
>> > I think I talked myself out of replacing Java function interfaces with
>> > java.util.function because...
>> > https://issues.apache.org/jira/browse/SPARK-25369
>> >
>> > There are also, say, old json and csv and avro reading method
>> > deprecated since 1.4. Remove?
>> > Anything deprecated since 2.0.0?
>> >
>> > Interested in general thoughts on these.
>> >
>> > Here are some more items targeted to 3.0:
>> >
>> https://issues.apache.org/jira/browse/SPARK-17875?jql=project%3D%22SPARK%22%20AND%20%22Target%20Version%2Fs%22%3D%223.0.0%22%20ORDER%20BY%20priority%20ASC
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

2018-10-16 Thread Erik Erlandson
SPARK-23257 merged more recently than I realized. If that isn't on
branch-2.4 then the first question is how soon on the release sequence that
can be adopted

On Tue, Oct 16, 2018 at 9:33 AM Reynold Xin  wrote:

> We shouldn’t merge new features into release branches anymore.
>
> On Tue, Oct 16, 2018 at 6:32 PM Rob Vesse  wrote:
>
>> Right now the Kerberos support for Spark on K8S is only on master AFAICT
>> i.e. the feature is not present on branch-2.4
>>
>>
>>
>> Therefore I don’t see any point in adding the tests into branch-2.4
>> unless the plan is to also merge the Kerberos support to branch-2.4
>>
>>
>>
>> Rob
>>
>>
>>
>> *From: *Erik Erlandson 
>> *Date: *Tuesday, 16 October 2018 at 16:47
>> *To: *dev 
>> *Subject: *[DISCUSS][K8S][TESTS] Include Kerberos integration tests for
>> Spark 2.4
>>
>>
>>
>> I'd like to propose including integration testing for Kerberos on the
>> Spark 2.4 release:
>>
>> https://github.com/apache/spark/pull/22608
>>
>>
>>
>> Arguments in favor:
>>
>> 1) it improves testing coverage on a feature important for integrating
>> with HDFS deployments
>>
>> 2) its intersection with existing code is small - it consists primarily
>> of new testing code, with a bit of refactoring into 'main' and 'test'
>> sub-trees. These new tests appear stable.
>>
>> 3) Spark 2.4 is still in RC, with outstanding correctness issues.
>>
>>
>>
>> The argument 'against' that I'm aware of would be the relatively large
>> size of the PR. I believe this is considered above, but am soliciting
>> community feedback before committing.
>>
>> Cheers,
>>
>> Erik
>>
>>
>>
>


[DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

2018-10-16 Thread Erik Erlandson
I'd like to propose including integration testing for Kerberos on the Spark
2.4 release:
https://github.com/apache/spark/pull/22608

Arguments in favor:
1) it improves testing coverage on a feature important for integrating with
HDFS deployments
2) its intersection with existing code is small - it consists primarily of
new testing code, with a bit of refactoring into 'main' and 'test'
sub-trees. These new tests appear stable.
3) Spark 2.4 is still in RC, with outstanding correctness issues.

The argument 'against' that I'm aware of would be the relatively large size
of the PR. I believe this is considered above, but am soliciting community
feedback before committing.
Cheers,
Erik


Kubernetes Big-Data-SIG notes, September 19

2018-09-19 Thread Erik Erlandson
Meta
Following this week's regular meeting we will be meeting bi weekly. The
next meeting will be October 3. I will be in London for Spark Summit and so
Yinan Li will chair that meeting.

Spark
K8s backend development for 2.4 is complete. There is some renewed
discussion about how much verification to do on user pod templates
. This feature is not
landing in 2.4 and there is leeway to generate more consensus during the
next release cycle.

Link to meeting minutes
https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85
Vms4V2uTsGZvSp8MNIA


Re: [DISCUSS][K8S] Supporting advanced pod customisation

2018-09-19 Thread Erik Erlandson
I can speak somewhat to the current design. Two of the goals for the design
of this feature are that
(1) its behavior is easy to reason about
(2) its implementation in the back-end is light weight

Option 1 was chosen partly because it's behavior is relatively simple to
describe to a user: "Your template will be taken as the starting point.
Spark may override a certain small set of fields (documented in a table)
that are necessary for its internal functioning."

This also keeps the actual back-end implementation relatively light weight.
It can load the template (which also includes syntax validation) into a pod
structure, then modify any fields it needs to (per above).


On Wed, Sep 19, 2018 at 9:11 AM, Rob Vesse  wrote:

> Hey all
>
>
>
> For those following the K8S backend you are probably aware of SPARK-24434
> [1] (and PR 22416 [2]) which proposes a mechanism to allow for advanced pod
> customisation via pod templates.  This is motivated by the fact that
> introducing additional Spark configuration properties for each aspect of
> pod specification a user might wish to customise was becoming unwieldy.
>
>
>
> However I am concerned that the current implementation doesn’t go far
> enough and actually limits the utility of the proposed new feature.  The
> problem stems from the fact that the implementation simply uses the pod
> template as a base and then Spark attempts to build a pod spec on top of
> that.  As the code that does this doesn’t do any kind of validation or
> inspection of the incoming template it is possible to provide a template
> that causes Spark to generate an invalid pod spec ultimately causing the
> job to be rejected by Kubernetes.
>
>
>
> Now clearly Spark code cannot attempt to account for every possible
> customisation that a user may attempt to make via pod templates nor should
> it be responsible for ensuring that the user doesn’t start from an invalid
> template in the first place.  However it seems like we could be more
> intelligent in how we build our pod specs to avoid generating invalid specs
> in cases where we have a clear use case for advanced customisation.  For
> example the current implementation does not allow users to customise the
> volumes used to back SPARK_LOCAL_DIRS to better suit the compute
> environment the K8S cluster is running on and trying to do so with a pod
> template will result in an invalid spec due to duplicate volumes.
>
>
>
> I think there are a few ways the community could address this:
>
>
>
>1. Status quo – provide the pod template feature as-is and simply tell
>users that certain customisations are never supported and may result in
>invalid pod specs
>2. Provide the ability for advanced users to explicitly skip pod spec
>building steps they know interfere with their pod templates via
>configuration properties
>3. Modify the pod spec building code to be aware of known desirable
>user customisation points and avoid generating  invalid specs in those 
> cases
>
>
>
> Currently committers seem to be going for Option 1.  Personally I would
> like to see the community adopt option 3 but have already received
> considerable pushback when I proposed that in one of my PRs hence the
> suggestion of the compromise option 2.  Yes this still has the possibility
> of ending up with invalid specs if users are over-zealous in the spec
> building steps they disable but since this is a power user feature I think
> this would be a risk power users would be willing to assume.  If we are
> going to provide features for power users we should avoid unnecessarily
> limiting the utility of those features.
>
>
>
> What do other K8S folks think about this issue?
>
>
>
> Thanks,
>
>
>
> Rob
>
>
>
> [1] https://issues.apache.org/jira/browse/SPARK-24434
>
> [2] https://github.com/apache/spark/pull/22146
>
>
>


Re: Python friendly API for Spark 3.0

2018-09-18 Thread Erik Erlandson
I like the notion of empowering cross platform bindings.

The trend of computing frameworks seems to be that all APIs gradually
converge on a stable attractor which could be described as "data frames and
SQL"  Spark's early API design was RDD focused, but these days the center
of gravity is all about DataFrame (Python's prevalence combined with its
lack of a static type system substantially dilutes the benefits of DataSet,
for any library development that aspires to both JVM and python support).

I can imagine optimizing the developer layers of Spark APIs so that cross
platform support and also 3rd-party support for new and existing Spark
bindings would be maximized for "parallelizable dataframe+SQL"  Another of
Spark's strengths is it's ability to federate heterogeneous data sources,
and making cross platform bindings easy for that is desirable.


On Sun, Sep 16, 2018 at 1:02 PM, Mark Hamstra 
wrote:

> It's not splitting hairs, Erik. It's actually very close to something that
> I think deserves some discussion (perhaps on a separate thread.) What I've
> been thinking about also concerns API "friendliness" or style. The original
> RDD API was very intentionally modeled on the Scala parallel collections
> API. That made it quite friendly for some Scala programmers, but not as
> much so for users of the other language APIs when they eventually came
> about. Similarly, the Dataframe API drew a lot from pandas and R, so it is
> relatively friendly for those used to those abstractions. Of course, the
> Spark SQL API is modeled closely on HiveQL and standard SQL. The new
> barrier scheduling draws inspiration from MPI. With all of these models and
> sources of inspiration, as well as multiple language targets, there isn't
> really a strong sense of coherence across Spark -- I mean, even though one
> of the key advantages of Spark is the ability to do within a single
> framework things that would otherwise require multiple frameworks, actually
> doing that is requiring more than one programming style or multiple design
> abstractions more than what is strictly necessary even when writing Spark
> code in just a single language.
>
> For me, that raises questions over whether we want to start designing,
> implementing and supporting APIs that are designed to be more consistent,
> friendly and idiomatic to particular languages and abstractions -- e.g. an
> API covering all of Spark that is designed to look and feel as much like
> "normal" code for a Python programmer, another that looks and feels more
> like "normal" Java code, another for Scala, etc. That's a lot more work and
> support burden than the current approach where sometimes it feels like you
> are writing "normal" code for your prefered programming environment, and
> sometimes it feels like you are trying to interface with something foreign,
> but underneath it hopefully isn't too hard for those writing the
> implementation code below the APIs, and it is not too hard to maintain
> multiple language bindings that are each fairly lightweight.
>
> It's a cost-benefit judgement, of course, whether APIs that are heavier
> (in terms of implementing and maintaining) and friendlier (for end users)
> are worth doing, and maybe some of these "friendlier" APIs can be done
> outside of Spark itself (imo, Frameless is doing a very nice job for the
> parts of Spark that it is currently covering -- https://github.com/
> typelevel/frameless); but what we have currently is a bit too ad hoc and
> fragmentary for my taste.
>
> On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson 
> wrote:
>
>> I am probably splitting hairs to finely, but I was considering the
>> difference between improvements to the jvm-side (py4j and the scala/java
>> code) that would make it easier to write the python layer ("python-friendly
>> api"), and actual improvements to the python layers ("friendly python api").
>>
>> They're not mutually exclusive of course, and both worth working on. But
>> it's *possible* to improve either without the other.
>>
>> Stub files look like a great solution for type annotations, maybe even if
>> only python 3 is supported.
>>
>> I definitely agree that any decision to drop python 2 should not be taken
>> lightly. Anecdotally, I'm seeing an increase in python developers
>> announcing that they are dropping support for python 2 (and loving it). As
>> people have already pointed out, if we don't drop python 2 for spark 3.0,
>> we're stuck with it until 4.0, which would place spark in a
>> possibly-awkward position of supporting python 2 for some time after it
>> goes EOL.
>>
>> Under the current release cadence, spark 3.0 will land some

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Erik Erlandson
I think that makes sense. The main benefit of deprecating *prior* to 3.0
would be informational - making the community aware of the upcoming
transition earlier. But there are other ways to start informing the
community between now and 3.0, besides formal deprecation.

I have some residual curiosity about what it might mean for a release like
2.4 to still be in its support lifetime after Py2 goes EOL. I asked Apache
Legal <https://issues.apache.org/jira/browse/LEGAL-407> to comment. It is
possible there are no issues with this at all.


On Mon, Sep 17, 2018 at 4:26 PM, Reynold Xin  wrote:

> i'd like to second that.
>
> if we want to communicate timeline, we can add to the release notes saying
> py2 will be deprecated in 3.0, and removed in a 3.x release.
>
> --
> excuse the brevity and lower case due to wrist injury
>
>
> On Mon, Sep 17, 2018 at 4:24 PM Matei Zaharia 
> wrote:
>
>> That’s a good point — I’d say there’s just a risk of creating a
>> perception issue. First, some users might feel that this means they have to
>> migrate now, which is before Python itself drops support; they might also
>> be surprised that we did this in a minor release (e.g. might we drop Python
>> 2 altogether in a Spark 2.5 if that later comes out?). Second, contributors
>> might feel that this means new features no longer have to work with Python
>> 2, which would be confusing. Maybe it’s OK on both fronts, but it just
>> seems scarier for users to do this now if we do plan to have Spark 3.0 in
>> the next 6 months anyway.
>>
>> Matei
>>
>> > On Sep 17, 2018, at 1:04 PM, Mark Hamstra 
>> wrote:
>> >
>> > What is the disadvantage to deprecating now in 2.4.0? I mean, it
>> doesn't change the code at all; it's just a notification that we will
>> eventually cease supporting Py2. Wouldn't users prefer to get that
>> notification sooner rather than later?
>> >
>> > On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia 
>> wrote:
>> > I’d like to understand the maintenance burden of Python 2 before
>> deprecating it. Since it is not EOL yet, it might make sense to only
>> deprecate it once it’s EOL (which is still over a year from now).
>> Supporting Python 2+3 seems less burdensome than supporting, say, multiple
>> Scala versions in the same codebase, so what are we losing out?
>> >
>> > The other thing is that even though Python core devs might not support
>> 2.x later, it’s quite possible that various Linux distros will if moving
>> from 2 to 3 remains painful. In that case, we may want Apache Spark to
>> continue releasing for it despite the Python core devs not supporting it.
>> >
>> > Basically, I’d suggest to deprecate this in Spark 3.0 and then remove
>> it later in 3.x instead of deprecating it in 2.4. I’d also consider looking
>> at what other data science tools are doing before fully removing it: for
>> example, if Pandas and TensorFlow no longer support Python 2 past some
>> point, that might be a good point to remove it.
>> >
>> > Matei
>> >
>> > > On Sep 17, 2018, at 11:01 AM, Mark Hamstra 
>> wrote:
>> > >
>> > > If we're going to do that, then we need to do it right now, since
>> 2.4.0 is already in release candidates.
>> > >
>> > > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson 
>> wrote:
>> > > I like Mark’s concept for deprecating Py2 starting with 2.4: It may
>> seem like a ways off but even now there may be some spark versions
>> supporting Py2 past the point where Py2 is no longer receiving security
>> patches
>> > >
>> > >
>> > > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra <
>> m...@clearstorydata.com> wrote:
>> > > We could also deprecate Py2 already in the 2.4.0 release.
>> > >
>> > > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
>> wrote:
>> > > In case this didn't make it onto this thread:
>> > >
>> > > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
>> remove it entirely on a later 3.x release.
>> > >
>> > > On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
>> wrote:
>> > > On a separate dev@spark thread, I raised a question of whether or
>> not to support python 2 in Apache Spark, going forward into Spark 3.0.
>> > >
>> > > Python-2 is going EOL at the end of 2019. The upcoming release of
>> Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and
>> so it is a good time to consider support for Python-2 on PySpark.
>>

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Erik Erlandson
FWIW, Pandas is dropping

Py2 support at the end of this year.  Tensorflow is less clear. They only
support py3 on windows, but there is no reference to any policy about py2
on their roadmap or the TF 2.0 announcement.


Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-17 Thread Erik Erlandson
I have no binding vote but I second Stavros’ recommendation for spark-23200

Per parallel threads on Py2 support I would also like to propose
deprecating Py2 starting with this 2.4 release

On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin 
wrote:

> You can log in to https://repository.apache.org and see what's wrong.
> Just find that staging repo and look at the messages. In your case it
> seems related to your signature.
>
> failureMessageNo public key: Key with id: () was not able to be
> located on http://gpg-keyserver.de/. Upload your public key and try
> the operation again.
> On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan  wrote:
> >
> > I confirmed that
> https://repository.apache.org/content/repositories/orgapachespark-1285 is
> not accessible. I did it via ./dev/create-release/do-release-docker.sh -d
> /my/work/dir -s publish , not sure what's going wrong. I didn't see any
> error message during it.
> >
> > Any insights are appreciated! So that I can fix it in the next RC.
> Thanks!
> >
> > On Mon, Sep 17, 2018 at 11:31 AM Sean Owen  wrote:
> >>
> >> I think one build is enough, but haven't thought it through. The
> >> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably
> >> best advertised as a 'beta'. So maybe publish a no-hadoop build of it?
> >> Really, whatever's the easy thing to do.
> >> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan 
> wrote:
> >> >
> >> > Ah I missed the Scala 2.12 build. Do you mean we should publish a
> Scala 2.12 build this time? Current for Scala 2.11 we have 3 builds: with
> hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing for
> Scala 2.12?
> >> >
> >> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen  wrote:
> >> >>
> >> >> A few preliminary notes:
> >> >>
> >> >> Wenchen for some weird reason when I hit your key in gpg --import, it
> >> >> asks for a passphrase. When I skip it, it's fine, gpg can still
> verify
> >> >> the signature. No issue there really.
> >> >>
> >> >> The staging repo gives a 404:
> >> >>
> https://repository.apache.org/content/repositories/orgapachespark-1285/
> >> >> 404 - Repository "orgapachespark-1285 (staging: open)"
> >> >> [id=orgapachespark-1285] exists but is not exposed.
> >> >>
> >> >> The (revamped) licenses are OK, though there are some minor glitches
> >> >> in the final release tarballs (my fault) : there's an extra
> directory,
> >> >> and the source release has both binary and source licenses. I'll fix
> >> >> that. Not strictly necessary to reject the release over those.
> >> >>
> >> >> Last, when I check the staging repo I'll get my answer, but, were you
> >> >> able to build 2.12 artifacts as well?
> >> >>
> >> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan 
> wrote:
> >> >> >
> >> >> > Please vote on releasing the following candidate as Apache Spark
> version 2.4.0.
> >> >> >
> >> >> > The vote is open until September 20 PST and passes if a majority
> +1 PMC votes are cast, with
> >> >> > a minimum of 3 +1 votes.
> >> >> >
> >> >> > [ ] +1 Release this package as Apache Spark 2.4.0
> >> >> > [ ] -1 Do not release this package because ...
> >> >> >
> >> >> > To learn more about Apache Spark, please see
> http://spark.apache.org/
> >> >> >
> >> >> > The tag to be voted on is v2.4.0-rc1 (commit
> 1220ab8a0738b5f67dc522df5e3e77ffc83d207a):
> >> >> > https://github.com/apache/spark/tree/v2.4.0-rc1
> >> >> >
> >> >> > The release files, including signatures, digests, etc. can be
> found at:
> >> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-bin/
> >> >> >
> >> >> > Signatures used for Spark RCs can be found in this file:
> >> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >> >> >
> >> >> > The staging repository for this release can be found at:
> >> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1285/
> >> >> >
> >> >> > The documentation corresponding to this release can be found at:
> >> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-docs/
> >> >> >
> >> >> > The list of bug fixes going into 2.4.0 can be found at the
> following URL:
> >> >> > https://issues.apache.org/jira/projects/SPARK/versions/2.4.0
> >> >> >
> >> >> > FAQ
> >> >> >
> >> >> > =
> >> >> > How can I help test this release?
> >> >> > =
> >> >> >
> >> >> > If you are a Spark user, you can help us test this release by
> taking
> >> >> > an existing Spark workload and running on this release candidate,
> then
> >> >> > reporting any regressions.
> >> >> >
> >> >> > If you're working in PySpark you can set up a virtual env and
> install
> >> >> > the current RC and see if anything important breaks, in the
> Java/Scala
> >> >> > you can add the staging repository to your projects resolvers and
> test
> >> >> > with the RC (make sure to clean up the artifact cache before/after
> so
> >> >> > you don't end up building with a out of date RC going forward).
> >> >> >
> >> >> > ===

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Erik Erlandson
I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem
like a ways off but even now there may be some spark versions supporting
Py2 past the point where Py2 is no longer receiving security patches


On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra 
wrote:

> We could also deprecate Py2 already in the 2.4.0 release.
>
> On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
> wrote:
>
>> In case this didn't make it onto this thread:
>>
>> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
>> remove it entirely on a later 3.x release.
>>
>> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
>> wrote:
>>
>>> On a separate dev@spark thread, I raised a question of whether or not
>>> to support python 2 in Apache Spark, going forward into Spark 3.0.
>>>
>>> Python-2 is going EOL <https://github.com/python/devguide/pull/344> at
>>> the end of 2019. The upcoming release of Spark 3.0 is an opportunity to
>>> make breaking changes to Spark's APIs, and so it is a good time to consider
>>> support for Python-2 on PySpark.
>>>
>>> Key advantages to dropping Python 2 are:
>>>
>>>- Support for PySpark becomes significantly easier.
>>>- Avoid having to support Python 2 until Spark 4.0, which is likely
>>>to imply supporting Python 2 for some time after it goes EOL.
>>>
>>> (Note that supporting python 2 after EOL means, among other things, that
>>> PySpark would be supporting a version of python that was no longer
>>> receiving security patches)
>>>
>>> The main disadvantage is that PySpark users who have legacy python-2
>>> code would have to migrate their code to python 3 to take advantage of
>>> Spark 3.0
>>>
>>> This decision obviously has large implications for the Apache Spark
>>> community and we want to solicit community feedback.
>>>
>>>
>>


Re: Should python-2 be supported in Spark 3.0?

2018-09-15 Thread Erik Erlandson
In case this didn't make it onto this thread:

There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove
it entirely on a later 3.x release.

On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
wrote:

> On a separate dev@spark thread, I raised a question of whether or not to
> support python 2 in Apache Spark, going forward into Spark 3.0.
>
> Python-2 is going EOL <https://github.com/python/devguide/pull/344> at
> the end of 2019. The upcoming release of Spark 3.0 is an opportunity to
> make breaking changes to Spark's APIs, and so it is a good time to consider
> support for Python-2 on PySpark.
>
> Key advantages to dropping Python 2 are:
>
>- Support for PySpark becomes significantly easier.
>- Avoid having to support Python 2 until Spark 4.0, which is likely to
>imply supporting Python 2 for some time after it goes EOL.
>
> (Note that supporting python 2 after EOL means, among other things, that
> PySpark would be supporting a version of python that was no longer
> receiving security patches)
>
> The main disadvantage is that PySpark users who have legacy python-2 code
> would have to migrate their code to python 3 to take advantage of Spark 3.0
>
> This decision obviously has large implications for the Apache Spark
> community and we want to solicit community feedback.
>
>


Should python-2 be supported in Spark 3.0?

2018-09-15 Thread Erik Erlandson
On a separate dev@spark thread, I raised a question of whether or not to
support python 2 in Apache Spark, going forward into Spark 3.0.

Python-2 is going EOL  at the
end of 2019. The upcoming release of Spark 3.0 is an opportunity to make
breaking changes to Spark's APIs, and so it is a good time to consider
support for Python-2 on PySpark.

Key advantages to dropping Python 2 are:

   - Support for PySpark becomes significantly easier.
   - Avoid having to support Python 2 until Spark 4.0, which is likely to
   imply supporting Python 2 for some time after it goes EOL.

(Note that supporting python 2 after EOL means, among other things, that
PySpark would be supporting a version of python that was no longer
receiving security patches)

The main disadvantage is that PySpark users who have legacy python-2 code
would have to migrate their code to python 3 to take advantage of Spark 3.0

This decision obviously has large implications for the Apache Spark
community and we want to solicit community feedback.


Re: Python friendly API for Spark 3.0

2018-09-15 Thread Erik Erlandson
I am probably splitting hairs to finely, but I was considering the
difference between improvements to the jvm-side (py4j and the scala/java
code) that would make it easier to write the python layer ("python-friendly
api"), and actual improvements to the python layers ("friendly python api").

They're not mutually exclusive of course, and both worth working on. But
it's *possible* to improve either without the other.

Stub files look like a great solution for type annotations, maybe even if
only python 3 is supported.

I definitely agree that any decision to drop python 2 should not be taken
lightly. Anecdotally, I'm seeing an increase in python developers
announcing that they are dropping support for python 2 (and loving it). As
people have already pointed out, if we don't drop python 2 for spark 3.0,
we're stuck with it until 4.0, which would place spark in a
possibly-awkward position of supporting python 2 for some time after it
goes EOL.

Under the current release cadence, spark 3.0 will land some time in early
2019, which at that point will be mere months until EOL for py2.

On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau  wrote:

>
>
> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:
>
>> To be clear, is this about "python-friendly API" or "friendly python API"
>> ?
>>
> Well what would you consider to be different between those two statements?
> I think it would be good to be a bit more explicit, but I don't think we
> should necessarily limit ourselves.
>
>>
>> On the python side, it might be nice to take advantage of static typing.
>> Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a
>> good opportunity to jump the python-3-only train.
>>
> I think we can make types sort of work without ditching 2 (the types only
> would work in 3 but it would still function in 2). Ditching 2 entirely
> would be a big thing to consider, I honestly hadn't been considering that
> but it could be from just spending so much time maintaining a 2/3 code
> base. I'd suggest reaching out to to user@ before making that kind of
> change.
>
>>
>> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
>> wrote:
>>
>>> Since we're talking about Spark 3.0 in the near future (and since some
>>> recent conversation on a proposed change reminded me) I wanted to open up
>>> the floor and see if folks have any ideas on how we could make a more
>>> Python friendly API for 3.0? I'm planning on taking some time to look at
>>> other systems in the solution space and see what we might want to learn
>>> from them but I'd love to hear what other folks are thinking too.
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/
>>> 2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>


Re: Python friendly API for Spark 3.0

2018-09-14 Thread Erik Erlandson
To be clear, is this about "python-friendly API" or "friendly python API" ?

On the python side, it might be nice to take advantage of static typing.
Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a
good opportunity to jump the python-3-only train.

On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau  wrote:

> Since we're talking about Spark 3.0 in the near future (and since some
> recent conversation on a proposed change reminded me) I wanted to open up
> the floor and see if folks have any ideas on how we could make a more
> Python friendly API for 3.0? I'm planning on taking some time to look at
> other systems in the solution space and see what we might want to learn
> from them but I'd love to hear what other folks are thinking too.
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/
> 2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Kubernetes Big-Data-SIG notes, September 12

2018-09-12 Thread Erik Erlandson
Spark
Pod template parameters 
is mostly done. The main remaining design discussion is around how (or
whether) to specify which container on the pod template is the driver pod.

HDFS
We had some discussions about the possibility of adding an HDFS operator,
in addition to the system of helm charts that Kimoon Kim has developed.
Yinan Li is looking into working on this.

Link to meeting minutes
https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85
Vms4V2uTsGZvSp8MNIA


Kubernetes Big-Data-SIG notes, September 5

2018-09-05 Thread Erik Erlandson
Meta
At the weekly K8s Big Data SIG meeting today, we agreed to experiment with
publishing a brief summary of noteworthy Spark-related topics from the
weekly meeting to dev@spark, as a reference for interested members of the
Apache Spark community.

The format is a brief summary, including a link to the SIG minutes (we also
post a link to the meeting recording on the minutes when it becomes
available).

With that, here are the first SIG meeting notes:

Spark
Based on initial feedback on the remote shuffle service storage design
exploration , further
work on design options is going to continue prior to attempting a POC
implementation. The group is consulting Facebook and Baidu for additional
input from their independent work in this area.

We reviewed the new fractional CPU
 support on the K8s
back-end (landing in Spark 2.4) and discussed options for configuring
fractional CPUs on the driver pod.

A consensus was reached to proceed with the current PR
 on apache/spark for the
upcoming user supplied pod template feature on the K8s back-end.

Link to meeting minutes
https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85
Vms4V2uTsGZvSp8MNIA


Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib

2018-08-23 Thread Erik Erlandson
Behaviors at this level of detail, across different ML implementations, are
highly unlikely to ever align exactly. Statistically small changes in
logic, such as "<" versus "<=", or differences in random number generators,
etc, (to say nothing of different implementation languages) will accumulate
over training to yield different models, even if their overall performance
should be similar.

. The random forest are a good example. I expected them to be dependent on
> feature/instance order. However, they are not in Weka, only in scikit-learn
> and Spark MLlib. There are more such examples, like logistic regression
> that exhibits different behavior in all three libraries.
>


Re: [DISCUSS] SparkR support on k8s back-end for Spark 2.4

2018-08-16 Thread Erik Erlandson
IMO sparkR support makes sense to merge for 2.4, as long as the release
wranglers agree that local integration testing is sufficiently convincing.
Part of the intent here is to allow this to happen without Shane having to
reorganize his complex upgrade schedule and make it even more complicated.

On Wed, Aug 15, 2018 at 7:08 PM, Wenchen Fan  wrote:

> I'm also happy to see we have R support on k8s for Spark 2.4. I'll do the
> manual testing for it if we don't want to upgrade the OS now. If the Python
> support is also merged in this way, I think we can merge the R support PR
> too?
>
> On Thu, Aug 16, 2018 at 7:23 AM shane knapp  wrote:
>
>>
>>> What is the current purpose of these builds?
>>>
>>> to be honest, i have absolutely no idea.  :)
>>
>> these were set up a long time ago, in a galaxy far far away, by someone
>> who is not me.
>>
>>
>>> - spark-docs seems to be building the docs, is that the only place
>>> where the docs build is tested?
>>>
>>> i think so...
>>
>>
>>> In the last many releases we've moved away from using jenkins jobs for
>>> preparing the packages, and the scripts have changed a lot to be
>>> friendlier to people running them locally (they even support docker
>>> now, and have flags to run "test" builds that don't require
>>> credentials such as GPG keys).
>>>
>>> awesome++
>>
>>
>>> Perhaps we should think about revamping these jobs instead of keeping
>>> them as is.
>>>
>>
>> i fully support this.  which is exactly why i punted on even trying to
>> get them ported over to the ubuntu nodes.
>>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>


[DISCUSS] SparkR support on k8s back-end for Spark 2.4

2018-08-15 Thread Erik Erlandson
The SparkR support PR is finished, along with integration testing, however
Shane has requested that the integration testing not be enabled until after
the 2.4 release because it requires the OS updates he wants to test *after*
the release.

The integration testing can be run locally, and so the question at hand is:
would the PMC be willing to consider inclusion of the SparkR for 2.4, based
on local verification of the testing? The PySpark PR was merged under
similar circumstances: the testing was verified locally and the PR was
merged before the testing was enabled for jenkins.

Cheers,
Erik


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Erik Erlandson
The PR for SparkR support on the kube back-end is completed, but waiting
for Shane to make some tweaks to the CI machinery for full testing support.
If the code freeze is being delayed, this PR could be merged as well.

On Fri, Jul 6, 2018 at 9:47 AM, Reynold Xin  wrote:

> FYI 6 mo is coming up soon since the last release. We will cut the branch
> and code freeze on Aug 1st in order to get 2.4 out on time.
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Erik Erlandson
I agree that looking at it from the pov of "code paths where isBarrier
tests were introduced" seems right.

>From pr-21758 <https://github.com/apache/spark/pull/21758/files> (the one
already merged) there are 13 files touched under
core/src/main/scala/org/apache/spark/scheduler/, although most of those
appear to be relatively small edits. The "big" modifications are
concentrated on Task.scala and TaskSchedulerImpl.scala. The followup
pr-21898 <https://github.com/apache/spark/pull/21898/files> touches a
subset of those.

The project-hydrogen epic for "barrier execution" SPARK-24374
<https://issues.apache.org/jira/browse/SPARK-24374> contains 22 sub-issues,
most of which are still open. Some are marked for future release cycles; Is
there a specific set being proposed for 2.4?  The various back-end supports
look tagged for subsequent release cycles: is the 2.4 scope standalone
clusters?

CI will obviously exercise standard task scheduling code paths, which
indicates some level of stability.  Folks on the k8s big data SIG today
were interested in building test distributions for the barrier-related
features. I was reflecting that although the spark-on-kube fork was awkward
in some ways, it did provide a unified distribution that interested
community members could build, download and/or run. Project hydrogen is
currently incarnated as a set of PRs, but a unified test build that
included pr-21758 <https://github.com/apache/spark/pull/21758/files> and
pr-21898 <https://github.com/apache/spark/pull/21898/files> (and others?)
would be cool. I've never seen an ideal workflow for handling multi-PR
development efforts.


On Wed, Aug 1, 2018 at 1:43 PM, Imran Rashid  wrote:

> I still would like to do more review on barrier mode changes, but from
> what I've seen so far I agree. I dunno if it'll really be ready for use,
> but it should not pose much risk for code which doesn't touch the new
> features.  of course, every change has some risk, especially in the
> scheduler which has proven to be very brittle (I've written plenty of
> scheduler bugs while fixing other things myself).
>
> On Wed, Aug 1, 2018 at 1:13 PM, Xingbo Jiang 
> wrote:
>
>> Speaking of the code from hydrogen PRs, actually we didn't remove any of
>> the existing logic, and I tried my best to hide almost all of the newly
>> added logic behind a `isBarrier` tag (or something similar). I have to add
>> some new variables and new methods to the core code paths, but I think they
>> shall not be hit if you are not running barrier workloads.
>>
>> The only significant change I can think of is I swapped the sequence of
>> failure handling in DAGScheduler, moving the `case FetchFailed` block to
>> before the `case Resubmitted` block, but again I don't think this shall
>> affect a regular workload because anyway you can only have one failure type.
>>
>> Actually I also reviewed the previous PRs adding Spark on K8s support,
>> and I feel it's a good example of how to add new features to a project
>> without breaking existing workloads, I'm trying to follow that way in
>> adding barrier execution mode support.
>>
>> I really appreciate any notice on hydrogen PRs and welcome comments to
>> help improve the feature, thanks!
>>
>> 2018-08-01 4:19 GMT+08:00 Reynold Xin :
>>
>>> I actually totally agree that we should make sure it should have no
>>> impact on existing code if the feature is not used.
>>>
>>>
>>> On Tue, Jul 31, 2018 at 1:18 PM Erik Erlandson 
>>> wrote:
>>>
>>>> I don't have a comprehensive knowledge of the project hydrogen PRs,
>>>> however I've perused them, and they make substantial modifications to
>>>> Spark's core DAG scheduler code.
>>>>
>>>> What I'm wondering is: how high is the confidence level that the
>>>> "traditional" code paths are still stable. Put another way, is it even
>>>> possible to "turn off" or "opt out" of this experimental feature? This
>>>> analogy isn't perfect, but for example the k8s back-end is a major body of
>>>> code, but it has a very small impact on any *core* code paths, and so if
>>>> you opt out of it, it is well understood that you aren't running any
>>>> experimental code.
>>>>
>>>> Looking at the project hydrogen code, I'm less sure the same is true.
>>>> However, maybe there is a clear way to show how it is true.
>>>>
>>>>
>>>> On Tue, Jul 31, 2018 at 12:03 PM, Mark Hamstra >>> > wrote:
>>>>
>>>>> No reasonable amount of time is likely going to be suffi

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Erik Erlandson
I don't have a comprehensive knowledge of the project hydrogen PRs, however
I've perused them, and they make substantial modifications to Spark's core
DAG scheduler code.

What I'm wondering is: how high is the confidence level that the
"traditional" code paths are still stable. Put another way, is it even
possible to "turn off" or "opt out" of this experimental feature? This
analogy isn't perfect, but for example the k8s back-end is a major body of
code, but it has a very small impact on any *core* code paths, and so if
you opt out of it, it is well understood that you aren't running any
experimental code.

Looking at the project hydrogen code, I'm less sure the same is true.
However, maybe there is a clear way to show how it is true.


On Tue, Jul 31, 2018 at 12:03 PM, Mark Hamstra 
wrote:

> No reasonable amount of time is likely going to be sufficient to fully vet
> the code as a PR. I'm not entirely happy with the design and code as they
> currently are (and I'm still trying to find the time to more publicly
> express my thoughts and concerns), but I'm fine with them going into 2.4
> much as they are as long as they go in with proper stability annotations
> and are understood not to be cast-in-stone final implementations, but
> rather as a way to get people using them and generating the feedback that
> is necessary to get us to something more like a final design and
> implementation.
>
> On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson 
> wrote:
>
>>
>> Barrier mode seems like a high impact feature on Spark's core code: is
>> one additional week enough time to properly vet this feature?
>>
>> On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <
>> joseph.tor...@databricks.com> wrote:
>>
>>> Full continuous processing aggregation support ran into unanticipated
>>> scalability and scheduling problems. We’re planning to overcome those by
>>> using some of the barrier execution machinery, but since barrier execution
>>> itself is still in progress the full support isn’t going to make it into
>>> 2.4.
>>>
>>> Jose
>>>
>>> On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> what is the status of Continuous Processing + Aggregations? As far as I
>>>> remember, Jose Torres said it should  be easy to perform aggregations
>>>> if
>>>> coalesce(1) work. IIRC it's already merged to master.
>>>>
>>>> Is this work in progress? If yes, it would be great to have full
>>>> aggregation/join support in Spark 2.4 in CP.
>>>>
>>>> Pozdrawiam / Best regards,
>>>>
>>>> Tomek
>>>>
>>>>
>>>> On 2018-07-31 10:43, Petar Zečević wrote:
>>>> > This one is important to us: https://issues.apache.org/
>>>> jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but
>>>> I think it could be useful to others too.
>>>> >
>>>> > It is finished and is ready to be merged (was ready a month ago at
>>>> least).
>>>> >
>>>> > Do you think you could consider including it in 2.4?
>>>> >
>>>> > Petar
>>>> >
>>>> >
>>>> > Wenchen Fan @ 1970-01-01 01:00 CET:
>>>> >
>>>> >> I went through the open JIRA tickets and here is a list that we
>>>> should consider for Spark 2.4:
>>>> >>
>>>> >> High Priority:
>>>> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>>>> >> This one is critical to the Spark ecosystem for deep learning. It
>>>> only has a few remaining works and I think we should have it in Spark 2.4.
>>>> >>
>>>> >> Middle Priority:
>>>> >> SPARK-23899: Built-in SQL Function Improvement
>>>> >> We've already added a lot of built-in functions in this release, but
>>>> there are a few useful higher-order functions in progress, like
>>>> `array_except`, `transform`, etc. It would be great if we can get them in
>>>> Spark 2.4.
>>>> >>
>>>> >> SPARK-14220: Build and test Spark against Scala 2.12
>>>> >> Very close to finishing, great to have it in Spark 2.4.
>>>> >>
>>>> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>>>> >> This one is there for years (thanks for your patience Michael!), and
>>>> is also close to finishing. Great to hav

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Erik Erlandson
Barrier mode seems like a high impact feature on Spark's core code: is one
additional week enough time to properly vet this feature?

On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres  wrote:

> Full continuous processing aggregation support ran into unanticipated
> scalability and scheduling problems. We’re planning to overcome those by
> using some of the barrier execution machinery, but since barrier execution
> itself is still in progress the full support isn’t going to make it into
> 2.4.
>
> Jose
>
> On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda 
> wrote:
>
>> Hi,
>>
>> what is the status of Continuous Processing + Aggregations? As far as I
>> remember, Jose Torres said it should  be easy to perform aggregations if
>> coalesce(1) work. IIRC it's already merged to master.
>>
>> Is this work in progress? If yes, it would be great to have full
>> aggregation/join support in Spark 2.4 in CP.
>>
>> Pozdrawiam / Best regards,
>>
>> Tomek
>>
>>
>> On 2018-07-31 10:43, Petar Zečević wrote:
>> > This one is important to us: https://issues.apache.org/
>> jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but I
>> think it could be useful to others too.
>> >
>> > It is finished and is ready to be merged (was ready a month ago at
>> least).
>> >
>> > Do you think you could consider including it in 2.4?
>> >
>> > Petar
>> >
>> >
>> > Wenchen Fan @ 1970-01-01 01:00 CET:
>> >
>> >> I went through the open JIRA tickets and here is a list that we should
>> consider for Spark 2.4:
>> >>
>> >> High Priority:
>> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>> >> This one is critical to the Spark ecosystem for deep learning. It only
>> has a few remaining works and I think we should have it in Spark 2.4.
>> >>
>> >> Middle Priority:
>> >> SPARK-23899: Built-in SQL Function Improvement
>> >> We've already added a lot of built-in functions in this release, but
>> there are a few useful higher-order functions in progress, like
>> `array_except`, `transform`, etc. It would be great if we can get them in
>> Spark 2.4.
>> >>
>> >> SPARK-14220: Build and test Spark against Scala 2.12
>> >> Very close to finishing, great to have it in Spark 2.4.
>> >>
>> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>> >> This one is there for years (thanks for your patience Michael!), and
>> is also close to finishing. Great to have it in 2.4.
>> >>
>> >> SPARK-24882: data source v2 API improvement
>> >> This is to improve the data source v2 API based on what we learned
>> during this release. From the migration of existing sources and design of
>> new features, we found some problems in the API and want to address them. I
>> believe this should be
>> >> the last significant API change to data source v2, so great to have in
>> Spark 2.4. I'll send a discuss email about it later.
>> >>
>> >> SPARK-24252: Add catalog support in Data Source V2
>> >> This is a very important feature for data source v2, and is currently
>> being discussed in the dev list.
>> >>
>> >> SPARK-24768: Have a built-in AVRO data source implementation
>> >> Most of it is done, but date/timestamp support is still missing. Great
>> to have in 2.4.
>> >>
>> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect
>> answers
>> >> This is a long-standing correctness bug, great to have in 2.4.
>> >>
>> >> There are some other important features like the adaptive execution,
>> streaming SQL, etc., not in the list, since I think we are not able to
>> finish them before 2.4.
>> >>
>> >> Feel free to add more things if you think they are important to Spark
>> 2.4 by replying to this email.
>> >>
>> >> Thanks,
>> >> Wenchen
>> >>
>> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
>> >>
>> >>   In theory releases happen on a time-based cadence, so it's pretty
>> much wrap up what's ready by the code freeze and ship it. In practice, the
>> cadence slips frequently, and it's very much a negotiation about what
>> features should push the
>> >>   code freeze out a few weeks every time. So, kind of a hybrid
>> approach here that works OK.
>> >>
>> >>   Certainly speak up if you think there's something that really needs
>> to get into 2.4. This is that discuss thread.
>> >>
>> >>   (BTW I updated the page you mention just yesterday, to reflect the
>> plan suggested in this thread.)
>> >>
>> >>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves
>>  wrote:
>> >>
>> >>   Shouldn't this be a discuss thread?
>> >>
>> >>   I'm also happy to see more release managers and agree the time is
>> getting close, but we should see what features are in progress and see how
>> close things are and propose a date based on that.  Cutting a branch to
>> soon just creates
>> >>   more work for committers to push to more branches.
>> >>
>> >>http://spark.apache.org/versioning-policy.html mentioned the code
>> freeze and release branch cut mid-august.
>> >>
>> >>   Tom
>> >
>> > 

Toward an "API" for spark images used by the Kubernetes back-end

2018-03-21 Thread Erik Erlandson
During the review of the recent PR to remove use of the init_container from
kube pods as created by the Kubernetes back-end, the topic of documenting
the "API" for these container images also came up. What information does
the back-end provide to these containers? In what form? What assumptions
does the back-end make about the structure of these containers?  This
information is important in a scenario where a user wants to create custom
images, particularly if these are not based on the reference dockerfiles.

A related topic is deciding what such an API should look like.  For
example, early incarnations were based more purely on environment
variables, which could have advantages in terms of an API that is easy to
describe in a document.  If we document the current API, should we annotate
it as Experimental?  If not, does that effectively freeze the API?

We are interested in community input about possible customization use cases
and opinions on possible API designs!
Cheers,
Erik


Publishing container images for Apache Spark

2018-01-11 Thread Erik Erlandson
Dear ASF Legal Affairs Committee,

The Apache Spark development community has begun some discussions

about publishing container images for Spark as part of its release
process.  These discussions were spurred by the upstream adoption of a new
Kubernetes scheduling back-end, which by nature operates via container
images running Spark inside a Kubernetes cluster.

The current state of thinking on this topic is influenced by the LEGAL-270
Jira  which can be
summarized as:
* A container image has the same legal status as other derived distributions
* As such, it is legally sound to publish a container image as long as that
image corresponds to an official project release
* An image that is regularly built from non-release code (e.g. a
'spark:latest' image built from the head of master branch) would not be
legally approved
* The image should not contain any code or binaries that carry GPL
licenses, or other licenses considered incompatible with ASF.

We are reaching out to you to get your additional input on what
requirements the community should meet to engineer Apache Spark container
images that meet ASF legal guidelines.

The original dev@spark thread is here:
http://apache-spark-developers-list.1001551.n3.nabble.com/Publishing-official-docker-images-for-KubernetesSchedulerBackend-td22928.html

LEGAL-270:
https://issues.apache.org/jira/browse/LEGAL-270


Fwd: Publishing official docker images for KubernetesSchedulerBackend

2017-12-19 Thread Erik Erlandson
Here are some specific questions I'd recommend for the Apache Spark PMC to
bring to ASF legal counsel:

1) Does the philosophy described on LEGAL-270 still represent a sanctioned
approach to publishing releases via container image?
2) If the transitive closure of pulled-in licenses on each of these images
is limited to licenses that are defined as compatible with Apache-2
<https://www.apache.org/legal/resolved.html>, does that satisfy ASF
licensing and legal guidelines?
3) What form of documentation/auditing for (2) should be provided to meet
legal requirements?

I would define the proposed action this way; to include, as part of the
Apache Spark official release process, publishing a "spark-base" image, to
be tagged with the specific release, that consists of a build of the spark
code for that release installed on a base-image (currently alpine, but
possibly some other alternative like centos), combined with the jvm and
python (and any of their transitive deps).  Additionally, some number of
images derived from "spark-base" would be built, which consist of
spark-base and a small layer of bash scripting for ENTRYPOINT and CMD, to
support the kubernets back-end.  Optionally, similar images targeted for
mesos or yarn might also be created.


On Tue, Dec 19, 2017 at 1:28 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> Reasoning by analogy to other Apache projects is generally not sufficient
> when it come to securing legally permissible form or behavior -- that
> another project is doing something is not a guarantee that they are doing
> it right. If we have issues or legal questions, we need to formulate them
> and our proposed actions as clearly and concretely as possible so that the
> PMC can take those issues, questions and proposed actions to Apache counsel
> for advice or guidance.
>
> On Tue, Dec 19, 2017 at 10:34 AM, Erik Erlandson <eerla...@redhat.com>
> wrote:
>
>> I've been looking a bit more into ASF legal posture on licensing and
>> container images. What I have found indicates that ASF considers container
>> images to be just another variety of distribution channel.  As such, it is
>> acceptable to publish official releases; for example an image such as
>> spark:v2.3.0 built from the v2.3.0 source is fine.  It is not acceptable to
>> do something like regularly publish spark:latest built from the head of
>> master.
>>
>> More detail here:
>> https://issues.apache.org/jira/browse/LEGAL-270
>>
>> So as I understand it, making a release-tagged public image as part of
>> each official release does not pose any problems.
>>
>> With respect to considering the licenses of other ancillary dependencies
>> that are also installed on such container images, I noticed this clause in
>> the legal boilerplate for the Flink images
>> <https://hub.docker.com/r/library/flink/>:
>>
>> As with all Docker images, these likely also contain other software which
>>> may be under other licenses (such as Bash, etc from the base distribution,
>>> along with any direct or indirect dependencies of the primary software
>>> being contained).
>>>
>>
>> So it may be sufficient to resolve this via disclaimer.
>>
>> -Erik
>>
>> On Thu, Dec 14, 2017 at 7:55 PM, Erik Erlandson <eerla...@redhat.com>
>> wrote:
>>
>>> Currently the containers are based off alpine, which pulls in BSD2 and
>>> MIT licensing:
>>> https://github.com/apache/spark/pull/19717#discussion_r154502824
>>>
>>> to the best of my understanding, neither of those poses a problem.  If
>>> we based the image off of centos I'd also expect the licensing of any image
>>> deps to be compatible.
>>>
>>> On Thu, Dec 14, 2017 at 7:19 PM, Mark Hamstra <m...@clearstorydata.com>
>>> wrote:
>>>
>>>> What licensing issues come into play?
>>>>
>>>> On Thu, Dec 14, 2017 at 4:00 PM, Erik Erlandson <eerla...@redhat.com>
>>>> wrote:
>>>>
>>>>> We've been discussing the topic of container images a bit more.  The
>>>>> kubernetes back-end operates by executing some specific CMD and ENTRYPOINT
>>>>> logic, which is different than mesos, and which is probably not practical
>>>>> to unify at this level.
>>>>>
>>>>> However: These CMD and ENTRYPOINT configurations are essentially just
>>>>> a thin skin on top of an image which is just an install of a spark distro.
>>>>> We feel that a single "spark-base" image should be publishable, that is
>>>>> consumable by kube-spark images, and mesos-spa

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-19 Thread Erik Erlandson
Agreed that the GPL family would be "toxic."

The current images have been at least informally confirmed to use licenses
that are ASF compatible.  Is there an officially sanctioned method of
license auditing that can be applied here?

On Tue, Dec 19, 2017 at 11:45 AM, Sean Owen <so...@cloudera.com> wrote:

> I think that's all correct, though the license of third party dependencies
> is actually a difficult and sticky part. The ASF couldn't make a software
> release including any GPL software for example, and it's not just a matter
> of adding a disclaimer. Any actual bits distributed by the PMC would have
> to follow all the license rules.
>
> On Tue, Dec 19, 2017 at 12:34 PM Erik Erlandson <eerla...@redhat.com>
> wrote:
>
>> I've been looking a bit more into ASF legal posture on licensing and
>> container images. What I have found indicates that ASF considers container
>> images to be just another variety of distribution channel.  As such, it is
>> acceptable to publish official releases; for example an image such as
>> spark:v2.3.0 built from the v2.3.0 source is fine.  It is not acceptable to
>> do something like regularly publish spark:latest built from the head of
>> master.
>>
>> More detail here:
>> https://issues.apache.org/jira/browse/LEGAL-270
>>
>> So as I understand it, making a release-tagged public image as part of
>> each official release does not pose any problems.
>>
>> With respect to considering the licenses of other ancillary dependencies
>> that are also installed on such container images, I noticed this clause in
>> the legal boilerplate for the Flink images
>> <https://hub.docker.com/r/library/flink/>:
>>
>> As with all Docker images, these likely also contain other software which
>>> may be under other licenses (such as Bash, etc from the base distribution,
>>> along with any direct or indirect dependencies of the primary software
>>> being contained).
>>>
>>
>> So it may be sufficient to resolve this via disclaimer.
>>
>> -Erik
>>
>> On Thu, Dec 14, 2017 at 7:55 PM, Erik Erlandson <eerla...@redhat.com>
>> wrote:
>>
>>> Currently the containers are based off alpine, which pulls in BSD2 and
>>> MIT licensing:
>>> https://github.com/apache/spark/pull/19717#discussion_r154502824
>>>
>>> to the best of my understanding, neither of those poses a problem.  If
>>> we based the image off of centos I'd also expect the licensing of any image
>>> deps to be compatible.
>>>
>>> On Thu, Dec 14, 2017 at 7:19 PM, Mark Hamstra <m...@clearstorydata.com>
>>> wrote:
>>>
>>>> What licensing issues come into play?
>>>>
>>>> On Thu, Dec 14, 2017 at 4:00 PM, Erik Erlandson <eerla...@redhat.com>
>>>> wrote:
>>>>
>>>>> We've been discussing the topic of container images a bit more.  The
>>>>> kubernetes back-end operates by executing some specific CMD and ENTRYPOINT
>>>>> logic, which is different than mesos, and which is probably not practical
>>>>> to unify at this level.
>>>>>
>>>>> However: These CMD and ENTRYPOINT configurations are essentially just
>>>>> a thin skin on top of an image which is just an install of a spark distro.
>>>>> We feel that a single "spark-base" image should be publishable, that is
>>>>> consumable by kube-spark images, and mesos-spark images, and likely any
>>>>> other community image whose primary purpose is running spark components.
>>>>> The kube-specific dockerfiles would be written "FROM spark-base" and just
>>>>> add the small command and entrypoint layers.  Likewise, the mesos images
>>>>> could add any specialization layers that are necessary on top of the
>>>>> "spark-base" image.
>>>>>
>>>>> Does this factorization sound reasonable to others?
>>>>> Cheers,
>>>>> Erik
>>>>>
>>>>>
>>>>> On Wed, Nov 29, 2017 at 10:04 AM, Mridul Muralidharan <
>>>>> mri...@gmail.com> wrote:
>>>>>
>>>>>> We do support running on Apache Mesos via docker images - so this
>>>>>> would not be restricted to k8s.
>>>>>> But unlike mesos support, which has other modes of running, I believe
>>>>>> k8s support more heavily depends on availability of docker images.
>>>>>>
>>>>>>
>>&

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-19 Thread Erik Erlandson
I've been looking a bit more into ASF legal posture on licensing and
container images. What I have found indicates that ASF considers container
images to be just another variety of distribution channel.  As such, it is
acceptable to publish official releases; for example an image such as
spark:v2.3.0 built from the v2.3.0 source is fine.  It is not acceptable to
do something like regularly publish spark:latest built from the head of
master.

More detail here:
https://issues.apache.org/jira/browse/LEGAL-270

So as I understand it, making a release-tagged public image as part of each
official release does not pose any problems.

With respect to considering the licenses of other ancillary dependencies
that are also installed on such container images, I noticed this clause in
the legal boilerplate for the Flink images
<https://hub.docker.com/r/library/flink/>:

As with all Docker images, these likely also contain other software which
> may be under other licenses (such as Bash, etc from the base distribution,
> along with any direct or indirect dependencies of the primary software
> being contained).
>

So it may be sufficient to resolve this via disclaimer.

-Erik

On Thu, Dec 14, 2017 at 7:55 PM, Erik Erlandson <eerla...@redhat.com> wrote:

> Currently the containers are based off alpine, which pulls in BSD2 and MIT
> licensing:
> https://github.com/apache/spark/pull/19717#discussion_r154502824
>
> to the best of my understanding, neither of those poses a problem.  If we
> based the image off of centos I'd also expect the licensing of any image
> deps to be compatible.
>
> On Thu, Dec 14, 2017 at 7:19 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> What licensing issues come into play?
>>
>> On Thu, Dec 14, 2017 at 4:00 PM, Erik Erlandson <eerla...@redhat.com>
>> wrote:
>>
>>> We've been discussing the topic of container images a bit more.  The
>>> kubernetes back-end operates by executing some specific CMD and ENTRYPOINT
>>> logic, which is different than mesos, and which is probably not practical
>>> to unify at this level.
>>>
>>> However: These CMD and ENTRYPOINT configurations are essentially just a
>>> thin skin on top of an image which is just an install of a spark distro.
>>> We feel that a single "spark-base" image should be publishable, that is
>>> consumable by kube-spark images, and mesos-spark images, and likely any
>>> other community image whose primary purpose is running spark components.
>>> The kube-specific dockerfiles would be written "FROM spark-base" and just
>>> add the small command and entrypoint layers.  Likewise, the mesos images
>>> could add any specialization layers that are necessary on top of the
>>> "spark-base" image.
>>>
>>> Does this factorization sound reasonable to others?
>>> Cheers,
>>> Erik
>>>
>>>
>>> On Wed, Nov 29, 2017 at 10:04 AM, Mridul Muralidharan <mri...@gmail.com>
>>> wrote:
>>>
>>>> We do support running on Apache Mesos via docker images - so this
>>>> would not be restricted to k8s.
>>>> But unlike mesos support, which has other modes of running, I believe
>>>> k8s support more heavily depends on availability of docker images.
>>>>
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>>
>>>> On Wed, Nov 29, 2017 at 8:56 AM, Sean Owen <so...@cloudera.com> wrote:
>>>> > Would it be logical to provide Docker-based distributions of other
>>>> pieces of
>>>> > Spark? or is this specific to K8S?
>>>> > The problem is we wouldn't generally also provide a distribution of
>>>> Spark
>>>> > for the reasons you give, because if that, then why not RPMs and so
>>>> on.
>>>> >
>>>> > On Wed, Nov 29, 2017 at 10:41 AM Anirudh Ramanathan <
>>>> ramanath...@google.com>
>>>> > wrote:
>>>> >>
>>>> >> In this context, I think the docker images are similar to the
>>>> binaries
>>>> >> rather than an extension.
>>>> >> It's packaging the compiled distribution to save people the effort of
>>>> >> building one themselves, akin to binaries or the python package.
>>>> >>
>>>> >> For reference, this is the base dockerfile for the main image that we
>>>> >> intend to publish. It's not particularly complicated.
>>>> >> The driver and executor images are based on said base image and only
>&

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-14 Thread Erik Erlandson
Currently the containers are based off alpine, which pulls in BSD2 and MIT
licensing:
https://github.com/apache/spark/pull/19717#discussion_r154502824

to the best of my understanding, neither of those poses a problem.  If we
based the image off of centos I'd also expect the licensing of any image
deps to be compatible.

On Thu, Dec 14, 2017 at 7:19 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> What licensing issues come into play?
>
> On Thu, Dec 14, 2017 at 4:00 PM, Erik Erlandson <eerla...@redhat.com>
> wrote:
>
>> We've been discussing the topic of container images a bit more.  The
>> kubernetes back-end operates by executing some specific CMD and ENTRYPOINT
>> logic, which is different than mesos, and which is probably not practical
>> to unify at this level.
>>
>> However: These CMD and ENTRYPOINT configurations are essentially just a
>> thin skin on top of an image which is just an install of a spark distro.
>> We feel that a single "spark-base" image should be publishable, that is
>> consumable by kube-spark images, and mesos-spark images, and likely any
>> other community image whose primary purpose is running spark components.
>> The kube-specific dockerfiles would be written "FROM spark-base" and just
>> add the small command and entrypoint layers.  Likewise, the mesos images
>> could add any specialization layers that are necessary on top of the
>> "spark-base" image.
>>
>> Does this factorization sound reasonable to others?
>> Cheers,
>> Erik
>>
>>
>> On Wed, Nov 29, 2017 at 10:04 AM, Mridul Muralidharan <mri...@gmail.com>
>> wrote:
>>
>>> We do support running on Apache Mesos via docker images - so this
>>> would not be restricted to k8s.
>>> But unlike mesos support, which has other modes of running, I believe
>>> k8s support more heavily depends on availability of docker images.
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Wed, Nov 29, 2017 at 8:56 AM, Sean Owen <so...@cloudera.com> wrote:
>>> > Would it be logical to provide Docker-based distributions of other
>>> pieces of
>>> > Spark? or is this specific to K8S?
>>> > The problem is we wouldn't generally also provide a distribution of
>>> Spark
>>> > for the reasons you give, because if that, then why not RPMs and so on.
>>> >
>>> > On Wed, Nov 29, 2017 at 10:41 AM Anirudh Ramanathan <
>>> ramanath...@google.com>
>>> > wrote:
>>> >>
>>> >> In this context, I think the docker images are similar to the binaries
>>> >> rather than an extension.
>>> >> It's packaging the compiled distribution to save people the effort of
>>> >> building one themselves, akin to binaries or the python package.
>>> >>
>>> >> For reference, this is the base dockerfile for the main image that we
>>> >> intend to publish. It's not particularly complicated.
>>> >> The driver and executor images are based on said base image and only
>>> >> customize the CMD (any file/directory inclusions are extraneous and
>>> will be
>>> >> removed).
>>> >>
>>> >> Is there only one way to build it? That's a bit harder to reason
>>> about.
>>> >> The base image I'd argue is likely going to always be built that way.
>>> The
>>> >> driver and executor images, there may be cases where people want to
>>> >> customize it - (like putting all dependencies into it for example).
>>> >> In those cases, as long as our images are bare bones, they can use the
>>> >> spark-driver/spark-executor images we publish as the base, and build
>>> their
>>> >> customization as a layer on top of it.
>>> >>
>>> >> I think the composability of docker images, makes this a bit different
>>> >> from say - debian packages.
>>> >> We can publish canonical images that serve as both - a complete image
>>> for
>>> >> most Spark applications, as well as a stable substrate to build
>>> >> customization upon.
>>> >>
>>> >> On Wed, Nov 29, 2017 at 7:38 AM, Mark Hamstra <
>>> m...@clearstorydata.com>
>>> >> wrote:
>>> >>>
>>> >>> It's probably also worth considering whether there is only one,
>>> >>> well-defined, correct way to create such an image or whether this is

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-14 Thread Erik Erlandson
We've been discussing the topic of container images a bit more.  The
kubernetes back-end operates by executing some specific CMD and ENTRYPOINT
logic, which is different than mesos, and which is probably not practical
to unify at this level.

However: These CMD and ENTRYPOINT configurations are essentially just a
thin skin on top of an image which is just an install of a spark distro.
We feel that a single "spark-base" image should be publishable, that is
consumable by kube-spark images, and mesos-spark images, and likely any
other community image whose primary purpose is running spark components.
The kube-specific dockerfiles would be written "FROM spark-base" and just
add the small command and entrypoint layers.  Likewise, the mesos images
could add any specialization layers that are necessary on top of the
"spark-base" image.

Does this factorization sound reasonable to others?
Cheers,
Erik


On Wed, Nov 29, 2017 at 10:04 AM, Mridul Muralidharan 
wrote:

> We do support running on Apache Mesos via docker images - so this
> would not be restricted to k8s.
> But unlike mesos support, which has other modes of running, I believe
> k8s support more heavily depends on availability of docker images.
>
>
> Regards,
> Mridul
>
>
> On Wed, Nov 29, 2017 at 8:56 AM, Sean Owen  wrote:
> > Would it be logical to provide Docker-based distributions of other
> pieces of
> > Spark? or is this specific to K8S?
> > The problem is we wouldn't generally also provide a distribution of Spark
> > for the reasons you give, because if that, then why not RPMs and so on.
> >
> > On Wed, Nov 29, 2017 at 10:41 AM Anirudh Ramanathan <
> ramanath...@google.com>
> > wrote:
> >>
> >> In this context, I think the docker images are similar to the binaries
> >> rather than an extension.
> >> It's packaging the compiled distribution to save people the effort of
> >> building one themselves, akin to binaries or the python package.
> >>
> >> For reference, this is the base dockerfile for the main image that we
> >> intend to publish. It's not particularly complicated.
> >> The driver and executor images are based on said base image and only
> >> customize the CMD (any file/directory inclusions are extraneous and
> will be
> >> removed).
> >>
> >> Is there only one way to build it? That's a bit harder to reason about.
> >> The base image I'd argue is likely going to always be built that way.
> The
> >> driver and executor images, there may be cases where people want to
> >> customize it - (like putting all dependencies into it for example).
> >> In those cases, as long as our images are bare bones, they can use the
> >> spark-driver/spark-executor images we publish as the base, and build
> their
> >> customization as a layer on top of it.
> >>
> >> I think the composability of docker images, makes this a bit different
> >> from say - debian packages.
> >> We can publish canonical images that serve as both - a complete image
> for
> >> most Spark applications, as well as a stable substrate to build
> >> customization upon.
> >>
> >> On Wed, Nov 29, 2017 at 7:38 AM, Mark Hamstra 
> >> wrote:
> >>>
> >>> It's probably also worth considering whether there is only one,
> >>> well-defined, correct way to create such an image or whether this is a
> >>> reasonable avenue for customization. Part of why we don't do something
> like
> >>> maintain and publish canonical Debian packages for Spark is because
> >>> different organizations doing packaging and distribution of
> infrastructures
> >>> or operating systems can reasonably want to do this in a custom (or
> >>> non-customary) way. If there is really only one reasonable way to do a
> >>> docker image, then my bias starts to tend more toward the Spark PMC
> taking
> >>> on the responsibility to maintain and publish that image. If there is
> more
> >>> than one way to do it and publishing a particular image is more just a
> >>> convenience, then my bias tends more away from maintaining and publish
> it.
> >>>
> >>> On Wed, Nov 29, 2017 at 5:14 AM, Sean Owen  wrote:
> 
>  Source code is the primary release; compiled binary releases are
>  conveniences that are also released. A docker image sounds fairly
> different
>  though. To the extent it's the standard delivery mechanism for some
> artifact
>  (think: pyspark on PyPI as well) that makes sense, but is that the
>  situation? if it's more of an extension or alternate presentation of
> Spark
>  components, that typically wouldn't be part of a Spark release. The
> ones the
>  PMC takes responsibility for maintaining ought to be the core,
> critical
>  means of distribution alone.
> 
>  On Wed, Nov 29, 2017 at 2:52 AM Anirudh Ramanathan
>   wrote:
> >
> > Hi all,
> >
> > We're all working towards the Kubernetes scheduler backend (full
> steam
> > ahead!) that's targeted 

Re: Timeline for Spark 2.3

2017-12-14 Thread Erik Erlandson
I wanted to check in on the state of the 2.3 freeze schedule.  Original
proposal was "late Dec", which is a bit open to interpretation.

We are working to get some refactoring done on the integration testing for
the Kubernetes back-end in preparation for testing upcoming release
candidates, however holiday vacation time is about to begin taking its toll
both on upstream reviewing and on the "downstream" spark-on-kube fork.

If the freeze pushed into January, that would take some of the pressure off
the kube back-end upstreaming. However, regardless, I was wondering if the
dates could be clarified.
Cheers,
Erik


On Mon, Nov 13, 2017 at 5:13 PM, dji...@dataxu.com 
wrote:

> Hi,
>
> What is the process to request an issue/fix to be included in the next
> release? Is there a place to vote for features?
> I am interested in https://issues.apache.org/jira/browse/SPARK-13127, to
> see
> if we can get Spark upgrade parquet to 1.9.0, which addresses the
> https://issues.apache.org/jira/browse/PARQUET-686.
> Can we include the fix in Spark 2.3 release?
>
> Thanks,
>
> Dong
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Timeline for Spark 2.3

2017-11-09 Thread Erik Erlandson
+1 on extending the deadline. It will significantly improve the logistics
for upstreaming the Kubernetes back-end.  Also agreed, on the general
realities of reduced bandwidth over the Nov-Dec holiday season.
Erik

On Thu, Nov 9, 2017 at 6:03 PM, Matei Zaharia 
wrote:

> I’m also +1 on extending this to get Kubernetes and other features in.
>
> Matei
>
> > On Nov 9, 2017, at 4:04 PM, Anirudh Ramanathan 
> wrote:
> >
> > This would help the community on the Kubernetes effort quite a bit -
> giving us additional time for reviews and testing for the 2.3 release.
> >
> > On Thu, Nov 9, 2017 at 3:56 PM, Justin Miller <
> justin.mil...@protectwise.com> wrote:
> > That sounds fine to me. I’m hoping that this ticket can make it into
> Spark 2.3: https://issues.apache.org/jira/browse/SPARK-18016
> >
> > It’s causing some pretty considerable problems when we alter the columns
> to be nullable, but we are OK for now without that.
> >
> > Best,
> > Justin
> >
> >> On Nov 9, 2017, at 4:54 PM, Michael Armbrust 
> wrote:
> >>
> >> According to the timeline posted on the website, we are nearing branch
> cut for Spark 2.3.  I'd like to propose pushing this out towards mid to
> late December for a couple of reasons and would like to hear what people
> think.
> >>
> >> 1. I've done release management during the Thanksgiving / Christmas
> time before and in my experience, we don't actually get a lot of testing
> during this time due to vacations and other commitments. I think beginning
> the RC process in early January would give us the best coverage in the
> shortest amount of time.
> >> 2. There are several large initiatives in progress that given a little
> more time would leave us with a much more exciting 2.3 release.
> Specifically, the work on the history server, Kubernetes and continuous
> processing.
> >> 3. Given the actual release date of Spark 2.2, I think we'll still get
> Spark 2.3 out roughly 6 months after.
> >>
> >> Thoughts?
> >>
> >> Michael
> >
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Announcing Spark on Kubernetes release 0.4.0

2017-09-25 Thread Erik Erlandson
The Spark on Kubernetes development community is pleased to announce
release 0.4.0 of Apache Spark with native Kubernetes scheduler back-end!

The dev community is planning to use this release as the reference for
upstreaming native kubernetes capability over the Spark 2.3 release cycle.

This release includes a variety of bug fixes and code improvements, as well
as the following new features:

   - HDFS rack locality support
   - Mount small files using secrets, without running the resource staging
   server
   - Java options exposed to executor pods
   - User specified secrets injection for driver and executor pods
   - Unit testing for the Kubernetes scheduler backend
   - Standardized docker image build scripting
   - Reference YAML for RBAC configurations

The full release notes are available here:
https://github.com/apache-spark-on-k8s/spark/releases/tag/v2.2.0-kubernetes-0.4.0

Community resources for Spark on Kubernetes are available at:

   - Slack: https://kubernetes.slack.com
   - User Docs: https://apache-spark-on-k8s.github.io/userdocs/
   - GitHub: https://github.com/apache-spark-on-k8s/spark


Re: SPIP: Spark on Kubernetes

2017-09-02 Thread Erik Erlandson
We have started discussions about upstreaming and merge strategy at our
weekly meetings. The associated github issue is:
https://github.com/apache-spark-on-k8s/spark/issues/441

There is general consensus that breaking it up into smaller components will
be important for upstream review. Our current focus is on identifying a way
to factor it so that it results in a minimal amount of "artificial"
deconstruction of existing work (for example, un-doing features purely to
present smaller initial PRs, but having to re-add them later).


On Fri, Sep 1, 2017 at 10:27 PM, Reynold Xin <r...@databricks.com> wrote:

> Anirudh (or somebody else familiar with spark-on-k8s),
>
> Can you create a short plan on how we would integrate and do code review
> to merge the project? If the diff is too large it'd be difficult to review
> and merge in one shot. Once we have a plan we can create subtickets to
> track the progress.
>
>
>
> On Thu, Aug 31, 2017 at 5:21 PM, Anirudh Ramanathan <
> ramanath...@google.com> wrote:
>
>> The proposal is in the process of being updated to include the details on
>> testing that we have, that Imran pointed out.
>> Please expect an update on the SPARK-18278
>> <https://issues.apache.org/jira/browse/SPARK-18278>.
>>
>> Mridul had a couple of points as well, about exposing an SPI and we've
>> been exploring that, to ascertain the effort involved.
>> That effort is separate, fairly long-term and we should have a working
>> group of representatives from all cluster managers to make progress on it.
>> A proposal regarding this will be in SPARK-19700
>> <https://issues.apache.org/jira/browse/SPARK-19700>.
>>
>> This vote has passed.
>> So far, there have been 4 binding +1 votes, ~25 non-binding votes, and no
>> -1 votes.
>>
>> Thanks all!
>>
>> +1 votes (binding):
>> Reynold Xin
>> Matei Zahari
>> Marcelo Vanzin
>> Mark Hamstra
>>
>> +1 votes (non-binding):
>> Anirudh Ramanathan
>> Erik Erlandson
>> Ilan Filonenko
>> Sean Suchter
>> Kimoon Kim
>> Timothy Chen
>> Will Benton
>> Holden Karau
>> Seshu Adunuthula
>> Daniel Imberman
>> Shubham Chopra
>> Jiri Kremser
>> Yinan Li
>> Andrew Ash
>> 李书明
>> Gary Lucas
>> Ismael Mejia
>> Jean-Baptiste Onofré
>> Alexander Bezzubov
>> duyanghao
>> elmiko
>> Sudarshan Kadambi
>> Varun Katta
>> Matt Cheah
>> Edward Zhang
>> Vaquar Khan
>>
>>
>>
>>
>>
>> On Wed, Aug 30, 2017 at 10:42 PM, Reynold Xin <r...@databricks.com>
>> wrote:
>>
>>> This has passed, hasn't it?
>>>
>>> On Tue, Aug 15, 2017 at 5:33 PM Anirudh Ramanathan <fox...@google.com>
>>> wrote:
>>>
>>>> Spark on Kubernetes effort has been developed separately in a fork, and
>>>> linked back from the Apache Spark project as an experimental backend
>>>> <http://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types>.
>>>> We're ~6 months in, have had 5 releases
>>>> <https://github.com/apache-spark-on-k8s/spark/releases>.
>>>>
>>>>- 2 Spark versions maintained (2.1, and 2.2)
>>>>- Extensive integration testing and refactoring efforts to maintain
>>>>code quality
>>>>- Developer
>>>><https://github.com/apache-spark-on-k8s/spark#getting-started> and
>>>>user-facing <https://apache-spark-on-k8s.github.io/userdocs/> docu
>>>>mentation
>>>>- 10+ consistent code contributors from different organizations
>>>>
>>>> <https://apache-spark-on-k8s.github.io/userdocs/contribute.html#project-contributions>
>>>>  involved
>>>>in actively maintaining and using the project, with several more members
>>>>involved in testing and providing feedback.
>>>>- The community has delivered several talks on Spark-on-Kubernetes
>>>>generating lots of feedback from users.
>>>>- In addition to these, we've seen efforts spawn off such as:
>>>>- HDFS on Kubernetes
>>>>   <https://github.com/apache-spark-on-k8s/kubernetes-HDFS> with
>>>>   Locality and Performance Experiments
>>>>   - Kerberized access
>>>>   
>>>> <https://docs.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_ewFuCNWKg/edit>
>>>>  to
>>>>   HDFS from Spark running on Kubernetes
>>>>
>

Re: SPIP: Spark on Kubernetes

2017-08-28 Thread Erik Erlandson
; On Mon, Aug 21, 2017 at 10:17 AM, Imran Rashid <iras...@cloudera.com>
>> wrote:
>>
>>> Overall this looks like a good proposal.  I do have some concerns which
>>> I'd like to discuss -- please understand I'm taking a "devil's advocate"
>>> stance here for discussion, not that I'm giving a -1.
>>>
>>> My primary concern is about testing and maintenance.  My concerns might
>>> be addressed if the doc included a section on testing that might just be
>>> this: https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2
>>> -kubernetes/resource-managers/kubernetes/README.md#running-
>>> the-kubernetes-integration-tests
>>>
>>> but without the concerning warning "Note that the integration test
>>> framework is currently being heavily revised and is subject to change".
>>> I'd like the proposal to clearly indicate that some baseline testing can be
>>> done by devs and in spark's regular jenkins builds without special access
>>> to kubernetes clusters.
>>>
>>> Its worth noting that there *are* advantages to keeping it outside Spark:
>>> * when making changes to spark's scheduler, we do *not* have to worry
>>> about how those changes impact kubernetes.  This simplifies things for
>>> those making changes to spark
>>> * making changes changes to the kubernetes integration is not blocked by
>>> getting enough attention from spark's committers
>>>
>>> or in other words, each community of experts can maintain its focus.  I
>>> have these concerns based on past experience with the mesos integration --
>>> mesos contributors are blocked on committers reviewing their changes, and
>>> then committers have no idea how to test that the changes are correct, and
>>> find it hard to even learn the ins and outs of that code without access to
>>> a mesos cluster.
>>>
>>> The same could be said for the yarn integration, but I think its helped
>>> that (a) spark-on-yarn *does* have local tests for testing basic
>>> integration and (b) there is a sufficient community of contributors and
>>> committers for spark-on-yarn.   I realize (b) is a chicken-and-egg problem,
>>> but I'd like to be sure that at least (a) is addressed.  (and maybe even
>>> spark-on-yarn shouldln't be inside spark itself, as mridul said, but its
>>> not clear what the other home should be.)
>>>
>>> At some point, this is just a judgement call, of the value it brings to
>>> the spark community vs the added complexity.  I'm willing to believe that
>>> kubernetes will bring enough value to make this worthwhile, just voicing my
>>> concerns.
>>>
>>> Secondary concern:
>>> the RSS doesn't seem necessary for kubernetes support, or specific to
>>> it.  If its nice to have, and you want to add it to kubernetes first before
>>> other cluster managers, fine, but seems separate from this proposal.
>>>
>>>
>>>
>>> On Tue, Aug 15, 2017 at 10:32 AM, Anirudh Ramanathan <
>>> fox...@google.com.invalid> wrote:
>>>
>>>> Spark on Kubernetes effort has been developed separately in a fork, and
>>>> linked back from the Apache Spark project as an experimental backend
>>>> <http://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types>.
>>>> We're ~6 months in, have had 5 releases
>>>> <https://github.com/apache-spark-on-k8s/spark/releases>.
>>>>
>>>>- 2 Spark versions maintained (2.1, and 2.2)
>>>>- Extensive integration testing and refactoring efforts to maintain
>>>>code quality
>>>>- Developer
>>>><https://github.com/apache-spark-on-k8s/spark#getting-started> and
>>>>user-facing <https://apache-spark-on-k8s.github.io/userdocs/> docu
>>>>mentation
>>>>- 10+ consistent code contributors from different organizations
>>>>
>>>> <https://apache-spark-on-k8s.github.io/userdocs/contribute.html#project-contributions>
>>>>  involved
>>>>in actively maintaining and using the project, with several more members
>>>>involved in testing and providing feedback.
>>>>- The community has delivered several talks on Spark-on-Kubernetes
>>>>generating lots of feedback from users.
>>>>- In addition to these, we've seen efforts spawn off such as:
>>>>- HDFS on Kubernetes
>>>>   <https://github.com/

Re: SPIP: Spark on Kubernetes

2017-08-21 Thread Erik Erlandson
   - Kerberized access
>>   
>> <https://docs.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_ewFuCNWKg/edit>
>>  to
>>   HDFS from Spark running on Kubernetes
>>
>> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>>
>>- +1: Yeah, let's go forward and implement the SPIP.
>>- +0: Don't really care.
>>- -1: I don't think this is a good idea because of the following
>>technical reasons.
>>
>> If there is any further clarification desired, on the design or the
>> implementation, please feel free to ask questions or provide feedback.
>>
>>
>> SPIP: Kubernetes as A Native Cluster Manager
>>
>> Full Design Doc: link
>> <https://issues.apache.org/jira/secure/attachment/12881586/SPARK-18278%20Spark%20on%20Kubernetes%20Design%20Proposal%20Revision%202%20%281%29.pdf>
>>
>> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>>
>> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>>
>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
>> Cheah,
>>
>> Ilan Filonenko, Sean Suchter, Kimoon Kim
>> Background and Motivation
>>
>> Containerization and cluster management technologies are constantly
>> evolving in the cluster computing world. Apache Spark currently implements
>> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
>> its own standalone cluster manager. In 2014, Google announced development
>> of Kubernetes <https://kubernetes.io/> which has its own unique feature
>> set and differentiates itself from YARN and Mesos. Since its debut, it has
>> seen contributions from over 1300 contributors with over 5 commits.
>> Kubernetes has cemented itself as a core player in the cluster computing
>> world, and cloud-computing providers such as Google Container Engine,
>> Google Compute Engine, Amazon Web Services, and Microsoft Azure support
>> running Kubernetes clusters.
>>
>> This document outlines a proposal for integrating Apache Spark with
>> Kubernetes in a first class way, adding Kubernetes to the list of cluster
>> managers that Spark can be used with. Doing so would allow users to share
>> their computing resources and containerization framework between their
>> existing applications on Kubernetes and their computational Spark
>> applications. Although there is existing support for running a Spark
>> standalone cluster on Kubernetes
>> <https://github.com/kubernetes/examples/blob/master/staging/spark/README.md>,
>> there are still major advantages and significant interest in having native
>> execution support. For example, this integration provides better support
>> for multi-tenancy and dynamic resource allocation. It also allows users to
>> run applications of different Spark versions of their choices in the same
>> cluster.
>>
>> The feature is being developed in a separate fork
>> <https://github.com/apache-spark-on-k8s/spark> in order to minimize risk
>> to the main project during development. Since the start of the development
>> in November of 2016, it has received over 100 commits from over 20
>> contributors and supports two releases based on Spark 2.1 and 2.2
>> respectively. Documentation is also being actively worked on both in the
>> main project repository and also in the repository
>> https://github.com/apache-spark-on-k8s/userdocs. Regarding real-world
>> use cases, we have seen cluster setup that uses 1000+ cores. We are also
>> seeing growing interests on this project from more and more organizations.
>>
>> While it is easy to bootstrap the project in a forked repository, it is
>> hard to maintain it in the long run because of the tricky process of
>> rebasing onto the upstream and lack of awareness in the large Spark
>> community. It would be beneficial to both the Spark and Kubernetes
>> community seeing this feature being merged upstream. On one hand, it gives
>> Spark users the option of running their Spark workloads along with other
>> workloads that may already be running on Kubernetes, enabling better
>> resource sharing and isolation, and better cluster administration. On the
>> other hand, it gives Kubernetes a leap forward in the area of large-scale
>> data processing by being an officially supported cluster manager for Spark.
>> The risk of merging into upstream is low because most of the changes are
>> purely incremental, i.e., new Kubernetes-aware implementations of existing
>> interfaces/classes in Spark core are introduced. The development is also
>>

Re: SPIP: Spark on Kubernetes

2017-08-18 Thread Erik Erlandson
There are a fair number of people (myself included) who have interest in
making scheduler back-ends fully pluggable.  That will represent a
significant impact to core spark architecture, with corresponding risk.
Adding the kubernetes back-end in a manner similar to the other three
back-ends has had a very small impact on spark core, which allowed it to be
developed in parallel and easily stay re-based on successive spark releases
while we were developing it and building up community support.

On Thu, Aug 17, 2017 at 7:14 PM, Mridul Muralidharan <mri...@gmail.com>
wrote:

> While I definitely support the idea of Apache Spark being able to
> leverage kubernetes, IMO it is better for long term evolution of spark
> to expose appropriate SPI such that this support need not necessarily
> live within Apache Spark code base.
> It will allow for multiple backends to evolve, decoupled from spark core.
> In this case, would have made maintaining apache-spark-on-k8s repo
> easier; just as it would allow for supporting other backends -
> opensource (nomad for ex) and proprietary.
>
> In retrospect directly integrating yarn support into spark, while
> mirroring mesos support at that time, was probably an incorrect design
> choice on my part.
>
>
> Regards,
> Mridul
>
> On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan
> <fox...@google.com.invalid> wrote:
> > Spark on Kubernetes effort has been developed separately in a fork, and
> > linked back from the Apache Spark project as an experimental backend.
> We're
> > ~6 months in, have had 5 releases.
> >
> > 2 Spark versions maintained (2.1, and 2.2)
> > Extensive integration testing and refactoring efforts to maintain code
> > quality
> > Developer and user-facing documentation
> > 10+ consistent code contributors from different organizations involved in
> > actively maintaining and using the project, with several more members
> > involved in testing and providing feedback.
> > The community has delivered several talks on Spark-on-Kubernetes
> generating
> > lots of feedback from users.
> > In addition to these, we've seen efforts spawn off such as:
> >
> > HDFS on Kubernetes with Locality and Performance Experiments
> > Kerberized access to HDFS from Spark running on Kubernetes
> >
> > Following the SPIP process, I'm putting this SPIP up for a vote.
> >
> > +1: Yeah, let's go forward and implement the SPIP.
> > +0: Don't really care.
> > -1: I don't think this is a good idea because of the following technical
> > reasons.
> >
> > If there is any further clarification desired, on the design or the
> > implementation, please feel free to ask questions or provide feedback.
> >
> >
> > SPIP: Kubernetes as A Native Cluster Manager
> >
> >
> > Full Design Doc: link
> >
> > JIRA: https://issues.apache.org/jira/browse/SPARK-18278
> >
> > Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
> >
> >
> > Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
> > Cheah,
> >
> > Ilan Filonenko, Sean Suchter, Kimoon Kim
> >
> > Background and Motivation
> >
> > Containerization and cluster management technologies are constantly
> evolving
> > in the cluster computing world. Apache Spark currently implements support
> > for Apache Hadoop YARN and Apache Mesos, in addition to providing its own
> > standalone cluster manager. In 2014, Google announced development of
> > Kubernetes which has its own unique feature set and differentiates itself
> > from YARN and Mesos. Since its debut, it has seen contributions from over
> > 1300 contributors with over 5 commits. Kubernetes has cemented
> itself as
> > a core player in the cluster computing world, and cloud-computing
> providers
> > such as Google Container Engine, Google Compute Engine, Amazon Web
> Services,
> > and Microsoft Azure support running Kubernetes clusters.
> >
> >
> > This document outlines a proposal for integrating Apache Spark with
> > Kubernetes in a first class way, adding Kubernetes to the list of cluster
> > managers that Spark can be used with. Doing so would allow users to share
> > their computing resources and containerization framework between their
> > existing applications on Kubernetes and their computational Spark
> > applications. Although there is existing support for running a Spark
> > standalone cluster on Kubernetes, there are still major advantages and
> > significant interest in having native execution support. For example,
> this
> > integration provides better support for 

Re: Questions about the future of UDTs and Encoders

2017-08-16 Thread Erik Erlandson
I've been working on packaging some UDTs as well.  I have them working in
scala and pyspark, although I haven't been able to get them to serialize to
parquet, which puzzles me.

Although it works, I have to define UDTs under the org.apache.spark scope
due to the privatization, which is a bit awkward.

On Wed, Aug 16, 2017 at 8:55 AM, Katherine Prevost  wrote:

> I'd say the quick summary of the problem is this:
>
> The encoder mechanism does not deal well with fields of case classes (you
> must use builtin types (including other case classes) for case class
> fields), and UDTs are not currently available (and never integrated well
> with built-in operations).
>
> Encoders work great for individual fields if you're using tuples, but once
> you get up over four or five fields this becomes incomprehensible. And, of
> course, encoders do nothing for you once you are in the realm of dataframes
> (including operations on fields, results of dataframe-based methods, and
> working in languages other than Scala.)
>
> The sort of machinations I describe below are unpleasant but not a huge
> deal for people who are trained as developers... but they're a much bigger
> mess when we have to provide these interfaces to our data scientists. Yes,
> they can do it, but the "every address is a string and you have to use
> these functions that parse the strings over and over again" approach is
> easier to use (if massively inefficient).
>
> I would like to improve Spark so that we can provide these types that our
> data scientists need to use *all the time* in a way that's both efficient
> and easy to use.
>
> Hence, my interest in doing work on the UDT and/or Encoder mechanisms of
> Spark (or equivalent, if something new is in the works), and my interest in
> hearing from anybody who is already working in this area, or hearing about
> any future plans that have already been made in this area.
>
>
> In more detail:
>
> On Wed, Aug 16, 2017 at 2:49 AM Jörn Franke  wrote:
>
>> Not sure I got to fully understand the issue (source code is always
>> helpful ;-) but why don't you override the toString method of IPAddress.
>> So, IP address could still be byte , but when it is displayed then toString
>> converts the byteaddress into something human-readable?
>>
>
> There are a couple of reasons it's not that simple. (If you look at the
> sample snippets of code I did include, you'll see that I did define
> toString methods.)
>
> The first problem is basically because toString doesn't happen when
> working with DataFrames, which are often the result of common Spark
> operations in Scala (though staying in the realm of Datasets is getting
> easier, and apparently also becoming more efficient). Outside of Scala,
> it's DataFrames all the way down.
>
> (If you look at my example code, you'll also see what happens when you
> have a DataFrame with a field that is a struct with a byte array in it, and
> nobody ever wants to see "[B@617f4814".)
>
> You can get around that (as long as you're still in a Dataset) with
> something like this (this is using the IPAddress.toString method to produce
> "IPAddress(Array(1,2,3,4))"):
>
> scala> ys.take(20)
> res10: Array[Rec] = Array(Rec(IPAddress(Array(1, 2, 3, 4)),
> IPAddress(Array(5, 6, 7, 8))), Rec(IPAddress(Array(1, 2, 3, 4, 5, 6, 7, 8,
> 9, 10, 11, 12, 13, 14, 15, 16)), IPAddress(Array(17, 18, 19, 20, 21, 22,
> 23, 24, 25, 26, 27, 28, 29, 30, 31, 32
>
> But then of course you lose any easy ability to view Rec fields in
> columns. (And while you could make something that prints Rec as columns,
> what happens once you transform your record and turn it into a tuple?)
>
> The second one is that operating on the fields cleanly is still rather
> painful, even if the values were to be displayed cleanly. This is what you
> have to do to search for rows that have a specific IPAddress value (ys("a")
> is a column of IPAddress, a is an IPAddress):
>
> scala> ys.select(ys("a.bytes") === a.bytes)
> res9: org.apache.spark.sql.DataFrame = [(a.bytes AS `bytes` =
> X'01020304'): boolean]
>
> It's worth noting that an implicit conversion from IPAddress to
> Array[Byte] or to Column wouldn't work here, because === accepts Any.
>
>
> katherine.
>
> > On 15. Aug 2017, at 18:49, Katherine Prevost  wrote:
>> >
>> > Hi, all!
>> >
>> >
>> > I'm a developer who works to support data scientists at CERT. We've
>> > been having some great success working with Spark for data analysis,
>> > and I have some questions about how we could contribute to work on
>> > Spark in support of our goals.
>> >
>> > Specifically, we have some interest in user-defined types, or their
>> > equivalents.
>> >
>> >
>> > When Spark 2 arrived, user-defined types (UDTs) were made private and
>> > seem to have fallen by the wayside in favor of using encoders for
>> > Datasets. I have some questions about the future of these mechanisms,
>> > and was wondering if there's been a plan 

Re: SPIP: Spark on Kubernetes

2017-08-15 Thread Erik Erlandson
Kubernetes has evolved into an important container orchestration platform;
it has a large and growing user base and an active ecosystem.  Users of
Apache Spark who are also deploying applications on Kubernetes (or are
planning to) will have convergence-related motivations for migrating their
Spark applications to Kubernetes as well. It avoids the need for deploying
separate cluster infra for Spark workloads and allows Spark applications to
take full advantage of inhabiting the same orchestration environment as
other applications.  In this respect, native Kubernetes support for Spark
represents a way to optimize uptake and retention of Apache Spark among the
members of the expanding Kubernetes community.

On Tue, Aug 15, 2017 at 8:43 AM, Erik Erlandson <eerla...@redhat.com> wrote:

> +1 (non-binding)
>
>
> On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan <fox...@google.com>
> wrote:
>
>> Spark on Kubernetes effort has been developed separately in a fork, and
>> linked back from the Apache Spark project as an experimental backend
>> <http://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types>.
>> We're ~6 months in, have had 5 releases
>> <https://github.com/apache-spark-on-k8s/spark/releases>.
>>
>>- 2 Spark versions maintained (2.1, and 2.2)
>>- Extensive integration testing and refactoring efforts to maintain
>>code quality
>>- Developer
>><https://github.com/apache-spark-on-k8s/spark#getting-started> and
>>user-facing <https://apache-spark-on-k8s.github.io/userdocs/> docu
>>mentation
>>- 10+ consistent code contributors from different organizations
>>
>> <https://apache-spark-on-k8s.github.io/userdocs/contribute.html#project-contributions>
>>  involved
>>in actively maintaining and using the project, with several more members
>>involved in testing and providing feedback.
>>- The community has delivered several talks on Spark-on-Kubernetes
>>generating lots of feedback from users.
>>- In addition to these, we've seen efforts spawn off such as:
>>- HDFS on Kubernetes
>>   <https://github.com/apache-spark-on-k8s/kubernetes-HDFS> with
>>   Locality and Performance Experiments
>>   - Kerberized access
>>   
>> <https://docs.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_ewFuCNWKg/edit>
>>  to
>>   HDFS from Spark running on Kubernetes
>>
>> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>>
>>- +1: Yeah, let's go forward and implement the SPIP.
>>- +0: Don't really care.
>>- -1: I don't think this is a good idea because of the following
>>technical reasons.
>>
>> If there is any further clarification desired, on the design or the
>> implementation, please feel free to ask questions or provide feedback.
>>
>>
>> SPIP: Kubernetes as A Native Cluster Manager
>>
>> Full Design Doc: link
>> <https://issues.apache.org/jira/secure/attachment/12881586/SPARK-18278%20Spark%20on%20Kubernetes%20Design%20Proposal%20Revision%202%20%281%29.pdf>
>>
>> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>>
>> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>>
>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
>> Cheah,
>>
>> Ilan Filonenko, Sean Suchter, Kimoon Kim
>> Background and Motivation
>>
>> Containerization and cluster management technologies are constantly
>> evolving in the cluster computing world. Apache Spark currently implements
>> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
>> its own standalone cluster manager. In 2014, Google announced development
>> of Kubernetes <https://kubernetes.io/> which has its own unique feature
>> set and differentiates itself from YARN and Mesos. Since its debut, it has
>> seen contributions from over 1300 contributors with over 5 commits.
>> Kubernetes has cemented itself as a core player in the cluster computing
>> world, and cloud-computing providers such as Google Container Engine,
>> Google Compute Engine, Amazon Web Services, and Microsoft Azure support
>> running Kubernetes clusters.
>>
>> This document outlines a proposal for integrating Apache Spark with
>> Kubernetes in a first class way, adding Kubernetes to the list of cluster
>> managers that Spark can be used with. Doing so would allow users to share
>> their computing resources and containerization framework between their
>> existing applications on Kubernetes

Re: SPIP: Spark on Kubernetes

2017-08-15 Thread Erik Erlandson
+1 (non-binding)

On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan <fox...@google.com>
wrote:

> Spark on Kubernetes effort has been developed separately in a fork, and
> linked back from the Apache Spark project as an experimental backend
> <http://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types>.
> We're ~6 months in, have had 5 releases
> <https://github.com/apache-spark-on-k8s/spark/releases>.
>
>- 2 Spark versions maintained (2.1, and 2.2)
>- Extensive integration testing and refactoring efforts to maintain
>code quality
>- Developer
><https://github.com/apache-spark-on-k8s/spark#getting-started> and
>user-facing <https://apache-spark-on-k8s.github.io/userdocs/>
>documentation
>- 10+ consistent code contributors from different organizations
>
> <https://apache-spark-on-k8s.github.io/userdocs/contribute.html#project-contributions>
>  involved
>in actively maintaining and using the project, with several more members
>involved in testing and providing feedback.
>- The community has delivered several talks on Spark-on-Kubernetes
>generating lots of feedback from users.
>- In addition to these, we've seen efforts spawn off such as:
>- HDFS on Kubernetes
>   <https://github.com/apache-spark-on-k8s/kubernetes-HDFS> with
>   Locality and Performance Experiments
>   - Kerberized access
>   
> <https://docs.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_ewFuCNWKg/edit>
>  to
>   HDFS from Spark running on Kubernetes
>
> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>
>- +1: Yeah, let's go forward and implement the SPIP.
>- +0: Don't really care.
>- -1: I don't think this is a good idea because of the following
>technical reasons.
>
> If there is any further clarification desired, on the design or the
> implementation, please feel free to ask questions or provide feedback.
>
>
> SPIP: Kubernetes as A Native Cluster Manager
>
> Full Design Doc: link
> <https://issues.apache.org/jira/secure/attachment/12881586/SPARK-18278%20Spark%20on%20Kubernetes%20Design%20Proposal%20Revision%202%20%281%29.pdf>
>
> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>
> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>
> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
> Cheah,
>
> Ilan Filonenko, Sean Suchter, Kimoon Kim
> Background and Motivation
>
> Containerization and cluster management technologies are constantly
> evolving in the cluster computing world. Apache Spark currently implements
> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
> its own standalone cluster manager. In 2014, Google announced development
> of Kubernetes <https://kubernetes.io/> which has its own unique feature
> set and differentiates itself from YARN and Mesos. Since its debut, it has
> seen contributions from over 1300 contributors with over 5 commits.
> Kubernetes has cemented itself as a core player in the cluster computing
> world, and cloud-computing providers such as Google Container Engine,
> Google Compute Engine, Amazon Web Services, and Microsoft Azure support
> running Kubernetes clusters.
>
> This document outlines a proposal for integrating Apache Spark with
> Kubernetes in a first class way, adding Kubernetes to the list of cluster
> managers that Spark can be used with. Doing so would allow users to share
> their computing resources and containerization framework between their
> existing applications on Kubernetes and their computational Spark
> applications. Although there is existing support for running a Spark
> standalone cluster on Kubernetes
> <https://github.com/kubernetes/examples/blob/master/staging/spark/README.md>,
> there are still major advantages and significant interest in having native
> execution support. For example, this integration provides better support
> for multi-tenancy and dynamic resource allocation. It also allows users to
> run applications of different Spark versions of their choices in the same
> cluster.
>
> The feature is being developed in a separate fork
> <https://github.com/apache-spark-on-k8s/spark> in order to minimize risk
> to the main project during development. Since the start of the development
> in November of 2016, it has received over 100 commits from over 20
> contributors and supports two releases based on Spark 2.1 and 2.2
> respectively. Documentation is also being actively worked on both in the
> main project repository and also in the repository
> https://github.com/apache-spark-on-k8s/userdocs. Regarding re

Apache Spark on Kubernetes: New Release for Spark 2.2

2017-08-14 Thread Erik Erlandson
The Apache Spark on Kubernetes Community Development Project is pleased to
announce the latest release of Apache Spark with native Scheduler Backend
for Kubernetes!  Features provided in this release include:


   -

   Cluster-mode submission of Spark jobs to a Kubernetes cluster
   -

   Support for Scala, Java and PySpark
   -

   Static and Dynamic Allocation for Executors
   -

   Automatic staging of local resources onto Driver and Executor pods
   -

   Configurable security and credential management
   -

   HDFS, running on the Kubernetes cluster or externally
   -

   Launch jobs using kubectl proxy
   -

   Built against Apache Spark 2.1 and 2.2
   -

   Support for Kubernetes 1.5 - 1.7
   -

   Pre-built docker images


Apache Spark on Kubernetes is currently being developed as an independent
community project, with several actively contributing companies in
collaboration. The project resides at the apache-spark-on-k8s GitHub
organization, and tracks upstream Apache Spark releases:

https://github.com/apache-spark-on-k8s/spark

If you have any questions or issues, we are happy to help! Please feel free
to reach out to the Spark on Kubernetes community on these channels:

   -

   Slack: https://kubernetes.slack.com #sig-big-data
   -

   User Documentation: https://apache-spark-on-k8s.github.io/userdocs/
   -

   GitHub issues: https://github.com/apache-spark-on-k8s/spark/issues
   -

   SIG: https://github.com/kubernetes/community/tree/master/sig-big-data


Failing to write a data-frame containing a UDT to parquet format

2017-07-30 Thread Erik Erlandson
I'm trying to support parquet i/o for data-frames that contain a UDT (for
t-digests). The UDT is defined here:

https://github.com/erikerlandson/isarn-sketches-spark/blob/feature/pyspark/src/main/scala/org/apache/spark/isarnproject/sketches/udt/TDigestUDT.scala#L37

I can read and write using 'objectFile', but when I try to use '
...write.parquet(...)' I'm getting failures I can't make sense of.  The
full stack-dump is here:
https://gist.github.com/erikerlandson/054652fc2d34ef896717124991196c0e

Following is the first portion of the dump.  The associated error message
is: "failure: `TimestampType' expected but `{' found"

scala> val data = sc.parallelize(Seq(1,2,3,4,5)).toDF("x")
data: org.apache.spark.sql.DataFrame = [x: int]

scala> val udaf = tdigestUDAF[Double].maxDiscrete(10)
udaf: org.isarnproject.sketches.udaf.TDigestUDAF[Double] =
TDigestUDAF(0.5,10)

scala> val agg = data.agg(udaf($"x").alias("tdigest"))
agg: org.apache.spark.sql.DataFrame = [tdigest: tdigest]

scala> agg.show()
++
| tdigest|
++
|TDigestSQL(TDiges...|
++

scala> agg.write.parquet("/tmp/agg.parquet")
2017-07-30 13:32:13 ERROR Utils:91 - Aborting task
java.lang.IllegalArgumentException: Unsupported dataType:
{"type":"struct","fields":[{"name":"tdigest","type":{"type":"udt","class":"org.apache.spark.isarnproject.sketches.udt.TDigestUDT$","pyClass":"isarnproject.sketches.udt.tdigest.TDigestUDT","sqlType":{"type":"struct","fields":[{"name":"delta","type":"double","nullable":false,"metadata":{}},{"name":"maxDiscrete","type":"integer","nullable":false,"metadata":{}},{"name":"nclusters","type":"integer","nullable":false,"metadata":{}},{"name":"clustX","type":{"type":"array","elementType":"double","containsNull":false},"nullable":false,"metadata":{}},{"name":"clustM","type":{"type":"array","elementType":"double","containsNull":false},"nullable":false,"metadata":{}}]}},"nullable":true,"metadata":{}}]},
[1.1] failure: `TimestampType' expected but `{' found


Spark on Kubernetes: Birds-of-a-Feather Session 12:50pm 6/6

2017-06-05 Thread Erik Erlandson
Come learn about the community development project to add a native
Kubernetes scheduling back-end to Apache Spark!  Meet contributors
and network with community members interested in running Spark on
Kubernetes. Learn how to run Spark jobs on your Kubernetes cluster;
find out how to contribute to the project.


Re: [ml] Why all model classes are final?

2015-06-11 Thread Erik Erlandson
I was able to work around this problem in several cases using the class 
'enhancement' or 'extension' pattern to add some functionality to the decision 
tree model data structures.


- Original Message -
 Hi, previously all the models in ml package were private to package, so
 if i need to customize some models i inherit them in org.apache.spark.ml
 package in my project. But now new models
 (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala#L46)
 are final classes. So if i need to customize 1 line or so, i need to
 redefine the whole class. Any reasons to do so? As a developer,i
 understand all the risks of using Developer/Alpha API. That's why i'm
 using spark, because it provides a building blocks that i could easily
 customize and combine for my need.
 
 Thanks,
 Peter Rudenko
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Discussion | SparkContext 's setJobGroup and clearJobGroup should return a new instance of SparkContext

2015-01-12 Thread Erik Erlandson
setJobGroup needs fixing:
https://issues.apache.org/jira/browse/SPARK-4514

I'm interested in any community input on what the semantics or design ought 
to be changed to.


- Original Message -
 Hi spark committers
 
 I would like to discuss the possibility of changing the signature
 of SparkContext 's setJobGroup and clearJobGroup functions to return a
 replica of SparkContext with the job group set/unset instead of mutating
 the original context. I am building a spark job server and I am assigning
 job groups before passing control to user provided logic that uses spark
 context to define and execute a job (very much like job-server). The issue
 is that I can't reliably know when to clear the job group as user defined
 code can use futures to submit multiple tasks in parallel. In fact, I am
 even allowing users to return a future from their function on which spark
 server can register callbacks to know when the user defined job is
 complete. Now, if I set the job group before passing control to user
 function and wait on future to complete so that I can clear the job group,
 I can no longer use that SparkContext for any other job. This means I will
 have to lock on the SparkContext which seems like a bad idea. Therefore, my
 proposal would be to return new instance of SparkContext (a replica with
 just job group set/unset) that can further be used in concurrent
 environment safely. I am also happy mutating the original SparkContext just
 not break backward compatibility as long as the returned SparkContext is
 not affected by set/unset of job groups on original SparkContext.
 
 Thoughts please?
 
 Thanks,
 Aniket
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Scalastyle improvements / large code reformatting

2014-10-13 Thread Erik Erlandson


- Original Message -
 I'm also against these huge reformattings. They slow down development and
 backporting for trivial reasons. Let's not do that at this point, the style
 of the current code is quite consistent and we have plenty of other things
 to worry about. Instead, what you can do is as you edit a file when you're
 working on a feature, fix up style issues you see. Or, as Josh suggested,
 some way to make this apply only to new files would help.


+1, the benefit/cost ratio of wide-spread code churn just to alter formatting 
is close to zero.



 
 Matei
 
 On Oct 12, 2014, at 10:16 PM, Patrick Wendell pwend...@gmail.com wrote:
 
  Another big problem with these patches are that they make it almost
  impossible to backport changes to older branches cleanly (there
  becomes like 100% chance of a merge conflict).
  
  One proposal is to do this:
  1. We only consider new style rules at the end of a release cycle,
  when there is the smallest chance of wanting to backport stuff.
  2. We require that they are submitted in individual patches with a (a)
  new style rule and (b) the associated changes. Then we can also
  evaluate on a case-by-case basis how large the change is for each
  rule. For rules that require sweeping changes across the codebase,
  personally I'd vote against them. For rules like import ordering that
  won't cause too much pain on the diff (it's pretty easy to deal with
  those conflicts) I'd be okay with it.
  
  If we went with this, we'd also have to warn people that we might not
  accept new style rules if they are too costly to enforce. I'm guessing
  people will still contribute even with those expectations.
  
  - Patrick
  
  On Sun, Oct 12, 2014 at 9:39 PM, Reynold Xin r...@databricks.com wrote:
  I actually think we should just take the bite and follow through with the
  reformatting. Many rules are simply not possible to enforce only on deltas
  (e.g. import ordering).
  
  That said, maybe there are better windows to do this, e.g. during the QA
  period.
  
  On Sun, Oct 12, 2014 at 9:37 PM, Josh Rosen rosenvi...@gmail.com wrote:
  
  There are a number of open pull requests that aim to extend Spark's
  automated style checks (see
  https://issues.apache.org/jira/browse/SPARK-3849 for an umbrella JIRA).
  These fixes are mostly good, but I have some concerns about merging these
  patches.  Several of these patches make large reformatting changes in
  nearly every file of Spark, which makes it more difficult to use `git
  blame` and has the potential to introduce merge conflicts with all open
  PRs
  and all backport patches.
  
  I feel that most of the value of automated style-checks comes from
  allowing reviewers/committers to focus on the technical content of pull
  requests rather than their formatting.  My concern is that the
  convenience
  added by these new style rules will not outweigh the other overheads that
  these reformatting patches will create for the committers.
  
  If possible, it would be great if we could extend the style checker to
  enforce the more stringent rules only for new code additions / deletions.
  If not, I don't think that we should proceed with the reformatting.
  Others
  might disagree, though, so I welcome comments / discussion.
  
  - Josh
  
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
  
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Adding abstraction in MLlib

2014-09-12 Thread Erik Erlandson

Are interface designs being captured anywhere as documents that the community 
can follow along with as the proposals evolve?

I've worked on other open source projects where design docs were published as 
living documents (e.g. on google docs, or etherpad, but the particular 
mechanism isn't crucial).   FWIW, I found that to be a good way to work in a 
community environment.


- Original Message -
 Hi Egor,
 
 Thanks for the feedback! We are aware of some of the issues you
 mentioned and there are JIRAs created for them. Specifically, I'm
 pushing out the design on pipeline features and algorithm/model
 parameters this week. We can move our discussion to
 https://issues.apache.org/jira/browse/SPARK-1856 .
 
 It would be nice to make tests against interfaces. But it definitely
 needs more discussion before making PRs. For example, we discussed the
 learning interfaces in Christoph's PR
 (https://github.com/apache/spark/pull/2137/) but it takes time to
 reach a consensus, especially on interfaces. Hopefully all of us could
 benefit from the discussion. The best practice is to break down the
 proposal into small independent piece and discuss them on the JIRA
 before submitting PRs.
 
 For performance tests, there is a spark-perf package
 (https://github.com/databricks/spark-perf) and we added performance
 tests for MLlib in v1.1. But definitely more work needs to be done.
 
 The dev-list may not be a good place for discussion on the design,
 could you create JIRAs for each of the issues you pointed out, and we
 track the discussion on JIRA? Thanks!
 
 Best,
 Xiangrui
 
 On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin r...@databricks.com wrote:
  Xiangrui can comment more, but I believe Joseph and him are actually
  working on standardize interface and pipeline feature for 1.2 release.
 
  On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov pahomov.e...@gmail.com
  wrote:
 
  Some architect suggestions on this matter -
  https://github.com/apache/spark/pull/2371
 
  2014-09-12 16:38 GMT+04:00 Egor Pahomov pahomov.e...@gmail.com:
 
   Sorry, I misswrote  - I meant learners part of framework - models
   already
   exists.
  
   2014-09-12 15:53 GMT+04:00 Christoph Sawade 
   christoph.saw...@googlemail.com:
  
   I totally agree, and we discovered also some drawbacks with the
   classification models implementation that are based on GLMs:
  
   - There is no distinction between predicting scores, classes, and
   calibrated scores (probabilities). For these models it is common to
   have
   access to all of them and the prediction function ``predict``should be
   consistent and stateless. Currently, the score is only available after
   removing the threshold from the model.
   - There is no distinction between multinomial and binomial
   classification. For multinomial problems, it is necessary to handle
   multiple weight vectors and multiple confidences.
   - Models are not serialisable, which makes it hard to use them in
   practise.
  
   I started a pull request [1] some time ago. I would be happy to
   continue
   the discussion and clarify the interfaces, too!
  
   Cheers, Christoph
  
   [1] https://github.com/apache/spark/pull/2137/
  
   2014-09-12 11:11 GMT+02:00 Egor Pahomov pahomov.e...@gmail.com:
  
   Here in Yandex, during implementation of gradient boosting in spark
   and
   creating our ML tool for internal use, we found next serious problems
  in
   MLLib:
  
  
  - There is no Regression/Classification model abstraction. We were
  building abstract data processing pipelines, which should work just
   with
  some regression - exact algorithm specified outside this code.
  There
   is no
  abstraction, which will allow me to do that. *(It's main reason for
   all
  further problems) *
  - There is no common practice among MLlib for testing algorithms:
   every
  model generates it's own random test data. There is no easy
   extractable
  test cases applible to another algorithm. There is no benchmarks
  for
  comparing algorithms. After implementing new algorithm it's very
  hard
   to
  understand how it should be tested.
  - Lack of serialization testing: MLlib algorithms don't contain
  tests
  which test that model work after serialization.
  - During implementation of new algorithm it's hard to understand
  what
  API you should create and which interface to implement.
  
   Start for solving all these problems must be done in creating common
   interface for typical algorithms/models - regression, classification,
   clustering, collaborative filtering.
  
   All main tests should be written against these interfaces, so when new
   algorithm implemented - all it should do is passed already written
  tests.
   It allow us to have managble quality among all lib.
  
   There should be couple benchmarks which allow new spark user to get
   feeling
   about which algorithm to use.
  
   Test set against these abstractions 

PSA: SI-8835 (Iterator 'drop' method has a complexity bug causing quadratic behavior)

2014-09-06 Thread Erik Erlandson
I tripped over this recently while preparing a solution for SPARK-3250 
(efficient sampling):

Iterator 'drop' method has a complexity bug causing quadratic behavior
https://issues.scala-lang.org/browse/SI-8835

It's something of a corner case, as the impact is serious only if one is 
repeatedly invoking 'drop' on Iterator[T], and it doesn't apply to all iterator 
subclasses (e.g. Array().iterator).   It's actually a bug in 'slice', and so 
invocations of 'drop', 'take' and 'slice' are potential vulnerabilities.

Not something I'd expect to show up frequently, but it seemed worth mentioning, 
as RDD partitions are ubiquitously presented as Iterator[T] in the compute 
model, and if it does happen it turns a linear algorithm into something having 
quadratic behavior.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Handling stale PRs

2014-08-26 Thread Erik Erlandson


- Original Message -

  Another thing is that we should, IMO, err on the side of explicitly saying
  no or not yet to patches, rather than letting them linger without
  attention. We do get patches where the user is well intentioned, but it is
 
  Completely agree. The solution is partly more supply of committer time
  on JIRAs. But that is detracting from the work the committers
  themselves want to do. More of the solution is reducing demand by
  helping people create useful, actionable, non-duplicate JIRAs from the
  start. Or encouraging people to resolve existing JIRAs and shepherding
  those in.
 
 saying no/not-yet is a vitally important piece of information.

+1, when I propose a contribution to a project, I consider an articulate (and 
hopefully polite) no thanks, or not-yet, or needs-work, to be far more 
useful and pleasing than just radio silence.  It allows me to either address 
feedback, or just move on.

Although it takes effort to keep abreast of community contributions, I don't 
think it needs to be an overbearing or heavy-weight process.  I've seen other 
communities where they talked themselves out of better management because they 
conceived the ticket workflow as being more effort than it needed to be.  Much 
better to keep ticket triage and workflow fast/simple, but actually do it.



 
 
  Elsewhere, I've found people reluctant to close JIRA for fear of
  offending or turning off contributors. I think the opposite is true.
  There is nothing wrong with no or not now especially accompanied
  with constructive feedback. Better to state for the record what is not
  being looked at and why, than let people work on and open the same
  JIRAs repeatedly.
 
 well stated!
 
 
  I have also found in the past that a culture of tolerating eternal
  JIRAs led people to file JIRAs in order to forget about a problem --
  it's in JIRA, and it's in progress, so it feels like someone else is
  going to fix it later and so it can be forgotten now.
 
 there's some value in these now-i-can-forget jira, though i'm not
 personally a fan. it can be good to keep them around and reachable by
 search, but they should be clearly marked as no/not-yet or something
 similar.
 
 
  For what it's worth, I think these project and culture mechanics are
  so important and it's my #1 concern for Spark at this stage. This
  challenge exists so much more here, exactly because there is so much
  potential. I'd love to help by trying to identify and close stale
  JIRAs but am afraid that tagging them is just adding to the heap of
  work.
 
 +1 concern and potential!
 
 
 best,
 
 
 matt
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



any interest in something like rdd.parent[T](n) (equivalent to firstParent[T] for n==0) ?

2014-08-05 Thread Erik Erlandson
Not that  rdd.dependencies(n).rdd.asInstanceOf[RDD[T]]  is terrible, but 
rdd.parent[T](n) better captures the intent.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-29 Thread Erik Erlandson


- Original Message -
 Sure, drop() would be useful, but breaking the transformations are lazy;
 only actions launch jobs model is abhorrent -- which is not to say that we
 haven't already broken that model for useful operations (cf.
 RangePartitioner, which is used for sorted RDDs), but rather that each such
 exception to the model is a significant source of pain that can be hard to
 work with or work around.
 
 I really wouldn't like to see another such model-breaking transformation
 added to the API.  On the other hand, being able to write transformations
 with dependencies on these kind of internal jobs is sometimes very
 useful, so a significant reworking of Spark's Dependency model that would
 allow for lazily running such internal jobs and making the results
 available to subsequent stages may be something worth pursuing.


It turns out that drop can be implemented as a proper lazy transform.  I 
discuss how that works here:
http://erikerlandson.github.io/blog/2014/07/29/deferring-spark-actions-to-lazy-transforms-with-the-promise-rdd/

I updated the PR with this lazy implementation.




 
 
 On Mon, Jul 21, 2014 at 8:27 AM, Andrew Ash and...@andrewash.com wrote:
 
  Personally I'd find the method useful -- I've often had a .csv file with a
  header row that I want to drop so filter it out, which touches all
  partitions anyway.  I don't have any comments on the implementation quite
  yet though.
 
 
  On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson e...@redhat.com wrote:
 
   A few weeks ago I submitted a PR for supporting rdd.drop(n), under
   SPARK-2315:
   https://issues.apache.org/jira/browse/SPARK-2315
  
   Supporting the drop method would make some operations convenient, however
   it forces computation of = 1 partition of the parent RDD, and so it
  would
   behave like a partial action that returns an RDD as the result.
  
   I wrote up a discussion of these trade-offs here:
  
  
  http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/
  
 
 


Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-22 Thread Erik Erlandson


- Original Message -
 It could make sense to add a skipHeader argument to SparkContext.textFile?

I also looked into this.   I don't think it's feasible given the limits of the 
InputFormat and RecordReader interfaces.  RecordReader can't (I think) *ever* 
know which split it's attached to, and the getSplits() method has no concept of 
RecordReader, so it can't know how many records reside in its splits.   At 
least in RDD it's possible to do, if not attractive.



 
 
 On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin r...@databricks.com wrote:
 
  If the purpose is for dropping csv headers, perhaps we don't really need a
  common drop and only one that drops the first line in a file? I'd really
  try hard to avoid a common drop/dropWhile because they can be expensive to
  do.
 
  Note that I think we will be adding this functionality (ignoring headers)
  to the CsvRDD functionality in Spark SQL.
   https://github.com/apache/spark/pull/1351
 
 
  On Mon, Jul 21, 2014 at 1:45 PM, Mark Hamstra m...@clearstorydata.com
  wrote:
 
   You can find some of the prior, related discussion here:
   https://issues.apache.org/jira/browse/SPARK-1021
  
  
   On Mon, Jul 21, 2014 at 1:25 PM, Erik Erlandson e...@redhat.com wrote:
  
   
   
- Original Message -
 Rather than embrace non-lazy transformations and add more of them,
  I'd
 rather we 1) try to fully characterize the needs that are driving
  their
 creation/usage; and 2) design and implement new Spark abstractions
  that
 will allow us to meet those needs and eliminate existing non-lazy
 transformation.
   
   
In the case of drop, obtaining the index of the boundary partition can
  be
viewed as the action forcing compute -- one that happens to be invoked
inside of a transform.  The concept of a lazy action, that is only
triggered if the result rdd has compute invoked on it, might be
   sufficient
to restore laziness to the drop transform.   For that matter, I might
   find
some way to make use of Scala lazy values directly and achieve the same
goal for drop.
   
   
   
 They really mess up things like creation of asynchronous
 FutureActions, job cancellation and accounting of job resource usage,
etc.,
 so I'd rather we seek a way out of the existing hole rather than make
   it
 deeper.


 On Mon, Jul 21, 2014 at 10:24 AM, Erik Erlandson e...@redhat.com
   wrote:

 
 
  - Original Message -
   Sure, drop() would be useful, but breaking the transformations
  are
lazy;
   only actions launch jobs model is abhorrent -- which is not to
  say
that
  we
   haven't already broken that model for useful operations (cf.
   RangePartitioner, which is used for sorted RDDs), but rather that
each
  such
   exception to the model is a significant source of pain that can
  be
hard
  to
   work with or work around.
 
  A thought that comes to my mind here is that there are in fact
   already
two
  categories of transform: ones that are truly lazy, and ones that
  are
not.
   A possible option is to embrace that, and commit to documenting
  the
two
  categories as such, with an obvious bias towards favoring lazy
transforms
  (to paraphrase Churchill, we're down to haggling over the price).
 
 
  
   I really wouldn't like to see another such model-breaking
transformation
   added to the API.  On the other hand, being able to write
transformations
   with dependencies on these kind of internal jobs is sometimes
   very
   useful, so a significant reworking of Spark's Dependency model
  that
would
   allow for lazily running such internal jobs and making the
  results
   available to subsequent stages may be something worth pursuing.
 
 
  This seems like a very interesting angle.   I don't have much feel
   for
  what a solution would look like, but it sounds as if it would
  involve
  caching all operations embodied by RDD transform method code for
  provisional execution.  I believe that these levels of invocation
  are
  currently executed in the master, not executor nodes.
 
 
  
  
   On Mon, Jul 21, 2014 at 8:27 AM, Andrew Ash 
  and...@andrewash.com
  wrote:
  
Personally I'd find the method useful -- I've often had a .csv
   file
  with a
header row that I want to drop so filter it out, which touches
   all
partitions anyway.  I don't have any comments on the
   implementation
  quite
yet though.
   
   
On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson 
  e...@redhat.com
  wrote:
   
 A few weeks ago I submitted a PR for supporting rdd.drop(n),
under
 SPARK-2315:
 https://issues.apache.org/jira/browse/SPARK-2315

 Supporting the drop method would make

RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson
A few weeks ago I submitted a PR for supporting rdd.drop(n), under SPARK-2315:
https://issues.apache.org/jira/browse/SPARK-2315

Supporting the drop method would make some operations convenient, however it 
forces computation of = 1 partition of the parent RDD, and so it would behave 
like a partial action that returns an RDD as the result.

I wrote up a discussion of these trade-offs here:
http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/


Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson


- Original Message -
 I too would like this feature. Erik's post makes sense. However, shouldn't
 the RDD also repartition itself after drop to effectively make use of
 cluster resources?


My thinking is that in most use cases(*), one is dropping a small number of 
rows, and they are in only the 1st partition, and so repartitioning would not 
be worth the cost.  The first partition would be passed mostly intact, and the 
remainder would be completely unchanged.

(*) or at least most use cases that I've considered.


 On Jul 21, 2014 8:58 PM, Andrew Ash [via Apache Spark Developers List] 
 ml-node+s1001551n7434...@n3.nabble.com wrote:
 
  Personally I'd find the method useful -- I've often had a .csv file with a
  header row that I want to drop so filter it out, which touches all
  partitions anyway.  I don't have any comments on the implementation quite
  yet though.
 
 
  On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson [hidden email]
  http://user/SendEmail.jtp?type=nodenode=7434i=0 wrote:
 
   A few weeks ago I submitted a PR for supporting rdd.drop(n), under
   SPARK-2315:
   https://issues.apache.org/jira/browse/SPARK-2315
  
   Supporting the drop method would make some operations convenient,
  however
   it forces computation of = 1 partition of the parent RDD, and so it
  would
   behave like a partial action that returns an RDD as the result.
  
   I wrote up a discussion of these trade-offs here:
  
  
  http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/
  
 
 
  --
   If you reply to this email, your message will be added to the discussion
  below:
 
  http://apache-spark-developers-list.1001551.n3.nabble.com/RFC-Supporting-the-Scala-drop-Method-for-Spark-RDDs-tp7433p7434.html
   To start a new topic under Apache Spark Developers List, email
  ml-node+s1001551n1...@n3.nabble.com
  To unsubscribe from Apache Spark Developers List, click here
  http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=YW5pa2V0LmJoYXRuYWdhckBnbWFpbC5jb218MXwxMzE3NTAzMzQz
  .
  NAML
  http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
 
 
 
 
 
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/RFC-Supporting-the-Scala-drop-Method-for-Spark-RDDs-tp7433p7436.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.


Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson


- Original Message -
 Sure, drop() would be useful, but breaking the transformations are lazy;
 only actions launch jobs model is abhorrent -- which is not to say that we
 haven't already broken that model for useful operations (cf.
 RangePartitioner, which is used for sorted RDDs), but rather that each such
 exception to the model is a significant source of pain that can be hard to
 work with or work around.

A thought that comes to my mind here is that there are in fact already two 
categories of transform: ones that are truly lazy, and ones that are not.  A 
possible option is to embrace that, and commit to documenting the two 
categories as such, with an obvious bias towards favoring lazy transforms (to 
paraphrase Churchill, we're down to haggling over the price).
 

 
 I really wouldn't like to see another such model-breaking transformation
 added to the API.  On the other hand, being able to write transformations
 with dependencies on these kind of internal jobs is sometimes very
 useful, so a significant reworking of Spark's Dependency model that would
 allow for lazily running such internal jobs and making the results
 available to subsequent stages may be something worth pursuing.


This seems like a very interesting angle.   I don't have much feel for what a 
solution would look like, but it sounds as if it would involve caching all 
operations embodied by RDD transform method code for provisional execution.  I 
believe that these levels of invocation are currently executed in the master, 
not executor nodes.


 
 
 On Mon, Jul 21, 2014 at 8:27 AM, Andrew Ash and...@andrewash.com wrote:
 
  Personally I'd find the method useful -- I've often had a .csv file with a
  header row that I want to drop so filter it out, which touches all
  partitions anyway.  I don't have any comments on the implementation quite
  yet though.
 
 
  On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson e...@redhat.com wrote:
 
   A few weeks ago I submitted a PR for supporting rdd.drop(n), under
   SPARK-2315:
   https://issues.apache.org/jira/browse/SPARK-2315
  
   Supporting the drop method would make some operations convenient, however
   it forces computation of = 1 partition of the parent RDD, and so it
  would
   behave like a partial action that returns an RDD as the result.
  
   I wrote up a discussion of these trade-offs here:
  
  
  http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/
  
 
 


Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson


- Original Message -
 Rather than embrace non-lazy transformations and add more of them, I'd
 rather we 1) try to fully characterize the needs that are driving their
 creation/usage; and 2) design and implement new Spark abstractions that
 will allow us to meet those needs and eliminate existing non-lazy
 transformation.  


In the case of drop, obtaining the index of the boundary partition can be 
viewed as the action forcing compute -- one that happens to be invoked inside 
of a transform.  The concept of a lazy action, that is only triggered if the 
result rdd has compute invoked on it, might be sufficient to restore laziness 
to the drop transform.   For that matter, I might find some way to make use of 
Scala lazy values directly and achieve the same goal for drop.



 They really mess up things like creation of asynchronous
 FutureActions, job cancellation and accounting of job resource usage, etc.,
 so I'd rather we seek a way out of the existing hole rather than make it
 deeper.
 
 
 On Mon, Jul 21, 2014 at 10:24 AM, Erik Erlandson e...@redhat.com wrote:
 
 
 
  - Original Message -
   Sure, drop() would be useful, but breaking the transformations are lazy;
   only actions launch jobs model is abhorrent -- which is not to say that
  we
   haven't already broken that model for useful operations (cf.
   RangePartitioner, which is used for sorted RDDs), but rather that each
  such
   exception to the model is a significant source of pain that can be hard
  to
   work with or work around.
 
  A thought that comes to my mind here is that there are in fact already two
  categories of transform: ones that are truly lazy, and ones that are not.
   A possible option is to embrace that, and commit to documenting the two
  categories as such, with an obvious bias towards favoring lazy transforms
  (to paraphrase Churchill, we're down to haggling over the price).
 
 
  
   I really wouldn't like to see another such model-breaking transformation
   added to the API.  On the other hand, being able to write transformations
   with dependencies on these kind of internal jobs is sometimes very
   useful, so a significant reworking of Spark's Dependency model that would
   allow for lazily running such internal jobs and making the results
   available to subsequent stages may be something worth pursuing.
 
 
  This seems like a very interesting angle.   I don't have much feel for
  what a solution would look like, but it sounds as if it would involve
  caching all operations embodied by RDD transform method code for
  provisional execution.  I believe that these levels of invocation are
  currently executed in the master, not executor nodes.
 
 
  
  
   On Mon, Jul 21, 2014 at 8:27 AM, Andrew Ash and...@andrewash.com
  wrote:
  
Personally I'd find the method useful -- I've often had a .csv file
  with a
header row that I want to drop so filter it out, which touches all
partitions anyway.  I don't have any comments on the implementation
  quite
yet though.
   
   
On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson e...@redhat.com
  wrote:
   
 A few weeks ago I submitted a PR for supporting rdd.drop(n), under
 SPARK-2315:
 https://issues.apache.org/jira/browse/SPARK-2315

 Supporting the drop method would make some operations convenient,
  however
 it forces computation of = 1 partition of the parent RDD, and so it
would
 behave like a partial action that returns an RDD as the result.

 I wrote up a discussion of these trade-offs here:


   
  http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/