Re: data source api v2 refactoring

2018-08-31 Thread Jungtaek Lim
Nice suggestion Reynold and great news to see that Wenchen succeeded
prototyping!

One thing I would like to make sure is, how continuous mode works with such
abstraction. Would continuous mode be also abstracted with Stream, and
createScan would provide unbounded Scan?

Thanks,
Jungtaek Lim (HeartSaVioR)

2018년 9월 1일 (토) 오전 8:26, Ryan Blue 님이 작성:

> Thanks, Reynold!
>
> I think your API sketch looks great. I appreciate having the Table level
> in the abstraction to plug into as well. I think this makes it clear what
> everything does, particularly having the Stream level that represents a
> configured (by ScanConfig) streaming read and can act as a factory for
> individual batch scans or for continuous scans.
>
> Wenchen, I'm not sure what you mean by doing pushdown at the table level.
> It seems to mean that pushdown is specific to a batch scan or streaming
> read, which seems to be what you're saying as well. Wouldn't the pushdown
> happen to create a ScanConfig, which is then used as Reynold suggests?
> Looking forward to seeing this PR when you get it posted. Thanks for all of
> your work on this!
>
> rb
>
> On Fri, Aug 31, 2018 at 3:52 PM Wenchen Fan 
> wrote:
>
>> Thank Reynold for writing this and starting the discussion!
>>
>> Data source v2 was started with batch only, so we didn't pay much
>> attention to the abstraction and just follow the v1 API. Now we are
>> designing the streaming API and catalog integration, the abstraction
>> becomes super important.
>>
>> I like this proposed abstraction and have successfully prototyped it to
>> make sure it works.
>>
>> During prototyping, I have to work around the issue that the current
>> streaming engine does query optimization/planning for each micro batch.
>> With this abstraction, the operator pushdown is only applied once
>> per-query. In my prototype, I do the physical planning up front to get the
>> pushdown result, and
>> add a logical linking node that wraps the resulting physical plan node
>> for the data source, and then swap that logical linking node into the
>> logical plan for each batch. In the future we should just let the streaming
>> engine do query optimization/planning only once.
>>
>> About pushdown, I think we should do it at the table level. The table
>> should create a new pushdow handler to apply operator pushdowm for each
>> scan/stream, and create the scan/stream with the pushdown result. The
>> rationale is, a table should have the same pushdown behavior regardless the
>> scan node.
>>
>> Thanks,
>> Wenchen
>>
>>
>>
>>
>>
>> On Fri, Aug 31, 2018 at 2:00 PM Reynold Xin  wrote:
>>
>>> I spent some time last week looking at the current data source v2 apis,
>>> and I thought we should be a bit more buttoned up in terms of the
>>> abstractions and the guarantees Spark provides. In particular, I feel we
>>> need the following levels of "abstractions", to fit the use cases in Spark,
>>> from batch, to streaming.
>>>
>>> Please don't focus on the naming at this stage. When possible, I draw
>>> parallels to what similar levels are named in the currently committed api:
>>>
>>> 0. Format: This represents a specific format, e.g. Parquet, ORC. There
>>> is currently no explicit class at this level.
>>>
>>> 1. Table: This should represent a logical dataset (with schema). This
>>> could be just a directory on the file system, or a table in the catalog.
>>> Operations on tables can include batch reads (Scan), streams, writes, and
>>> potentially other operations such as deletes. The closest to the table
>>> level abstraction in the current code base is the "Provider" class,
>>> although Provider isn't quite a Table. This is similar to Ryan's proposed
>>> design.
>>>
>>> 2. Stream: Specific to streaming. A stream is created out of a Table.
>>> This logically represents a an instance of a StreamingQuery. Pushdowns and
>>> options are handled at this layer. I.e. Spark guarnatees to data source
>>> implementation pushdowns and options don't change within a Stream. Each
>>> Stream consists of a sequence of scans. There is no equivalent concept
>>> in the current committed code.
>>>
>>> 3. Scan: A physical scan -- either as part of a streaming query, or a
>>> batch query. This should contain sufficient information and methods so we
>>> can run a Spark job over a defined subset of the table. It's functionally
>>> equivalent to an RDD, except there's no dependency on RDD so it is a
>>> smaller surface. In the current code, the equivalent class would be the
>>> ScanConfig, which represents the information needed, but in order to
>>> execute a job, ReadSupport is needed (various methods in ReadSupport takes
>>> a ScanConfig).
>>>
>>>
>>> To illustrate with pseudocode what the different levels mean, a batch
>>> query would look like the following:
>>>
>>> val provider = reflection[Format]("parquet")
>>> val table = provider.createTable(options)
>>> val scan = table.createScan(scanConfig) // scanConfig includes pushdown
>>> and options
>>> // run 

Re: [Feedback Requested] SPARK-25299: Using Distributed Storage for Persisting Shuffle Data

2018-08-31 Thread Yuanjian Li
Hi Matt,
 Thanks for the great document and proposal, I want to +1 for the
reliable shuffle data and give some feedback.
 I think a reliable shuffle service based on DFS is necessary on Spark,
especially running Spark job over unstable environment. For example, while
mixed deploying Spark with online service, Spark executor will be killed
any time. Current stage retry strategy will make the job many times slower
than normal job.
 Actually we(Baidu inc) solved this problem by stable shuffle service
over Hadoop, and we are now docking Spark to this shuffle service. The POC
work will be done at October as expect. We'll post more benchmark and
detailed work at that time. I'm still reading your discussion document and
happy to give more feedback in the doc.

Thanks,
Yuanjian Li

Matt Cheah  于2018年9月1日周六 上午8:42写道:

> Hi everyone,
>
>
>
> I filed SPARK-25299 
> to promote discussion on how we can improve the shuffle operation in Spark.
> The basic premise is to discuss the ways we can leverage distributed
> storage to improve the reliability and isolation of Spark’s shuffle
> architecture.
>
>
>
> A few designs and a full problem statement are outlined in this architecture
> discussion document
> 
> .
>
>
>
> This is a complex problem and it would be great to get feedback from the
> community about the right direction to take this work in. Note that we have
> not yet committed to a specific implementation and architecture – there’s a
> lot that needs to be discussed for this improvement, so we hope to get as
> much input as possible before moving forward with a design.
>
>
>
> Please feel free to leave comments and suggestions on the JIRA ticket or
> on the discussion document.
>
>
>
> Thank you!
>
>
>
> -Matt Cheah
>


[Feedback Requested] SPARK-25299: Using Distributed Storage for Persisting Shuffle Data

2018-08-31 Thread Matt Cheah
Hi everyone,

 

I filed SPARK-25299 to promote discussion on how we can improve the shuffle 
operation in Spark. The basic premise is to discuss the ways we can leverage 
distributed storage to improve the reliability and isolation of Spark’s shuffle 
architecture.

 

A few designs and a full problem statement are outlined in this architecture 
discussion document.

 

This is a complex problem and it would be great to get feedback from the 
community about the right direction to take this work in. Note that we have not 
yet committed to a specific implementation and architecture – there’s a lot 
that needs to be discussed for this improvement, so we hope to get as much 
input as possible before moving forward with a design.

 

Please feel free to leave comments and suggestions on the JIRA ticket or on the 
discussion document.

 

Thank you!

 

-Matt Cheah



smime.p7s
Description: S/MIME cryptographic signature


Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Ryan Blue
+1

I think this is a great suggestion. I agree a bit with Sean, but I think it
is really about mapping these questions into some of the existing
structure. These are a great way to think about projects, but they're
general and it would help to rephrase them for a software project, like
Matei's comment on considering cost. Similarly, we might rephrase
objectives to be goals/non-goals and add something to highlight that we
expect absolutely no Jargon. A design sketch is needed to argue how long it
will take, what is new, and why it would be successful; adding these
questions will help people understand how to go from that design sketch to
an argument for that design. I think these will guide people to write
proposals that is persuasive and well-formed.

rb

On Fri, Aug 31, 2018 at 4:17 PM Jules Damji  wrote:

> +1
>
> One could argue that the litany of the questions are really a double-click
> on the essence: why, what, how. The three interrogatives ought to be the
> essence and distillation of any proposal or technical exposition.
>
> Cheers
> Jules
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Aug 31, 2018, at 11:23 AM, Reynold Xin  wrote:
>
> I helped craft the current SPIP template
>  last year. I was
> recently (re-)introduced to the Heilmeier Catechism, a set of questions
> DARPA developed to evaluate proposals. The set of questions are:
>
> - What are you trying to do? Articulate your objectives using absolutely
> no jargon.
> - How is it done today, and what are the limits of current practice?
> - What is new in your approach and why do you think it will be successful?
> - Who cares? If you are successful, what difference will it make?
> - What are the risks?
> - How much will it cost?
> - How long will it take?
> - What are the mid-term and final “exams” to check for success?
>
> When I read the above list, it resonates really well because they are
> almost always the same set of questions I ask myself and others before I
> decide whether something is worth doing. In some ways, our SPIP template
> tries to capture some of these (e.g. target persona), but are not as
> explicit and well articulated.
>
> What do people think about replacing the current SPIP template with the
> above?
>
> At a high level, I think the Heilmeier's Catechism emphasizes less about
> the "how", and more the "why" and "what", which is what I'd argue SPIPs
> should be about. The hows should be left in design docs for larger projects.
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: data source api v2 refactoring

2018-08-31 Thread Ryan Blue
Thanks, Reynold!

I think your API sketch looks great. I appreciate having the Table level in
the abstraction to plug into as well. I think this makes it clear what
everything does, particularly having the Stream level that represents a
configured (by ScanConfig) streaming read and can act as a factory for
individual batch scans or for continuous scans.

Wenchen, I'm not sure what you mean by doing pushdown at the table level.
It seems to mean that pushdown is specific to a batch scan or streaming
read, which seems to be what you're saying as well. Wouldn't the pushdown
happen to create a ScanConfig, which is then used as Reynold suggests?
Looking forward to seeing this PR when you get it posted. Thanks for all of
your work on this!

rb

On Fri, Aug 31, 2018 at 3:52 PM Wenchen Fan  wrote:

> Thank Reynold for writing this and starting the discussion!
>
> Data source v2 was started with batch only, so we didn't pay much
> attention to the abstraction and just follow the v1 API. Now we are
> designing the streaming API and catalog integration, the abstraction
> becomes super important.
>
> I like this proposed abstraction and have successfully prototyped it to
> make sure it works.
>
> During prototyping, I have to work around the issue that the current
> streaming engine does query optimization/planning for each micro batch.
> With this abstraction, the operator pushdown is only applied once
> per-query. In my prototype, I do the physical planning up front to get the
> pushdown result, and
> add a logical linking node that wraps the resulting physical plan node for
> the data source, and then swap that logical linking node into the logical
> plan for each batch. In the future we should just let the streaming engine
> do query optimization/planning only once.
>
> About pushdown, I think we should do it at the table level. The table
> should create a new pushdow handler to apply operator pushdowm for each
> scan/stream, and create the scan/stream with the pushdown result. The
> rationale is, a table should have the same pushdown behavior regardless the
> scan node.
>
> Thanks,
> Wenchen
>
>
>
>
>
> On Fri, Aug 31, 2018 at 2:00 PM Reynold Xin  wrote:
>
>> I spent some time last week looking at the current data source v2 apis,
>> and I thought we should be a bit more buttoned up in terms of the
>> abstractions and the guarantees Spark provides. In particular, I feel we
>> need the following levels of "abstractions", to fit the use cases in Spark,
>> from batch, to streaming.
>>
>> Please don't focus on the naming at this stage. When possible, I draw
>> parallels to what similar levels are named in the currently committed api:
>>
>> 0. Format: This represents a specific format, e.g. Parquet, ORC. There is
>> currently no explicit class at this level.
>>
>> 1. Table: This should represent a logical dataset (with schema). This
>> could be just a directory on the file system, or a table in the catalog.
>> Operations on tables can include batch reads (Scan), streams, writes, and
>> potentially other operations such as deletes. The closest to the table
>> level abstraction in the current code base is the "Provider" class,
>> although Provider isn't quite a Table. This is similar to Ryan's proposed
>> design.
>>
>> 2. Stream: Specific to streaming. A stream is created out of a Table.
>> This logically represents a an instance of a StreamingQuery. Pushdowns and
>> options are handled at this layer. I.e. Spark guarnatees to data source
>> implementation pushdowns and options don't change within a Stream. Each
>> Stream consists of a sequence of scans. There is no equivalent concept
>> in the current committed code.
>>
>> 3. Scan: A physical scan -- either as part of a streaming query, or a
>> batch query. This should contain sufficient information and methods so we
>> can run a Spark job over a defined subset of the table. It's functionally
>> equivalent to an RDD, except there's no dependency on RDD so it is a
>> smaller surface. In the current code, the equivalent class would be the
>> ScanConfig, which represents the information needed, but in order to
>> execute a job, ReadSupport is needed (various methods in ReadSupport takes
>> a ScanConfig).
>>
>>
>> To illustrate with pseudocode what the different levels mean, a batch
>> query would look like the following:
>>
>> val provider = reflection[Format]("parquet")
>> val table = provider.createTable(options)
>> val scan = table.createScan(scanConfig) // scanConfig includes pushdown
>> and options
>> // run tasks on executors
>>
>> A streaming micro-batch scan would look like the following:
>>
>> val provider = reflection[Format]("parquet")
>> val table = provider.createTable(options)
>> val stream = table.createStream(scanConfig)
>>
>> while(true) {
>>   val scan = streamingScan.createScan(startOffset)
>>   // run tasks on executors
>> }
>>
>>
>> Vs the current API, the above:
>>
>> 1. Creates an explicit Table abstraction, and an explicit Scan
>> abstraction.
>>
>> 

Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Jules Damji
+1 

One could argue that the litany of the questions are really a double-click on 
the essence: why, what, how. The three interrogatives ought to be the essence 
and distillation of any proposal or technical exposition.

Cheers
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Aug 31, 2018, at 11:23 AM, Reynold Xin  wrote:
> 
> I helped craft the current SPIP template last year. I was recently 
> (re-)introduced to the Heilmeier Catechism, a set of questions DARPA 
> developed to evaluate proposals. The set of questions are:
> 
> - What are you trying to do? Articulate your objectives using absolutely no 
> jargon.
> - How is it done today, and what are the limits of current practice?
> - What is new in your approach and why do you think it will be successful?
> - Who cares? If you are successful, what difference will it make?
> - What are the risks?
> - How much will it cost?
> - How long will it take?
> - What are the mid-term and final “exams” to check for success?
> 
> When I read the above list, it resonates really well because they are almost 
> always the same set of questions I ask myself and others before I decide 
> whether something is worth doing. In some ways, our SPIP template tries to 
> capture some of these (e.g. target persona), but are not as explicit and well 
> articulated. 
> 
> What do people think about replacing the current SPIP template with the 
> above? 
> 
> At a high level, I think the Heilmeier's Catechism emphasizes less about the 
> "how", and more the "why" and "what", which is what I'd argue SPIPs should be 
> about. The hows should be left in design docs for larger projects.
> 
> 


Re: Nightly Builds in the docs (in spark-nightly/spark-master-bin/latest? Can't seem to find it)

2018-08-31 Thread Marcelo Vanzin
I think there still might be an active job publishing stuff. Here's a
pretty recent build from master:

https://dist.apache.org/repos/dist/dev/spark/2.4.0-SNAPSHOT-2018_08_31_12_02-32da87d-docs/_site/index.html

But it seems only docs are being published, which makes me think it's
those builds that Shane mentioned in a recent e-mail.

On Fri, Aug 31, 2018 at 1:29 PM Sean Owen  wrote:
>
> There are some builds there, but they're not recent:
>
> https://people.apache.org/~pwendell/spark-nightly/
>
> We can either get the jobs running again, or just knock this on the head and 
> remove it.
>
> Anyone know how to get it running again and want to? I have a feeling Shane 
> knows if anyone. Or does anyone know if we even need these at this point? if 
> nobody has complained in about a year, unlikely.
>
> On Fri, Aug 31, 2018 at 3:15 PM Cody Koeninger  wrote:
>>
>> Just got a question about this on the user list as well.
>>
>> Worth removing that link to pwendell's directory from the docs?
>>
>> On Sun, Jan 21, 2018 at 12:13 PM, Jacek Laskowski  wrote:
>> > Hi,
>> >
>> > http://spark.apache.org/developer-tools.html#nightly-builds reads:
>> >
>> >> Spark nightly packages are available at:
>> >> Latest master build:
>> >> https://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest
>> >
>> > but the URL gives 404. Is this intended?
>> >
>> > Pozdrawiam,
>> > Jacek Laskowski
>> > 
>> > https://about.me/JacekLaskowski
>> > Mastering Spark SQL https://bit.ly/mastering-spark-sql
>> > Spark Structured Streaming https://bit.ly/spark-structured-streaming
>> > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
>> > Follow me at https://twitter.com/jaceklaskowski
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>


-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Nightly Builds in the docs (in spark-nightly/spark-master-bin/latest? Can't seem to find it)

2018-08-31 Thread Matei Zaharia
If we actually build stuff nightly in Jenkins, it wouldn’t hurt to publish them 
IMO. It helps more people try master and test it.

> On Aug 31, 2018, at 1:28 PM, Sean Owen  wrote:
> 
> There are some builds there, but they're not recent:
> 
> https://people.apache.org/~pwendell/spark-nightly/
> 
> We can either get the jobs running again, or just knock this on the head and 
> remove it.
> 
> Anyone know how to get it running again and want to? I have a feeling Shane 
> knows if anyone. Or does anyone know if we even need these at this point? if 
> nobody has complained in about a year, unlikely.
> 
> On Fri, Aug 31, 2018 at 3:15 PM Cody Koeninger  wrote:
> Just got a question about this on the user list as well.
> 
> Worth removing that link to pwendell's directory from the docs?
> 
> On Sun, Jan 21, 2018 at 12:13 PM, Jacek Laskowski  wrote:
> > Hi,
> >
> > http://spark.apache.org/developer-tools.html#nightly-builds reads:
> >
> >> Spark nightly packages are available at:
> >> Latest master build:
> >> https://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest
> >
> > but the URL gives 404. Is this intended?
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://about.me/JacekLaskowski
> > Mastering Spark SQL https://bit.ly/mastering-spark-sql
> > Spark Structured Streaming https://bit.ly/spark-structured-streaming
> > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> > Follow me at https://twitter.com/jaceklaskowski
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Nightly Builds in the docs (in spark-nightly/spark-master-bin/latest? Can't seem to find it)

2018-08-31 Thread Sean Owen
There are some builds there, but they're not recent:

https://people.apache.org/~pwendell/spark-nightly/

We can either get the jobs running again, or just knock this on the head
and remove it.

Anyone know how to get it running again and want to? I have a feeling Shane
knows if anyone. Or does anyone know if we even need these at this point?
if nobody has complained in about a year, unlikely.

On Fri, Aug 31, 2018 at 3:15 PM Cody Koeninger  wrote:

> Just got a question about this on the user list as well.
>
> Worth removing that link to pwendell's directory from the docs?
>
> On Sun, Jan 21, 2018 at 12:13 PM, Jacek Laskowski  wrote:
> > Hi,
> >
> > http://spark.apache.org/developer-tools.html#nightly-builds reads:
> >
> >> Spark nightly packages are available at:
> >> Latest master build:
> >>
> https://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest
> >
> > but the URL gives 404. Is this intended?
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://about.me/JacekLaskowski
> > Mastering Spark SQL https://bit.ly/mastering-spark-sql
> > Spark Structured Streaming https://bit.ly/spark-structured-streaming
> > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> > Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Matei Zaharia
I like this as well. Regarding “cost”, I think the equivalent concept for us is 
impact on the rest of the project (say maintenance cost down the line or 
whatever). This could be captured in the “risks” too, but it’s a slightly 
different concept. We should probably just clarify what we mean with each 
question.

Matei

> On Aug 31, 2018, at 1:09 PM, Cody Koeninger  wrote:
> 
> +1 to Sean's comment
> 
> On Fri, Aug 31, 2018 at 2:48 PM, Reynold Xin  wrote:
>> Yup all good points. One way I've done it in the past is to have an appendix
>> section for design sketch, as an expansion to the question "- What is new in
>> your approach and why do you think it will be successful?"
>> 
>> On Fri, Aug 31, 2018 at 12:47 PM Marcelo Vanzin
>>  wrote:
>>> 
>>> I like the questions (aside maybe from the cost one which perhaps does
>>> not matter much here), especially since they encourage explaining
>>> things in a more plain language than generally used by specs.
>>> 
>>> But I don't think we can ignore design aspects; it's been my
>>> observation that a good portion of SPIPs, when proposed, already have
>>> at the very least some sort of implementation (even if it's a barely
>>> working p.o.c.), so it would also be good to have that information up
>>> front if it's available.
>>> 
>>> (So I guess I'm just repeating Sean's reply.)
>>> 
>>> On Fri, Aug 31, 2018 at 11:23 AM Reynold Xin  wrote:
 
 I helped craft the current SPIP template last year. I was recently
 (re-)introduced to the Heilmeier Catechism, a set of questions DARPA
 developed to evaluate proposals. The set of questions are:
 
 - What are you trying to do? Articulate your objectives using absolutely
 no jargon.
 - How is it done today, and what are the limits of current practice?
 - What is new in your approach and why do you think it will be
 successful?
 - Who cares? If you are successful, what difference will it make?
 - What are the risks?
 - How much will it cost?
 - How long will it take?
 - What are the mid-term and final “exams” to check for success?
 
 When I read the above list, it resonates really well because they are
 almost always the same set of questions I ask myself and others before I
 decide whether something is worth doing. In some ways, our SPIP template
 tries to capture some of these (e.g. target persona), but are not as
 explicit and well articulated.
 
 What do people think about replacing the current SPIP template with the
 above?
 
 At a high level, I think the Heilmeier's Catechism emphasizes less about
 the "how", and more the "why" and "what", which is what I'd argue SPIPs
 should be about. The hows should be left in design docs for larger 
 projects.
 
 
>>> 
>>> 
>>> --
>>> Marcelo
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> 
>> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Nightly Builds in the docs (in spark-nightly/spark-master-bin/latest? Can't seem to find it)

2018-08-31 Thread Cody Koeninger
Just got a question about this on the user list as well.

Worth removing that link to pwendell's directory from the docs?

On Sun, Jan 21, 2018 at 12:13 PM, Jacek Laskowski  wrote:
> Hi,
>
> http://spark.apache.org/developer-tools.html#nightly-builds reads:
>
>> Spark nightly packages are available at:
>> Latest master build:
>> https://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest
>
> but the URL gives 404. Is this intended?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Cody Koeninger
+1 to Sean's comment

On Fri, Aug 31, 2018 at 2:48 PM, Reynold Xin  wrote:
> Yup all good points. One way I've done it in the past is to have an appendix
> section for design sketch, as an expansion to the question "- What is new in
> your approach and why do you think it will be successful?"
>
> On Fri, Aug 31, 2018 at 12:47 PM Marcelo Vanzin
>  wrote:
>>
>> I like the questions (aside maybe from the cost one which perhaps does
>> not matter much here), especially since they encourage explaining
>> things in a more plain language than generally used by specs.
>>
>> But I don't think we can ignore design aspects; it's been my
>> observation that a good portion of SPIPs, when proposed, already have
>> at the very least some sort of implementation (even if it's a barely
>> working p.o.c.), so it would also be good to have that information up
>> front if it's available.
>>
>> (So I guess I'm just repeating Sean's reply.)
>>
>> On Fri, Aug 31, 2018 at 11:23 AM Reynold Xin  wrote:
>> >
>> > I helped craft the current SPIP template last year. I was recently
>> > (re-)introduced to the Heilmeier Catechism, a set of questions DARPA
>> > developed to evaluate proposals. The set of questions are:
>> >
>> > - What are you trying to do? Articulate your objectives using absolutely
>> > no jargon.
>> > - How is it done today, and what are the limits of current practice?
>> > - What is new in your approach and why do you think it will be
>> > successful?
>> > - Who cares? If you are successful, what difference will it make?
>> > - What are the risks?
>> > - How much will it cost?
>> > - How long will it take?
>> > - What are the mid-term and final “exams” to check for success?
>> >
>> > When I read the above list, it resonates really well because they are
>> > almost always the same set of questions I ask myself and others before I
>> > decide whether something is worth doing. In some ways, our SPIP template
>> > tries to capture some of these (e.g. target persona), but are not as
>> > explicit and well articulated.
>> >
>> > What do people think about replacing the current SPIP template with the
>> > above?
>> >
>> > At a high level, I think the Heilmeier's Catechism emphasizes less about
>> > the "how", and more the "why" and "what", which is what I'd argue SPIPs
>> > should be about. The hows should be left in design docs for larger 
>> > projects.
>> >
>> >
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Reynold Xin
Yup all good points. One way I've done it in the past is to have an
appendix section for design sketch, as an expansion to the question "- What
is new in your approach and why do you think it will be successful?"

On Fri, Aug 31, 2018 at 12:47 PM Marcelo Vanzin 
wrote:

> I like the questions (aside maybe from the cost one which perhaps does
> not matter much here), especially since they encourage explaining
> things in a more plain language than generally used by specs.
>
> But I don't think we can ignore design aspects; it's been my
> observation that a good portion of SPIPs, when proposed, already have
> at the very least some sort of implementation (even if it's a barely
> working p.o.c.), so it would also be good to have that information up
> front if it's available.
>
> (So I guess I'm just repeating Sean's reply.)
>
> On Fri, Aug 31, 2018 at 11:23 AM Reynold Xin  wrote:
> >
> > I helped craft the current SPIP template last year. I was recently
> (re-)introduced to the Heilmeier Catechism, a set of questions DARPA
> developed to evaluate proposals. The set of questions are:
> >
> > - What are you trying to do? Articulate your objectives using absolutely
> no jargon.
> > - How is it done today, and what are the limits of current practice?
> > - What is new in your approach and why do you think it will be
> successful?
> > - Who cares? If you are successful, what difference will it make?
> > - What are the risks?
> > - How much will it cost?
> > - How long will it take?
> > - What are the mid-term and final “exams” to check for success?
> >
> > When I read the above list, it resonates really well because they are
> almost always the same set of questions I ask myself and others before I
> decide whether something is worth doing. In some ways, our SPIP template
> tries to capture some of these (e.g. target persona), but are not as
> explicit and well articulated.
> >
> > What do people think about replacing the current SPIP template with the
> above?
> >
> > At a high level, I think the Heilmeier's Catechism emphasizes less about
> the "how", and more the "why" and "what", which is what I'd argue SPIPs
> should be about. The hows should be left in design docs for larger projects.
> >
> >
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Marcelo Vanzin
I like the questions (aside maybe from the cost one which perhaps does
not matter much here), especially since they encourage explaining
things in a more plain language than generally used by specs.

But I don't think we can ignore design aspects; it's been my
observation that a good portion of SPIPs, when proposed, already have
at the very least some sort of implementation (even if it's a barely
working p.o.c.), so it would also be good to have that information up
front if it's available.

(So I guess I'm just repeating Sean's reply.)

On Fri, Aug 31, 2018 at 11:23 AM Reynold Xin  wrote:
>
> I helped craft the current SPIP template last year. I was recently 
> (re-)introduced to the Heilmeier Catechism, a set of questions DARPA 
> developed to evaluate proposals. The set of questions are:
>
> - What are you trying to do? Articulate your objectives using absolutely no 
> jargon.
> - How is it done today, and what are the limits of current practice?
> - What is new in your approach and why do you think it will be successful?
> - Who cares? If you are successful, what difference will it make?
> - What are the risks?
> - How much will it cost?
> - How long will it take?
> - What are the mid-term and final “exams” to check for success?
>
> When I read the above list, it resonates really well because they are almost 
> always the same set of questions I ask myself and others before I decide 
> whether something is worth doing. In some ways, our SPIP template tries to 
> capture some of these (e.g. target persona), but are not as explicit and well 
> articulated.
>
> What do people think about replacing the current SPIP template with the above?
>
> At a high level, I think the Heilmeier's Catechism emphasizes less about the 
> "how", and more the "why" and "what", which is what I'd argue SPIPs should be 
> about. The hows should be left in design docs for larger projects.
>
>


-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: TimSort bug

2018-08-31 Thread Reynold Xin
Thanks for looking into this, Sean! Loved the tl;dr.


On Fri, Aug 31, 2018 at 12:28 PM Sean Owen  wrote:

> TL;DR - We already had the fix from SPARK-5984. The delta from the current
> JDK implementation to Spark's looks actually inconsequential. No action
> required AFAICT.
>
> On Fri, Aug 31, 2018 at 12:30 PM Sean Owen  wrote:
>
>> I looked into this, because it sure sounds like a similar issue from a
>> few years ago that was fixed in
>> https://issues.apache.org/jira/browse/SPARK-5984
>> The change in that JIRA actually looks almost identical to the change
>> mentioned in the JDK bug:
>> http://hg.openjdk.java.net/jdk/jdk/rev/3a6d47df8239
>>
>> Reading the paper
>> http://drops.dagstuhl.de/opus/volltexte/2018/9467/pdf/LIPIcs-ESA-2018-4.pdf 
>> in
>> section 5 a little more, I think they are saying that there were two ways
>> to fix the original problem: a) fix the invariant, or b) increase some data
>> structure size. Java did the latter, it seems, and now they've shown it's
>> actually still busted. However Python and the original paper did the
>> former, and we seem to have copied that fix from
>> http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/
>>  My
>> understanding is that this still works, and is what Java *now* implements.
>>
>> The only difference I can see in implementation is an extra check for a
>> negative array index before dereferencing array[n]. We can add that for
>> full consistency with the JVM change, I suppose. I don't think it's
>> relevant to the problem reported in the paper, but could be an issue
>> otherwise. The JVM implementation suggests it thinks this needs to be
>> guarded.
>>
>> I did hack together a crude translation of the paper's bug reproduction
>> at http://igm.univ-mlv.fr/~pivoteau/Timsort/Test.java by copying some
>> Spark test code. It does need a huge amount of memory to run (>32g.. ended
>> up at 44g) so not so feasible to put in the test suite. Running it on Spark
>> master nets a JVM crash:
>>
>> # Problematic frame:
>> # J 10195 C2
>> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(Ljava/lang/Object;IILjava/util/Comparator;)I
>> (198 bytes) @ 0x7ff64dd9a262 [0x7ff64dd99f20+0x342]
>>
>> Thats... not good, but I can't tell if it's really due to this same issue
>> or something else going on in the off-heap code. Making the tiny change I
>> mentioned above doesn't do anything.
>>
>> On Fri, Aug 31, 2018 at 2:37 AM Reynold Xin  wrote:
>>
>>> “As a byproduct of our study, we uncover a bug in the Java
>>> implementation that can cause the sorting method to fail during the
>>> execution.”
>>>
>>> http://drops.dagstuhl.de/opus/volltexte/2018/9467/
>>>
>>> This might impact Spark since we took the Java based TimSort
>>> implementation. I have seen in the wild TimSort failing in the past. Maybe
>>> this is the cause.
>>>
>>


Re: TimSort bug

2018-08-31 Thread Sean Owen
TL;DR - We already had the fix from SPARK-5984. The delta from the current
JDK implementation to Spark's looks actually inconsequential. No action
required AFAICT.

On Fri, Aug 31, 2018 at 12:30 PM Sean Owen  wrote:

> I looked into this, because it sure sounds like a similar issue from a few
> years ago that was fixed in
> https://issues.apache.org/jira/browse/SPARK-5984
> The change in that JIRA actually looks almost identical to the change
> mentioned in the JDK bug:
> http://hg.openjdk.java.net/jdk/jdk/rev/3a6d47df8239
>
> Reading the paper
> http://drops.dagstuhl.de/opus/volltexte/2018/9467/pdf/LIPIcs-ESA-2018-4.pdf in
> section 5 a little more, I think they are saying that there were two ways
> to fix the original problem: a) fix the invariant, or b) increase some data
> structure size. Java did the latter, it seems, and now they've shown it's
> actually still busted. However Python and the original paper did the
> former, and we seem to have copied that fix from
> http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/
>  My
> understanding is that this still works, and is what Java *now* implements.
>
> The only difference I can see in implementation is an extra check for a
> negative array index before dereferencing array[n]. We can add that for
> full consistency with the JVM change, I suppose. I don't think it's
> relevant to the problem reported in the paper, but could be an issue
> otherwise. The JVM implementation suggests it thinks this needs to be
> guarded.
>
> I did hack together a crude translation of the paper's bug reproduction at
> http://igm.univ-mlv.fr/~pivoteau/Timsort/Test.java by copying some Spark
> test code. It does need a huge amount of memory to run (>32g.. ended up at
> 44g) so not so feasible to put in the test suite. Running it on Spark
> master nets a JVM crash:
>
> # Problematic frame:
> # J 10195 C2
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(Ljava/lang/Object;IILjava/util/Comparator;)I
> (198 bytes) @ 0x7ff64dd9a262 [0x7ff64dd99f20+0x342]
>
> Thats... not good, but I can't tell if it's really due to this same issue
> or something else going on in the off-heap code. Making the tiny change I
> mentioned above doesn't do anything.
>
> On Fri, Aug 31, 2018 at 2:37 AM Reynold Xin  wrote:
>
>> “As a byproduct of our study, we uncover a bug in the Java
>> implementation that can cause the sorting method to fail during the
>> execution.”
>>
>> http://drops.dagstuhl.de/opus/volltexte/2018/9467/
>>
>> This might impact Spark since we took the Java based TimSort
>> implementation. I have seen in the wild TimSort failing in the past. Maybe
>> this is the cause.
>>
>


Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Sean Owen
Looks good. From the existing template at
https://spark.apache.org/improvement-proposals.html I might keep points
about design sketch, API, and non goals. And we don't need a cost section.

On Fri, Aug 31, 2018, 1:23 PM Reynold Xin  wrote:

> I helped craft the current SPIP template
>  last year. I was
> recently (re-)introduced to the Heilmeier Catechism, a set of questions
> DARPA developed to evaluate proposals. The set of questions are:
>
> - What are you trying to do? Articulate your objectives using absolutely
> no jargon.
> - How is it done today, and what are the limits of current practice?
> - What is new in your approach and why do you think it will be successful?
> - Who cares? If you are successful, what difference will it make?
> - What are the risks?
> - How much will it cost?
> - How long will it take?
> - What are the mid-term and final “exams” to check for success?
>
> When I read the above list, it resonates really well because they are
> almost always the same set of questions I ask myself and others before I
> decide whether something is worth doing. In some ways, our SPIP template
> tries to capture some of these (e.g. target persona), but are not as
> explicit and well articulated.
>
> What do people think about replacing the current SPIP template with the
> above?
>
> At a high level, I think the Heilmeier's Catechism emphasizes less about
> the "how", and more the "why" and "what", which is what I'd argue SPIPs
> should be about. The hows should be left in design docs for larger projects.
>
>
>


[discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Reynold Xin
I helped craft the current SPIP template
 last year. I was
recently (re-)introduced to the Heilmeier Catechism, a set of questions
DARPA developed to evaluate proposals. The set of questions are:

- What are you trying to do? Articulate your objectives using absolutely no
jargon.
- How is it done today, and what are the limits of current practice?
- What is new in your approach and why do you think it will be successful?
- Who cares? If you are successful, what difference will it make?
- What are the risks?
- How much will it cost?
- How long will it take?
- What are the mid-term and final “exams” to check for success?

When I read the above list, it resonates really well because they are
almost always the same set of questions I ask myself and others before I
decide whether something is worth doing. In some ways, our SPIP template
tries to capture some of these (e.g. target persona), but are not as
explicit and well articulated.

What do people think about replacing the current SPIP template with the
above?

At a high level, I think the Heilmeier's Catechism emphasizes less about
the "how", and more the "why" and "what", which is what I'd argue SPIPs
should be about. The hows should be left in design docs for larger projects.


Re: TimSort bug

2018-08-31 Thread Sean Owen
I looked into this, because it sure sounds like a similar issue from a few
years ago that was fixed in https://issues.apache.org/jira/browse/SPARK-5984

The change in that JIRA actually looks almost identical to the change
mentioned in the JDK bug:
http://hg.openjdk.java.net/jdk/jdk/rev/3a6d47df8239

Reading the paper
http://drops.dagstuhl.de/opus/volltexte/2018/9467/pdf/LIPIcs-ESA-2018-4.pdf in
section 5 a little more, I think they are saying that there were two ways
to fix the original problem: a) fix the invariant, or b) increase some data
structure size. Java did the latter, it seems, and now they've shown it's
actually still busted. However Python and the original paper did the
former, and we seem to have copied that fix from
http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/
My
understanding is that this still works, and is what Java *now* implements.

The only difference I can see in implementation is an extra check for a
negative array index before dereferencing array[n]. We can add that for
full consistency with the JVM change, I suppose. I don't think it's
relevant to the problem reported in the paper, but could be an issue
otherwise. The JVM implementation suggests it thinks this needs to be
guarded.

I did hack together a crude translation of the paper's bug reproduction at
http://igm.univ-mlv.fr/~pivoteau/Timsort/Test.java by copying some Spark
test code. It does need a huge amount of memory to run (>32g.. ended up at
44g) so not so feasible to put in the test suite. Running it on Spark
master nets a JVM crash:

# Problematic frame:
# J 10195 C2
org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(Ljava/lang/Object;IILjava/util/Comparator;)I
(198 bytes) @ 0x7ff64dd9a262 [0x7ff64dd99f20+0x342]

Thats... not good, but I can't tell if it's really due to this same issue
or something else going on in the off-heap code. Making the tiny change I
mentioned above doesn't do anything.

On Fri, Aug 31, 2018 at 2:37 AM Reynold Xin  wrote:

> “As a byproduct of our study, we uncover a bug in the Java implementation
> that can cause the sorting method to fail during the execution.”
>
> http://drops.dagstuhl.de/opus/volltexte/2018/9467/
>
> This might impact Spark since we took the Java based TimSort
> implementation. I have seen in the wild TimSort failing in the past. Maybe
> this is the cause.
>


Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-31 Thread Reynold Xin
Both ahead of time, or just in time. Just like a normal Spark closure.


On Fri, Aug 31, 2018 at 10:18 AM Nihar Sheth  wrote:

> Hi @rxin,
>
> Just to make sure I understand your desired use case, are you suggesting a
> way (for the driver) to call, at any given time, a general method that can
> be defined ahead of time on the executors?
>
> On Thu, Aug 30, 2018 at 11:11 PM, Reynold Xin  wrote:
>
>> I actually had a similar use case a while ago, but not entirely the same.
>> In my use case, Spark is already up, but I want to make sure all existing
>> (and new) executors run some specific code. Can we update the API to
>> support that? I think that's doable if we split the design into two: one is
>> the ability to do what I just mentioned, and second is the ability to
>> register via config class when Spark starts to run the code.
>>
>>
>> On Thu, Aug 30, 2018 at 11:01 PM Felix Cheung 
>> wrote:
>>
>>> +1
>>> --
>>> *From:* Mridul Muralidharan 
>>> *Sent:* Wednesday, August 29, 2018 1:27:27 PM
>>> *To:* dev@spark.apache.org
>>> *Subject:* Re: SPIP: Executor Plugin (SPARK-24918)
>>>
>>> +1
>>> I left a couple of comments in NiharS's PR, but this is very useful to
>>> have in spark !
>>>
>>> Regards,
>>> Mridul
>>> On Fri, Aug 3, 2018 at 10:00 AM Imran Rashid
>>>  wrote:
>>> >
>>> > I'd like to propose adding a plugin api for Executors, primarily for
>>> instrumentation and debugging (
>>> https://issues.apache.org/jira/browse/SPARK-24918).  The changes are
>>> small, but as its adding a new api, it might be spip-worthy.  I mentioned
>>> it as well in a recent email I sent about memory monitoring
>>> >
>>> > The spip proposal is here (and attached to the jira as well):
>>> https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/edit?usp=sharing
>>> >
>>> > There are already some comments on the jira and pr, and I hope to get
>>> more thoughts and opinions on it.
>>> >
>>> > thanks,
>>> > Imran
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>


Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-31 Thread Nihar Sheth
Hi @rxin,

Just to make sure I understand your desired use case, are you suggesting a
way (for the driver) to call, at any given time, a general method that can
be defined ahead of time on the executors?

On Thu, Aug 30, 2018 at 11:11 PM, Reynold Xin  wrote:

> I actually had a similar use case a while ago, but not entirely the same.
> In my use case, Spark is already up, but I want to make sure all existing
> (and new) executors run some specific code. Can we update the API to
> support that? I think that's doable if we split the design into two: one is
> the ability to do what I just mentioned, and second is the ability to
> register via config class when Spark starts to run the code.
>
>
> On Thu, Aug 30, 2018 at 11:01 PM Felix Cheung 
> wrote:
>
>> +1
>> --
>> *From:* Mridul Muralidharan 
>> *Sent:* Wednesday, August 29, 2018 1:27:27 PM
>> *To:* dev@spark.apache.org
>> *Subject:* Re: SPIP: Executor Plugin (SPARK-24918)
>>
>> +1
>> I left a couple of comments in NiharS's PR, but this is very useful to
>> have in spark !
>>
>> Regards,
>> Mridul
>> On Fri, Aug 3, 2018 at 10:00 AM Imran Rashid
>>  wrote:
>> >
>> > I'd like to propose adding a plugin api for Executors, primarily for
>> instrumentation and debugging (https://issues.apache.org/
>> jira/browse/SPARK-24918).  The changes are small, but as its adding a
>> new api, it might be spip-worthy.  I mentioned it as well in a recent email
>> I sent about memory monitoring
>> >
>> > The spip proposal is here (and attached to the jira as well):
>> https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5c
>> bHBQtyqIA2hgtc/edit?usp=sharing
>> >
>> > There are already some comments on the jira and pr, and I hope to get
>> more thoughts and opinions on it.
>> >
>> > thanks,
>> > Imran
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Upgrade SBT to the latest

2018-08-31 Thread Ted Yu
+1
 Original message From: Sean Owen  Date: 
8/31/18  6:40 AM  (GMT-08:00) To: Darcy Shen  Cc: 
dev@spark.apache.org Subject: Re: Upgrade SBT to the latest 
Certainly worthwhile. I think this should target Spark 3, which should come 
after 2.4, which is itself already just about ready to test and release.
On Fri, Aug 31, 2018 at 8:16 AM Darcy Shen  wrote:

SBT 1.x is ready for a long time.

We may spare some time upgrading sbt for Spark.

An unbrella JIRA like Scala 2.12 should be created.





Re: mllib + SQL

2018-08-31 Thread Sean Owen
My $0.02 -- this isn't worthwhile.

Yes, there are ML-in-SQL tools. I'm thinking of MADlib for example. I think
these hold over from days when someone's only interface to a data warehouse
was SQL, and so there had to be SQL-language support for invoking ML jobs.
There was no programmatic alternative.

There's nothing particularly helpful about SQL as a language for expressing
this, versus simply writing operations in a high-level programming language.

Spark is that programmatic paradigm, and offers a more general way to
express ETL, ML and SQL within their own appropriate DSLs. There's no need
to also shoehorn Spark ML into Spark SQL.

I also think there's a bit of false abstraction here. The nice thing about
SQL-only access to these functions is it sounds much simpler, and
accessible to people that only know SQL and nothing about Python or JVMs.
In practice, using Spark means having some basic awareness of its
distributed execution environment. SQL-only analysts would struggle to be
effective with SQL-only access to Spark.

On Fri, Aug 31, 2018 at 5:05 AM Hemant Bhanawat 
wrote:

> We allow our users to interact with spark cluster using SQL queries only.
> That's easy for them. MLLib does not have SQL extensions and we cannot
> expose it to our users.
>
> SQL extensions can further accelerate MLLib's adoption. See
> https://cloud.google.com/bigquery/docs/bigqueryml-intro.
>
> Hemant
>
>
> On Thu, Aug 30, 2018 at 9:41 PM William Benton  wrote:
>
>> What are you interested in accomplishing?
>>
>> The spark.ml package has provided a machine learning API based on
>> DataFrames for quite some time.  If you are interested in mixing query
>> processing and machine learning, this is certainly the best place to start.
>>
>> See here:  https://spark.apache.org/docs/latest/ml-guide.html
>>
>>
>> best,
>> wb
>>
>>
>>
>> On Thu, Aug 30, 2018 at 1:45 AM Hemant Bhanawat 
>> wrote:
>>
>>> Is there a plan to support SQL extensions for mllib? Or is there an
>>> effort already underway?
>>>
>>> Any information is appreciated.
>>>
>>> Thanks in advance.
>>> Hemant
>>>
>>


Re: Upgrade SBT to the latest

2018-08-31 Thread Sean Owen
Certainly worthwhile. I think this should target Spark 3, which should come
after 2.4, which is itself already just about ready to test and release.

On Fri, Aug 31, 2018 at 8:16 AM Darcy Shen  wrote:

>
> SBT 1.x is ready for a long time.
>
> We may spare some time upgrading sbt for Spark.
>
> An unbrella JIRA like Scala 2.12 should be created.
>
>
>


Upgrade SBT to the latest

2018-08-31 Thread Darcy Shen
SBT 1.x is ready for a long time. We may spare some time upgrading sbt for 
Spark. An unbrella JIRA like Scala 2.12 should be created.

Re: mllib + SQL

2018-08-31 Thread Hemant Bhanawat
BTW, I can contribute if there is already an effort going on somewhere.

On Fri, Aug 31, 2018 at 3:35 PM Hemant Bhanawat 
wrote:

> We allow our users to interact with spark cluster using SQL queries only.
> That's easy for them. MLLib does not have SQL extensions and we cannot
> expose it to our users.
>
> SQL extensions can further accelerate MLLib's adoption. See
> https://cloud.google.com/bigquery/docs/bigqueryml-intro.
>
> Hemant
>
>
> On Thu, Aug 30, 2018 at 9:41 PM William Benton  wrote:
>
>> What are you interested in accomplishing?
>>
>> The spark.ml package has provided a machine learning API based on
>> DataFrames for quite some time.  If you are interested in mixing query
>> processing and machine learning, this is certainly the best place to start.
>>
>> See here:  https://spark.apache.org/docs/latest/ml-guide.html
>>
>>
>> best,
>> wb
>>
>>
>>
>> On Thu, Aug 30, 2018 at 1:45 AM Hemant Bhanawat 
>> wrote:
>>
>>> Is there a plan to support SQL extensions for mllib? Or is there an
>>> effort already underway?
>>>
>>> Any information is appreciated.
>>>
>>> Thanks in advance.
>>> Hemant
>>>
>>


Re: mllib + SQL

2018-08-31 Thread Hemant Bhanawat
We allow our users to interact with spark cluster using SQL queries only.
That's easy for them. MLLib does not have SQL extensions and we cannot
expose it to our users.

SQL extensions can further accelerate MLLib's adoption. See
https://cloud.google.com/bigquery/docs/bigqueryml-intro.

Hemant


On Thu, Aug 30, 2018 at 9:41 PM William Benton  wrote:

> What are you interested in accomplishing?
>
> The spark.ml package has provided a machine learning API based on
> DataFrames for quite some time.  If you are interested in mixing query
> processing and machine learning, this is certainly the best place to start.
>
> See here:  https://spark.apache.org/docs/latest/ml-guide.html
>
>
> best,
> wb
>
>
>
> On Thu, Aug 30, 2018 at 1:45 AM Hemant Bhanawat 
> wrote:
>
>> Is there a plan to support SQL extensions for mllib? Or is there an
>> effort already underway?
>>
>> Any information is appreciated.
>>
>> Thanks in advance.
>> Hemant
>>
>


Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-31 Thread Ted Yu
+1
 Original message From: Reynold Xin  Date: 
8/30/18  11:11 PM  (GMT-08:00) To: Felix Cheung  Cc: 
dev  Subject: Re: SPIP: Executor Plugin (SPARK-24918) 
I actually had a similar use case a while ago, but not entirely the same. In my 
use case, Spark is already up, but I want to make sure all existing (and new) 
executors run some specific code. Can we update the API to support that? I 
think that's doable if we split the design into two: one is the ability to do 
what I just mentioned, and second is the ability to register via config class 
when Spark starts to run the code.

On Thu, Aug 30, 2018 at 11:01 PM Felix Cheung  wrote:














+1



From: Mridul Muralidharan 

Sent: Wednesday, August 29, 2018 1:27:27 PM

To: dev@spark.apache.org

Subject: Re: SPIP: Executor Plugin (SPARK-24918)
 



+1

I left a couple of comments in NiharS's PR, but this is very useful to

have in spark !



Regards,

Mridul

On Fri, Aug 3, 2018 at 10:00 AM Imran Rashid

 wrote:

>

> I'd like to propose adding a plugin api for Executors, primarily for 
> instrumentation and debugging 
> (https://issues.apache.org/jira/browse/SPARK-24918).  The changes are small, 
> but as its adding
 a new api, it might be spip-worthy.  I mentioned it as well in a recent email 
I sent about memory monitoring

>

> The spip proposal is here (and attached to the jira as well): 
https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/edit?usp=sharing

>

> There are already some comments on the jira and pr, and I hope to get more 
> thoughts and opinions on it.

>

> thanks,

> Imran



-

To unsubscribe e-mail: dev-unsubscr...@spark.apache.org










TimSort bug

2018-08-31 Thread Reynold Xin
“As a byproduct of our study, we uncover a bug in the Java implementation
that can cause the sorting method to fail during the execution.”

http://drops.dagstuhl.de/opus/volltexte/2018/9467/

This might impact Spark since we took the Java based TimSort
implementation. I have seen in the wild TimSort failing in the past. Maybe
this is the cause.


Re: [DISCUSS] move away from python doctests

2018-08-31 Thread Hyukjin Kwon
IMHO, one thing we should consider before this is, refactoring the PySpark
tests all to make them separate pairs for main codes. Now, we put all those
unit tests into few several files, which makes hard to follow the tests.

2018년 8월 31일 (금) 오후 2:05, Felix Cheung 님이 작성:

> +1 on what Li said.
>
> And +1 on getting more coverage in unit tests - however often times we
> omit python unit tests deliberately if the python “wrapper” is trivial.
> This is what I’ve learned over the years from the previous pyspark
> maintainers. Admittedly gaps are there.
>
>
> --
> *From:* Imran Rashid 
> *Sent:* Wednesday, August 29, 2018 1:42 PM
> *To:* ice.xell...@gmail.com
> *Cc:* dev
> *Subject:* Re: [DISCUSS] move away from python doctests
>
> (Also, maybe there are already good unit tests, and I just don't know
> where to find them, as Bryan Cutler pointed out for the bit of code I was
> originally asking about.)
>
> On Wed, Aug 29, 2018 at 3:26 PM Imran Rashid  wrote:
>
>> Hi Li,
>>
>> yes that makes perfect sense.  That more-or-less is the same as my view,
>> though I framed it differently.  I guess in that case, I'm really asking:
>>
>> Can pyspark changes please be accompanied by more unit tests, and not
>> assume we're getting coverage from doctests?
>>
>> Imran
>>
>> On Wed, Aug 29, 2018 at 2:02 PM Li Jin  wrote:
>>
>>> Hi Imran,
>>>
>>> My understanding is that doctests and unittests are orthogonal -
>>> doctests are used to make sure docstring examples are correct and are not
>>> meant to replace unittests.
>>> Functionalities are covered by unit tests to ensure correctness and
>>> doctests are used to test the docstring, not the functionalities itself.
>>>
>>> There are issues with doctests, for example, we cannot test arrow
>>> related functions in doctest because of pyarrow is optional dependency, but
>>> I think that's a separate issue.
>>>
>>> Does this make sense?
>>>
>>> Li
>>>
>>> On Wed, Aug 29, 2018 at 6:35 PM Imran Rashid
>>>  wrote:
>>>
 Hi,

 I'd like to propose that we move away from such heavy reliance on
 doctests in python, and move towards more traditional unit tests.  The main
 reason is that its hard to share test code in doc tests.  For example, I
 was just looking at

 https://github.com/apache/spark/commit/82c18c240a6913a917df3b55cc5e22649561c4dd
  and wondering if we had any tests for some of the pyspark changes.
 SparkSession.createDataFrame has doctests, but those are just run with one
 standard spark configuration, which does not enable arrow.  Its hard to
 easily reuse that test, just with another spark context with a different
 conf.  Similarly I've wondered about reusing test cases but with
 local-cluster instead of local mode.  I feel like they also discourage
 writing a test which tries to get more exhaustive coverage on corner cases.

 I'm not saying we should stop using doctests -- I see why they're
 nice.  I just think they should really only be when you want that code
 snippet in the doc anyway, so you might as well test it.

 Admittedly, I'm not really a python-developer, so I could be totally
 wrong about the right way to author doctests -- pushback welcome!

 Thoughts?

 thanks,
 Imran

>>>


Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-31 Thread Lars Francke
+1

On Fri, Aug 31, 2018 at 8:11 AM, Reynold Xin  wrote:

> I actually had a similar use case a while ago, but not entirely the same.
> In my use case, Spark is already up, but I want to make sure all existing
> (and new) executors run some specific code. Can we update the API to
> support that? I think that's doable if we split the design into two: one is
> the ability to do what I just mentioned, and second is the ability to
> register via config class when Spark starts to run the code.
>
>
> On Thu, Aug 30, 2018 at 11:01 PM Felix Cheung 
> wrote:
>
>> +1
>> --
>> *From:* Mridul Muralidharan 
>> *Sent:* Wednesday, August 29, 2018 1:27:27 PM
>> *To:* dev@spark.apache.org
>> *Subject:* Re: SPIP: Executor Plugin (SPARK-24918)
>>
>> +1
>> I left a couple of comments in NiharS's PR, but this is very useful to
>> have in spark !
>>
>> Regards,
>> Mridul
>> On Fri, Aug 3, 2018 at 10:00 AM Imran Rashid
>>  wrote:
>> >
>> > I'd like to propose adding a plugin api for Executors, primarily for
>> instrumentation and debugging (https://issues.apache.org/
>> jira/browse/SPARK-24918).  The changes are small, but as its adding a
>> new api, it might be spip-worthy.  I mentioned it as well in a recent email
>> I sent about memory monitoring
>> >
>> > The spip proposal is here (and attached to the jira as well):
>> https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5c
>> bHBQtyqIA2hgtc/edit?usp=sharing
>> >
>> > There are already some comments on the jira and pr, and I hope to get
>> more thoughts and opinions on it.
>> >
>> > thanks,
>> > Imran
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-31 Thread Reynold Xin
I actually had a similar use case a while ago, but not entirely the same.
In my use case, Spark is already up, but I want to make sure all existing
(and new) executors run some specific code. Can we update the API to
support that? I think that's doable if we split the design into two: one is
the ability to do what I just mentioned, and second is the ability to
register via config class when Spark starts to run the code.


On Thu, Aug 30, 2018 at 11:01 PM Felix Cheung 
wrote:

> +1
> --
> *From:* Mridul Muralidharan 
> *Sent:* Wednesday, August 29, 2018 1:27:27 PM
> *To:* dev@spark.apache.org
> *Subject:* Re: SPIP: Executor Plugin (SPARK-24918)
>
> +1
> I left a couple of comments in NiharS's PR, but this is very useful to
> have in spark !
>
> Regards,
> Mridul
> On Fri, Aug 3, 2018 at 10:00 AM Imran Rashid
>  wrote:
> >
> > I'd like to propose adding a plugin api for Executors, primarily for
> instrumentation and debugging (
> https://issues.apache.org/jira/browse/SPARK-24918).  The changes are
> small, but as its adding a new api, it might be spip-worthy.  I mentioned
> it as well in a recent email I sent about memory monitoring
> >
> > The spip proposal is here (and attached to the jira as well):
> https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/edit?usp=sharing
> >
> > There are already some comments on the jira and pr, and I hope to get
> more thoughts and opinions on it.
> >
> > thanks,
> > Imran
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] move away from python doctests

2018-08-31 Thread Felix Cheung
+1 on what Li said.

And +1 on getting more coverage in unit tests - however often times we omit 
python unit tests deliberately if the python “wrapper” is trivial. This is what 
I’ve learned over the years from the previous pyspark maintainers. Admittedly 
gaps are there.



From: Imran Rashid 
Sent: Wednesday, August 29, 2018 1:42 PM
To: ice.xell...@gmail.com
Cc: dev
Subject: Re: [DISCUSS] move away from python doctests

(Also, maybe there are already good unit tests, and I just don't know where to 
find them, as Bryan Cutler pointed out for the bit of code I was originally 
asking about.)

On Wed, Aug 29, 2018 at 3:26 PM Imran Rashid 
mailto:iras...@cloudera.com>> wrote:
Hi Li,

yes that makes perfect sense.  That more-or-less is the same as my view, though 
I framed it differently.  I guess in that case, I'm really asking:

Can pyspark changes please be accompanied by more unit tests, and not assume 
we're getting coverage from doctests?

Imran

On Wed, Aug 29, 2018 at 2:02 PM Li Jin 
mailto:ice.xell...@gmail.com>> wrote:
Hi Imran,

My understanding is that doctests and unittests are orthogonal - doctests are 
used to make sure docstring examples are correct and are not meant to replace 
unittests.
Functionalities are covered by unit tests to ensure correctness and doctests 
are used to test the docstring, not the functionalities itself.

There are issues with doctests, for example, we cannot test arrow related 
functions in doctest because of pyarrow is optional dependency, but I think 
that's a separate issue.

Does this make sense?

Li

On Wed, Aug 29, 2018 at 6:35 PM Imran Rashid  
wrote:
Hi,

I'd like to propose that we move away from such heavy reliance on doctests in 
python, and move towards more traditional unit tests.  The main reason is that 
its hard to share test code in doc tests.  For example, I was just looking at
https://github.com/apache/spark/commit/82c18c240a6913a917df3b55cc5e22649561c4dd
 and wondering if we had any tests for some of the pyspark changes.  
SparkSession.createDataFrame has doctests, but those are just run with one 
standard spark configuration, which does not enable arrow.  Its hard to easily 
reuse that test, just with another spark context with a different conf.  
Similarly I've wondered about reusing test cases but with local-cluster instead 
of local mode.  I feel like they also discourage writing a test which tries to 
get more exhaustive coverage on corner cases.

I'm not saying we should stop using doctests -- I see why they're nice.  I just 
think they should really only be when you want that code snippet in the doc 
anyway, so you might as well test it.

Admittedly, I'm not really a python-developer, so I could be totally wrong 
about the right way to author doctests -- pushback welcome!

Thoughts?

thanks,
Imran


Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-31 Thread Felix Cheung
+1

From: Mridul Muralidharan 
Sent: Wednesday, August 29, 2018 1:27:27 PM
To: dev@spark.apache.org
Subject: Re: SPIP: Executor Plugin (SPARK-24918)

+1
I left a couple of comments in NiharS's PR, but this is very useful to
have in spark !

Regards,
Mridul
On Fri, Aug 3, 2018 at 10:00 AM Imran Rashid
 wrote:
>
> I'd like to propose adding a plugin api for Executors, primarily for 
> instrumentation and debugging 
> (https://issues.apache.org/jira/browse/SPARK-24918).  The changes are small, 
> but as its adding a new api, it might be spip-worthy.  I mentioned it as well 
> in a recent email I sent about memory monitoring
>
> The spip proposal is here (and attached to the jira as well): 
> https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/edit?usp=sharing
>
> There are already some comments on the jira and pr, and I hope to get more 
> thoughts and opinions on it.
>
> thanks,
> Imran

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



data source api v2 refactoring

2018-08-31 Thread Reynold Xin
I spent some time last week looking at the current data source v2 apis, and
I thought we should be a bit more buttoned up in terms of the abstractions
and the guarantees Spark provides. In particular, I feel we need the
following levels of "abstractions", to fit the use cases in Spark, from
batch, to streaming.

Please don't focus on the naming at this stage. When possible, I draw
parallels to what similar levels are named in the currently committed api:

0. Format: This represents a specific format, e.g. Parquet, ORC. There is
currently no explicit class at this level.

1. Table: This should represent a logical dataset (with schema). This could
be just a directory on the file system, or a table in the catalog.
Operations on tables can include batch reads (Scan), streams, writes, and
potentially other operations such as deletes. The closest to the table
level abstraction in the current code base is the "Provider" class,
although Provider isn't quite a Table. This is similar to Ryan's proposed
design.

2. Stream: Specific to streaming. A stream is created out of a Table. This
logically represents a an instance of a StreamingQuery. Pushdowns and
options are handled at this layer. I.e. Spark guarnatees to data source
implementation pushdowns and options don't change within a Stream. Each
Stream consists of a sequence of scans. There is no equivalent concept in
the current committed code.

3. Scan: A physical scan -- either as part of a streaming query, or a batch
query. This should contain sufficient information and methods so we can run
a Spark job over a defined subset of the table. It's functionally
equivalent to an RDD, except there's no dependency on RDD so it is a
smaller surface. In the current code, the equivalent class would be the
ScanConfig, which represents the information needed, but in order to
execute a job, ReadSupport is needed (various methods in ReadSupport takes
a ScanConfig).


To illustrate with pseudocode what the different levels mean, a batch query
would look like the following:

val provider = reflection[Format]("parquet")
val table = provider.createTable(options)
val scan = table.createScan(scanConfig) // scanConfig includes pushdown and
options
// run tasks on executors

A streaming micro-batch scan would look like the following:

val provider = reflection[Format]("parquet")
val table = provider.createTable(options)
val stream = table.createStream(scanConfig)

while(true) {
  val scan = streamingScan.createScan(startOffset)
  // run tasks on executors
}


Vs the current API, the above:

1. Creates an explicit Table abstraction, and an explicit Scan abstraction.

2. Have an explicit Stream level and makes it clear pushdowns and options
are handled there, rather than at the individual scan (ReadSupport) level.
Data source implementations don't need to worry about pushdowns or options
changing mid-stream. For batch, those happen when the scan object is
created.



This email is just a high level sketch. I've asked Wenchen to prototype
this, to see if it is actually feasible and the degree of hacks it removes,
or creates.