Re: Remove Flume support in 3.0.0?

2018-10-10 Thread Sean Owen
Yup was thinking the same. It is legacy too at this point.

On Wed, Oct 10, 2018, 3:19 PM Marcelo Vanzin  wrote:

> BTW, although I did not file a bug for that, I think we should also
> consider getting rid of the kafka-0.8 connector.
>
> That would leave only kafka-0.10 as the single remaining dstream
> connector in Spark, though. (If you ignore kinesis which we can't ship
> in binary form or something like that?)
> On Wed, Oct 10, 2018 at 1:32 PM Sean Owen  wrote:
> >
> > Marcelo makes an argument that Flume support should be removed in
> > 3.0.0 at https://issues.apache.org/jira/browse/SPARK-25598
> >
> > I tend to agree. Is there an argument that it needs to be supported,
> > and can this move to Bahir if so?
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> --
> Marcelo
>


Re: Remove Flume support in 3.0.0?

2018-10-10 Thread Marcelo Vanzin
BTW, although I did not file a bug for that, I think we should also
consider getting rid of the kafka-0.8 connector.

That would leave only kafka-0.10 as the single remaining dstream
connector in Spark, though. (If you ignore kinesis which we can't ship
in binary form or something like that?)
On Wed, Oct 10, 2018 at 1:32 PM Sean Owen  wrote:
>
> Marcelo makes an argument that Flume support should be removed in
> 3.0.0 at https://issues.apache.org/jira/browse/SPARK-25598
>
> I tend to agree. Is there an argument that it needs to be supported,
> and can this move to Bahir if so?
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>


-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Sean Owen
+1. I tested the source build against Scala 2.12 and common build
profiles. License and sigs look OK.

No blockers; one critical:

SPARK-25378 ArrayData.toArray(StringType) assume UTF8String in 2.4

I think this one is "won't fix" though? not trying to restore the behavior?

Other items open for 2.4.0:

SPARK-25347 Document image data source in doc site
SPARK-25584 Document libsvm data source in doc site
SPARK-25179 Document the features that require Pyarrow 0.10
SPARK-25507 Update documents for the new features in 2.4 release
SPARK-25346 Document Spark builtin data sources
SPARK-24464 Unit tests for MLlib's Instrumentation
SPARK-23197 Flaky test: spark.streaming.ReceiverSuite."receiver_life_cycle"
SPARK-22809 pyspark is sensitive to imports with dots
SPARK-21030 extend hint syntax to support any expression for Python and R

Anyone know enough to close or retarget them? they don't look critical
for 2.4, SPARK-25507 has no content, itself. SPARK-25179 "Document the
features that require Pyarrow 0.10" however sounds like it could have
been important for 2.4? if not a blocker.

PS I don't think that SPARK-25150 is an issue; see JIRA. At least
there is some ongoing discussion there.

I am evaluating
https://github.com/apache/spark/pull/22259#discussion_r224252642 right
now.


On Wed, Oct 10, 2018 at 9:47 AM Wenchen Fan  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.4.0.
>
> The vote is open until October 1 PST and passes if a majority +1 PMC votes 
> are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.0-rc3 (commit 
> 8e4a99bd201b9204fec52580f19ae70a229ed94e):
> https://github.com/apache/spark/tree/v2.4.0-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1289
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/
>
> The list of bug fixes going into 2.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.0?
> ===
>
> The current list of open tickets targeted at 2.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 2.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Remove Flume support in 3.0.0?

2018-10-10 Thread Jörn Franke
I think it makes sense to remove it. 
If it is not too much effort and the architecture of the flume source is not 
considered as too strange one may extract it as a separate project and put it 
on github in a dedicated non-supported repository. This would enable 
distributors and other companies to continue to use it with minor adaptions in 
case their architecture depends on it. Furthermore, if there is a growing 
interest then one could pick it up and create a clean connector based on the 
current Spark architecture to be available as a dedicated connector or again in 
later Spark versions.

That being said there are also „indirect“ ways to use Flume with Spark (eg via 
Kafka), so i believe people would not be affected so much by a removal.

(Non-Voting just my opinion)

> Am 10.10.2018 um 22:31 schrieb Sean Owen :
> 
> Marcelo makes an argument that Flume support should be removed in
> 3.0.0 at https://issues.apache.org/jira/browse/SPARK-25598
> 
> I tend to agree. Is there an argument that it needs to be supported,
> and can this move to Bahir if so?
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Remove Flume support in 3.0.0?

2018-10-10 Thread Sean Owen
Marcelo makes an argument that Flume support should be removed in
3.0.0 at https://issues.apache.org/jira/browse/SPARK-25598

I tend to agree. Is there an argument that it needs to be supported,
and can this move to Bahir if so?

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Sql custom streamer design questions and feedback

2018-10-10 Thread Vadim Chekan
Hi all,

I am trying to write custom sql streaming source and I have quite a lot of
questions about how it envisioned to be done.

First attempt was to
extend org.apache.spark.sql.execution.streaming.Source. At first it looks
simple. Data source tells what last offset it has and spark would ask for
data from previous offset up to the last you told you have. This leads to
straightforward implementation. Background thread pumps data into internal
buffer, getOffset reports last offset in the buffer, getBatch converts
slice of buffer into DataFrame.
But during initialization from checkpoint this workflow is turned upside
down. getBatch would be called with offset range which driver never
reported it has. So just for initialization I'll have to write fulfill
range function. Now data source has to implement 2 mode of operation, pull
and push. This seems to be redundant and driven by API design.

Ok, there are V2 streaming interfaces. Let's try "ContinuousReader".
Minimal implementation gives me "java.lang.UnsupportedOperationException:
Data source eventstore2 does not support microbatch processing." Hmm, I
thought that ContinuousReader is an alternative to MicroBatchReader. Also,
if it is required, why ContinuousReader is not inherited from
MicroBatchReader? Perhaps should be documented in ContinuousReader, what is
needed to make it actually working.

Ok, lets try MicroBatchReader. It is better then Source because it provides
explicit initialization call "setOffsetRange". But next call to implement
is "getEndOffset" and with driver being just initialized and have not
consumed any message yet, what should i return? There is no notion "not
ready yet" in API. Is the intention to finish all initialization in
setOffsetRange and block in there until first batch of data arrive? There
is no documentation in MicrobatchReader which would describe start-up
logic. Neither I can find examples of implementation. The only
implementation is RateSourceProviderV2, which does not deal with long
lasting initialization and does not demonstrate where is should be done.

And another question, how it should be implemented, if I do not know my
start position at all, for example Kafka's notion of special offsets, like
"start/end offset". If topic is empty, i will not be able to resolve actual
offset for very long tome, maybe weeks. I can guess that Offset is opaque
to spark and source implementation can implement notion of special offsets,
like "from start" but it would be nice to have it documented.

Ok, I've implemented getEndOffset as returning "0" until driver actually
get some data, next question is about createDataReaderFactories. From
comments I understand that this is done for partitioning purposes. So,
kafka for example could start multiple consumers for the same consumer
group and those consumers would be wrapped into DataReader by
DataReaderFactory. So, this means, that all connection and buffering
business is done in DataReader instances?  So upon "commit",
MicroBatchReader should inform its DataReaders that buffers can be
truncated?

Also terminology. What is "mix-in at least one of the plug-in interfaces"?
I see "mix-in" a lot in interface descriptions and can not understand what
its meaning is in this context. The concept of mixin does not exist in Java
AFIK and my best guess is that documentation hints that it takes more than
one interface to make useful class. But this only adds anxiety because
now I do not know which interfaces do I need to implement and the only way
to know is to implement one and see if it works and keep adding interfaces
until it start working (as is the case with ContinuousReader).
And what is "plug-in interface" I have no clue.

Thanks,
Vadym.


-- 
>From RFC 2631: In ASN.1, EXPLICIT tagging is implicit unless IMPLICIT is
explicitly specified


Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Nicholas Chammas
FYI I believe we have an open correctness issue here:

https://issues.apache.org/jira/browse/SPARK-25150

However, it needs review by another person to confirm whether it is indeed
a correctness issue (and whether it still impacts this latest RC).

Nick

2018년 10월 10일 (수) 오후 3:14, Jean Georges Perrin 님이 작성:

> Awesome - thanks Dongjoon!
>
>
> On Oct 10, 2018, at 2:36 PM, Dongjoon Hyun 
> wrote:
>
> For now, you can see generated release notes. Official one will be posted
> on the website when the official 2.4.0 is out.
>
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420=12342385
>
> Bests,
> Dongjoon.
>
>
> On Wed, Oct 10, 2018 at 11:29 AM Jean Georges Perrin  wrote:
>
>> Hi,
>>
>> Sorry if it's stupid question, but where can I find the release notes of
>> 2.4.0?
>>
>> jg
>>
>> On Oct 10, 2018, at 2:00 PM, Imran Rashid 
>> wrote:
>>
>> Sorry I had messed up my testing earlier, so I only just discovered
>> https://issues.apache.org/jira/browse/SPARK-25704
>>
>> I dont' think this is a release blocker, because its not a regression and
>> there is a workaround, just fyi.
>>
>> On Wed, Oct 10, 2018 at 11:47 AM Wenchen Fan  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.4.0.
>>>
>>> The vote is open until October 1 PST and passes if a majority +1 PMC
>>> votes are cast, with
>>> a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.4.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.4.0-rc3 (commit
>>> 8e4a99bd201b9204fec52580f19ae70a229ed94e):
>>> https://github.com/apache/spark/tree/v2.4.0-rc3
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1289
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/
>>>
>>> The list of bug fixes going into 2.4.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 2.4.0?
>>> ===
>>>
>>> The current list of open tickets targeted at 2.4.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 2.4.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>
>>
>


Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Jean Georges Perrin
Awesome - thanks Dongjoon!

> On Oct 10, 2018, at 2:36 PM, Dongjoon Hyun  wrote:
> 
> For now, you can see generated release notes. Official one will be posted on 
> the website when the official 2.4.0 is out.
> 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420=12342385
>  
> 
> 
> Bests,
> Dongjoon.
> 
> 
> On Wed, Oct 10, 2018 at 11:29 AM Jean Georges Perrin  > wrote:
> Hi,
> 
> Sorry if it's stupid question, but where can I find the release notes of 
> 2.4.0?
> 
> jg
> 
>> On Oct 10, 2018, at 2:00 PM, Imran Rashid > > wrote:
>> 
>> Sorry I had messed up my testing earlier, so I only just discovered 
>> https://issues.apache.org/jira/browse/SPARK-25704 
>> 
>> 
>> I dont' think this is a release blocker, because its not a regression and 
>> there is a workaround, just fyi.
>> 
>> On Wed, Oct 10, 2018 at 11:47 AM Wenchen Fan > > wrote:
>> Please vote on releasing the following candidate as Apache Spark version 
>> 2.4.0.
>> 
>> The vote is open until October 1 PST and passes if a majority +1 PMC votes 
>> are cast, with
>> a minimum of 3 +1 votes.
>> 
>> [ ] +1 Release this package as Apache Spark 2.4.0
>> [ ] -1 Do not release this package because ...
>> 
>> To learn more about Apache Spark, please see http://spark.apache.org/ 
>> 
>> 
>> The tag to be voted on is v2.4.0-rc3 (commit 
>> 8e4a99bd201b9204fec52580f19ae70a229ed94e):
>> https://github.com/apache/spark/tree/v2.4.0-rc3 
>> 
>> 
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/ 
>> 
>> 
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS 
>> 
>> 
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1289 
>> 
>> 
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/ 
>> 
>> 
>> The list of bug fixes going into 2.4.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12342385 
>> 
>> 
>> FAQ
>> 
>> =
>> How can I help test this release?
>> =
>> 
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>> 
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>> 
>> ===
>> What should happen to JIRA tickets still targeting 2.4.0?
>> ===
>> 
>> The current list of open tickets targeted at 2.4.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK 
>>  and search for "Target 
>> Version/s" = 2.4.0
>> 
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>> 
>> ==
>> But my bug isn't fixed?
>> ==
>> 
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
> 



Re: moving the spark jenkins job builder repo from dbricks --> spark

2018-10-10 Thread shane knapp
>
> Not sure if that's what you meant; but it should be ok for the jenkins
> servers to manually sync with master after you (or someone else) have
> verified the changes. That should prevent inadvertent breakages since
> I don't expect it to be easy to test those scripts without access to
> some test jenkins server.
>
> JJB has some built-in lint and testing, so that'll be the first step in
verifying the build configs.

i still have a dream where i have a fully functioning jenkins staging
deployment...  one day i will make that happen.  :)

shane

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Dongjoon Hyun
For now, you can see generated release notes. Official one will be posted
on the website when the official 2.4.0 is out.

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420=12342385

Bests,
Dongjoon.


On Wed, Oct 10, 2018 at 11:29 AM Jean Georges Perrin  wrote:

> Hi,
>
> Sorry if it's stupid question, but where can I find the release notes of
> 2.4.0?
>
> jg
>
> On Oct 10, 2018, at 2:00 PM, Imran Rashid 
> wrote:
>
> Sorry I had messed up my testing earlier, so I only just discovered
> https://issues.apache.org/jira/browse/SPARK-25704
>
> I dont' think this is a release blocker, because its not a regression and
> there is a workaround, just fyi.
>
> On Wed, Oct 10, 2018 at 11:47 AM Wenchen Fan  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.4.0.
>>
>> The vote is open until October 1 PST and passes if a majority +1 PMC
>> votes are cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.4.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.4.0-rc3 (commit
>> 8e4a99bd201b9204fec52580f19ae70a229ed94e):
>> https://github.com/apache/spark/tree/v2.4.0-rc3
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1289
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/
>>
>> The list of bug fixes going into 2.4.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.4.0?
>> ===
>>
>> The current list of open tickets targeted at 2.4.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.4.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>
>


Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Jean Georges Perrin
Hi,

Sorry if it's stupid question, but where can I find the release notes of 2.4.0?

jg

> On Oct 10, 2018, at 2:00 PM, Imran Rashid  > wrote:
> 
> Sorry I had messed up my testing earlier, so I only just discovered 
> https://issues.apache.org/jira/browse/SPARK-25704 
> 
> 
> I dont' think this is a release blocker, because its not a regression and 
> there is a workaround, just fyi.
> 
> On Wed, Oct 10, 2018 at 11:47 AM Wenchen Fan  > wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 2.4.0.
> 
> The vote is open until October 1 PST and passes if a majority +1 PMC votes 
> are cast, with
> a minimum of 3 +1 votes.
> 
> [ ] +1 Release this package as Apache Spark 2.4.0
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see http://spark.apache.org/ 
> 
> 
> The tag to be voted on is v2.4.0-rc3 (commit 
> 8e4a99bd201b9204fec52580f19ae70a229ed94e):
> https://github.com/apache/spark/tree/v2.4.0-rc3 
> 
> 
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/ 
> 
> 
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS 
> 
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1289 
> 
> 
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/ 
> 
> 
> The list of bug fixes going into 2.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342385 
> 
> 
> FAQ
> 
> =
> How can I help test this release?
> =
> 
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
> 
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
> 
> ===
> What should happen to JIRA tickets still targeting 2.4.0?
> ===
> 
> The current list of open tickets targeted at 2.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK 
>  and search for "Target 
> Version/s" = 2.4.0
> 
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
> 
> ==
> But my bug isn't fixed?
> ==
> 
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.



Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Imran Rashid
Sorry I had messed up my testing earlier, so I only just discovered
https://issues.apache.org/jira/browse/SPARK-25704

I dont' think this is a release blocker, because its not a regression and
there is a workaround, just fyi.

On Wed, Oct 10, 2018 at 11:47 AM Wenchen Fan  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
>
> The vote is open until October 1 PST and passes if a majority +1 PMC votes
> are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.0-rc3 (commit
> 8e4a99bd201b9204fec52580f19ae70a229ed94e):
> https://github.com/apache/spark/tree/v2.4.0-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1289
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/
>
> The list of bug fixes going into 2.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.0?
> ===
>
> The current list of open tickets targeted at 2.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


RE: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Garlapati, Suryanarayana (Nokia - IN/Bangalore)
Might be you need to change the date(Oct 1 has already passed).

>> The vote is open until October 1 PST and passes if a majority +1 PMC votes 
>> are cast, with
>> a minimum of 3 +1 votes.

Regards
Surya

From: Wenchen Fan 
Sent: Wednesday, October 10, 2018 10:20 PM
To: Spark dev list 
Subject: Re: [VOTE] SPARK 2.4.0 (RC3)

I'm adding my own +1, since there are no known blocker issues. The correctness 
issue has been fixed, the streaming Java API problem has been resolved, and we 
have upgraded to Scala 2.12.7.

On Thu, Oct 11, 2018 at 12:46 AM Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.4.0.

The vote is open until October 1 PST and passes if a majority +1 PMC votes are 
cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.0-rc3 (commit 
8e4a99bd201b9204fec52580f19ae70a229ed94e):
https://github.com/apache/spark/tree/v2.4.0-rc3

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1289

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/

The list of bug fixes going into 2.4.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12342385

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.0?
===

The current list of open tickets targeted at 2.4.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 2.4.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


Re: moving the spark jenkins job builder repo from dbricks --> spark

2018-10-10 Thread Marcelo Vanzin
Thanks for doing this. The more things we have accessible to the
project members in general the better!

(Now there's that hive fork repo somewhere, but let's not talk about that.)

On Wed, Oct 10, 2018 at 9:30 AM shane knapp  wrote:
>> > * the JJB templates are able to be run by anyone w/jenkins login access 
>> > without the need to commit changes to the repo.  this means there's a 
>> > non-zero potential for bad actors to change the build configs.  since we 
>> > will only be managing test and compile jobs through this, the chances for 
>> > Real Bad Stuff[tm] is minimized.  i will also have a local server, not on 
>> > the jenkins network, run a nightly cron job that grabs the latest configs 
>> > from github and syncs them to jenkins.

Not sure if that's what you meant; but it should be ok for the jenkins
servers to manually sync with master after you (or someone else) have
verified the changes. That should prevent inadvertent breakages since
I don't expect it to be easy to test those scripts without access to
some test jenkins server.

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Wenchen Fan
I'm adding my own +1, since there are no known blocker issues. The
correctness issue has been fixed, the streaming Java API problem has been
resolved, and we have upgraded to Scala 2.12.7.

On Thu, Oct 11, 2018 at 12:46 AM Wenchen Fan  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
>
> The vote is open until October 1 PST and passes if a majority +1 PMC votes
> are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.0-rc3 (commit
> 8e4a99bd201b9204fec52580f19ae70a229ed94e):
> https://github.com/apache/spark/tree/v2.4.0-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1289
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/
>
> The list of bug fixes going into 2.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.0?
> ===
>
> The current list of open tickets targeted at 2.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


[VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version
2.4.0.

The vote is open until October 1 PST and passes if a majority +1 PMC votes
are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.0-rc3 (commit
8e4a99bd201b9204fec52580f19ae70a229ed94e):
https://github.com/apache/spark/tree/v2.4.0-rc3

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1289

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/

The list of bug fixes going into 2.4.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12342385

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.0?
===

The current list of open tickets targeted at 2.4.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.4.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


Re: moving the spark jenkins job builder repo from dbricks --> spark

2018-10-10 Thread shane knapp
hey everyone!

just for visibility, after some lengthy conversations w/some PMC members
(mostly sean and josh) about the location of the jenkins job builder
temples being in a private, databricks repo, we've decided to move them in
to the main apache spark repo.

https://docs.openstack.org/infra/jenkins-job-builder/

On Tue, Oct 9, 2018 at 10:22 PM Sean Owen  wrote:

> Some responses inline -- this discussion can do to dev@ though.
>
> dev@ added.


> On Tue, Oct 9, 2018 at 3:28 PM shane knapp  wrote:
> > JBB templates in spark repo:
> > * code path is currently undecided, maybe build/?  i honestly don't have
> any strong opinions
>
> How about a subfolder of dev/? that's where many items like the
> release scripts and build style checkers live.
>
> works for me.


> > * this stuff will only in the master branch
>
> Sure, it'll get versioned with branches anyway, but only the master
> branch will matter.
>
> > * the current JJB templates include release and packaging jobs, which
> aren't run via jenkins anymore.  this means we can remove the job builder
> configs for these two, as well as the encrypted secrets.
>
> Sure, if it's not used, remove it. I suppose the concerns below lessen
> if none of the jobs in question here create release artifacts.
>
> exactly.


> > * the JJB templates are able to be run by anyone w/jenkins login access
> without the need to commit changes to the repo.  this means there's a
> non-zero potential for bad actors to change the build configs.  since we
> will only be managing test and compile jobs through this, the chances for
> Real Bad Stuff[tm] is minimized.  i will also have a local server, not on
> the jenkins network, run a nightly cron job that grabs the latest configs
> from github and syncs them to jenkins.
>
> You mean anyone with access to amplab Jenkins? I think this is an
> acceptable risk, especially as it's never been an issue to date. The
> worst case is deleting or sabotaging CI jobs, right? not great, but,
> nobody would be able to commit code or anything. That's the worst
> case, right?
>
> re: access to jenkins -- correct.
re: worst case, deleting a job -- correct (but a cron sync of the jobs from
the tip of master will repair and damage done by nefarious folks).
re: committing code -- correct


> This might be a good time to ask whether we want to use a different CI
> system. I don't see a need to. I don't see any problem that's surfaced
> by publishing the configs as part of the project. If riselab is OK
> continuing to subsidize the build infra, and it continues to work for
> the project, it seems fine. As long as the PMC is meaningfully in
> control of it, I can't see any issue.
>

i have confirmation from RISELab's PI (ion stoica) that we are committed to
continuing to host the apache spark build system here.

if that situation changes (which i cannot foresee), it will of course be
brought up w/the community and the PMC immediately.  i would also like some
heads up from the PMC if the situation changes on their end.  ;)

btw, work will not begin on this until next week.  once i get a jira and PR
opened, i'll respond to this thread w/those links.

shane
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Docker image to build Spark/Spark doc

2018-10-10 Thread Robert Kruszewski
Me and my colleagues built one for running spark builds on circleci. The
images are at
https://hub.docker.com/r/palantirtechnologies/circle-spark-python/
(circle-spark-r if you want to build sparkr). Dockerfiles for those images
can be found at
https://github.com/palantir/spark/tree/master/dev/docker-images

On Wed, 10 Oct 2018 at 15:31, Sean Owen  wrote:

> You can just build it with Maven or SBT as in the docs. I don't know of a
> docker image but there isn't much to package.
>
> On Wed, Oct 10, 2018, 1:10 AM assaf.mendelson 
> wrote:
>
>> Hi all,
>> I was wondering if there was a docker image to build spark and/or spark
>> documentation
>>
>> The idea would be that I would start the docker image, supplying the
>> directory with my code and a target directory and it would simply build
>> everything (maybe with some options).
>>
>> Any chance there is already something like that which is working and
>> tested?
>>
>> Thanks,
>> Assaf
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Docker image to build Spark/Spark doc

2018-10-10 Thread Sean Owen
You can just build it with Maven or SBT as in the docs. I don't know of a
docker image but there isn't much to package.

On Wed, Oct 10, 2018, 1:10 AM assaf.mendelson 
wrote:

> Hi all,
> I was wondering if there was a docker image to build spark and/or spark
> documentation
>
> The idea would be that I would start the docker image, supplying the
> directory with my code and a target directory and it would simply build
> everything (maybe with some options).
>
> Any chance there is already something like that which is working and
> tested?
>
> Thanks,
> Assaf
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Coalesce behaviour

2018-10-10 Thread Wenchen Fan
Note that, RDD partitions and Spark tasks are not always 1-1 mapping.

Assuming `rdd1` has 100 partitions, and `rdd2 = rdd1.coalesce(10)`. Then
`rdd2` has 10 partitions, and there is no shuffle between `rdd1` and
`rdd2`. During scheduling, `rdd1` and `rdd2` are in the same stage, and
this stage has 10 tasks (decided by the last RDD). This means, each Spark
task will process 10 partitions of `rdd1`.

Looking at your example, I don't see where is the problem. Can you describe
what is not expected?

On Wed, Oct 10, 2018 at 2:11 PM Sergey Zhemzhitsky 
wrote:

> Well, it seems that I can still extend the CoalesceRDD to make it preserve
> the total number of partitions from the parent RDD, reduce some partitons
> in the same way as the original coalesce does for map-only jobs and fill
> the gaps (partitions which should reside on the positions of the coalesced
> ones) with just a special kind of partitions which do not have any parent
> dependencies and always return an empty iterator.
>
> I believe this should work as desired (at least the previous
> ShuffleMapStage will think that the number of partitons in the next stage,
> it generates shuffle output for, is not changed).
>
> There are few issues though - existence of empty partitions which can be
> evaluated almost for free and empty output files from these empty partitons
> which can be beaten by means of LazyOutputFormat in case of RDDs.
>
>
>
> On Mon, Oct 8, 2018, 23:57 Koert Kuipers  wrote:
>
>> although i personally would describe this as a bug the answer will be
>> that this is the intended behavior. the coalesce "infects" the shuffle
>> before it, making a coalesce useless for reducing output files after a
>> shuffle with many partitions b design.
>>
>> your only option left is a repartition for which you pay the price in
>> that it introduces another expensive shuffle.
>>
>> interestingly if you do a coalesce on a map-only job it knows how to
>> reduce the partitions and output files without introducing a shuffle, so
>> clearly it is possible, but i dont know how to get this behavior after a
>> shuffle in an existing job.
>>
>> On Fri, Oct 5, 2018 at 6:34 PM Sergey Zhemzhitsky 
>> wrote:
>>
>>> Hello guys,
>>>
>>> Currently I'm a little bit confused with coalesce behaviour.
>>>
>>> Consider the following usecase - I'd like to join two pretty big RDDs.
>>> To make a join more stable and to prevent it from failures by OOM RDDs
>>> are usually repartitioned to redistribute data more evenly and to
>>> prevent every partition from hitting 2GB limit. Then after join with a
>>> lot of partitions.
>>>
>>> Then after successful join I'd like to save the resulting dataset.
>>> But I don't need such a huge amount of files as the number of
>>> partitions/tasks during joining. Actually I'm fine with such number of
>>> files as the total number of executor cores allocated to the job. So
>>> I've considered using a coalesce.
>>>
>>> The problem is that coalesce with shuffling disabled prevents join
>>> from using the specified number of partitions and instead forces join
>>> to use the number of partitions provided to coalesce
>>>
>>> scala> sc.makeRDD(1 to 100, 20).repartition(100).coalesce(5,
>>> false).toDebugString
>>> res5: String =
>>> (5) CoalescedRDD[15] at coalesce at :25 []
>>>  |  MapPartitionsRDD[14] at repartition at :25 []
>>>  |  CoalescedRDD[13] at repartition at :25 []
>>>  |  ShuffledRDD[12] at repartition at :25 []
>>>  +-(20) MapPartitionsRDD[11] at repartition at :25 []
>>> |   ParallelCollectionRDD[10] at makeRDD at :25 []
>>>
>>> With shuffling enabled everything is ok, e.g.
>>>
>>> scala> sc.makeRDD(1 to 100, 20).repartition(100).coalesce(5,
>>> true).toDebugString
>>> res6: String =
>>> (5) MapPartitionsRDD[24] at coalesce at :25 []
>>>  |  CoalescedRDD[23] at coalesce at :25 []
>>>  |  ShuffledRDD[22] at coalesce at :25 []
>>>  +-(100) MapPartitionsRDD[21] at coalesce at :25 []
>>>  |   MapPartitionsRDD[20] at repartition at :25 []
>>>  |   CoalescedRDD[19] at repartition at :25 []
>>>  |   ShuffledRDD[18] at repartition at :25 []
>>>  +-(20) MapPartitionsRDD[17] at repartition at :25 []
>>> |   ParallelCollectionRDD[16] at makeRDD at :25 []
>>>
>>> In that case the problem is that for pretty huge datasets additional
>>> reshuffling can take hours or at least comparable amount of time as
>>> for the join itself.
>>>
>>> So I'd like to understand whether it is a bug or just an expected
>>> behaviour?
>>> In case it is expected is there any way to insert additional
>>> ShuffleMapStage into an appropriate position of DAG but without
>>> reshuffling itself?
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: [DISCUSS] Cascades style CBO for Spark SQL

2018-10-10 Thread 吴晓菊
Hi All,

Takeshi Yamamuro gave some comments on this topic on twitter. And after
more research, here are correction and updates of my understanding about
bottom-up and top-down now.

Bottom-up and top-down are just 2 strategies to enumerate join order and
generate the search space. Both of them can get the *same best plan* if the
statistics and cost model are the same. (While the best plan is "best"
theoretically since the cost model is an estimation of the performance and
workload of environment, and join selectivity is also based on estimation).

While the performance of the optimizer depends not on bottom-up or top-down
but on the algorithms used in bottom-up or top-down framework.

Both bottom-up and top-down have many algorithms and some are similar and
compete with each other.

Bottom-up:
 DPsize, DPsub, DPccp
 (Papers: *"Analysis of Two Existing and One New Dynamic Programming
Algorithm for the Generation of Optimal Bushy Join Trees without Cross
Products"* )
Top-Down:
 Basic algorithms based on Commutativity and Associativity: RS-B1, RS-B2
 Graph
algorithms:  TDMinCutLazy, TDMinCutBranch, TDMinCutConservative..
 (Papers:*"Optimal Top-Down Join Enumeration", *
   *"Optimizing Join Enumeration in Transformation-based
Query Optimizers"*,
   *"Effective and Robust Pruning for Top-Down Join
Enumeration Algorithms"*)

Before the graph algorithms for top-down, it's known that bottom-up is more
efficient especially for CartesianProduct-free search space. While top-down
has the capability to do pruning. And after graph algorithms for top-down,
it can also be CP-free. More details is provided in this paper   *"Optimizing
Join Enumeration in Transformation-based Query Optimizers".*

In conclusion, my suggestion is to implement a Cascades like top-down
optimizer which is based on best graph algorithm, CP-free and pruning
enabled. Also a good cost model is provided which is based on physical
implementations, for example, hashjoin and sortmerge-join have the same
input and output, but the time spent on reading, computing and copying are
different. Details can be similar with what is done in Hive(
https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive
)

@Xiao Li  @Yamamuro  any comments?

Thanks,
Xiaoju



吴晓菊  于2018年9月26日周三 上午10:39写道:

> Hi Xiao
>
> Quite agree with you that a good cost model is important, instead of
> current stats based cost.
>
> While I think the bottom-up framework itself has limitation since it only
> keeps one best plan of each level. But it doesn't exactly mean the best
> plan of the final level. If you want to get the exact best plan of all in
> current bottom-up framework, you need to enumerate all alternative plans
> and compare the costs of them.
>
> Volcano/Cascades framework provides a more efficient solution which is
> already used in Calcite, Greenplum, SQL Server
>
> So I think both framework and cost model are important.
>
> We are now working on a Cascades POC, also considering about a new cost
> model. We want to know if the community is interested in this feature. If
> yes, we can share more detailed design and discuss with you.
>
> Thanks,
> Xiaoju
>
>
>
> Xiao Li  于2018年9月26日周三 上午8:30写道:
>
>> Hi, Xiaoju,
>>
>> Thanks for sending this to the dev list. The current join reordering rule
>> is just a stats based optimizer rule. Either top-down or bottom-up
>> optimization can achieve the same-level optimized plans. DB2 is using
>> bottom up. In the future, we plan to move the stats based join reordering
>> rule to the cost-based planner, which is the right place of this rule based
>> on the original design of Spark SQL.
>>
>> Actually, building a good cost model is much more difficult than
>> implementing such a classic framework, especially when Spark does not own
>> the data. Also, we need to compute incremental stats instead of always
>> recomputing the stats.
>>
>> Cheers,
>>
>> Xiao
>>
>>
>>
>> 吴晓菊  于2018年9月24日周一 下午7:53写道:
>>
>>> Hi All,
>>>
>>> Current Spark CBO implements a cost based multi-way join reordering
>>> algorithm based on the System-R’s paper [Access Path-SIGMOD’79]
>>> .
>>> When building m-way joins, it uses a bottom-up approach and put all items
>>> (basic joined nodes) into level 0, then build all two-way joins at level 1
>>> from plans at level 0 (single items), then build all 3-way joins ... etc.
>>> The algorithm also considers all combinations including left-deep trees,
>>> bushy trees, and right-deep-trees. It also prunes cartesian product
>>> candidates.
>>>
>>> While we still found many *limitations* of current CBO implementation:
>>> 1. The current CBO is a rule in logic phase, it only outputs one logical
>>> plan to physical phase optimize, while we cannot make sure the best plan in
>>> logical phase is still the best after physical optimize.
>>>
>>> 2. In current bottom-up approach, 

Docker image to build Spark/Spark doc

2018-10-10 Thread assaf.mendelson
Hi all,
I was wondering if there was a docker image to build spark and/or spark
documentation

The idea would be that I would start the docker image, supplying the
directory with my code and a target directory and it would simply build
everything (maybe with some options).

Any chance there is already something like that which is working and tested?

Thanks, 
Assaf




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Coalesce behaviour

2018-10-10 Thread Sergey Zhemzhitsky
Well, it seems that I can still extend the CoalesceRDD to make it preserve
the total number of partitions from the parent RDD, reduce some partitons
in the same way as the original coalesce does for map-only jobs and fill
the gaps (partitions which should reside on the positions of the coalesced
ones) with just a special kind of partitions which do not have any parent
dependencies and always return an empty iterator.

I believe this should work as desired (at least the previous
ShuffleMapStage will think that the number of partitons in the next stage,
it generates shuffle output for, is not changed).

There are few issues though - existence of empty partitions which can be
evaluated almost for free and empty output files from these empty partitons
which can be beaten by means of LazyOutputFormat in case of RDDs.



On Mon, Oct 8, 2018, 23:57 Koert Kuipers  wrote:

> although i personally would describe this as a bug the answer will be that
> this is the intended behavior. the coalesce "infects" the shuffle before
> it, making a coalesce useless for reducing output files after a shuffle
> with many partitions b design.
>
> your only option left is a repartition for which you pay the price in that
> it introduces another expensive shuffle.
>
> interestingly if you do a coalesce on a map-only job it knows how to
> reduce the partitions and output files without introducing a shuffle, so
> clearly it is possible, but i dont know how to get this behavior after a
> shuffle in an existing job.
>
> On Fri, Oct 5, 2018 at 6:34 PM Sergey Zhemzhitsky 
> wrote:
>
>> Hello guys,
>>
>> Currently I'm a little bit confused with coalesce behaviour.
>>
>> Consider the following usecase - I'd like to join two pretty big RDDs.
>> To make a join more stable and to prevent it from failures by OOM RDDs
>> are usually repartitioned to redistribute data more evenly and to
>> prevent every partition from hitting 2GB limit. Then after join with a
>> lot of partitions.
>>
>> Then after successful join I'd like to save the resulting dataset.
>> But I don't need such a huge amount of files as the number of
>> partitions/tasks during joining. Actually I'm fine with such number of
>> files as the total number of executor cores allocated to the job. So
>> I've considered using a coalesce.
>>
>> The problem is that coalesce with shuffling disabled prevents join
>> from using the specified number of partitions and instead forces join
>> to use the number of partitions provided to coalesce
>>
>> scala> sc.makeRDD(1 to 100, 20).repartition(100).coalesce(5,
>> false).toDebugString
>> res5: String =
>> (5) CoalescedRDD[15] at coalesce at :25 []
>>  |  MapPartitionsRDD[14] at repartition at :25 []
>>  |  CoalescedRDD[13] at repartition at :25 []
>>  |  ShuffledRDD[12] at repartition at :25 []
>>  +-(20) MapPartitionsRDD[11] at repartition at :25 []
>> |   ParallelCollectionRDD[10] at makeRDD at :25 []
>>
>> With shuffling enabled everything is ok, e.g.
>>
>> scala> sc.makeRDD(1 to 100, 20).repartition(100).coalesce(5,
>> true).toDebugString
>> res6: String =
>> (5) MapPartitionsRDD[24] at coalesce at :25 []
>>  |  CoalescedRDD[23] at coalesce at :25 []
>>  |  ShuffledRDD[22] at coalesce at :25 []
>>  +-(100) MapPartitionsRDD[21] at coalesce at :25 []
>>  |   MapPartitionsRDD[20] at repartition at :25 []
>>  |   CoalescedRDD[19] at repartition at :25 []
>>  |   ShuffledRDD[18] at repartition at :25 []
>>  +-(20) MapPartitionsRDD[17] at repartition at :25 []
>> |   ParallelCollectionRDD[16] at makeRDD at :25 []
>>
>> In that case the problem is that for pretty huge datasets additional
>> reshuffling can take hours or at least comparable amount of time as
>> for the join itself.
>>
>> So I'd like to understand whether it is a bug or just an expected
>> behaviour?
>> In case it is expected is there any way to insert additional
>> ShuffleMapStage into an appropriate position of DAG but without
>> reshuffling itself?
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>