Adaptive Query Execution performance results in 3TB TPC-DS

2020-02-11 Thread Jia, Ke A
Hi all,
We have completed the Spark 3.0 Adaptive Query Execution(AQE) performance tests 
in 3TB TPC-DS on 5 node Cascade Lake cluster. 2 queries bring about more than 
1.5x performance and 37 queries bring more than 1.1x performance with AQE.  
There is no query has significant performance degradations. The detail 
performance results and key configurations are shown in 
here.
 Based on the performance result, we recommend users to turn on AQE in spark 
3.0. If encounter any bug or improvement when enable AQE, please help to file 
related JIRAs. Thanks.

Regards,
Jia Ke



Re: Request to document the direct relationship between other configurations

2020-02-11 Thread Hyukjin Kwon
To do that, we should explicitly document such structured configuration and
implicit effect, which is currently missing.
I would be more than happy if we document such implied relationship, *and*
if we are very sure all configurations are structured correctly coherently.
Until that point, I think it might be more practical to simply document it
for now.

> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue
on practice - whether to duplicate description between configuration code
and doc. I have been asked to add description on configuration code
regardlessly, and existing codebase doesn't. This configuration is
widely-used one.
This is actually something we should fix too. in SQL configuration, now we
don't have such duplications as of
https://github.com/apache/spark/pull/27459 as it generates. We should do it
in other configurations.


2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim 님이 작성:

> I'm looking into the case of `spark.dynamicAllocation` and this seems to
> be the thing to support my voice.
>
>
> https://github.com/apache/spark/blob/master/docs/configuration.md#dynamic-allocation
>
> I don't disagree with adding "This requires spark.shuffle.service.enabled
> to be set." in the description of `spark.dynamicAllocation.enabled`. This
> cannot be inferred implicitly, hence it should be better to have it.
>
> Why I'm in favor of structured configuration & implicit effect over
> describing everything explicitly is there.
>
> 1. There're 10 configurations (if the doc doesn't miss any other
> configuration) except `spark.dynamicAllocation.enabled`, and only 4
> configurations are referred in the description of
> `spark.dynamicAllocation.enabled` - majority of config keys are missing.
> 2. I think it's intentional, but the table starts
> with `spark.dynamicAllocation.enabled` which talks implicitly but
> intuitively that if you disable this then everything on dynamic allocation
> won't work. Missing majority of references on config keys don't get it hard
> to understand.
> 3. Even `spark.dynamicAllocation` has bad case - see
> `spark.dynamicAllocation.shuffleTracking.enabled` and
> `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the
> structure of configuration. I think this is worse than not explicitly
> mentioning the description. Let's assume the name has
> been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive
> that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false`
> would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?
>
> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on
> practice - whether to duplicate description between configuration code and
> doc. I have been asked to add description on configuration code
> regardlessly, and existing codebase doesn't. This configuration is
> widely-used one.
>
>
> On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon  wrote:
>
>> Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it
>> although it might be "redundant" :-) since anyone can give feedback to any
>> thread in Spark dev mailing list, and discuss.
>>
>> This is actually more prevailing given my rough reading of configuration
>> files. I would like to see this missing relationship as a bad pattern,
>> started from a personal preference.
>>
>> > Personally I'd rather not think someone won't understand setting
>> `.enabled` to `false` means the functionality is disabled and effectively
>> it disables all sub-configurations.
>> > E.g. when `spark.sql.adaptive.enabled` is `false`, all the
>> configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this
>> is pretty intuitive and the one of major
>> > benefits of the structured configurations.
>>
>> I don't think this is a good idea we assume for users to know such
>> contexts. One might think
>> `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
>> partially enable the feature. It is better to be explicit to document
>> since some of configurations are even difficult for users to confirm if it
>> is working or not.
>> For instance, one might think setting
>> 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then,
>> they realise the log is not rolling later after the file
>> size becomes bigger.
>>
>>
>> 2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim 님이
>> 작성:
>>
>>> I'm sorry if I miss something, but this is ideally better to be started
>>> as [DISCUSS] as I haven't seen any reference to have consensus on this
>>> practice.
>>>
>>> For me it's just there're two different practices co-existing on the
>>> codebase, meaning it's closer to the preference of individual (with
>>> implicitly agreeing that others have different preferences), or it hasn't
>>> been discussed thoughtfully.
>>>
>>> Personally I'd rather not think someone won't understand setting
>>> `.enabled` to `false` means the functionality is disabled and effectively
>>> it disables all sub-configurations. E.g. when `spark.sql.a

Re: Request to document the direct relationship between other configurations

2020-02-11 Thread Jungtaek Lim
I'm looking into the case of `spark.dynamicAllocation` and this seems to be
the thing to support my voice.

https://github.com/apache/spark/blob/master/docs/configuration.md#dynamic-allocation

I don't disagree with adding "This requires spark.shuffle.service.enabled
to be set." in the description of `spark.dynamicAllocation.enabled`. This
cannot be inferred implicitly, hence it should be better to have it.

Why I'm in favor of structured configuration & implicit effect over
describing everything explicitly is there.

1. There're 10 configurations (if the doc doesn't miss any other
configuration) except `spark.dynamicAllocation.enabled`, and only 4
configurations are referred in the description of
`spark.dynamicAllocation.enabled` - majority of config keys are missing.
2. I think it's intentional, but the table starts
with `spark.dynamicAllocation.enabled` which talks implicitly but
intuitively that if you disable this then everything on dynamic allocation
won't work. Missing majority of references on config keys don't get it hard
to understand.
3. Even `spark.dynamicAllocation` has bad case - see
`spark.dynamicAllocation.shuffleTracking.enabled` and
`spark.dynamicAllocation.shuffleTimeout`. It is not respecting the
structure of configuration. I think this is worse than not explicitly
mentioning the description. Let's assume the name has
been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive
that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false`
would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?

Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on
practice - whether to duplicate description between configuration code and
doc. I have been asked to add description on configuration code
regardlessly, and existing codebase doesn't. This configuration is
widely-used one.


On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon  wrote:

> Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it
> although it might be "redundant" :-) since anyone can give feedback to any
> thread in Spark dev mailing list, and discuss.
>
> This is actually more prevailing given my rough reading of configuration
> files. I would like to see this missing relationship as a bad pattern,
> started from a personal preference.
>
> > Personally I'd rather not think someone won't understand setting
> `.enabled` to `false` means the functionality is disabled and effectively
> it disables all sub-configurations.
> > E.g. when `spark.sql.adaptive.enabled` is `false`, all the
> configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this
> is pretty intuitive and the one of major
> > benefits of the structured configurations.
>
> I don't think this is a good idea we assume for users to know such
> contexts. One might think
> `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
> partially enable the feature. It is better to be explicit to document
> since some of configurations are even difficult for users to confirm if it
> is working or not.
> For instance, one might think setting 'spark.eventLog.rolling.maxFileSize'
> automatically enables rolling. Then, they realise the log is not rolling
> later after the file
> size becomes bigger.
>
>
> 2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim 님이
> 작성:
>
>> I'm sorry if I miss something, but this is ideally better to be started
>> as [DISCUSS] as I haven't seen any reference to have consensus on this
>> practice.
>>
>> For me it's just there're two different practices co-existing on the
>> codebase, meaning it's closer to the preference of individual (with
>> implicitly agreeing that others have different preferences), or it hasn't
>> been discussed thoughtfully.
>>
>> Personally I'd rather not think someone won't understand setting
>> `.enabled` to `false` means the functionality is disabled and effectively
>> it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled`
>> is `false`, all the configurations for `spark.sql.adaptive.*` are
>> implicitly no-op. For me this is pretty intuitive and the one of major
>> benefits of the structured configurations.
>>
>> If we want to make it explicit, "all" sub-configurations should have
>> redundant part of the doc. More redundant if the condition is nested. I
>> agree this is the good step of "be kind" but less pragmatic.
>>
>> I'd be happy to follow the consensus we would make in this thread.
>> Appreciate more voices.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon 
>> wrote:
>>
>>> > I don't plan to document this officially yet
>>> Just to prevent confusion, I meant I don't yet plan to document the fact
>>> that we should write the relationships in configurations as a code/review
>>> guideline in https://spark.apache.org/contributing.html
>>>
>>>
>>> 2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon 님이 작성:
>>>
 Hi all,

 I happened to review some PRs and I noticed that 

Re: Request to document the direct relationship between other configurations

2020-02-11 Thread Hyukjin Kwon
Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it
although it might be "redundant" :-) since anyone can give feedback to any
thread in Spark dev mailing list, and discuss.

This is actually more prevailing given my rough reading of configuration
files. I would like to see this missing relationship as a bad pattern,
started from a personal preference.

> Personally I'd rather not think someone won't understand setting
`.enabled` to `false` means the functionality is disabled and effectively
it disables all sub-configurations.
> E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations
for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty
intuitive and the one of major
> benefits of the structured configurations.

I don't think this is a good idea we assume for users to know such
contexts. One might think
`spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
partially enable the feature. It is better to be explicit to document since
some of configurations are even difficult for users to confirm if it is
working or not.
For instance, one might think setting 'spark.eventLog.rolling.maxFileSize'
automatically enables rolling. Then, they realise the log is not rolling
later after the file
size becomes bigger.


2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim 님이 작성:

> I'm sorry if I miss something, but this is ideally better to be started as
> [DISCUSS] as I haven't seen any reference to have consensus on this
> practice.
>
> For me it's just there're two different practices co-existing on the
> codebase, meaning it's closer to the preference of individual (with
> implicitly agreeing that others have different preferences), or it hasn't
> been discussed thoughtfully.
>
> Personally I'd rather not think someone won't understand setting
> `.enabled` to `false` means the functionality is disabled and effectively
> it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled`
> is `false`, all the configurations for `spark.sql.adaptive.*` are
> implicitly no-op. For me this is pretty intuitive and the one of major
> benefits of the structured configurations.
>
> If we want to make it explicit, "all" sub-configurations should have
> redundant part of the doc. More redundant if the condition is nested. I
> agree this is the good step of "be kind" but less pragmatic.
>
> I'd be happy to follow the consensus we would make in this thread.
> Appreciate more voices.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon  wrote:
>
>> > I don't plan to document this officially yet
>> Just to prevent confusion, I meant I don't yet plan to document the fact
>> that we should write the relationships in configurations as a code/review
>> guideline in https://spark.apache.org/contributing.html
>>
>>
>> 2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon 님이 작성:
>>
>>> Hi all,
>>>
>>> I happened to review some PRs and I noticed that some configurations
>>> don't have some information
>>> necessary.
>>>
>>> To be explicit, I would like to make sure we document the direct
>>> relationship between other configurations
>>> in the documentation. For example,
>>> `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
>>> can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's
>>> clearly documented.
>>> We're good in general given that we document them in general in Apache
>>> Spark.
>>> See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled',
>>> 'spark.sql.parquet.filterPushdown', etc.
>>>
>>> However, I noticed such a pattern that such information is missing in
>>> some components in general, for example,
>>> `spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and
>>> `spark.history.ui.acls.* `
>>>
>>> I hope we all start to document such information. Logically users can't
>>> know the relationship and I myself
>>> had to read the codes to confirm when I review.
>>> I don't plan to document this officially yet because to me it looks a
>>> pretty logical request to me; however,
>>> let me know if you guys have some different opinions.
>>>
>>> Thanks.
>>>
>>>
>>>


Re: Request to document the direct relationship between other configurations

2020-02-11 Thread Jungtaek Lim
I'm sorry if I miss something, but this is ideally better to be started as
[DISCUSS] as I haven't seen any reference to have consensus on this
practice.

For me it's just there're two different practices co-existing on the
codebase, meaning it's closer to the preference of individual (with
implicitly agreeing that others have different preferences), or it hasn't
been discussed thoughtfully.

Personally I'd rather not think someone won't understand setting `.enabled`
to `false` means the functionality is disabled and effectively it disables
all sub-configurations. E.g. when `spark.sql.adaptive.enabled` is `false`,
all the configurations for `spark.sql.adaptive.*` are implicitly no-op. For
me this is pretty intuitive and the one of major benefits of the structured
configurations.

If we want to make it explicit, "all" sub-configurations should have
redundant part of the doc. More redundant if the condition is nested. I
agree this is the good step of "be kind" but less pragmatic.

I'd be happy to follow the consensus we would make in this thread.
Appreciate more voices.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon  wrote:

> > I don't plan to document this officially yet
> Just to prevent confusion, I meant I don't yet plan to document the fact
> that we should write the relationships in configurations as a code/review
> guideline in https://spark.apache.org/contributing.html
>
>
> 2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon 님이 작성:
>
>> Hi all,
>>
>> I happened to review some PRs and I noticed that some configurations
>> don't have some information
>> necessary.
>>
>> To be explicit, I would like to make sure we document the direct
>> relationship between other configurations
>> in the documentation. For example,
>> `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
>> can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's
>> clearly documented.
>> We're good in general given that we document them in general in Apache
>> Spark.
>> See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled',
>> 'spark.sql.parquet.filterPushdown', etc.
>>
>> However, I noticed such a pattern that such information is missing in
>> some components in general, for example,
>> `spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and
>> `spark.history.ui.acls.* `
>>
>> I hope we all start to document such information. Logically users can't
>> know the relationship and I myself
>> had to read the codes to confirm when I review.
>> I don't plan to document this officially yet because to me it looks a
>> pretty logical request to me; however,
>> let me know if you guys have some different opinions.
>>
>> Thanks.
>>
>>
>>


Re: Request to document the direct relationship between other configurations

2020-02-11 Thread Hyukjin Kwon
> I don't plan to document this officially yet
Just to prevent confusion, I meant I don't yet plan to document the fact
that we should write the relationships in configurations as a code/review
guideline in https://spark.apache.org/contributing.html


2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon 님이 작성:

> Hi all,
>
> I happened to review some PRs and I noticed that some configurations don't
> have some information
> necessary.
>
> To be explicit, I would like to make sure we document the direct
> relationship between other configurations
> in the documentation. For example,
> `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
> can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's
> clearly documented.
> We're good in general given that we document them in general in Apache
> Spark.
> See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled',
> 'spark.sql.parquet.filterPushdown', etc.
>
> However, I noticed such a pattern that such information is missing in some
> components in general, for example,
> `spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and
> `spark.history.ui.acls.* `
>
> I hope we all start to document such information. Logically users can't
> know the relationship and I myself
> had to read the codes to confirm when I review.
> I don't plan to document this officially yet because to me it looks a
> pretty logical request to me; however,
> let me know if you guys have some different opinions.
>
> Thanks.
>
>
>


Request to document the direct relationship between other configurations

2020-02-11 Thread Hyukjin Kwon
Hi all,

I happened to review some PRs and I noticed that some configurations don't
have some information
necessary.

To be explicit, I would like to make sure we document the direct
relationship between other configurations
in the documentation. For example,
`spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
can be only enabled when `spark.sql.adaptive.enabled` is enabled. That's
clearly documented.
We're good in general given that we document them in general in Apache
Spark.
See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled',
'spark.sql.parquet.filterPushdown', etc.

However, I noticed such a pattern that such information is missing in some
components in general, for example,
`spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and
`spark.history.ui.acls.* `

I hope we all start to document such information. Logically users can't
know the relationship and I myself
had to read the codes to confirm when I review.
I don't plan to document this officially yet because to me it looks a
pretty logical request to me; however,
let me know if you guys have some different opinions.

Thanks.


Re: [build system] enabled the ubuntu staging node to help w/build queue

2020-02-11 Thread Takeshi Yamamuro
Thanks always..!

Bests,
Takeshi

On Wed, Feb 12, 2020 at 3:28 AM shane knapp ☠  wrote:

> the build queue has been increasing and to help throughput i enabled the
> 'ubuntu-testing' node.  i spot-checked a bunch of the spark maven builds,
> and they passed.
>
> i'll keep an eye out for any failures caused by the system and either
> remove it from the worker pool of fix what i need to.
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
---
Takeshi Yamamuro


SMJ operator spilling perf improvements PR 27246

2020-02-11 Thread sinisa knezevic


Hello All, 
Could you please let me know what would be next step for PR: 
https://github.com/apache/spark/pull/27246?I would like to know if there is any 
action item on my side.
Thank youSinisa





Re: Apache Spark Docker image repository

2020-02-11 Thread Dongjoon Hyun
Hi, Sean.

Yes. We should keep this minimal.

BTW, for the following questions,

> But how much value does that add?

How much value do you think we have at our binary distribution in the
following link?

-
https://www.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

Docker image can have a similar value with the above for the users who are
using Dockerized environment.

If you are assuming the users who build from the source code or lives on
vendor distributions, both the above existing binary distribution link and
Docker image have no value.

Bests,
Dongjoon.


On Tue, Feb 11, 2020 at 8:51 AM Sean Owen  wrote:

> To be clear this is a convenience 'binary' for end users, not just an
> internal packaging to aid the testing framework?
>
> There's nothing wrong with providing an additional official packaging
> if we vote on it and it follows all the rules. There is an open
> question about how much value it adds vs that maintenance. I see we do
> already have some Dockerfiles, sure. Is it possible to reuse or
> repurpose these so that we don't have more to maintain? or: what is
> different from the existing Dockerfiles here? (dumb question, never
> paid much attention to them)
>
> We definitely can't release GPL bits or anything, yes. Just releasing
> a Dockerfile referring to GPL bits is a gray area - no bits are being
> redistributed, but, does it constitute a derived work where the GPL
> stuff is a non-optional dependency? Would any publishing of these
> images cause us to put a copy of third party GPL code anywhere?
>
> At the least, we should keep this minimal. One image if possible, that
> you overlay on top of your preferred OS/Java/Python image. But how
> much value does that add? I have no info either way that people want
> or don't need such a thing.
>
> On Tue, Feb 11, 2020 at 10:13 AM Erik Erlandson 
> wrote:
> >
> > My takeaway from the last time we discussed this was:
> > 1) To be ASF compliant, we needed to only publish images at official
> releases
> > 2) There was some ambiguity about whether or not a container image that
> included GPL'ed packages (spark images do) might trip over the GPL "viral
> propagation" due to integrating ASL and GPL in a "binary release".  The
> "air gap" GPL provision may apply - the GPL software interacts only at
> command-line boundaries.
> >
> > On Wed, Feb 5, 2020 at 1:23 PM Dongjoon Hyun 
> wrote:
> >>
> >> Hi, All.
> >>
> >> From 2020, shall we have an official Docker image repository as an
> additional distribution channel?
> >>
> >> I'm considering the following images.
> >>
> >> - Public binary release (no snapshot image)
> >> - Public non-Spark base image (OS + R + Python)
> >>   (This can be used in GitHub Action Jobs and Jenkins K8s
> Integration Tests to speed up jobs and to have more stabler environments)
> >>
> >> Bests,
> >> Dongjoon.
>


Re: comparable and orderable CalendarInterval

2020-02-11 Thread Enrico Minack
I compute the difference of two timestamps and compare them with a 
constant interval:


Seq(("2019-01-02 12:00:00", "2019-01-02 13:30:00"))
  .toDF("start", "end")
  .select($"start".cast(TimestampType), $"end".cast(TimestampType))
  .select($"start", $"end", ($"end" - $"start").as("diff"))
  .where($"diff" < lit("INTERVAL 2 HOUR").cast(CalendarIntervalType))
  .show

Coming from timestamps, the interval should have correct hours 
(millisecond component), so comparing it with the "right kinds of 
intervals" should always be correct.


Enrico


Am 11.02.20 um 17:06 schrieb Wenchen Fan:
What's your use case to compare intervals? It's tricky in Spark as 
there is only one interval type and you can't really compare one month 
with 30 days.


On Wed, Feb 12, 2020 at 12:01 AM Enrico Minack > wrote:


Hi Devs,

I would like to know what is the current roadmap of making
CalendarInterval comparable and orderable again (SPARK-29679,
SPARK-29385, #26337).

With #27262, this got reverted but SPARK-30551 does not mention
how to
go forward in this matter. I have found SPARK-28494, but this
seems to
be stale.

While I find it useful to compare such intervals, I cannot find a
way to
work around the missing comparability. Is there a way to get, e.g.
the
seconds that an interval represents to be able to compare
intervals? In
org.apache.spark.sql.catalyst.util.IntervalUtils there are methods
like
getEpoch or getDuration, which I cannot see are exposed to SQL or
in the
org.apache.spark.sql.functions package.

Thanks for the insights,
Enrico


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org






[build system] enabled the ubuntu staging node to help w/build queue

2020-02-11 Thread shane knapp ☠
the build queue has been increasing and to help throughput i enabled the
'ubuntu-testing' node.  i spot-checked a bunch of the spark maven builds,
and they passed.

i'll keep an eye out for any failures caused by the system and either
remove it from the worker pool of fix what i need to.

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Apache Spark Docker image repository

2020-02-11 Thread Sean Owen
To be clear this is a convenience 'binary' for end users, not just an
internal packaging to aid the testing framework?

There's nothing wrong with providing an additional official packaging
if we vote on it and it follows all the rules. There is an open
question about how much value it adds vs that maintenance. I see we do
already have some Dockerfiles, sure. Is it possible to reuse or
repurpose these so that we don't have more to maintain? or: what is
different from the existing Dockerfiles here? (dumb question, never
paid much attention to them)

We definitely can't release GPL bits or anything, yes. Just releasing
a Dockerfile referring to GPL bits is a gray area - no bits are being
redistributed, but, does it constitute a derived work where the GPL
stuff is a non-optional dependency? Would any publishing of these
images cause us to put a copy of third party GPL code anywhere?

At the least, we should keep this minimal. One image if possible, that
you overlay on top of your preferred OS/Java/Python image. But how
much value does that add? I have no info either way that people want
or don't need such a thing.

On Tue, Feb 11, 2020 at 10:13 AM Erik Erlandson  wrote:
>
> My takeaway from the last time we discussed this was:
> 1) To be ASF compliant, we needed to only publish images at official releases
> 2) There was some ambiguity about whether or not a container image that 
> included GPL'ed packages (spark images do) might trip over the GPL "viral 
> propagation" due to integrating ASL and GPL in a "binary release".  The "air 
> gap" GPL provision may apply - the GPL software interacts only at 
> command-line boundaries.
>
> On Wed, Feb 5, 2020 at 1:23 PM Dongjoon Hyun  wrote:
>>
>> Hi, All.
>>
>> From 2020, shall we have an official Docker image repository as an 
>> additional distribution channel?
>>
>> I'm considering the following images.
>>
>> - Public binary release (no snapshot image)
>> - Public non-Spark base image (OS + R + Python)
>>   (This can be used in GitHub Action Jobs and Jenkins K8s Integration 
>> Tests to speed up jobs and to have more stabler environments)
>>
>> Bests,
>> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark Docker image repository

2020-02-11 Thread Erik Erlandson
My takeaway from the last time we discussed this was:
1) To be ASF compliant, we needed to only publish images at official
releases
2) There was some ambiguity about whether or not a container image that
included GPL'ed packages (spark images do) might trip over the GPL "viral
propagation" due to integrating ASL and GPL in a "binary release".  The
"air gap" GPL provision may apply - the GPL software interacts only at
command-line boundaries.

On Wed, Feb 5, 2020 at 1:23 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> From 2020, shall we have an official Docker image repository as an
> additional distribution channel?
>
> I'm considering the following images.
>
> - Public binary release (no snapshot image)
> - Public non-Spark base image (OS + R + Python)
>   (This can be used in GitHub Action Jobs and Jenkins K8s Integration
> Tests to speed up jobs and to have more stabler environments)
>
> Bests,
> Dongjoon.
>


Re: comparable and orderable CalendarInterval

2020-02-11 Thread Joseph Torres
The problem is that there isn't a consistent number of seconds an interval
represents - as Wenchen mentioned, a month interval isn't a fixed number of
days. If your use case can account for that, maybe you could add the
interval to a fixed reference date and then compare the result.

On Tue, Feb 11, 2020 at 8:01 AM Enrico Minack 
wrote:

> Hi Devs,
>
> I would like to know what is the current roadmap of making
> CalendarInterval comparable and orderable again (SPARK-29679,
> SPARK-29385, #26337).
>
> With #27262, this got reverted but SPARK-30551 does not mention how to
> go forward in this matter. I have found SPARK-28494, but this seems to
> be stale.
>
> While I find it useful to compare such intervals, I cannot find a way to
> work around the missing comparability. Is there a way to get, e.g. the
> seconds that an interval represents to be able to compare intervals? In
> org.apache.spark.sql.catalyst.util.IntervalUtils there are methods like
> getEpoch or getDuration, which I cannot see are exposed to SQL or in the
> org.apache.spark.sql.functions package.
>
> Thanks for the insights,
> Enrico
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: comparable and orderable CalendarInterval

2020-02-11 Thread Wenchen Fan
What's your use case to compare intervals? It's tricky in Spark as there is
only one interval type and you can't really compare one month with 30 days.

On Wed, Feb 12, 2020 at 12:01 AM Enrico Minack 
wrote:

> Hi Devs,
>
> I would like to know what is the current roadmap of making
> CalendarInterval comparable and orderable again (SPARK-29679,
> SPARK-29385, #26337).
>
> With #27262, this got reverted but SPARK-30551 does not mention how to
> go forward in this matter. I have found SPARK-28494, but this seems to
> be stale.
>
> While I find it useful to compare such intervals, I cannot find a way to
> work around the missing comparability. Is there a way to get, e.g. the
> seconds that an interval represents to be able to compare intervals? In
> org.apache.spark.sql.catalyst.util.IntervalUtils there are methods like
> getEpoch or getDuration, which I cannot see are exposed to SQL or in the
> org.apache.spark.sql.functions package.
>
> Thanks for the insights,
> Enrico
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


comparable and orderable CalendarInterval

2020-02-11 Thread Enrico Minack

Hi Devs,

I would like to know what is the current roadmap of making 
CalendarInterval comparable and orderable again (SPARK-29679, 
SPARK-29385, #26337).


With #27262, this got reverted but SPARK-30551 does not mention how to 
go forward in this matter. I have found SPARK-28494, but this seems to 
be stale.


While I find it useful to compare such intervals, I cannot find a way to 
work around the missing comparability. Is there a way to get, e.g. the 
seconds that an interval represents to be able to compare intervals? In 
org.apache.spark.sql.catalyst.util.IntervalUtils there are methods like 
getEpoch or getDuration, which I cannot see are exposed to SQL or in the 
org.apache.spark.sql.functions package.


Thanks for the insights,
Enrico


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org