from:"Dongjoon Hyun"

JDK11 QA (SPARK-29194)

2019-09-20 Thread Dongjoon Hyun

Hi, All.

As a next step, we started JDK11 QA.

https://issues.apache.org/jira/browse/SPARK-29194

This issue mainly focuses on the following areas, but feel free to add any
sub-issues which you hit on JDK11 from now.

- Documentations
- Examples
- Performance
- Integration Tests

Bests,
Dongjoon.

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Dongjoon Hyun

Do you mean you want to have a breaking API change between 3.0 and 3.1?
I believe we follow Semantic Versioning (
https://spark.apache.org/versioning-policy.html ).

> We just won’t add any breaking changes before 3.1.

Bests,
Dongjoon.


On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue 
wrote:

> I don’t think we need to gate a 3.0 release on making a more stable
> version of InternalRow
>
> Sounds like we agree, then. We will use it for 3.0, but there are known
> problems with it.
>
> Thinking we’d have dsv2 working in both 3.x (which will change and
> progress towards more stable, but will have to break certain APIs) and 2.x
> seems like a false premise.
>
> Why do you think we will need to break certain APIs before 3.0?
>
> I’m only suggesting that we release the same support in a 2.5 release that
> we do in 3.0. Since we are nearly finished with the 3.0 goals, it seems
> like we can certainly do that. We just won’t add any breaking changes
> before 3.1.
>
> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin  wrote:
>
>> I don't think we need to gate a 3.0 release on making a more stable
>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>> (which will change and progress towards more stable, but will have to break
>> certain APIs) and 2.x seems like a false premise.
>>
>> To point out some problems with InternalRow that you think are already
>> pragmatic and stable:
>>
>> The class is in catalyst, which states:
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>
>> /**
>> * Catalyst is a library for manipulating relational query plans.  All
>> classes in catalyst are
>> * considered an internal API to Spark SQL and are subject to change
>> between minor releases.
>> */
>>
>> There is no even any annotation on the interface.
>>
>> The entire dependency chain were created to be private, and tightly
>> coupled with internal implementations. For example,
>>
>>
>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>
>> /**
>> * A UTF-8 String for internal Spark use.
>> * 
>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>> comparison,
>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>> * 
>> * Note: This is not designed for general use cases, should not be used
>> outside SQL.
>> */
>>
>>
>>
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>
>> (which again is in catalyst package)
>>
>>
>> If you want to argue this way, you might as well argue we should make the
>> entire catalyst package public to be pragmatic and not allow any changes.
>>
>>
>>
>>
>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue  wrote:
>>
>>> When you created the PR to make InternalRow public
>>>
>>> This isn’t quite accurate. The change I made was to use InternalRow
>>> instead of UnsafeRow, which is a specific implementation of InternalRow.
>>> Exposing this API has always been a part of DSv2 and while both you and I
>>> did some work to avoid this, we are still in the phase of starting with
>>> that API.
>>>
>>> Note that any change to InternalRow would be very costly to implement
>>> because this interface is widely used. That is why I think we can certainly
>>> consider it stable enough to use here, and that’s probably why UnsafeRow
>>> was part of the original proposal.
>>>
>>> In any case, the goal for 3.0 was not to replace the use of InternalRow,
>>> it was to get the majority of SQL working on top of the interface added
>>> after 2.4. That’s done and stable, so I think a 2.5 release with it is also
>>> reasonable.
>>>
>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin 
>>> wrote:
>>>
>>> To push back, while I agree we should not drastically change
>>> "InternalRow", there are a lot of changes that need to happen to make it
>>> stable. For example, none of the publicly exposed interfaces should be in
>>> the Catalyst package or the unsafe package. External implementations should
>>> be decoupled from the internal implementations, with cheap ways to convert
>>> back and forth.
>>>
>>> When you created the PR to make InternalRow public, the understanding
>>> was to work towards making it stable in the future, assuming we will start
>>> with an unstable API temporarily. You can't just make a bunch internal APIs
>>> tightly coupled with other internal pieces public and stable and call it a
>>> day, just because it happen to satisfy some use cases temporarily assuming
>>> the rest of Spark doesn't change.
>>>
>>>
>>>
>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue  wrote:
>>>
>>> > DSv2 is far from stable right?
>>>
>>> No, I think it is reasonably stable and very close to being ready for a
>>> release.
>>>
>>> > All the actual data types are unstable and you guys have completely
>>> ignored that.
>>>
>>> I think what you're referring to is the use of `InternalRow`. That's

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Dongjoon Hyun

gt;
>>> As I pointed out, DSv2 has been changing in the 2.x line, so this is
>>> expected. I don't think it will need incompatible changes in the 3.x line.
>>>
>>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim  wrote:
>>>
>>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>>> deal with this as the change made confusion on my PRs...), but my bet is
>>> that DSv2 would be already changed in incompatible way, at least who works
>>> for custom DataSource. Making downstream to diverge their implementation
>>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>>> experience - especially we are not completely closed the chance to further
>>> modify DSv2, and the change could be backward incompatible.
>>>
>>> If we really want to bring the DSv2 change to 2.x version line to let
>>> end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather
>>> say preparation of Spark 2.5 should be started after Spark 3.0 is
>>> officially released, honestly even later than that, say, getting some
>>> reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we
>>> don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may
>>> be frustrated to upgrade to next minor version.
>>>
>>> Btw, do we have any specific target users for this? Personally DSv2
>>> change would be the major backward incompatibility which Spark 2.x users
>>> may hesitate to upgrade, so they might be already prepared to migrate to
>>> Spark 3.0 if they are prepared to migrate to new DSv2.
>>>
>>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun 
>>> wrote:
>>>
>>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>>> I believe we follow Semantic Versioning (
>>> https://spark.apache.org/versioning-policy.html ).
>>>
>>> > We just won’t add any breaking changes before 3.1.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue 
>>> wrote:
>>>
>>> I don’t think we need to gate a 3.0 release on making a more stable
>>> version of InternalRow
>>>
>>> Sounds like we agree, then. We will use it for 3.0, but there are known
>>> problems with it.
>>>
>>> Thinking we’d have dsv2 working in both 3.x (which will change and
>>> progress towards more stable, but will have to break certain APIs) and 2.x
>>> seems like a false premise.
>>>
>>> Why do you think we will need to break certain APIs before 3.0?
>>>
>>> I’m only suggesting that we release the same support in a 2.5 release
>>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>>> seems like we can certainly do that. We just won’t add any breaking changes
>>> before 3.1.
>>>
>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin 
>>> wrote:
>>>
>>> I don't think we need to gate a 3.0 release on making a more stable
>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>> (which will change and progress towards more stable, but will have to break
>>> certain APIs) and 2.x seems like a false premise.
>>>
>>> To point out some problems with InternalRow that you think are already
>>> pragmatic and stable:
>>>
>>> The class is in catalyst, which states:
>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>
>>> /**
>>> * Catalyst is a library for manipulating relational query plans.  All
>>> classes in catalyst are
>>> * considered an internal API to Spark SQL and are subject to change
>>> between minor releases.
>>> */
>>>
>>> There is no even any annotation on the interface.
>>>
>>> The entire dependency chain were created to be private, and tightly
>>> coupled with internal implementations. For example,
>>>
>>>
>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>
>>> /**
>>> * A UTF-8 String for internal Spark use.
>>> * 
>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>>> comparison,
>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>> * 
>>> * Note: This is not designed for general use cases, should not be u

Re: [DISCUSS] Preferred approach on dealing with SPARK-29322

2019-10-01 Thread Dongjoon Hyun

Thank you for reporting, Jungtaek.

Can we try to upgrade it to the newer version first?

Since we are at 1.4.2, the newer version is 1.4.3.

Bests,
Dongjoon.



On Tue, Oct 1, 2019 at 9:18 PM Mridul Muralidharan  wrote:

> Makes more sense to drop support for zstd assuming the fix is not
> something at spark end (configuration, etc).
> Does not make sense to try to detect deadlock in codec.
>
> Regards,
> Mridul
>
> On Tue, Oct 1, 2019 at 8:39 PM Jungtaek Lim
>  wrote:
> >
> > Hi devs,
> >
> > I've discovered an issue with event logger, specifically reading
> incomplete event log file which is compressed with 'zstd' - the reader
> thread got stuck on reading that file.
> >
> > This is very easy to reproduce: setting configuration as below
> >
> > - spark.eventLog.enabled=true
> > - spark.eventLog.compress=true
> > - spark.eventLog.compression.codec=zstd
> >
> > and start Spark application. While the application is running, load the
> application in SHS webpage. It may succeed to replay the event log, but
> high likely it will be stuck and loading page will be also stuck.
> >
> > Please refer SPARK-29322 for more details.
> >
> > As the issue only occurs with 'zstd', the simplest approach is dropping
> support of 'zstd' for event log. More general approach would be introducing
> timeout on reading event log file, but it should be able to differentiate
> thread being stuck vs thread busy with reading huge event log file.
> >
> > Which approach would be preferred in Spark community, or would someone
> propose better ideas for handling this?
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] Preferred approach on dealing with SPARK-29322

2019-10-02 Thread Dongjoon Hyun

Thank you for the investigation and making a fix.

So, both issues are on only master (3.0.0) branch?

Bests,
Dongjoon.


On Wed, Oct 2, 2019 at 00:06 Jungtaek Lim 
wrote:

> FYI: patch submitted - https://github.com/apache/spark/pull/25996
>
> On Wed, Oct 2, 2019 at 3:25 PM Jungtaek Lim 
> wrote:
>
>> I need to do full manual test to make sure, but according to experiment
>> (small UT) "closeFrameOnFlush" seems to work.
>>
>> There was relevant change on master branch SPARK-26283 [1], and it
>> changed the way to read the zstd event log file to "continuous", which
>> seems to read open frame. With "closeFrameOnFlush" being false for
>> ZstdOutputStream, frame is never closed (even flushing output stream)
>> unless output stream is closed.
>>
>> I'll raise a patch once manual test is passed. Sorry for the false alarm.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-26283
>>
>> On Wed, Oct 2, 2019 at 2:33 PM Jungtaek Lim 
>> wrote:
>>
>>> The change log for zstd v1.4.3 feels me that the changes don't seem to
>>> be related.
>>>
>>> https://github.com/facebook/zstd/blob/dev/CHANGELOG#L1-L5
>>>
>>> v1.4.3
>>> bug: Fix Dictionary Compression Ratio Regression by @cyan4973 (#1709)
>>> bug: Fix Buffer Overflow in v0.3 Decompression by @felixhandte (#1722)
>>> build: Add support for IAR C/C++ Compiler for Arm by @joseph0918 (#1705)
>>> misc: Add NULL pointer check in util.c by @leeyoung624 (#1706)
>>>
>>> But it's only the matter of dependency update and rebuild, so I'll try
>>> it out.
>>>
>>> Before that, I just indicated ZstdOutputStream has a parameter
>>> "closeFrameOnFlush" which seems to deal with flush. We let the value as the
>>> default value which is "false". Let me pass the value to "true" and see it
>>> helps. Please let me know if someone knows why we pick the value as false
>>> (or let it by default).
>>>
>>>
>>> On Wed, Oct 2, 2019 at 1:48 PM Dongjoon Hyun 
>>> wrote:
>>>
>>>> Thank you for reporting, Jungtaek.
>>>>
>>>> Can we try to upgrade it to the newer version first?
>>>>
>>>> Since we are at 1.4.2, the newer version is 1.4.3.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>>
>>>> On Tue, Oct 1, 2019 at 9:18 PM Mridul Muralidharan 
>>>> wrote:
>>>>
>>>>> Makes more sense to drop support for zstd assuming the fix is not
>>>>> something at spark end (configuration, etc).
>>>>> Does not make sense to try to detect deadlock in codec.
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>> On Tue, Oct 1, 2019 at 8:39 PM Jungtaek Lim
>>>>>  wrote:
>>>>> >
>>>>> > Hi devs,
>>>>> >
>>>>> > I've discovered an issue with event logger, specifically reading
>>>>> incomplete event log file which is compressed with 'zstd' - the reader
>>>>> thread got stuck on reading that file.
>>>>> >
>>>>> > This is very easy to reproduce: setting configuration as below
>>>>> >
>>>>> > - spark.eventLog.enabled=true
>>>>> > - spark.eventLog.compress=true
>>>>> > - spark.eventLog.compression.codec=zstd
>>>>> >
>>>>> > and start Spark application. While the application is running, load
>>>>> the application in SHS webpage. It may succeed to replay the event log, 
>>>>> but
>>>>> high likely it will be stuck and loading page will be also stuck.
>>>>> >
>>>>> > Please refer SPARK-29322 for more details.
>>>>> >
>>>>> > As the issue only occurs with 'zstd', the simplest approach is
>>>>> dropping support of 'zstd' for event log. More general approach would be
>>>>> introducing timeout on reading event log file, but it should be able to
>>>>> differentiate thread being stuck vs thread busy with reading huge event 
>>>>> log
>>>>> file.
>>>>> >
>>>>> > Which approach would be preferred in Spark community, or would
>>>>> someone propose better ideas for handling this?
>>>>> >
>>>>> > Thanks,
>>>>> > Jungtaek Lim (HeartSaVioR)
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>

Re: [DISCUSS] Spark 2.5 release

2019-09-23 Thread Dongjoon Hyun

Hi, Ryan.

This thread has many replied as you see. That is the evidence that the
community is interested in your suggestion a lot.

> I'm offering to help build a stable release without breaking changes. But
if there is no community interest in it, I'm happy to drop this.

In this thread, the root cause of the disagreement is due to the lack of
supporting evidence for your claims.

1. Is DSv2 stable in `master`?
2. If then, what subset of DSv2 patches does Ryan is suggesting backporting?
3. How much those backporting DSv2 patches looks differently in
`branch-2.4`?
4. What does he mean by `without breaking changes? Is it technically
feasible?
Apache Spark 2.4.x and 2.5.x DSv2 should be compatible. (Not between
2.5.x DSv2 and 3.0.0 DSv2)
5. How long does it take? Is it possible before 3.0.0-preview? Who will
work on that backporting?
6. Is this meaningful if 2.5 and 3.1 become different again too soon (in
2020 Summer)?

We are SW engineers.
If you have a working branch, please share with us.
It will help us understand your suggestion and this discussion.
We can help you verify that branch achieves your goal.
The branch is tested already, isn't it?

Bests,
Dongjoon.




On Mon, Sep 23, 2019 at 10:44 AM Holden Karau  wrote:

> I would personally love to see us provide a gentle migration path to Spark
> 3 especially if much of the work is already going to happen anyways.
>
> Maybe giving it a different name (eg something like
> Spark-2-to-3-transitional) would make it more clear about its intended
> purpose and encourage folks to move to 3 when they can?
>
> On Mon, Sep 23, 2019 at 9:17 AM Ryan Blue 
> wrote:
>
>> My understanding is that 3.0-preview is not going to be a
>> production-ready release. For those of us that have been using backports of
>> DSv2 in production, that doesn't help.
>>
>> It also doesn't help as a stepping stone because users would need to
>> handle all of the incompatible changes in 3.0. Using 3.0-preview would be
>> an unstable release with breaking changes instead of a stable release
>> without the breaking changes.
>>
>> I'm offering to help build a stable release without breaking changes. But
>> if there is no community interest in it, I'm happy to drop this.
>>
>> On Sun, Sep 22, 2019 at 6:39 PM Hyukjin Kwon  wrote:
>>
>>> +1 for Matei's as well.
>>>
>>> On Sun, 22 Sep 2019, 14:59 Marco Gaido,  wrote:
>>>
>>>> I agree with Matei too.
>>>>
>>>> Thanks,
>>>> Marco
>>>>
>>>> Il giorno dom 22 set 2019 alle ore 03:44 Dongjoon Hyun <
>>>> dongjoon.h...@gmail.com> ha scritto:
>>>>
>>>>> +1 for Matei's suggestion!
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia 
>>>>> wrote:
>>>>>
>>>>>> If the goal is to get people to try the DSv2 API and build DSv2 data
>>>>>> sources, can we recommend the 3.0-preview release for this? That would 
>>>>>> get
>>>>>> people shifting to 3.0 faster, which is probably better overall compared 
>>>>>> to
>>>>>> maintaining two major versions. There’s not that much else changing in 
>>>>>> 3.0
>>>>>> if you already want to update your Java version.
>>>>>>
>>>>>> On Sep 21, 2019, at 2:45 PM, Ryan Blue 
>>>>>> wrote:
>>>>>>
>>>>>> > If you insist we shouldn't change the unstable temporary API in 3.x
>>>>>> . . .
>>>>>>
>>>>>> Not what I'm saying at all. I said we should carefully
>>>>>> consider whether a breaking change is the right decision in the 3.x line.
>>>>>>
>>>>>> All I'm suggesting is that we can make a 2.5 release with the feature
>>>>>> and an API that is the same as the one in 3.0.
>>>>>>
>>>>>> > I also don't get this backporting a giant feature to 2.x line
>>>>>>
>>>>>> I am planning to do this so we can use DSv2 before 3.0 is released.
>>>>>> Then we can have a source implementation that works in both 2.x and 3.0 
>>>>>> to
>>>>>> make the transition easier. Since I'm already doing the work, I'm 
>>>>>> offering
>>>>>> to share it with the community.
>>>>>>
>>>>>>
>>>>>> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin 
>>>>>> wrote:
>>&

Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-09 Thread Dongjoon Hyun

Thank you for the reply, Sean. Sure. 2.4.x should be a LTS version.

The main reason of 2.4.4 release (before 3.0.0) is to have a better basis
for comparison to 3.0.0.
For example, SPARK-27798 had an old bug, but its correctness issue is only
exposed at Spark 2.4.3.
It would be great if we can have a better basis.

Bests,
Dongjoon.


On Tue, Jul 9, 2019 at 9:52 AM Sean Owen  wrote:

> We will certainly want a 2.4.4 release eventually. In fact I'd expect
> 2.4.x gets maintained for longer than the usual 18 months, as it's the
> last 2.x branch.
> It doesn't need to happen before 3.0, but could. Usually maintenance
> releases happen 3-4 months apart and the last one was 2 months ago. If
> these are significant issues, sure. It'll probably be August before
> it's out anyway.
>
> On Tue, Jul 9, 2019 at 11:15 AM Dongjoon Hyun 
> wrote:
> >
> > Hi, All.
> >
> > Spark 2.4.3 was released two months ago (8th May).
> >
> > As of today (9th July), there exist 45 fixes in `branch-2.4` including
> the following correctness or blocker issues.
> >
> > - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for
> decimals not fitting in long
> > - SPARK-26045 Error in the spark 2.4 release package with the
> spark-avro_2.11 dependency
> > - SPARK-27798 from_avro can modify variables in other rows in local
> mode
> > - SPARK-27907 HiveUDAF should return NULL in case of 0 rows
> > - SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist
> entries
> > - SPARK-28308 CalendarInterval sub-second part should be padded
> before parsing
> >
> > It would be great if we can have Spark 2.4.4 before we are going to get
> busier for 3.0.0.
> > If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll
> it next Monday. (15th July).
> > How do you think about this?
> >
> > Bests,
> > Dongjoon.
>

Re: [VOTE] SPARK 3.0.0-preview (RC2)

2019-11-04 Thread Dongjoon Hyun

Hi, Xingbo.

Could you sent a vote result email to finalize this vote, please?

Bests,
Dongjoon.

On Fri, Nov 1, 2019 at 2:55 PM Takeshi Yamamuro 
wrote:

> +1, too.
>
> On Sat, Nov 2, 2019 at 3:36 AM Hyukjin Kwon  wrote:
>
>> +1
>>
>> On Fri, 1 Nov 2019, 15:36 Wenchen Fan,  wrote:
>>
>>> The PR builder uses Hadoop 2.7 profile, which makes me think that 2.7 is
>>> more stable and we should make releases using 2.7 by default.
>>>
>>> +1
>>>
>>> On Fri, Nov 1, 2019 at 7:16 AM Xiao Li  wrote:
>>>
 Spark 3.0 will still use the Hadoop 2.7 profile by default, I think.
 Hadoop 2.7 profile is much more stable than Hadoop 3.2 profile.

 On Thu, Oct 31, 2019 at 3:54 PM Sean Owen  wrote:

> This isn't a big thing, but I see that the pyspark build includes
> Hadoop 2.7 rather than 3.2. Maybe later we change the build to put in
> 3.2 by default.
>
> Otherwise, the tests all seems to pass with JDK 8 / 11 with all
> profiles enabled, so I'm +1 on it.
>
>
> On Thu, Oct 31, 2019 at 1:00 AM Xingbo Jiang 
> wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark
> version 3.0.0-preview.
> >
> > The vote is open until November 3 PST and passes if a majority +1
> PMC votes are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.0.0-preview
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> http://spark.apache.org/
> >
> > The tag to be voted on is v3.0.0-preview-rc2 (commit
> 007c873ae34f58651481ccba30e8e2ba38a692c4):
> > https://github.com/apache/spark/tree/v3.0.0-preview-rc2
> >
> > The release files, including signatures, digests, etc. can be found
> at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc2-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> >
> https://repository.apache.org/content/repositories/orgapachespark-1336/
> >
> > The documentation corresponding to this release can be found at:
> >
> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc2-docs/
> >
> > The list of bug fixes going into 3.0.0 can be found at the following
> URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12339177
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate,
> then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the
> Java/Scala
> > you can add the staging repository to your projects resolvers and
> test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with an out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 3.0.0?
> > ===
> >
> > The current list of open tickets targeted at 3.0.0 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for
> "Target Version/s" = 3.0.0
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

 --
 [image: Databricks Summit - Watch the talks]
 

>>>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-01 Thread Dongjoon Hyun

Hi, Xiao.

How JDK11-support can make `Hadoop-3.2 profile` risky? We build and publish
with JDK8.

> In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive
thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile
only.

Since we build and publish with JDK8 and the default runtime is still JDK8,
I don't think `hadoop-3.2 profile` is risky in that context.

For JDK11, Hive execution module 2.3.6 doesn't support JDK11 still in terms
of remote HiveMetastore.

So, among the above reasons, we can say that Hive execution module (with
Hive 2.3.6) can be the root cause of potential unknown issues.

In other words, `Hive 1.2.1` is the one you think stable, isn't it?

Although Hive 2.3.6 might be not proven in Apache Spark officially, we
resolved several SPARK issues by upgrading Hive from 1.2.1 to 2.3.6 also.

Bests,
Dongjoon.



On Fri, Nov 1, 2019 at 5:37 PM Jiaxin Shan  wrote:

> +1 for Hadoop 3.2.  Seems lots of cloud integration efforts Steve made is
> only available in 3.2. We see lots of users asking for better S3A support
> in Spark.
>
> On Fri, Nov 1, 2019 at 9:46 AM Xiao Li  wrote:
>
>> Hi, Steve,
>>
>> Thanks for your comments! My major quality concern is not against Hadoop
>> 3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive
>> thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile
>> only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile is more
>> risky due to these changes.
>>
>> To speed up the adoption of Spark 3.0, which has many other highly
>> desirable features, I am proposing to keep Hadoop 2.x profile as the
>> default.
>>
>> Cheers,
>>
>> Xiao.
>>
>>
>>
>> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran 
>> wrote:
>>
>>> What is the current default value? as the 2.x releases are becoming EOL;
>>> 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release
>>> getting attention. 2.10.0 shipped yesterday, but the ".0" means there will
>>> inevitably be surprises.
>>>
>>> One issue about using a older versions is that any problem reported
>>> -especially at stack traces you can blame me for- Will generally be met by
>>> a response of "does it go away when you upgrade?" The other issue is how
>>> much test coverage are things getting?
>>>
>>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
>>> client is there, and I the big guava update (HADOOP-16213) went in. People
>>> will either love or hate that.
>>>
>>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>>> backport planned though, including changes to better handle AWS caching of
>>> 404s generatd from HEAD requests before an object was actually created.
>>>
>>> It would be really good if the spark distributions shipped with later
>>> versions of the hadoop artifacts.
>>>
>>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li  wrote:
>>>
>>>> The stability and quality of Hadoop 3.2 profile are unknown. The
>>>> changes are massive, including Hive execution and a new version of Hive
>>>> thriftserver.
>>>>
>>>> To reduce the risk, I would like to keep the current default version
>>>> unchanged. When it becomes stable, we can change the default profile to
>>>> Hadoop-3.2.
>>>>
>>>> Cheers,
>>>>
>>>> Xiao
>>>>
>>>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen  wrote:
>>>>
>>>>> I'm OK with that, but don't have a strong opinion nor info about the
>>>>> implications.
>>>>> That said my guess is we're close to the point where we don't need to
>>>>> support Hadoop 2.x anyway, so, yeah.
>>>>>
>>>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun 
>>>>> wrote:
>>>>> >
>>>>> > Hi, All.
>>>>> >
>>>>> > There was a discussion on publishing artifacts built with Hadoop 3 .
>>>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview`
>>>>> will be the same because we didn't change anything yet.
>>>>> >
>>>>> > Technically, we need to change two places for publishing.
>>>>> >
>>>>> > 1. Jenkins Snapshot Publishing
>>>>> >
>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>>>> >
>>>>> > 2

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-02 Thread Dongjoon Hyun

Hi, Koert.

Could you be more specific to your Hadoop version requirement?

Although we will have Hadoop 2.7 profile, Hadoop 2.6 and older support is
officially already dropped in Apache Spark 3.0.0. We can not give you the
answer for Hadoop 2.6 and older version clusters because we are not testing
at all.

Also, Steve already pointed out that Hadoop 2.7 is also EOL. According to
his advice, we might need to upgrade our Hadoop 2.7 profile to the latest
2.x. I'm wondering you are against on that because of Hadoop 2.6 or older
version support.

BTW, I'm the one of the users of Hadoop 3.x clusters. It's used already and
we are migrating more. Apache Spark 3.0 will arrive 2020 (not today). We
need to consider that, too. Do you have any migration plan in 2020?

In short, for the clusters using Hadoop 2.6 and older versions, Apache
Spark 2.4 is supported as a LTS version. You can get the bug fixes. For
Hadoop 2.7, Apache Spark 3.0 will have the profile and the binary release.
Making Hadoop 3.2 profile as a default is irrelevant to that.

Bests,
Dongjoon.

On Sat, Nov 2, 2019 at 09:35 Koert Kuipers  wrote:

> i dont see how we can be close to the point where we dont need to support
> hadoop 2.x. this does not agree with the reality from my perspective, which
> is that all our clients are on hadoop 2.x. not a single one is on hadoop
> 3.x currently. this includes deployments of cloudera distros, hortonworks
> distros, and cloud distros like emr and dataproc.
>
> forcing us to be on older spark versions would be unfortunate for us, and
> also bad for the community (as deployments like ours help find bugs in
> spark).
>
> On Mon, Oct 28, 2019 at 3:51 PM Sean Owen  wrote:
>
>> I'm OK with that, but don't have a strong opinion nor info about the
>> implications.
>> That said my guess is we're close to the point where we don't need to
>> support Hadoop 2.x anyway, so, yeah.
>>
>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun 
>> wrote:
>> >
>> > Hi, All.
>> >
>> > There was a discussion on publishing artifacts built with Hadoop 3 .
>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will
>> be the same because we didn't change anything yet.
>> >
>> > Technically, we need to change two places for publishing.
>> >
>> > 1. Jenkins Snapshot Publishing
>> >
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>> >
>> > 2. Release Snapshot/Release Publishing
>> >
>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>> >
>> > To minimize the change, we need to switch our default Hadoop profile.
>> >
>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2
>> (3.2.0)` is optional.
>> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
>> optionally.
>> >
>> > Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7`
>> distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>> >
>> > Bests,
>> > Dongjoon.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Removing `CRAN incoming feasibility` check from the main build

2019-11-02 Thread Dongjoon Hyun

Hi, All.

CRAN instability seems to be a blocker for our dev process.
The following simple check causes consecutive failures in 4 of 9 Jenkins
jobs + PR builder.

- spark-branch-2.4-test-sbt-hadoop-2.6
- spark-branch-2.4-test-sbt-hadoop-2.7
- spark-master-test-sbt-hadoop-2.7
- spark-master-test-sbt-hadoop-3.2
- PRBuilder

```
* checking CRAN incoming feasibility ...Error in
.check_package_CRAN_incoming(pkgdir) :
  dims [product 24] do not match the length of object [0]
```

Since this happens frequently and it's out of our control,
I'd like to suggest remove this R check from the above main Jenkins jobs.
Instead, like `linter` or `snapshot publish` jobs, we can have independent
Jenkins jobs for R CRAN incoming feasibility check.

How do you think about that?

Bests,
Dongjoon.

Re: Removing `CRAN incoming feasibility` check from the main build

2019-11-02 Thread Dongjoon Hyun

Hi, All.

I made a PR to recover the PR Builder and the above Jenkins jobs since this
has been blocking us for a day.

https://github.com/apache/spark/pull/26375

There is a discussion how to proceed after recover this. We will recover
our `check-cran` test coverage as a follow-up.

Bests,
Dongjoon.

On Sat, Nov 2, 2019 at 7:10 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> CRAN instability seems to be a blocker for our dev process.
> The following simple check causes consecutive failures in 4 of 9 Jenkins
> jobs + PR builder.
>
> - spark-branch-2.4-test-sbt-hadoop-2.6
> - spark-branch-2.4-test-sbt-hadoop-2.7
> - spark-master-test-sbt-hadoop-2.7
> - spark-master-test-sbt-hadoop-3.2
> - PRBuilder
>
> ```
> * checking CRAN incoming feasibility ...Error in
> .check_package_CRAN_incoming(pkgdir) :
>   dims [product 24] do not match the length of object [0]
> ```
>
> Since this happens frequently and it's out of our control,
> I'd like to suggest remove this R check from the above main Jenkins jobs.
> Instead, like `linter` or `snapshot publish` jobs, we can have independent
> Jenkins jobs for R CRAN incoming feasibility check.
>
> How do you think about that?
>
> Bests,
> Dongjoon.
>

Re: [VOTE] SPARK 3.0.0-preview (RC1)

2019-10-30 Thread Dongjoon Hyun

Hi, Xingbo.

Currently, RC2 tag is pointing RC1 tag.

https://github.com/apache/spark/tree/v3.0.0-preview-rc2

Could you cut from the HEAD of master branch?
Otherwise, nobody knows what release script you used for RC2.

Bests,
Dongjoon.



On Wed, Oct 30, 2019 at 4:15 PM Xingbo Jiang  wrote:

> Hi all,
>
> This RC fails because:
> It fails to generate a PySpark release.
>
> I'll start RC2 soon.
>
> Thanks!
>
> Xingbo
>
>
> On Wed, Oct 30, 2019 at 4:10 PM Xingbo Jiang 
> wrote:
>
>> Thanks Sean, since we need to generate PySpark release with a different
>> name, I would prefer fail RC1 and start another release candidate.
>>
>> Sean Owen  于2019年10月30日周三 下午4:00写道：
>>
>>> I agree that we need a Pyspark release for this preview release. If
>>> it's a matter of producing it from the same tag, we can evaluate it
>>> within this same release candidate. Otherwise, just roll another
>>> release candidate.
>>>
>>> I was able to build it and pass all tests with JDK 8 and JDK 11
>>> (hadoop-3.2 profile, note) on Ubuntu, so this is otherwise looking
>>> good to me.
>>>
>>> On Tue, Oct 29, 2019 at 9:01 PM Xingbo Jiang 
>>> wrote:
>>> >
>>> > Please vote on releasing the following candidate as Apache Spark
>>> version 3.0.0-preview.
>>> >
>>> > The vote is open until November 2 PST and passes if a majority +1 PMC
>>> votes are cast, with
>>> > a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 3.0.0-preview
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>> >
>>> > The tag to be voted on is v3.0.0-preview-rc1 (commit
>>> 5eddbb5f1d9789696927f435c55df887e50a1389):
>>> > https://github.com/apache/spark/tree/v3.0.0-preview-rc1
>>> >
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-bin/
>>> >
>>> > Signatures used for Spark RCs can be found in this file:
>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >
>>> > The staging repository for this release can be found at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1334/
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-docs/
>>> >
>>> > The list of bug fixes going into 3.0.0 can be found at the following
>>> URL:
>>> > https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>> >
>>> > FAQ
>>> >
>>> > =
>>> > How can I help test this release?
>>> > =
>>> >
>>> > If you are a Spark user, you can help us test this release by taking
>>> > an existing Spark workload and running on this release candidate, then
>>> > reporting any regressions.
>>> >
>>> > If you're working in PySpark you can set up a virtual env and install
>>> > the current RC and see if anything important breaks, in the Java/Scala
>>> > you can add the staging repository to your projects resolvers and test
>>> > with the RC (make sure to clean up the artifact cache before/after so
>>> > you don't end up building with a out of date RC going forward).
>>> >
>>> > ===
>>> > What should happen to JIRA tickets still targeting 3.0.0?
>>> > ===
>>> >
>>> > The current list of open tickets targeted at 3.0.0 can be found at:
>>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.0.0
>>> >
>>> > Committers should look at those and triage. Extremely important bug
>>> > fixes, documentation, and API tweaks that impact compatibility should
>>> > be worked on immediately. Everything else please retarget to an
>>> > appropriate release.
>>> >
>>> > ==
>>> > But my bug isn't fixed?
>>> > ==
>>> >
>>> > In order to make timely releases, we will typically not hold the
>>> > release unless the bug in question is a regression from the previous
>>> > release. That being said, if there is something which is a regression
>>> > that has not been correctly targeted please ping me or a committer to
>>> > help target the issue.
>>>
>>

Re: [DISCUSS] Deprecate Python < 3.6 in Spark 3.0

2019-10-30 Thread Dongjoon Hyun

Thank you all. I made a PR for that.

https://github.com/apache/spark/pull/26326

On Tue, Oct 29, 2019 at 5:45 AM Takeshi Yamamuro 
wrote:

> +1, too.
>
> On Tue, Oct 29, 2019 at 4:16 PM Holden Karau  wrote:
>
>> +1 to deprecating but not yet removing support for 3.6
>>
>> On Tue, Oct 29, 2019 at 3:47 AM Shane Knapp  wrote:
>>
>>> +1 to testing the absolute minimum number of python variants as
>>> possible.  ;)
>>>
>>> On Mon, Oct 28, 2019 at 7:46 PM Hyukjin Kwon 
>>> wrote:
>>>
>>>> +1 from me as well.
>>>>
>>>> 2019년 10월 29일 (화) 오전 5:34, Xiangrui Meng 님이 작성:
>>>>
>>>>> +1. And we should start testing 3.7 and maybe 3.8 in Jenkins.
>>>>>
>>>>> On Thu, Oct 24, 2019 at 9:34 AM Dongjoon Hyun 
>>>>> wrote:
>>>>>
>>>>>> Thank you for starting the thread.
>>>>>>
>>>>>> In addition to that, we currently are testing Python 3.6 only in
>>>>>> Apache Spark Jenkins environment.
>>>>>>
>>>>>> Given that Python 3.8 is already out and Apache Spark 3.0.0 RC1 will
>>>>>> start next January
>>>>>> (https://spark.apache.org/versioning-policy.html), I'm +1 for the
>>>>>> deprecation (Python < 3.6) at Apache Spark 3.0.0.
>>>>>>
>>>>>> It's just a deprecation to prepare the next-step development cycle.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 24, 2019 at 1:10 AM Maciej Szymkiewicz <
>>>>>> mszymkiew...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> While deprecation of Python 2 in 3.0.0 has been announced
>>>>>>> <https://spark.apache.org/news/plan-for-dropping-python-2-support.html>,
>>>>>>> there is no clear statement about specific continuing support of 
>>>>>>> different
>>>>>>> Python 3 version.
>>>>>>>
>>>>>>> Specifically:
>>>>>>>
>>>>>>>- Python 3.4 has been retired this year.
>>>>>>>- Python 3.5 is already in the "security fixes only" mode and
>>>>>>>should be retired in the middle of 2020.
>>>>>>>
>>>>>>> Continued support of these two blocks adoption of many new Python
>>>>>>> features (PEP 468)  and it is hard to justify beyond 2020.
>>>>>>>
>>>>>>> Should these two be deprecated in 3.0.0 as well?
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Maciej
>>>>>>>
>>>>>>>
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: [apache/spark] [SPARK-29674][CORE] Update dropwizard metrics to 4.1.x for JDK 9+ (#26332)

2019-10-30 Thread Dongjoon Hyun

The Ganglia module has only 2 files.
In addition to dropping, we may choose the following two ways to support it
still partially
like `kafka-0.8` which Apache Spark supports in Scala 2.11 only.

   1. We can stick to `dropwizard 3.x` for JDK8 (by default) and use
`dropwizard 4.x` for `hadoop-3.2` profile only.
   2. If we upgrade to `drop wizard 4.x` completely, we can make the
Ganglia module as an external packages (with dropwizard 3.x) for Apache
Spark 3.0 JDK8.

$ tree .
.
├── pom.xml
└── src
└── main
└── scala
└── org
└── apache
└── spark
└── metrics
└── sink
└── GangliaSink.scala

---
Language files  blankcomment
code
---
Scala1 20 17
  59
Maven1  4 17
  27
---
SUM: 2 24 34
  86
---

Bests,
Dongjoon.


On Wed, Oct 30, 2019 at 6:18 PM Sean Owen  wrote:

> I wanted to raise this to dev@.
>
> So, updating dropwizard metrics from 3.2.x to 4.x might be important for
> JDK 11 support. Our tests pass as-is without this update. But we don't test
> some elements of this metrics support, like Ganglia integration. And I have
> heard reports that downstream custom usages of dropwizard 3.2.x doesn't
> work on JDK 11.
>
> The bad news is that the Ganglia integration doesn't exist anymore in 4.x.
> And we have a whole custom module for that integration with Spark.
>
> My question is: how much do we need to keep Ganglia integration in Spark
> 3.x? I think it does have some users. We can keep it as is and hope it
> works out in JDK 11, or consider dropping this module.
>
>
> -- Forwarded message -
> From: Apache Spark QA 
> Date: Wed, Oct 30, 2019 at 6:56 PM
> Subject: Re: [apache/spark] [SPARK-29674][CORE] Update dropwizard metrics
> to 4.1.x for JDK 9+ (#26332)
> To: apache/spark 
> Cc: Sean Owen , Assign 
>
>
> *Test build #112974 has started
> *
> for PR 26332 at commit aefde48
> 
> .
>
> —
> You are receiving this because you were assigned.
> Reply to this email directly, view it on GitHub
> ,
> or unsubscribe
> 
> .
>

Re: Adding JIRA ID as the prefix for the test case name

2019-11-12 Thread Dongjoon Hyun

Thank you for the suggestion, Hyukjin.

Previously, we added Jira IDs for the bug fix PR test cases as Gabor said.

For the new features (and improvements), we didn't add them

because all test cases in the newly added test suite share the same prefix
JIRA ID in that case.

It might looks redundant.

However, I'm +1 for Hyukjin's original suggestion because we had better
have the official rule for this in some ways.

Thank you again, Hyukjin.

Bests,
Dongjoon.



On Tue, Nov 12, 2019 at 1:13 AM Gabor Somogyi 
wrote:

> +1 for having that consistent rule in test names.
> +1 for making it a guideline.
> +1 defining exact guides in general.
>
> Until now I've followed the alternative (only add the prefix when the
> JIRA's type is bug) and that way I knew that such tests contain edge cases.
> In case of new features I'm pretty sure there is a reason to introduce it
> but at the moment can't imagine a use-case where it can help us (want to
> convert it to daily routine).
>
> > This is helpful when the test cases are moved to a different file.
> The test can be found by name without jira ID
>
>
> On Tue, Nov 12, 2019 at 5:31 AM Hyukjin Kwon  wrote:
>
>> In few days, I will wrote this in our guidelines probably after rewording
>> it a bit better:
>>
>> 1. Add a prefix into a test name when a PR adds a couple of tests.
>> 2. Uses "SPARK-: test name" format.
>>
>> Please let me know if you have any different opinion about what/when to
>> write the JIRA ID as the prefix.
>> I would like to make sure this simple rule is closer to the actual
>> practice from you guys.
>>
>>
>> 2019년 11월 12일 (화) 오전 8:41, Gengliang 님이 작성:
>>
>>> +1 for making it a guideline. This is helpful when the test cases are
>>> moved to a different file.
>>>
>>> On Mon, Nov 11, 2019 at 3:23 PM Takeshi Yamamuro 
>>> wrote:
>>>
 +1 for having that consistent rule in test names.
 This is a trivial problem though, I think documenting this rule in the
 contribution guide
 might be able to make reviewer overhead a little smaller.

 Bests,
 Takeshi

 On Tue, Nov 12, 2019 at 1:46 AM Hyukjin Kwon 
 wrote:

> Hi all,
>
> Maybe it's not a big deal but it brought some confusions time to time
> into Spark dev and community. I think it's time to discuss about 
> when/which
> format to add a JIRA ID as a prefix for the test case name in Scala test
> cases.
>
> Currently we have many test case names with prefixes as below:
>
>- test("SPARK-X blah blah")
>- test("SPARK-X: blah blah")
>- test("SPARK-X - blah blah")
>- test("[SPARK-X] blah blah")
>- …
>
> It is a good practice to have the JIRA ID in general because, for
> instance,
> it makes us put less efforts to track commit histories (or even when
> the files
> are totally moved), or to track related information of tests failed.
> Considering Spark's getting big, I think it's good to document.
>
> I would like to suggest this and document it in our guideline:
>
> 1. Add a prefix into a test name when a PR adds a couple of tests.
> 2. Uses "SPARK-: test name" format which is used in our code base
> most
>   often[1].
>
> We should make it simple and clear but closer to the actual practice.
> So, I would like to listen to what other people think. I would appreciate
> if you guys give some feedback about when to add the JIRA prefix. One
> alternative is that, we only add the prefix when the JIRA's type is bug.
>
> [1]
> git grep -E 'test\("\SPARK-([0-9]+):' | wc -l
>  923
> git grep -E 'test\("\SPARK-([0-9]+) ' | wc -l
>  477
> git grep -E 'test\("\[SPARK-([0-9]+)\]' | wc -l
>   16
> git grep -E 'test\("\SPARK-([0-9]+) -' | wc -l
>   13
>
>
>
>

 --
 ---
 Takeshi Yamamuro

>>>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-16 Thread Dongjoon Hyun

Thank you for suggestion.

Having `hive-2.3` profile sounds good to me because it's orthogonal to
Hadoop 3.
IIRC, originally, it was proposed in that way, but we put it under
`hadoop-3.2` to avoid adding new profiles at that time.

And, I'm wondering if you are considering additional pre-built distribution
and Jenkins jobs.

Bests,
Dongjoon.



On Fri, Nov 15, 2019 at 1:38 PM Cheng Lian  wrote:

> Cc Yuming, Steve, and Dongjoon
>
> On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian  wrote:
>
>> Similar to Xiao, my major concern about making Hadoop 3.2 the default
>> Hadoop version is quality control. The current hadoop-3.2 profile covers
>> too many major component upgrades, i.e.:
>>
>>- Hadoop 3.2
>>- Hive 2.3
>>- JDK 11
>>
>> We have already found and fixed some feature and performance regressions
>> related to these upgrades. Empirically, I’m not surprised at all if more
>> regressions are lurking somewhere. On the other hand, we do want help from
>> the community to help us to evaluate and stabilize these new changes.
>> Following that, I’d like to propose:
>>
>>1.
>>
>>Introduce a new profile hive-2.3 to enable (hopefully) less risky
>>Hadoop/Hive/JDK version combinations.
>>
>>This new profile allows us to decouple Hive 2.3 from the hadoop-3.2
>>profile, so that users may try out some less risky Hadoop/Hive/JDK
>>combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to
>>face potential regressions introduced by the Hadoop 3.2 upgrade.
>>
>>Yuming Wang has already sent out PR #26533
>> to exercise the Hadoop
>>2.7 + Hive 2.3 + JDK 11 combination (this PR does not have the
>>hive-2.3 profile yet), and the result looks promising: the Kafka
>>streaming and Arrow related test failures should be irrelevant to the 
>> topic
>>discussed here.
>>
>>After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a
>>lot of difference between having Hadoop 2.7 or Hadoop 3.2 as the default
>>Hadoop version. For users who are still using Hadoop 2.x in production,
>>they will have to use a hadoop-provided prebuilt package or build
>>Spark 3.0 against their own 2.x version anyway. It does make a difference
>>for cloud users who don’t use Hadoop at all, though. And this probably 
>> also
>>helps to stabilize the Hadoop 3.2 code path faster since our PR builder
>>will exercise it regularly.
>>2.
>>
>>Defer Hadoop 2.x upgrade to Spark 3.1+
>>
>>I personally do want to bump our Hadoop 2.x version to 2.9 or even
>>2.10. Steve has already stated the benefits very well. My worry here is
>>still quality control: Spark 3.0 has already had tons of changes and major
>>component version upgrades that are subject to all kinds of known and
>>hidden regressions. Having Hadoop 2.7 there provides us a safety net, 
>> since
>>it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 
>> 2.7
>>to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the
>>next 1 or 2 Spark 3.x releases.
>>
>> Cheng
>>
>> On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers  wrote:
>>
>>> i get that cdh and hdp backport a lot and in that way left 2.7 behind.
>>> but they kept the public apis stable at the 2.7 level, because thats kind
>>> of the point. arent those the hadoop apis spark uses?
>>>
>>> On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran
>>>  wrote:
>>>


 On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran
>  wrote:
>
>> It would be really good if the spark distributions shipped with later
>> versions of the hadoop artifacts.
>>
>
> I second this. If we need to keep a Hadoop 2.x profile around, why not
> make it Hadoop 2.8 or something newer?
>

 go for 2.9

>
> Koert Kuipers  wrote:
>
>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2
>> profile to latest would probably be an issue for us.
>
>
> When was the last time HDP 2.x bumped their minor version of Hadoop?
> Do we want to wait for them to bump to Hadoop 2.8 before we do the same?
>

 The internal builds of CDH and HDP are not those of ASF 2.7.x. A really
 large proportion of the later branch-2 patches are backported. 2,7 was left
 behind a long time ago




>>>

Re: Spark 2.4.5 release for Parquet and Avro dependency updates?

2019-11-22 Thread Dongjoon Hyun

Hi, Michael.

I'm not sure Apache Spark is in the status close to what you want.

First, both Apache Spark 3.0.0-preview and Apache Spark 2.4 is using Avro
1.8.2. Also, `master` and `branch-2.4` branch does. Cutting new releases do
not provide you what you want.

Do we have a PR on the master branch? Otherwise, before starting to discuss
the releases, could you make a PR first on the master branch? For Parquet,
it's the same.

Second, we want to provide Apache Spark 3.0.0 as compatible as possible.
The incompatible change could be a reason for rejection even in `master`
branch for Apache Spark 3.0.0.

Lastly, we may consider backporting if it lands at `master` branch for 3.0.
However, as Nan Zhu said, the dependency upgrade backporting PR is -1 by
default. Usually, it's allowed only for those serious cases like
security/production outage.

Bests,
Dongjoon.

On Fri, Nov 22, 2019 at 9:00 AM Ryan Blue  wrote:

> Just to clarify, I don't think that Parquet 1.10.1 to 1.11.0 is a
> runtime-incompatible change. The example mixed 1.11.0 and 1.10.1 in the
> same execution.
>
> Michael, please be more careful about announcing compatibility problems in
> other communities. If you've observed problems, let's find out the root
> cause first.
>
> rb
>
> On Fri, Nov 22, 2019 at 8:56 AM Michael Heuer  wrote:
>
>> Hello,
>>
>> Avro 1.8.2 to 1.9.1 is a binary incompatible update, and it appears that
>> Parquet 1.10.1 to 1.11 will be a runtime-incompatible update (see thread on
>> dev@parquet
>> 
>> ).
>>
>> Might there be any desire to cut a Spark 2.4.5 release so that users can
>> pick up these changes independently of all the other changes in Spark 3.0?
>>
>> Thank you in advance,
>>
>>michael
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-22 Thread Dongjoon Hyun

Thank you, Steve and all.

As a conclusion of this thread, we will merge the following PR and move
forward.

[SPARK-29981][BUILD] Add hive-1.2/2.3 profiles
https://github.com/apache/spark/pull/26619

Please leave your comments if you have any concern.
And, the following PRs and more will follow it soon.

SPARK-29988 Adjust Jenkins jobs for hive-1.2/2.3 combination
SPARK-29989 Update release-script for hive-1.2/2.3 combination
SPARK-29991 Support hive-1.2/2.3 in PR Builder

In this thread, we have been focusing on only Hive dependency.
These change become effective at Apache Spark 3.0.0 (or the next preview).
For Hadoop3 and JDK11, please follow up the other threads.

Bests,
Dongjoon.

Re: SQL test failures in PR builder?

2019-12-04 Thread Dongjoon Hyun

Hi, Sean.

It seems that there is no failure on your other SQL PR.

https://github.com/apache/spark/pull/26748

Does the sequential failure happen only at `NewSparkPullRequestBuilder`?
Since `NewSparkPullRequestBuilder` is not the same with
`SparkPullRequestBuilder`,
there might be a root cause inside it if it happens only at
`NewSparkPullRequestBuilder`.

For `org.apache.hive.service.ServiceException: Failed to Start HiveServer2`,
I've observed them before, but the root cause might be different from this
one.

BTW, to reduce the scope of investigation, could you try with `[hive-1.2]`
tag in your PR?

Bests,
Dongjoon.

On Wed, Dec 4, 2019 at 6:29 AM Sean Owen  wrote:

> I'm seeing consistent failures in the PR builder when touching SQL code:
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4960/testReport/
>
>  org.apache.spark.sql.hive.thriftserver.SparkMetadataOperationSuite.Spark's
> own GetSchemasOperation(SparkGetSchemasOperation)14 ms2
>  org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextSuite.(It
> is not a test it is a sbt.testing.SuiteSelector)
>
> Looks like this has failed about 6 builds in the past few days. Has anyone
> seen this / has a clue what's causing it? errors are like ...
>
> java.sql.SQLException: No suitable driver found for 
> jdbc:hive2://localhost:13694/?a=avalue;b=bvalue#c=cvalue;d=dvalue
>
>
> Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: class 
> org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl not 
> org.apache.hadoop.hive.metastore.MetaStoreFilterHook
>
>

Re: Spark 3.0 preview release 2?

2019-12-09 Thread Dongjoon Hyun

Thank you, All.

+1 for another `3.0-preview`.

Also, thank you Yuming for volunteering for that!

Bests,
Dongjoon.


On Mon, Dec 9, 2019 at 9:39 AM Xiao Li  wrote:

> When entering the official release candidates, the new features have to be
> disabled or even reverted [if the conf is not available] if the fixes are
> not trivial; otherwise, we might need 10+ RCs to make the final release.
> The new features should not block the release based on the previous
> discussions.
>
> I agree we should have code freeze at the beginning of 2020. The preview
> releases should not block the official releases. The preview is just to
> collect more feedback about these new features or behavior changes.
>
> Also, for the release of Spark 3.0, we still need the Hive community to do
> us a favor to release 2.3.7 for having HIVE-22190
> . Before asking Hive
> community to do 2.3.7 release, if possible, we want our Spark community to
> have more tries, especially the support of JDK 11 on Hadoop 2.7 and 3.2,
> which is based on Hive 2.3 execution JAR. During the preview stage, we
> might find more issues that are not covered by our test cases.
>
>
>
> On Mon, Dec 9, 2019 at 4:55 AM Sean Owen  wrote:
>
>> Seems fine to me of course. Honestly that wouldn't be a bad result for
>> a release candidate, though we would probably roll another one now.
>> How about simply moving to a release candidate? If not now then at
>> least move to code freeze from the start of 2020. There is also some
>> downside in pushing out the 3.0 release further with previews.
>>
>> On Mon, Dec 9, 2019 at 12:32 AM Xiao Li  wrote:
>> >
>> > I got many great feedbacks from the community about the recent 3.0
>> preview release. Since the last 3.0 preview release, we already have 353
>> commits [https://github.com/apache/spark/compare/v3.0.0-preview...master].
>> There are various important features and behavior changes we want the
>> community to try before entering the official release candidates of Spark
>> 3.0.
>> >
>> >
>> > Below is my selected items that are not part of the last 3.0 preview
>> but already available in the upstream master branch:
>> >
>> > Support JDK 11 with Hadoop 2.7
>> > Spark SQL will respect its own default format (i.e., parquet) when
>> users do CREATE TABLE without USING or STORED AS clauses
>> > Enable Parquet nested schema pruning and nested pruning on expressions
>> by default
>> > Add observable Metrics for Streaming queries
>> > Column pruning through nondeterministic expressions
>> > RecordBinaryComparator should check endianness when compared by long
>> > Improve parallelism for local shuffle reader in adaptive query execution
>> > Upgrade Apache Arrow to version 0.15.1
>> > Various interval-related SQL support
>> > Add a mode to pin Python thread into JVM's
>> > Provide option to clean up completed files in streaming query
>> >
>> > I am wondering if we can have another preview release for Spark 3.0?
>> This can help us find the design/API defects as early as possible and avoid
>> the significant delay of the upcoming Spark 3.0 release
>> >
>> >
>> > Also, any committer is willing to volunteer as the release manager of
>> the next preview release of Spark 3.0, if we have such a release?
>> >
>> >
>> > Cheers,
>> >
>> >
>> > Xiao
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> [image: Databricks Summit - Watch the talks]
> 
>

Release Apache Spark 2.4.5 and 2.4.6

2019-12-09 Thread Dongjoon Hyun

Hi, All.

Along with the discussion on 3.0.0, I'd like to discuss about the next
releases on `branch-2.4`.

As we know, `branch-2.4` is our LTS branch and also there exists some
questions on the release plans. More releases are important not only for
the latest K8s version support, but also for delivering important bug fixes
regularly (at least until 3.x becomes dominant.)

In short, I'd like to propose the followings.

1. Apache Spark 2.4.5 release (2020 January)
2. Apache Spark 2.4.6 release (2020 July)

Of course, we can adjust the schedule.
This aims to have a pre-defined cadence in order to give release managers
to prepare.

Bests,
Dongjoon.

PS. As of now, `branch-2.4` has 135 additional patches after `2.4.4`.

Re: Spark 3.0 preview release 2?

2019-12-10 Thread Dongjoon Hyun

BTW, our Jenkins seems to be behind.

1. For the first item, `Support JDK 11 with Hadoop 2.7`:
At least, we need a new Jenkins job
`spark-master-test-maven-hadoop-2.7-jdk-11/`.
2. https://issues.apache.org/jira/browse/SPARK-28900 (Test Pyspark, SparkR
on JDK 11 with run-tests)
3. https://issues.apache.org/jira/browse/SPARK-29988 (Adjust Jenkins jobs
for `hive-1.2/2.3` combination)

It would be great if we can finish the above three jobs before mentioning
them in our release note of the next preview.

Bests,
Dongjoon.


On Tue, Dec 10, 2019 at 6:29 AM Tom Graves 
wrote:

> +1 for another preview
>
> Tom
>
> On Monday, December 9, 2019, 12:32:29 AM CST, Xiao Li <
> gatorsm...@gmail.com> wrote:
>
>
> I got many great feedbacks from the community about the recent 3.0
> preview release. Since the last 3.0 preview release, we already have 353
> commits [https://github.com/apache/spark/compare/v3.0.0-preview...master].
> There are various important features and behavior changes we want the
> community to try before entering the official release candidates of Spark
> 3.0.
>
>
> Below is my selected items that are not part of the last 3.0 preview but
> already available in the upstream master branch:
>
>
>- Support JDK 11 with Hadoop 2.7
>- Spark SQL will respect its own default format (i.e., parquet) when
>users do CREATE TABLE without USING or STORED AS clauses
>- Enable Parquet nested schema pruning and nested pruning on
>expressions by default
>- Add observable Metrics for Streaming queries
>- Column pruning through nondeterministic expressions
>- RecordBinaryComparator should check endianness when compared by long
>- Improve parallelism for local shuffle reader in adaptive query
>execution
>- Upgrade Apache Arrow to version 0.15.1
>- Various interval-related SQL support
>- Add a mode to pin Python thread into JVM's
>- Provide option to clean up completed files in streaming query
>
> I am wondering if we can have another preview release for Spark 3.0? This
> can help us find the design/API defects as early as possible and avoid the
> significant delay of the upcoming Spark 3.0 release
>
>
> Also, any committer is willing to volunteer as the release manager of the
> next preview release of Spark 3.0, if we have such a release?
>
>
> Cheers,
>
>
> Xiao
>

Re: R linter is broken

2019-12-13 Thread Dongjoon Hyun

It seems to fail at installation because the remote repository seems to be
changed.

Bests,
Dongjoon

On Fri, Dec 13, 2019 at 07:46 Nicholas Chammas 
wrote:

> The R linter GitHub action seems to be busted
> .
> Looks like we need to update some repository references
> 
> ?
>
> Nick
>

Re: R linter is broken

2019-12-13 Thread Dongjoon Hyun

Please see here for the root cause.

-
https://github.community/t5/GitHub-Actions/ubuntu-latest-Apt-repository-list-issues/td-p/41122


On Fri, Dec 13, 2019 at 9:11 AM Dongjoon Hyun 
wrote:

> It seems to fail at installation because the remote repository seems to be
> changed.
>
> Bests,
> Dongjoon
>
> On Fri, Dec 13, 2019 at 07:46 Nicholas Chammas 
> wrote:
>
>> The R linter GitHub action seems to be busted
>> <https://github.com/apache/spark/pull/26877/checks?check_run_id=347572350#step:4:68>.
>> Looks like we need to update some repository references
>> <https://github.com/apache/spark/blob/ac9b1881a281d33730d2bfb82ab2fb4bc04cc0a0/.github/workflows/master.yml#L109>
>> ?
>>
>> Nick
>>
>

Re: R linter is broken

2019-12-13 Thread Dongjoon Hyun

Microsoft mirror is recovered now.

Bests,
Dongjoon.

On Fri, Dec 13, 2019 at 9:45 AM Dongjoon Hyun 
wrote:

> Please see here for the root cause.
>
> -
> https://github.community/t5/GitHub-Actions/ubuntu-latest-Apt-repository-list-issues/td-p/41122
>
>
> On Fri, Dec 13, 2019 at 9:11 AM Dongjoon Hyun 
> wrote:
>
>> It seems to fail at installation because the remote repository seems to
>> be changed.
>>
>> Bests,
>> Dongjoon
>>
>> On Fri, Dec 13, 2019 at 07:46 Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> The R linter GitHub action seems to be busted
>>> <https://github.com/apache/spark/pull/26877/checks?check_run_id=347572350#step:4:68>.
>>> Looks like we need to update some repository references
>>> <https://github.com/apache/spark/blob/ac9b1881a281d33730d2bfb82ab2fb4bc04cc0a0/.github/workflows/master.yml#L109>
>>> ?
>>>
>>> Nick
>>>
>>

Re: Release Apache Spark 2.4.5 and 2.4.6

2019-12-11 Thread Dongjoon Hyun

Thank you all. I'll make a PR to Apache Spark website.

Bests,
Dongjoon.

On Tue, Dec 10, 2019 at 11:43 PM Wenchen Fan  wrote:

> Sounds good. Thanks for bringing this up!
>
> On Wed, Dec 11, 2019 at 3:18 PM Takeshi Yamamuro 
> wrote:
>
>> That looks nice, thanks!
>> I checked the previous v2.4.4 release; it has around 130 commits (from
>> 2.4.3 to 2.4.4), so
>> I think branch-2.4 already has enough commits for the next release.
>>
>> A commit list from 2.4.3 to 2.4.4;
>>
>> https://github.com/apache/spark/compare/5ac2014e6c118fbeb1fe8e5c8064c4a8ee9d182a...7955b3962ac46b89564e0613db7bea98a1478bf2
>>
>> Bests,
>> Takeshi
>>
>> On Tue, Dec 10, 2019 at 3:32 AM Sean Owen  wrote:
>>
>>> Sure, seems fine. The release cadence slows down in a branch over time
>>> as there is probably less to fix, so Jan-Feb 2020 for 2.4.5 and
>>> something like middle or Q3 2020 for 2.4.6 is a reasonable
>>> expectation. It might plausibly be the last 2.4.x release but who
>>> knows.
>>>
>>> On Mon, Dec 9, 2019 at 12:29 PM Dongjoon Hyun 
>>> wrote:
>>> >
>>> > Hi, All.
>>> >
>>> > Along with the discussion on 3.0.0, I'd like to discuss about the next
>>> releases on `branch-2.4`.
>>> >
>>> > As we know, `branch-2.4` is our LTS branch and also there exists some
>>> questions on the release plans. More releases are important not only for
>>> the latest K8s version support, but also for delivering important bug fixes
>>> regularly (at least until 3.x becomes dominant.)
>>> >
>>> > In short, I'd like to propose the followings.
>>> >
>>> > 1. Apache Spark 2.4.5 release (2020 January)
>>> > 2. Apache Spark 2.4.6 release (2020 July)
>>> >
>>> > Of course, we can adjust the schedule.
>>> > This aims to have a pre-defined cadence in order to give release
>>> managers to prepare.
>>> >
>>> > Bests,
>>> > Dongjoon.
>>> >
>>> > PS. As of now, `branch-2.4` has 135 additional patches after `2.4.4`.
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

2019-12-06 Thread Dongjoon Hyun

Hi, All.

I want to share the following change to the community.

SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

This is merged today and now Spark's `CREATE TABLE` is using Spark's
default data sources instead of `hive` provider. This is a good and big
improvement for Apache Spark 3.0, but this might surprise someone. (Please
note that there is a fallback option for them.)

Thank you, Yi, Wenchen, Xiao.

Cheers,
Dongjoon.

Re: Packages to release in 3.0.0-preview

2019-10-27 Thread Dongjoon Hyun

Hi, Yuming.

Is the project working correctly on JDK8 with you?

When I simply cloned your repo and did `mvn clean package` on
JDK 1.8.0_232, it seems not to pass the UTs.

I also tried to rerun after ignoring two ORC table test like the
followings, but the UT is failing.

~/A/test-spark-jdk11:master$ git diff | grep 'ORC table'
-  test("Datasource ORC table") {
+  ignore("Datasource ORC table") {
-  test("Hive ORC table") {
+  ignore("Hive ORC table") {

~/A/test-spark-jdk11:master$ mvn clean package
...
- Hive ORC table !!! IGNORED !!!
Run completed in 36 seconds, 999 milliseconds.
Total number of tests run: 2
Suites: completed 3, aborted 0
Tests: succeeded 1, failed 1, canceled 0, ignored 2, pending 0
*** 1 TEST FAILED ***

~/A/test-spark-jdk11:master$ java -version
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_232-b09)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.232-b09, mixed mode)


Bests,
Dongjoon.

On Sun, Oct 27, 2019 at 1:38 PM Dongjoon Hyun 
wrote:

> It seems not a Hadoop issue, doesn't it?
>
> What Yuming pointed seems to be `Hive 2.3.6` profile implementation issue
> which is enabled only when `Hadoop 3.2`.
>
> From my side, I'm +1 for publishing jars which depends on `Hadoop 3.2.0 /
> Hive 2.3.6` jars to Maven since Apache Spark 3.0.0.
>
> For the others, I'd like to mention that this implies the followings, too.
>
> 1. We are not going to use Hive 1.2.1 library. Only Hadoop-2.7 profile
> tarball distribution will use Hive 1.2.1.
> 2. Although we depends on Hadoop 3.2.0, Hadoop 3.2.1 changes their Guava
> library version significantly.
> So, it requires some attentions in Apache Spark. Otherwise, we may hit
> some issues on Hadoop 3.2.1+ runtime later.
>
> Thanks,
> Dongjoon.
>
>
> On Sun, Oct 27, 2019 at 7:31 AM Sean Owen  wrote:
>
>> Is the Spark artifact actually any different between those builds? I
>> thought it just affected what else was included in the binary tarball.
>> If it matters, yes I'd publish a "Hadoop 3" version to Maven. (Scala
>> 2.12 is the only supported Scala version).
>>
>> On Sun, Oct 27, 2019 at 4:35 AM Yuming Wang  wrote:
>> >
>> > Do we need to publish the Scala 2.12 + hadoop 3.2 jar packages to the
>> Maven repository? Otherwise it will throw a NoSuchMethodError on Java 11.
>> > Here is an example:
>> >
>> https://github.com/wangyum/test-spark-jdk11/blob/master/src/test/scala/test/spark/HiveTableSuite.scala#L34-L38
>> >
>> https://github.com/wangyum/test-spark-jdk11/commit/927ce7d3766881fba98f2434055fa3a1d1544ad2/checks?check_suite_id=283076578
>> >
>> >
>> > On Sat, Oct 26, 2019 at 10:41 AM Takeshi Yamamuro <
>> linguin@gmail.com> wrote:
>> >>
>> >> Thanks for that work!
>> >>
>> >> > I don't think JDK 11 is a separate release (by design). We build
>> >> > everything targeting JDK 8 and it should work on JDK 11 too.
>> >> +1. a single package working on both jvms looks nice.
>> >>
>> >>
>> >> On Sat, Oct 26, 2019 at 4:18 AM Sean Owen  wrote:
>> >>>
>> >>> I don't think JDK 11 is a separate release (by design). We build
>> >>> everything targeting JDK 8 and it should work on JDK 11 too.
>> >>>
>> >>> So, just two releases, but, frankly I think we soon need to stop
>> >>> multiple releases for multiple Hadoop versions, and stick to Hadoop 3.
>> >>> I think it's fine to try to release for Hadoop 2 as the support still
>> >>> exists, and because the difference happens to be larger due to the
>> >>> different Hive dependency.
>> >>>
>> >>> On Fri, Oct 25, 2019 at 2:08 PM Xingbo Jiang 
>> wrote:
>> >>> >
>> >>> > Hi all,
>> >>> >
>> >>> > I would like to bring out a discussion on how many packages shall
>> be released in 3.0.0-preview, the ones I can think of now:
>> >>> >
>> >>> > * scala 2.12 + hadoop 2.7
>> >>> > * scala 2.12 + hadoop 3.2
>> >>> > * scala 2.12 + hadoop 3.2 + JDK 11
>> >>> >
>> >>> > Do you have other combinations to add to the above list?
>> >>> >
>> >>> > Cheers,
>> >>> >
>> >>> > Xingbo
>> >>>
>> >>> -
>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>
>> >>
>> >>
>> >> --
>> >> ---
>> >> Takeshi Yamamuro
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: Packages to release in 3.0.0-preview

2019-10-27 Thread Dongjoon Hyun

It seems not a Hadoop issue, doesn't it?

What Yuming pointed seems to be `Hive 2.3.6` profile implementation issue
which is enabled only when `Hadoop 3.2`.

>From my side, I'm +1 for publishing jars which depends on `Hadoop 3.2.0 /
Hive 2.3.6` jars to Maven since Apache Spark 3.0.0.

For the others, I'd like to mention that this implies the followings, too.

1. We are not going to use Hive 1.2.1 library. Only Hadoop-2.7 profile
tarball distribution will use Hive 1.2.1.
2. Although we depends on Hadoop 3.2.0, Hadoop 3.2.1 changes their Guava
library version significantly.
So, it requires some attentions in Apache Spark. Otherwise, we may hit
some issues on Hadoop 3.2.1+ runtime later.

Thanks,
Dongjoon.


On Sun, Oct 27, 2019 at 7:31 AM Sean Owen  wrote:

> Is the Spark artifact actually any different between those builds? I
> thought it just affected what else was included in the binary tarball.
> If it matters, yes I'd publish a "Hadoop 3" version to Maven. (Scala
> 2.12 is the only supported Scala version).
>
> On Sun, Oct 27, 2019 at 4:35 AM Yuming Wang  wrote:
> >
> > Do we need to publish the Scala 2.12 + hadoop 3.2 jar packages to the
> Maven repository? Otherwise it will throw a NoSuchMethodError on Java 11.
> > Here is an example:
> >
> https://github.com/wangyum/test-spark-jdk11/blob/master/src/test/scala/test/spark/HiveTableSuite.scala#L34-L38
> >
> https://github.com/wangyum/test-spark-jdk11/commit/927ce7d3766881fba98f2434055fa3a1d1544ad2/checks?check_suite_id=283076578
> >
> >
> > On Sat, Oct 26, 2019 at 10:41 AM Takeshi Yamamuro 
> wrote:
> >>
> >> Thanks for that work!
> >>
> >> > I don't think JDK 11 is a separate release (by design). We build
> >> > everything targeting JDK 8 and it should work on JDK 11 too.
> >> +1. a single package working on both jvms looks nice.
> >>
> >>
> >> On Sat, Oct 26, 2019 at 4:18 AM Sean Owen  wrote:
> >>>
> >>> I don't think JDK 11 is a separate release (by design). We build
> >>> everything targeting JDK 8 and it should work on JDK 11 too.
> >>>
> >>> So, just two releases, but, frankly I think we soon need to stop
> >>> multiple releases for multiple Hadoop versions, and stick to Hadoop 3.
> >>> I think it's fine to try to release for Hadoop 2 as the support still
> >>> exists, and because the difference happens to be larger due to the
> >>> different Hive dependency.
> >>>
> >>> On Fri, Oct 25, 2019 at 2:08 PM Xingbo Jiang 
> wrote:
> >>> >
> >>> > Hi all,
> >>> >
> >>> > I would like to bring out a discussion on how many packages shall be
> released in 3.0.0-preview, the ones I can think of now:
> >>> >
> >>> > * scala 2.12 + hadoop 2.7
> >>> > * scala 2.12 + hadoop 3.2
> >>> > * scala 2.12 + hadoop 3.2 + JDK 11
> >>> >
> >>> > Do you have other combinations to add to the above list?
> >>> >
> >>> > Cheers,
> >>> >
> >>> > Xingbo
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
> >>
> >>
> >> --
> >> ---
> >> Takeshi Yamamuro
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Apache Spark 3.0 timeline

2019-10-16 Thread Dongjoon Hyun

Hi, All.

I saw the following comment from Wenchen in the previous email thread.

> Personally I'd like to avoid cutting branch-3.0 right now, otherwise we
need to merge PRs into two branches in the following several months.

Since 3.0.0-preview seems to be already here for RC, can we update our
timeline in the official web page accordingly.

http://spark.apache.org/versioning-policy.html

-
Spark 2.4 Release Window
Date Event
Mid Aug 2018   Code freeze. Release branch cut.
Late Aug 2018   QA period. Focus on bug fixes, tests, stability and docs.
Generally, no new features merged.
Early Sep 2018   Release candidates (RC), voting, etc. until final release
passes

branch-3.0 vs branch-3.0-preview (?)

2019-10-15 Thread Dongjoon Hyun

Hi,

It seems that we have `branch-3.0-preview` branch.

https://github.com/apache/spark/commits/branch-3.0-preview

Can we have `branch-3.0` instead of `branch-3.0-preview`?

We can tag `v3.0.0-preview` on `branch-3.0` and continue to use for
`v3.0.0` later.

Bests,
Dongjoon.

Re: [DISCUSS] Deprecate Python < 3.6 in Spark 3.0

2019-10-24 Thread Dongjoon Hyun

Thank you for starting the thread.

In addition to that, we currently are testing Python 3.6 only in Apache
Spark Jenkins environment.

Given that Python 3.8 is already out and Apache Spark 3.0.0 RC1 will start
next January
(https://spark.apache.org/versioning-policy.html), I'm +1 for the
deprecation (Python < 3.6) at Apache Spark 3.0.0.

It's just a deprecation to prepare the next-step development cycle.

Bests,
Dongjoon.

On Thu, Oct 24, 2019 at 1:10 AM Maciej Szymkiewicz 
wrote:

> Hi everyone,
>
> While deprecation of Python 2 in 3.0.0 has been announced
> ,
> there is no clear statement about specific continuing support of different
> Python 3 version.
>
> Specifically:
>
>- Python 3.4 has been retired this year.
>- Python 3.5 is already in the "security fixes only" mode and should
>be retired in the middle of 2020.
>
> Continued support of these two blocks adoption of many new Python features
> (PEP 468)  and it is hard to justify beyond 2020.
>
> Should these two be deprecated in 3.0.0 as well?
>
> --
> Best regards,
> Maciej
>
>

Minimum JDK8 version

2019-10-24 Thread Dongjoon Hyun

Hi, All.

Apache Spark 3.x will support both JDK8 and JDK11.

I'm wondering if we can have a minimum JDK8 version in Apache Spark 3.0.

Specifically, can we start to deprecate JDK8u81 and older at 3.0.

Currently, Apache Spark testing infra are testing only with jdk1.8.0_191
and above.

Bests,
Dongjoon.

Re: Minimum JDK8 version

2019-10-24 Thread Dongjoon Hyun

Thank you for reply, Sean, Shane, Takeshi.

The reason is that there is a PR to aim to add
`-XX:OnOutOfMemoryError="kill -9 %p"` as a default behavior at 3.0.0.
(Please note that the PR will add it by *default* always. There is no way
for user to remove it.)

- [SPARK-27900][CORE][K8s] Add `spark.driver.killOnOOMError` flag in
cluster mode
- https://github.com/apache/spark/pull/26161

If we can deprecate old JDK8 versions, we are able to use JVM option
`ExitOnOutOfMemoryError` instead.
(This is added at JDK 8u92. In my previous email, 8u82 was a typo.)

-
https://www.oracle.com/technetwork/java/javase/8u92-relnotes-2949471.html

All versions of JDK8 are not the same naturally. For example, Hadoop
community also have the following document although they are not specifying
the minimum versions.

-
https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions

Bests,
Dongjoon.

On Thu, Oct 24, 2019 at 6:05 PM Takeshi Yamamuro 
wrote:

> Hi, Dongjoon
>
> It might be worth clearly describing which jdk versions we check in the
> testing infra
> in some documents, e.g., https://spark.apache.org/docs/latest/#downloading
>
> btw, any other project announcing the minimum support jdk version?
> It seems that hadoop does not.
>
> On Fri, Oct 25, 2019 at 6:51 AM Sean Owen  wrote:
>
>> Probably, but what is the difference that makes it different to
>> support u81 vs later?
>>
>> On Thu, Oct 24, 2019 at 4:39 PM Dongjoon Hyun 
>> wrote:
>> >
>> > Hi, All.
>> >
>> > Apache Spark 3.x will support both JDK8 and JDK11.
>> >
>> > I'm wondering if we can have a minimum JDK8 version in Apache Spark 3.0.
>> >
>> > Specifically, can we start to deprecate JDK8u81 and older at 3.0.
>> >
>> > Currently, Apache Spark testing infra are testing only with
>> jdk1.8.0_191 and above.
>> >
>> > Bests,
>> > Dongjoon.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Minimum JDK8 version

2019-10-24 Thread Dongjoon Hyun

Thank you. I created a PR for that. For now, the minimum requirement is
8u92 in that PR.

https://github.com/apache/spark/pull/26249

Bests,
Dongjoon.


On Thu, Oct 24, 2019 at 7:55 PM Sean Owen  wrote:

> I think that's fine, personally. Anyone using JDK 8 should / probably
> is on a recent release.
>
> On Thu, Oct 24, 2019 at 8:56 PM Dongjoon Hyun 
> wrote:
> >
> > Thank you for reply, Sean, Shane, Takeshi.
> >
> > The reason is that there is a PR to aim to add
> `-XX:OnOutOfMemoryError="kill -9 %p"` as a default behavior at 3.0.0.
> > (Please note that the PR will add it by *default* always. There is no
> way for user to remove it.)
> >
> > - [SPARK-27900][CORE][K8s] Add `spark.driver.killOnOOMError` flag in
> cluster mode
> > - https://github.com/apache/spark/pull/26161
> >
> > If we can deprecate old JDK8 versions, we are able to use JVM option
> `ExitOnOutOfMemoryError` instead.
> > (This is added at JDK 8u92. In my previous email, 8u82 was a typo.)
> >
> > -
> https://www.oracle.com/technetwork/java/javase/8u92-relnotes-2949471.html
> >
> > All versions of JDK8 are not the same naturally. For example, Hadoop
> community also have the following document although they are not specifying
> the minimum versions.
> >
> > -
> https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions
> >
> > Bests,
> > Dongjoon.
> >
> >
> > On Thu, Oct 24, 2019 at 6:05 PM Takeshi Yamamuro 
> wrote:
> >>
> >> Hi, Dongjoon
> >>
> >> It might be worth clearly describing which jdk versions we check in the
> testing infra
> >> in some documents, e.g.,
> https://spark.apache.org/docs/latest/#downloading
> >>
> >> btw, any other project announcing the minimum support jdk version?
> >> It seems that hadoop does not.
> >>
> >> On Fri, Oct 25, 2019 at 6:51 AM Sean Owen  wrote:
> >>>
> >>> Probably, but what is the difference that makes it different to
> >>> support u81 vs later?
> >>>
> >>> On Thu, Oct 24, 2019 at 4:39 PM Dongjoon Hyun 
> wrote:
> >>> >
> >>> > Hi, All.
> >>> >
> >>> > Apache Spark 3.x will support both JDK8 and JDK11.
> >>> >
> >>> > I'm wondering if we can have a minimum JDK8 version in Apache Spark
> 3.0.
> >>> >
> >>> > Specifically, can we start to deprecate JDK8u81 and older at 3.0.
> >>> >
> >>> > Currently, Apache Spark testing infra are testing only with
> jdk1.8.0_191 and above.
> >>> >
> >>> > Bests,
> >>> > Dongjoon.
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
> >>
> >>
> >> --
> >> ---
> >> Takeshi Yamamuro
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-10-28 Thread Dongjoon Hyun

Thank you for the feedback, Sean and Xiao.

Bests,
Dongjoon.

On Mon, Oct 28, 2019 at 12:52 PM Xiao Li  wrote:

> The stability and quality of Hadoop 3.2 profile are unknown. The changes
> are massive, including Hive execution and a new version of Hive
> thriftserver.
>
> To reduce the risk, I would like to keep the current default version
> unchanged. When it becomes stable, we can change the default profile to
> Hadoop-3.2.
>
> Cheers,
>
> Xiao
>
> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen  wrote:
>
>> I'm OK with that, but don't have a strong opinion nor info about the
>> implications.
>> That said my guess is we're close to the point where we don't need to
>> support Hadoop 2.x anyway, so, yeah.
>>
>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun 
>> wrote:
>> >
>> > Hi, All.
>> >
>> > There was a discussion on publishing artifacts built with Hadoop 3 .
>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will
>> be the same because we didn't change anything yet.
>> >
>> > Technically, we need to change two places for publishing.
>> >
>> > 1. Jenkins Snapshot Publishing
>> >
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>> >
>> > 2. Release Snapshot/Release Publishing
>> >
>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>> >
>> > To minimize the change, we need to switch our default Hadoop profile.
>> >
>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2
>> (3.2.0)` is optional.
>> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
>> optionally.
>> >
>> > Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7`
>> distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>> >
>> > Bests,
>> > Dongjoon.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> [image: Databricks Summit - Watch the talks]
> <https://databricks.com/sparkaisummit/north-america>
>

Re: [build system] intermittent network issues + potential power shutoff over the weekend

2019-10-28 Thread Dongjoon Hyun

Thank you for fixing the worker ENVs, Shane.

Bests,
Dongjoon.

On Mon, Oct 28, 2019 at 10:47 AM Shane Knapp  wrote:

> i will need to restart jenkins -- the worker's ENV vars got borked when
> they came back up.
>
> this is happening NOW.
>
> shane
>
> On Mon, Oct 28, 2019 at 10:37 AM Shane Knapp  wrote:
>
>> we're back up and building!
>>
>> On Mon, Oct 28, 2019 at 8:35 AM Shane Knapp  wrote:
>>
>>> ok, it looks like the colo will have power until monday morning, and
 it will be shut down from 8am to noon to perform some maintenance.

 this means jenkins will be up all weekend, but down monday morning.

 jenkins is currently down due to colo maintenance.  expect it to return
>>> in ~3.5 hours.
>>>
>>> shane
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-10-28 Thread Dongjoon Hyun

Hi, All.

There was a discussion on publishing artifacts built with Hadoop 3 .
But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be
the same because we didn't change anything yet.

Technically, we need to change two places for publishing.

1. Jenkins Snapshot Publishing

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/

2. Release Snapshot/Release Publishing

https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh

To minimize the change, we need to switch our default Hadoop profile.

Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2
(3.2.0)` is optional.
We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
optionally.

Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7`
distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.

Bests,
Dongjoon.

Re: [VOTE] SPARK 3.0.0-preview (RC1)

2019-10-29 Thread Dongjoon Hyun

Hi, Xingbo.

PySpark seems to fail to build. There is only `sha512`.

SparkR_3.0.0-preview.tar.gz
SparkR_3.0.0-preview.tar.gz.asc
SparkR_3.0.0-preview.tar.gz.sha512
*pyspark-3.0.0.preview.tar.gz.sha512*
spark-3.0.0-preview-bin-hadoop2.7.tgz
spark-3.0.0-preview-bin-hadoop2.7.tgz.asc
spark-3.0.0-preview-bin-hadoop2.7.tgz.sha512
spark-3.0.0-preview-bin-hadoop3.2.tgz
spark-3.0.0-preview-bin-hadoop3.2.tgz.asc
spark-3.0.0-preview-bin-hadoop3.2.tgz.sha512
spark-3.0.0-preview-bin-without-hadoop.tgz
spark-3.0.0-preview-bin-without-hadoop.tgz.asc
spark-3.0.0-preview-bin-without-hadoop.tgz.sha512
spark-3.0.0-preview.tgz
spark-3.0.0-preview.tgz.asc
spark-3.0.0-preview.tgz.sha512


Bests,
Dongjoon.


On Tue, Oct 29, 2019 at 7:18 PM Xingbo Jiang  wrote:

> Thanks for the correction, we shall remove the statement
>>
>> Everything else please retarget to an appropriate release.
>>
>
> Reynold Xin  于2019年10月29日周二 下午7:09写道：
>
>> Does the description make sense? This is a preview release so there is no
>> need to retarget versions.
>>
>> On Tue, Oct 29, 2019 at 7:01 PM Xingbo Jiang 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 3.0.0-preview.
>>>
>>> The vote is open until November 2 PST and passes if a majority +1 PMC
>>> votes are cast, with
>>> a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.0.0-preview
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.0.0-preview-rc1 (commit
>>> 5eddbb5f1d9789696927f435c55df887e50a1389):
>>> https://github.com/apache/spark/tree/v3.0.0-preview-rc1
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1334/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-docs/
>>>
>>> The list of bug fixes going into 3.0.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.0.0?
>>> ===
>>>
>>> The current list of open tickets targeted at 3.0.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.0.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>

Re: Apache Spark 3.0 timeline

2019-10-16 Thread Dongjoon Hyun

Thanks! That sounds reasonable. I'm +1. :)

Historically, 2.0-preview was on May 2016 and 2.0 was on July, 2016. 3.0
seems to be different.

Bests,
Dongjoon

On Wed, Oct 16, 2019 at 16:38 Sean Owen  wrote:

> I think the branch question is orthogonal but yeah we can probably make an
> updated statement about 3.0 release. Clearly a preview is imminent. I
> figure we are probably moving to code freeze late in the year, release
> early next year? Any better ideas about estimates to publish? They aren't
> binding.
>
> On Wed, Oct 16, 2019, 4:01 PM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> I saw the following comment from Wenchen in the previous email thread.
>>
>> > Personally I'd like to avoid cutting branch-3.0 right now, otherwise we
>> need to merge PRs into two branches in the following several months.
>>
>> Since 3.0.0-preview seems to be already here for RC, can we update our
>> timeline in the official web page accordingly.
>>
>> http://spark.apache.org/versioning-policy.html
>>
>> -
>> Spark 2.4 Release Window
>> Date Event
>> Mid Aug 2018   Code freeze. Release branch cut.
>> Late Aug 2018   QA period. Focus on bug fixes, tests, stability and
>> docs. Generally, no new features merged.
>> Early Sep 2018   Release candidates (RC), voting, etc. until final
>> release passes
>>
>

Re: branch-3.0 vs branch-3.0-preview (?)

2019-10-17 Thread Dongjoon Hyun

Great! Thank you!

Bests,
Dongjoon.

On Thu, Oct 17, 2019 at 10:19 Xingbo Jiang  wrote:

> I've deleted the branch-3.0-preview branch, and added `3.0.0-preview` tag
> to master (https://github.com/apache/spark/releases/tag/3.0.0-preview).
> I'll be working on make a RC now.
>
> Cheers,
>
> Xingbo
>
> Sean Owen  于2019年10月17日周四 下午4:23写道：
>
>> Sure, if that works, that's a simpler solution. The preview release is
>> like an RC of the master branch itself.
>> Are there any issues with that approach right now?
>> Yes if it turns out that we can't get a reasonably stable release off
>> master, then we can branch and cherry-pick. We'd have to retain the
>> branch though.
>>
>> On Thu, Oct 17, 2019 at 12:28 AM Xingbo Jiang 
>> wrote:
>> >
>> > How about add `3.0.0-preview` tag on master branch, and claim that for
>> the preview release, we won't consider bugs introduced by new features
>> merged into master after the first preview RC ? This could rule out the
>> risk that we keep on import new commits and need to resolve more critical
>> bugs thus the release would never converge.
>> >
>> > Cheers,
>> >
>> > Xingbo
>> >
>> > Sean Owen  于2019年10月16日周三 下午6:34写道：
>> >>
>> >> We do not have to do anything to branch-3.0-preview; it's just for the
>> >> convenience of the RM. Just continue to merge to master for 3.0.
>> >>
>> >> If it happens that some state of the master branch works as a preview
>> >> release, sure, just tag and release. We might get away with it. But if
>> >> for example we have a small issue to fix with the preview and
>> >> meanwhile something else has landed in the master branch that doesn't
>> >> work, we'll struggle to get an RC out. I agree, that would be nice to
>> >> not deal with this as a branch yet.
>> >>
>> >> But if we do: Yeah I figured the merge script would pick it up, which
>> >> is a little annoying, but you can still just type branch-2.4.
>> >> I think we have to retain the branch though if there are any
>> >> cherry-picks, to record the state of the release.
>> >>
>> >> We don't want a "3.0-preview" version in JIRA. Let's fix the script if
>> we must.
>> >>
>> >> So, I take it that the current preview RC didn't work. What if we
>> >> delete that branch and try again from master? does that work?
>> >>
>> >> On Wed, Oct 16, 2019 at 11:19 AM Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>> >> >
>> >> > Technically, `branch-3.0-preview` has many issues.
>> >> >
>> >> > First of all, are we going to delete `branch-3.0-preview` after
>> releasing `3.0-preview`?
>> >> > I guess we didn't delete old branches (including feature branches
>> like jdbc, yarn branches)
>> >> >
>> >> > Second, our merge script already starts to show `branch-3.0-preview`
>> instead of `branch-2.4` already.
>> >> > Currently, We need to merge to `master` -> `branch-3.0-preview` ->
>> `branch-2.4`.
>> >> > This already creates a burden to maintain our LTS branch
>> `branch-2.4`.
>> >> >
>> >> > Third, during updating JIRA, our merge script starts to fail because
>> it extracts the version number from `branch-3.0-preview` but Apache JIRA
>> doesn't have a version `3.0-preview`. Are we going to add a release version
>> at `Apache Spark JIRA`?
>> >> > (I'm -1 for this. `Fixed Version: 3.0-preview` seems to be overkill).
>> >> >
>> >> > If we are reluctant to have `branch-3.0` because it has a meaning of
>> `feature` and its merging cost, I'm +1 for tag on `master` (Reynold's
>> suggestion)
>> >> >
>> >> > We can do vote and stabilize `3.0-alpha` in master branch.
>> >> >
>> >> > Bests,
>> >> > Dongjoon.
>> >> >
>> >> >
>> >> > On Wed, Oct 16, 2019 at 3:04 AM Sean Owen  wrote:
>> >> >>
>> >> >> I don't think we would want to cut 'branch-3.0' right now, which
>> would
>> >> >> imply that master is 3.1. We don't want to merge every new change
>> into
>> >> >> two branches.
>> >> >> It may still be useful to have `branch-3.0-preview` as a short-lived
>> >> >> branch just used to manage the preview release, as we will need to
>> let

Re: branch-3.0 vs branch-3.0-preview (?)

2019-10-16 Thread Dongjoon Hyun

Technically, `branch-3.0-preview` has many issues.

First of all, are we going to delete `branch-3.0-preview` after releasing
`3.0-preview`?
I guess we didn't delete old branches (including feature branches like
jdbc, yarn branches)

Second, our merge script already starts to show `branch-3.0-preview`
instead of `branch-2.4` already.
Currently, We need to merge to `master` -> `branch-3.0-preview` ->
`branch-2.4`.
This already creates a burden to maintain our LTS branch `branch-2.4`.

Third, during updating JIRA, our merge script starts to fail because it
extracts the version number from `branch-3.0-preview` but Apache JIRA
doesn't have a version `3.0-preview`. Are we going to add a release version
at `Apache Spark JIRA`?
(I'm -1 for this. `Fixed Version: 3.0-preview` seems to be overkill).

If we are reluctant to have `branch-3.0` because it has a meaning of
`feature` and its merging cost, I'm +1 for tag on `master` (Reynold's
suggestion)

We can do vote and stabilize `3.0-alpha` in master branch.

Bests,
Dongjoon.

On Wed, Oct 16, 2019 at 3:04 AM Sean Owen  wrote:

> I don't think we would want to cut 'branch-3.0' right now, which would
> imply that master is 3.1. We don't want to merge every new change into
> two branches.
> It may still be useful to have `branch-3.0-preview` as a short-lived
> branch just used to manage the preview release, as we will need to let
> development on 3.0 in master continue while stabilizing the preview
> release with a few selected cherry-picks, but that's only of concern
> to the release manager.
>
> On Wed, Oct 16, 2019 at 2:01 AM Xingbo Jiang 
> wrote:
> >
> > Hi Dongjoon,
> >
> > I'm not sure about the best practice of maintaining a preview release
> branch, since new features might still go into Spark 3.0 after preview
> release, I guess it might make more sense to have separated  branches for
> 3.0.0 and 3.0-preview.
> >
> > However, I'm open to both solutions, if we really want to reuse the
> branch to also release Spark 3.0.0, then I would be happy to create a new
> one.
> >
> > Thanks!
> >
> > Xingbo
> >
> > Dongjoon Hyun  于2019年10月16日周三 上午6:26写道：
> >>
> >> Hi,
> >>
> >> It seems that we have `branch-3.0-preview` branch.
> >>
> >> https://github.com/apache/spark/commits/branch-3.0-preview
> >>
> >> Can we have `branch-3.0` instead of `branch-3.0-preview`?
> >>
> >> We can tag `v3.0.0-preview` on `branch-3.0` and continue to use for
> `v3.0.0` later.
> >>
> >> Bests,
> >> Dongjoon.
>

Re: [build system] intermittent network issues + potential power shutoff over the weekend

2019-10-25 Thread Dongjoon Hyun

Thank you for notice, Shane.

Bests,
Dongjoon.

On Fri, Oct 25, 2019 at 12:31 PM Shane Knapp  wrote:

> > 1) our department is having some issues w/their network and other
> > services.  this means that if you're at the jenkins site, you may
> > occasionally get a 503 error.  just hit refresh a couple of times and
> > it will come back.
> >
> btw this might impact some pull request builds, so if your PR or
> 'retest this please' comment doesn't seem to catch and trigger a build
> after ~15 mins, feel free (if you're whitelisted or an admin) to
> request another test with 'test this please'.
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Unable to resolve dependency of sbt-mima-plugin since yesterday

2019-10-22 Thread Dongjoon Hyun

Hi, All.

This is fixed in master/branch-2.4.

Bests,
Dongjoon.

On Tue, Oct 22, 2019 at 12:19 Sean Owen  wrote:

> Weird. Let's discuss at https://issues.apache.org/jira/browse/SPARK-29560
>
> On Tue, Oct 22, 2019 at 2:06 PM Xingbo Jiang 
> wrote:
> >
> > Hi,
> >
> > Do you have any idea why the `./dev/lint-scala` check are failure with
> the following message since yesterday ？
> >
> >> WARNING: An illegal reflective access operation has occurred
> >> 9WARNING: Illegal reflective access by
> org.apache.ivy.util.url.IvyAuthenticator
> (file:/home/runner/work/spark/spark/build/sbt-launch-0.13.18.jar) to field
> java.net.Authenticator.theAuthenticator
> >> 10WARNING: Please consider reporting this to the maintainers of
> org.apache.ivy.util.url.IvyAuthenticator
> >> 11WARNING: Use --illegal-access=warn to enable warnings of further
> illegal reflective access operations
> >> 12WARNING: All illegal access operations will be denied in a future
> release
> >> 13Scalastyle checks failed at following occurrences:
> >> 14[error] (*:update) sbt.ResolveException: unresolved dependency:
> com.typesafe#sbt-mima-plugin;0.3.0: not found
> >> 15##[error]Process completed with exit code 1.
> >
> >
> > I'm not able to reproduce the failure on my local environment, but seems
> all the open PRs are failing on this check.
> >
> > Thanks,
> >
> > Xingbo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Dongjoon Hyun

Cheng, could you elaborate on your criteria, `Hive 2.3 code paths are
proven to be stable`?
For me, it's difficult to image that we can reach any stable situation when
we don't use it at all by ourselves.

> The Hive 1.2 code paths can only be removed once the Hive 2.3 code
paths are proven to be stable.

Sean, our published POM is pointing and advertising the illegitimate Hive
1.2 fork as a compile dependency.
Yes. It can be overridden. So, why does Apache Spark need to publish like
that?
If someone want to use that illegitimate Hive 1.2 fork, let them override
it. We are unable to delete those illegitimate Hive 1.2 fork.
Those artifacts will be orphans.

> The published POM will be agnostic to Hadoop / Hive; well,
> it will link against a particular version but can be overridden.

-
https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.12/3.0.0-preview
   ->
https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2
   ->
https://mvnrepository.com/artifact/org.spark-project.hive/hive-metastore/1.2.1.spark2

Bests,
Dongjoon.


On Tue, Nov 19, 2019 at 5:26 PM Hyukjin Kwon  wrote:

> > Should Hadoop 2 + Hive 2 be considered to work on JDK 11?
> This seems being investigated by Yuming's PR (
> https://github.com/apache/spark/pull/26533) if I am not mistaken.
>
> Oh, yes, what I meant by (default) was the default profiles we will use in
> Spark.
>
>
> 2019년 11월 20일 (수) 오전 10:14, Sean Owen 님이 작성:
>
>> Should Hadoop 2 + Hive 2 be considered to work on JDK 11? I wasn't
>> sure if 2.7 did, but honestly I've lost track.
>> Anyway, it doesn't matter much as the JDK doesn't cause another build
>> permutation. All are built targeting Java 8.
>>
>> I also don't know if we have to declare a binary release a default.
>> The published POM will be agnostic to Hadoop / Hive; well, it will
>> link against a particular version but can be overridden. That's what
>> you're getting at?
>>
>>
>> On Tue, Nov 19, 2019 at 7:11 PM Hyukjin Kwon  wrote:
>> >
>> > So, are we able to conclude our plans as below?
>> >
>> > 1. In Spark 3,  we release as below:
>> >   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
>> >   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11
>> >   - Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)
>> >
>> > 2. In Spark 3.1, we target:
>> >   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
>> >   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11 (default)
>> >
>> > 3. Avoid to remove "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)"
>> combo right away after cutting branch-3 to see if Hive 2.3 is considered as
>> stable in general.
>> > I roughly suspect it would be a couple of months after Spark 3.0
>> release (?).
>> >
>> > BTW, maybe we should officially note that "Hadoop 2.7 + Hive 1.2.1
>> (fork) + JDK8 (default)" combination is deprecated anyway in Spark 3.
>> >
>>
>

Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-18 Thread Dongjoon Hyun

Hi, All.

First of all, I want to put this as a policy issue instead of a technical
issue.
Also, this is orthogonal from `hadoop` version discussion.

Apache Spark community kept (not maintained) the forked Apache Hive
1.2.1 because there has been no other options before. As we see at
SPARK-20202, it's not a desirable situation among the Apache projects.

https://issues.apache.org/jira/browse/SPARK-20202

Also, please note that we `kept`, not `maintained`, because we know it's
not good.
There are several attempt to update that forked repository
for several reasons (Hadoop 3 support is one of the example),
but those attempts are also turned down.

>From Apache Spark 3.0, it seems that we have a new feasible option
`hive-2.3` profile. What about moving forward in this direction further?

For example, can we remove the usage of forked `hive` in Apache Spark 3.0
completely officially? If someone still needs to use the forked `hive`, we
can
have a profile `hive-1.2`. Of course, it should not be a default profile in
the community.

I want to say this is a goal we should achieve someday.
If we don't do anything, nothing happen. At least we need to prepare this.
Without any preparation, Spark 3.1+ will be the same.

Shall we focus on what are our problems with Hive 2.3.6?
If the only reason is that we didn't use it before, we can release another
`3.0.0-preview` for that.

Bests,
Dongjoon.

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-18 Thread Dongjoon Hyun

I also agree with Steve and Felix.

Let's have another thread to discuss Hive issue

because this thread was originally for `hadoop` version.

And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
`hadoop-3.0` versions.

We don't need to mix both.

Bests,
Dongjoon.


On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung 
wrote:

> 1000% with Steve, the org.spark-project hive 1.2 will need a solution. It
> is old and rather buggy; and It’s been *years*
>
> I think we should decouple hive change from everything else if people are
> concerned?
>
> --
> *From:* Steve Loughran 
> *Sent:* Sunday, November 17, 2019 9:22:09 AM
> *To:* Cheng Lian 
> *Cc:* Sean Owen ; Wenchen Fan ;
> Dongjoon Hyun ; dev ;
> Yuming Wang 
> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>
> Can I take this moment to remind everyone that the version of hive which
> spark has historically bundled (the org.spark-project one) is an orphan
> project put together to deal with Hive's shading issues and a source of
> unhappiness in the Hive project. What ever get shipped should do its best
> to avoid including that file.
>
> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest
> move from a risk minimisation perspective. If something has broken then it
> is you can start with the assumption that it is in the o.a.s packages
> without having to debug o.a.hadoop and o.a.hive first. There is a cost: if
> there are problems with the hadoop / hive dependencies those teams will
> inevitably ignore filed bug reports for the same reason spark team will
> probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the
> Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that
> in mind. It's not been tested, it has dependencies on artifacts we know are
> incompatible, and as far as the Hadoop project is concerned: people should
> move to branch 3 if they want to run on a modern version of Java
>
> It would be really really good if the published spark maven artefacts (a)
> included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop 3.x.
> That way people doing things with their own projects will get up-to-date
> dependencies and don't get WONTFIX responses themselves.
>
> -Steve
>
> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last ever"
> branch-2 release and then declare its predecessors EOL; 2.10 will be the
> transition release.
>
> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian  wrote:
>
> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
> seemed risky, and therefore we only introduced Hive 2.3 under the
> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
> here...
>
> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that
> Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
> upgrade together looks too risky.
>
> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen  wrote:
>
> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
> than introduce yet another build combination. Does Hadoop 2 + Hive 2
> work and is there demand for it?
>
> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan  wrote:
> >
> > Do we have a limitation on the number of pre-built distributions? Seems
> this time we need
> > 1. hadoop 2.7 + hive 1.2
> > 2. hadoop 2.7 + hive 2.3
> > 3. hadoop 3 + hive 2.3
> >
> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so
> don't need to add JDK version to the combination.
> >
> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun 
> wrote:
> >>
> >> Thank you for suggestion.
> >>
> >> Having `hive-2.3` profile sounds good to me because it's orthogonal to
> Hadoop 3.
> >> IIRC, originally, it was proposed in that way, but we put it under
> `hadoop-3.2` to avoid adding new profiles at that time.
> >>
> >> And, I'm wondering if you are considering additional pre-built
> distribution and Jenkins jobs.
> >>
> >> Bests,
> >> Dongjoon.
> >>
>
>

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Dongjoon Hyun

Thanks. That will be a giant step forward, Sean!

> I'd prefer making it the default in the POM for 3.0.

Bests,
Dongjoon.

On Wed, Nov 20, 2019 at 11:02 AM Sean Owen  wrote:

> Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
> same old and buggy that's been there a while. "stable" in that sense
> I'm sure there is a lot more delta between Hive 1 and 2 in terms of
> bug fixes that are important; the question isn't just 1.x releases.
>
> What I don't know is how much affects Spark, as it's a Hive client
> mostly. Clearly some do.
>
> I'd prefer making it the default in the POM for 3.0. Mostly on the
> grounds that its effects are on deployed clusters, not apps. And
> deployers can still choose a binary distro with 1.x or make the choice
> they want. Those that don't care should probably be nudged to 2.x.
> Spark 3.x is already full of behavior changes and 'unstable', so I
> think this is minor relative to the overall risk question.
>
> On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun 
> wrote:
> >
> > Hi, All.
> >
> > I'm sending this email because it's important to discuss this topic
> narrowly
> > and make a clear conclusion.
> >
> > `The forked Hive 1.2.1 is stable`? It sounds like a myth we created
> > by ignoring the existing bugs. If you want to say the forked Hive 1.2.1
> is
> > stabler than XXX, please give us the evidence. Then, we can fix it.
> > Otherwise, let's stop making `The forked Hive 1.2.1` invincible.
> >
> > Historically, the following forked Hive 1.2.1 has never been stable.
> > It's just frozen. Since the forked Hive is out of our control, we
> ignored bugs.
> > That's all. The reality is a way far from the stable status.
> >
> > https://mvnrepository.com/artifact/org.spark-project.hive/
> >
> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark
> (2015 August)
> >
> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2
> (2016 April)
> >
> > First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and
> 1.2.3,
> >
> > Apache Hive 1.2.2 has 50 bug fixes.
> > Apache Hive 1.2.3 has 9 bug fixes.
> >
> > I will not cover all of them, but Apache Hive community also backports
> > important patches like Apache Spark community.
> >
> > Second, let's move to SPARK issues because we aren't exposed to all Hive
> issues.
> >
> > SPARK-19109 ORC metadata section can sometimes exceed protobuf
> message size limit
> > SPARK-22267 Spark SQL incorrectly reads ORC file when column order
> is different
> >
> > These were reported since Apache Spark 1.6.x because the forked Hive
> doesn't have
> > a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).
> >
> > Since we couldn't update the frozen forked Hive, we added Apache ORC
> dependency
> > at SPARK-20682 (2.3.0), added a switching configuration at SPARK-20728
> (2.3.0),
> > tured on `spark.sql.hive.convertMetastoreOrc by default` at SPARK-22279
> (2.4.0).
> > However, if you turn off the switch and start to use the forked hive,
> > you will be exposed to the buggy forked Hive 1.2.1 again.
> >
> > Third, let's talk about the new features like Hadoop 3 and JDK11.
> > No one believe that the ancient forked Hive 1.2.1 will work with this.
> > I saw that the following issue is mentioned as an evidence of Hive 2.3.6
> bug.
> >
> > SPARK-29245 ClassCastException during creating HiveMetaStoreClient
> >
> > Yes. I know that issue because I reported it and verified HIVE-21508.
> > It's fixed already and will be released ad Apache Hive 2.3.7.
> >
> > Can we imagine something like this in the forked Hive 1.2.1?
> > 'No'. There is no future on it. It's frozen.
> >
> > From now, I want to claim that the forked Hive 1.2.1 is the unstable one.
> > I welcome all your positive and negative opinions.
> > Please share your concerns and problems and fix them together.
> > Apache Spark is an open source project we shared.
> >
> > Bests,
> > Dongjoon.
> >
>

The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Dongjoon Hyun

Hi, All.

I'm sending this email because it's important to discuss this topic narrowly
and make a clear conclusion.

`The forked Hive 1.2.1 is stable`? It sounds like a myth we created
by ignoring the existing bugs. If you want to say the forked Hive 1.2.1 is
stabler than XXX, please give us the evidence. Then, we can fix it.
Otherwise, let's stop making `The forked Hive 1.2.1` invincible.

Historically, the following forked Hive 1.2.1 has never been stable.
It's just frozen. Since the forked Hive is out of our control, we ignored
bugs.
That's all. The reality is a way far from the stable status.

https://mvnrepository.com/artifact/org.spark-project.hive/

https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark
(2015 August)

https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2
(2016 April)

First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and
1.2.3,

Apache Hive 1.2.2 has 50 bug fixes.
Apache Hive 1.2.3 has 9 bug fixes.

I will not cover all of them, but Apache Hive community also backports
important patches like Apache Spark community.

Second, let's move to SPARK issues because we aren't exposed to all Hive
issues.

SPARK-19109 ORC metadata section can sometimes exceed protobuf message
size limit
SPARK-22267 Spark SQL incorrectly reads ORC file when column order is
different

These were reported since Apache Spark 1.6.x because the forked Hive
doesn't have
a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).

Since we couldn't update the frozen forked Hive, we added Apache ORC
dependency
at SPARK-20682 (2.3.0), added a switching configuration at SPARK-20728
(2.3.0),
tured on `spark.sql.hive.convertMetastoreOrc by default` at SPARK-22279
(2.4.0).
However, if you turn off the switch and start to use the forked hive,
you will be exposed to the buggy forked Hive 1.2.1 again.

Third, let's talk about the new features like Hadoop 3 and JDK11.
No one believe that the ancient forked Hive 1.2.1 will work with this.
I saw that the following issue is mentioned as an evidence of Hive 2.3.6
bug.

SPARK-29245 ClassCastException during creating HiveMetaStoreClient

Yes. I know that issue because I reported it and verified HIVE-21508.
It's fixed already and will be released ad Apache Hive 2.3.7.

Can we imagine something like this in the forked Hive 1.2.1?
'No'. There is no future on it. It's frozen.

>From now, I want to claim that the forked Hive 1.2.1 is the unstable one.
I welcome all your positive and negative opinions.
Please share your concerns and problems and fix them together.
Apache Spark is an open source project we shared.

Bests,
Dongjoon.

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-20 Thread Dongjoon Hyun

Yes. Right. That's the situation we are hitting and the result I expected.
We need to change our default with Hive 2 in the POM.

Dongjoon.


On Wed, Nov 20, 2019 at 5:20 AM Sean Owen  wrote:

> Yes, good point. A user would get whatever the POM says without
> profiles enabled so it matters.
>
> Playing it out, an app _should_ compile with the Spark dependency
> marked 'provided'. In that case the app that is spark-submit-ted is
> agnostic to the Hive dependency as the only one that matters is what's
> on the cluster. Right? we don't leak through the Hive API in the Spark
> API. And yes it's then up to the cluster to provide whatever version
> it wants. Vendors will have made a specific version choice when
> building their distro one way or the other.
>
> If you run a Spark cluster yourself, you're using the binary distro,
> and we're already talking about also publishing a binary distro with
> this variation, so that's not the issue.
>
> The corner cases where it might matter are:
>
> - I unintentionally package Spark in the app and by default pull in
> Hive 2 when I will deploy against Hive 1. But that's user error, and
> causes other problems
> - I run tests locally in my project, which will pull in a default
> version of Hive defined by the POM
>
> Double-checking, is that right? if so it kind of implies it doesn't
> matter. Which is an argument either way about what's the default. I
> too would then prefer defaulting to Hive 2 in the POM. Am I missing
> something about the implication?
>
> (That fork will stay published forever anyway, that's not an issue per se.)
>
> On Wed, Nov 20, 2019 at 1:40 AM Dongjoon Hyun 
> wrote:
> > Sean, our published POM is pointing and advertising the illegitimate
> Hive 1.2 fork as a compile dependency.
> > Yes. It can be overridden. So, why does Apache Spark need to publish
> like that?
> > If someone want to use that illegitimate Hive 1.2 fork, let them
> override it. We are unable to delete those illegitimate Hive 1.2 fork.
> > Those artifacts will be orphans.
> >
>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Dongjoon Hyun

Thank you for feedback, Hyujkjin and Sean.

I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
if we can make a decision to eliminate the illegitimate Hive fork reference
immediately after `branch-3.0` cut.

Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.

-
https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E

The way I see this is that it's not a user problem. Apache Spark community
didn't try to drop the illegitimate Hive fork yet.
We need to drop it by ourselves because we created it and it's our bad.

Bests,
Dongjoon.



On Tue, Nov 19, 2019 at 5:06 AM Sean Owen  wrote:

> Just to clarify, as even I have lost the details over time: hadoop-2.7
> works with hive-2.3? it isn't tied to hadoop-3.2?
> Roughly how much risk is there in using the Hive 1.x fork over Hive
> 2.x, for end users using Hive via Spark?
> I don't have a strong opinion, other than sharing the view that we
> have to dump the Hive 1.x fork at the first opportunity.
> Question is simply how much risk that entails. Keeping in mind that
> Spark 3.0 is already something that people understand works
> differently. We can accept some behavior changes.
>
> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun 
> wrote:
> >
> > Hi, All.
> >
> > First of all, I want to put this as a policy issue instead of a
> technical issue.
> > Also, this is orthogonal from `hadoop` version discussion.
> >
> > Apache Spark community kept (not maintained) the forked Apache Hive
> > 1.2.1 because there has been no other options before. As we see at
> > SPARK-20202, it's not a desirable situation among the Apache projects.
> >
> > https://issues.apache.org/jira/browse/SPARK-20202
> >
> > Also, please note that we `kept`, not `maintained`, because we know it's
> not good.
> > There are several attempt to update that forked repository
> > for several reasons (Hadoop 3 support is one of the example),
> > but those attempts are also turned down.
> >
> > From Apache Spark 3.0, it seems that we have a new feasible option
> > `hive-2.3` profile. What about moving forward in this direction further?
> >
> > For example, can we remove the usage of forked `hive` in Apache Spark 3.0
> > completely officially? If someone still needs to use the forked `hive`,
> we can
> > have a profile `hive-1.2`. Of course, it should not be a default profile
> in the community.
> >
> > I want to say this is a goal we should achieve someday.
> > If we don't do anything, nothing happen. At least we need to prepare
> this.
> > Without any preparation, Spark 3.1+ will be the same.
> >
> > Shall we focus on what are our problems with Hive 2.3.6?
> > If the only reason is that we didn't use it before, we can release
> another
> > `3.0.0-preview` for that.
> >
> > Bests,
> > Dongjoon.
>

Re: Status of Scala 2.13 support

2019-12-02 Thread Dongjoon Hyun

Thank you for sharing the status, Sean.

Given the current circumstance, our status and approach sounds realistic to
me.

+1 for continuing after cutting `branch-3.0`.

Bests,
Dongjoon.


On Sun, Dec 1, 2019 at 10:50 AM Sean Owen  wrote:

> As you can see, I've been working on Scala 2.13 support. The umbrella
> is https://issues.apache.org/jira/browse/SPARK-25075 I wanted to lay
> out status and strategy.
>
> This will not be done for 3.0. At the least, there are a few key
> dependencies (Chill, Kafka) that aren't published for 2.13, and at
> least one change that will need removing an API deprecated as of 3.0.
> Realistically: maybe Spark 3.1. I don't yet think it's pressing.
>
>
> Making the change is difficult as it's hard to understand the extent
> of the necessary changes until the whole thing minimally compiles for
> 2.13. I have gotten essentially that far in a local clone. The good
> news is I don't see any obvious hard blockers, but the changes add up
> to thousands of line in 200+ files.
>
>
> What do we need to do for 3.0? any changes that entail breaking a
> public API, ideally. The biggest issue there comes from extensive
> changes to the Scala collection hierarchy mean that the types of many
> public APIs that return a Seq, Map, TraversableOnce, etc _will_
> actually change types in 2.13 (become immutable). See:
> https://issues.apache.org/jira/browse/SPARK-27683 and
> https://issues.apache.org/jira/browse/SPARK-29292 as the main
> examples.
>
> In both cases, keeping the exact same public type would require much
> bigger changes. These are the type of changes that all applications
> face when migrating to 2.13 though. 2.12 and 2.13 apps were never
> meant to be binary-compatible. So, in both cases we're not changing
> these, to avoid a lot of change and parallel source trees.
>
> I _think_ we're done with any other must-do changes for 3.0, therefore.
>
>
> What _can_ we do for 3.0? small changes that don't affect the 2.12
> build are OK, and that's what you see in pull requests going in at the
> moment. The big question is whether we want to do the large change for
> https://issues.apache.org/jira/browse/SPARK-29292 before 3.0. It will
> mean adding a ton of ".toSeq" and ".toMap" calls to make mutable
> collections immutable when passed to methods. In theory, it won't
> affect behavior. We'll have to see if it does in practice.
>
> The rest will have to wait until after 3.0, I believe, including even
> testing the 2.13 build, which will probably turn up some more issues.
>
>
> Thoughts on approach?
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] PostgreSQL dialect

2019-11-27 Thread Dongjoon Hyun

+1

Bests,
Dongjoon.

On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro 
wrote:

> Yea, +1, that looks pretty reasonable to me.
> > Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
> from the codebase before it's too late. Curently we only have 3 features
> under PostgreSQL dialect:
> I personally think we could at least stop work about the Dialect until 3.0
> released.
>
>
> On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <
> gengliang.w...@databricks.com> wrote:
>
>> +1 with the practical proposal.
>> To me, the major concern is that the code base becomes complicated, while
>> the PostgreSQL dialect has very limited features. I tried introducing one
>> big flag `spark.sql.dialect` and isolating related code in #25697
>> , but it seems hard to be
>> clean.
>> Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI
>> mode, which can be confusing sometimes.
>>
>> Gengliang
>>
>> On Tue, Nov 26, 2019 at 8:57 AM Xiao Li  wrote:
>>
>>> +1
>>>
>>>
 One particular negative effect has been that new postgresql tests add
 well over an hour to tests,
>>>
>>>
>>> Adding postgresql tests is for improving the test coverage of Spark SQL.
>>> We should continue to do this by importing more test cases. The quality of
>>> Spark highly depends on the test coverage. We can further paralyze the test
>>> execution to reduce the test time.
>>>
>>> Migrating PostgreSQL workloads to Spark SQL
>>>
>>>
>>> This should not be our current focus. In the near future, it is
>>> impossible to be fully compatible with PostgreSQL. We should focus on
>>> adding features that are useful to Spark community. PostgreSQL is a good
>>> reference, but we do not need to blindly follow it. We already closed
>>> multiple related JIRAs that try to add some PostgreSQL features that are
>>> not commonly used.
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>>
>>> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <
>>> mszymkiew...@gmail.com> wrote:
>>>
 I think it is important to distinguish between two different concepts:

- Adherence to standards and their well established implementations.
- Enabling migrations from some product X to Spark.

 While these two problems are related, there are independent and one can
 be achieved without the other.

- The former approach doesn't imply that all features of SQL
standard (or its specific implementation) are provided. It is sufficient
that commonly used features that are implemented, are standard 
 compliant.
Therefore if end user applies some well known pattern, thing will work 
 as
expected. I

In my personal opinion that's something that is worth the required
development resources, and in general should happen within the project.


- The latter one is more complicated. First of all the premise that
one can "migrate PostgreSQL workloads to Spark" seems to be flawed. 
 While
both Spark and PostgreSQL evolve, and probably have more in common 
 today,
than a few years ago, they're not even close enough to pretend that one 
 can
be replacement for the other. In contrast, existing compatibility layers
between major vendors make sense, because feature disparity (at
least when it comes to core functionality) is usually minimal. And that
doesn't even touch the problem that PostgreSQL provides extensively used
extension points that enable broad and evolving ecosystem (what should 
 we
do about continuous queries? Should Structured Streaming provide some
compatibility layer as well?).

More realistically Spark could provide a compatibility layer with
some analytical tools that itself provide some PostgreSQL compatibility,
but these are not always fully compatible with upstream PostgreSQL, nor
necessarily follow the latest PostgreSQL development.

Furthermore compatibility layer can be, within certain limits (i.e.
availability of required primitives), maintained as a separate project,
without putting more strain on existing resources. Effectively what we 
 care
about here is if we can translate certain SQL string into logical or
physical plan.


 On 11/26/19 3:26 PM, Wenchen Fan wrote:

 Hi all,

 Recently we start an effort to achieve feature parity between Spark and
 PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764

 This goes very well. We've added many missing features(parser rules,
 built-in functions, etc.) to Spark, and also corrected several
 inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
 Many thanks to all the people that contribute to it!

 There are several cases when adding a PostgreSQL feature:
 1. Spark doesn't have this

Re: [VOTE] SPARK 3.0.0-preview (RC2)

2019-11-01 Thread Dongjoon Hyun

+1 for Apache Spark 3.0.0-preview (RC2).

Bests,
Dongjoon.

On Thu, Oct 31, 2019 at 11:36 PM Wenchen Fan  wrote:

> The PR builder uses Hadoop 2.7 profile, which makes me think that 2.7 is
> more stable and we should make releases using 2.7 by default.
>
> +1
>
> On Fri, Nov 1, 2019 at 7:16 AM Xiao Li  wrote:
>
>> Spark 3.0 will still use the Hadoop 2.7 profile by default, I think.
>> Hadoop 2.7 profile is much more stable than Hadoop 3.2 profile.
>>
>> On Thu, Oct 31, 2019 at 3:54 PM Sean Owen  wrote:
>>
>>> This isn't a big thing, but I see that the pyspark build includes
>>> Hadoop 2.7 rather than 3.2. Maybe later we change the build to put in
>>> 3.2 by default.
>>>
>>> Otherwise, the tests all seems to pass with JDK 8 / 11 with all
>>> profiles enabled, so I'm +1 on it.
>>>
>>>
>>> On Thu, Oct 31, 2019 at 1:00 AM Xingbo Jiang 
>>> wrote:
>>> >
>>> > Please vote on releasing the following candidate as Apache Spark
>>> version 3.0.0-preview.
>>> >
>>> > The vote is open until November 3 PST and passes if a majority +1 PMC
>>> votes are cast, with
>>> > a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 3.0.0-preview
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>> >
>>> > The tag to be voted on is v3.0.0-preview-rc2 (commit
>>> 007c873ae34f58651481ccba30e8e2ba38a692c4):
>>> > https://github.com/apache/spark/tree/v3.0.0-preview-rc2
>>> >
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc2-bin/
>>> >
>>> > Signatures used for Spark RCs can be found in this file:
>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >
>>> > The staging repository for this release can be found at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1336/
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc2-docs/
>>> >
>>> > The list of bug fixes going into 3.0.0 can be found at the following
>>> URL:
>>> > https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>> >
>>> > FAQ
>>> >
>>> > =
>>> > How can I help test this release?
>>> > =
>>> >
>>> > If you are a Spark user, you can help us test this release by taking
>>> > an existing Spark workload and running on this release candidate, then
>>> > reporting any regressions.
>>> >
>>> > If you're working in PySpark you can set up a virtual env and install
>>> > the current RC and see if anything important breaks, in the Java/Scala
>>> > you can add the staging repository to your projects resolvers and test
>>> > with the RC (make sure to clean up the artifact cache before/after so
>>> > you don't end up building with an out of date RC going forward).
>>> >
>>> > ===
>>> > What should happen to JIRA tickets still targeting 3.0.0?
>>> > ===
>>> >
>>> > The current list of open tickets targeted at 3.0.0 can be found at:
>>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.0.0
>>> >
>>> > Committers should look at those and triage. Extremely important bug
>>> > fixes, documentation, and API tweaks that impact compatibility should
>>> > be worked on immediately.
>>> >
>>> > ==
>>> > But my bug isn't fixed?
>>> > ==
>>> >
>>> > In order to make timely releases, we will typically not hold the
>>> > release unless the bug in question is a regression from the previous
>>> > release. That being said, if there is something which is a regression
>>> > that has not been correctly targeted please ping me or a committer to
>>> > help target the issue.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> 
>>
>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Dongjoon Hyun

Hi, Cheng.

This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
If we consider them, it could be the followings.

+--+-++
|  | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
+-+
|Legitimate|X| O  |
|JDK11 |X| O  |
|Hadoop3   |X| O  |
|Hadoop2   |O| O  |
|Functions | Baseline|   More |
|Bug fixes | Baseline|   More |
+-+

To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
(including Jenkins/GitHubAction/AppVeyor).

For me, AS-IS 3.0 is not enough for that. According to your advices,
to give more visibility to the whole community,

1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
distribution
2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
after `branch-3.0` branch cut.

I know that we have been reluctant to (1) and (2) due to its burden.
But, it's time to prepare. Without them, we are going to be insufficient
again and again.

Bests,
Dongjoon.




On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian  wrote:

> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor
> release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135>
> and here
> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927>
> .)
>
> Again, I'm happy to get rid of ancient legacy dependencies like Hadoop 2.7
> and the Hive 1.2 fork, but I do believe that we need a safety net for Spark
> 3.0. For preview releases, I'm afraid that their visibility is not good
> enough for covering such major upgrades.
>
> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for feedback, Hyujkjin and Sean.
>>
>> I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
>> if we can make a decision to eliminate the illegitimate Hive fork
>> reference
>> immediately after `branch-3.0` cut.
>>
>> Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.
>>
>> -
>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>>
>> The way I see this is that it's not a user problem. Apache Spark
>> community didn't try to drop the illegitimate Hive fork yet.
>> We need to drop it by ourselves because we created it and it's our bad.
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen  wrote:
>>
>>> Just to clarify, as even I have lost the details over time: hadoop-2.7
>>> works with hive-2.3? it isn't tied to hadoop-3.2?
>>> Roughly how much risk is there in using the Hive 1.x fork over Hive
>>> 2.x, for end users using Hive via Spark?
>>> I don't have a strong opinion, other than sharing the view that we
>>> have to dump the Hive 1.x fork at the first opportunity.
>>> Question is simply how much risk that entails. Keeping in mind that
>>> Spark 3.0 is already something that people understand works
>>> differently. We can accept some behavior changes.
>>>
>>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun 
>>> wrote:
>>> >
>>> > Hi, All.
>>> >
>>> > First of all, I want to put this as a policy issue instead of a
>>> technical issue.
>>> > Also, this is orthogonal from `hadoop` version discussion.
>>> >
>>> > Apache Spark community kept (not maintained) the forked Apache Hive
>>> > 1.2.1 because there has been no other options before. As we see at
>>> > SPARK-20202, it's not a desirable situation among the Apache projects.
>>> >
>>> > https://issues.apache.org/jira/browse/SPARK-20202
>>> >
>>> > Also, please note that we `kept`, not `maintained`, because we know
>>> it's not good.
>>> > There are several attempt to update that forked repository
>>> > for several reasons (Hadoop 3 support is one of the example),
>>> > but those attempts are also turned down.
>>> >
>>> > From Apache Spark 3.0, it seems that we have a new feasible option
>>&g

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Dongjoon Hyun

BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.

For directory name, we use '1.2.1' and '2.3.5' because we just delayed the
renaming the directories until 3.0.0 deadline to minimize the diff.

We can replace it immediately if we want right now.



On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun 
wrote:

> Hi, Cheng.
>
> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
> If we consider them, it could be the followings.
>
> +--+-++
> |  | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
> +-+
> |Legitimate|X| O  |
> |JDK11 |X| O  |
> |Hadoop3   |X| O  |
> |Hadoop2   |O| O  |
> |Functions | Baseline|   More |
> |Bug fixes | Baseline|   More |
> +-+
>
> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
> (including Jenkins/GitHubAction/AppVeyor).
>
> For me, AS-IS 3.0 is not enough for that. According to your advices,
> to give more visibility to the whole community,
>
> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
> distribution
> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
> after `branch-3.0` branch cut.
>
> I know that we have been reluctant to (1) and (2) due to its burden.
> But, it's time to prepare. Without them, we are going to be insufficient
> again and again.
>
> Bests,
> Dongjoon.
>
>
>
>
> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian  wrote:
>
>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor
>> release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135>
>> and here
>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927>
>> .)
>>
>> Again, I'm happy to get rid of ancient legacy dependencies like Hadoop
>> 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for
>> Spark 3.0. For preview releases, I'm afraid that their visibility is not
>> good enough for covering such major upgrades.
>>
>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for feedback, Hyujkjin and Sean.
>>>
>>> I proposed `preview-2` for that purpose but I'm also +1 for do that at
>>> 3.1
>>> if we can make a decision to eliminate the illegitimate Hive fork
>>> reference
>>> immediately after `branch-3.0` cut.
>>>
>>> Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.
>>>
>>> -
>>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>>>
>>> The way I see this is that it's not a user problem. Apache Spark
>>> community didn't try to drop the illegitimate Hive fork yet.
>>> We need to drop it by ourselves because we created it and it's our bad.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen  wrote:
>>>
>>>> Just to clarify, as even I have lost the details over time: hadoop-2.7
>>>> works with hive-2.3? it isn't tied to hadoop-3.2?
>>>> Roughly how much risk is there in using the Hive 1.x fork over Hive
>>>> 2.x, for end users using Hive via Spark?
>>>> I don't have a strong opinion, other than sharing the view that we
>>>> have to dump the Hive 1.x fork at the first opportunity.
>>>> Question is simply how much risk that entails. Keeping in mind that
>>>> Spark 3.0 is already something that people understand works
>>>> differently. We can accept some behavior changes.
>>>>
>>>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun 
>>>> wrote:
>>>> >
>>>> > Hi, All.
>>>> >
>>>> > First of all, I want to put this as a policy issue instead of a
>>>> technical issue.
>>>> > Also, this is orthogonal from `hadoop` version discussion.
>>>> >
>>>> > Apache Spark community

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Dongjoon Hyun

Nice. That's a progress.

Let's narrow down to the path. We need to clarify what is the criteria we
can agree.

1. What does `battle-tested for years` mean exactly?
How and when can we start the `battle-tested` stage for Hive 2.3?

2. What is the new "Hive integration in Spark"?
During introducing Hive 2.3, we fixed the compatibility stuff as you
said.
Most of code is shared for Hive 1.2 and Hive 2.3.
That means if there is a bug inside this shared code, both of them will
be affected.
Of course, we can fix this because it's Spark code. We will learn and
fix it as you said.

>  Yes, there are issues, but people have learned how to get along with
these issues.

The only non-shared code are the following.
Do you have a concern on the following directories?
If there is no bugs on the following codebase, can we switch?

$ find . -name v2.3.5
./sql/core/v2.3.5
./sql/hive-thriftserver/v2.3.5

3. We know that we can keep both code bases, but the community should
choose Hive 2.3 officially.
That's the right choice in the Apache project policy perspective. At
least, Sean and I prefer that.
If someone really want to stick to Hive 1.2 fork, they can use it at
their own risks.

> for Spark 3.0 end-users who really don't want to interact with this
Hive 1.2 fork, they can always use Hive 2.3 at their own risks.

Specifically, what about having a profile `hive-1.2` at `3.0.0` with the
default Hive 2.3 pom at least?
How do you think about that way, Cheng?

Bests,
Dongjoon.


On Wed, Nov 20, 2019 at 12:59 PM Cheng Lian  wrote:

> Hey Dongjoon and Felix,
>
> I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise, we
> wouldn't even consider integrating with Hive 2.3 in Spark 3.0.
>
> However, *"Hive" and "Hive integration in Spark" are two quite different
> things*, and I don't think anybody has ever mentioned "the forked Hive
> 1.2.1 is stable" in any recent Hadoop/Hive version discussions (at least I
> double-checked all my replies).
>
> What I really care about is the stability and quality of "Hive integration
> in Spark", which have gone through some major updates due to the recent
> Hive 2.3 upgrade in Spark 3.0. We had already found bugs in this piece, and
> empirically, for a significant upgrade like this one, it is not surprising
> that other bugs/regressions can be found in the near future. On the other
> hand, the Hive 1.2 integration code path in Spark has been battle-tested
> for years. Yes, there are issues, but people have learned how to get along
> with these issues. And please don't forget that, for Spark 3.0 end-users
> who really don't want to interact with this Hive 1.2 fork, they can always
> use Hive 2.3 at their own risks.
>
> True, "stable" is quite vague a criterion, and hard to be proven. But that
> is exactly the reason why we may want to be conservative and wait for some
> time and see whether there are further signals suggesting that the Hive 2.3
> integration in Spark 3.0 is *unstable*. After one or two Spark 3.x minor
> releases, if we've fixed all the outstanding issues and no more significant
> ones are showing up, we can declare that the Hive 2.3 integration in Spark
> 3.x is stable, and then we can consider removing reference to the Hive 1.2
> fork. Does that make sense?
>
> Cheng
>
> On Wed, Nov 20, 2019 at 11:49 AM Felix Cheung 
> wrote:
>
>> Just to add - hive 1.2 fork is definitely not more stable. We know of a
>> few critical bug fixes that we cherry picked into a fork of that fork to
>> maintain ourselves.
>>
>>
>> --
>> *From:* Dongjoon Hyun 
>> *Sent:* Wednesday, November 20, 2019 11:07:47 AM
>> *To:* Sean Owen 
>> *Cc:* dev 
>> *Subject:* Re: The Myth: the forked Hive 1.2.1 is stabler than XXX
>>
>> Thanks. That will be a giant step forward, Sean!
>>
>> > I'd prefer making it the default in the POM for 3.0.
>>
>> Bests,
>> Dongjoon.
>>
>> On Wed, Nov 20, 2019 at 11:02 AM Sean Owen  wrote:
>>
>> Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
>> same old and buggy that's been there a while. "stable" in that sense
>> I'm sure there is a lot more delta between Hive 1 and 2 in terms of
>> bug fixes that are important; the question isn't just 1.x releases.
>>
>> What I don't know is how much affects Spark, as it's a Hive client
>> mostly. Clearly some do.
>>
>> I'd prefer making it the default in the POM for 3.0. Mostly on the
>> grounds that its effects are on deployed clusters, not apps. And
>> deployers can still choose a binary distro with 1.x or make the choice
>> they want.

Migration `Spark QA Compile` Jenkins jobs to GitHub Action

2019-11-19 Thread Dongjoon Hyun

Hi, All.

Apache Spark community used the following dashboard as post-hook
verifications.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/

There are six registered jobs.

1. spark-branch-2.4-compile-maven-hadoop-2.6
2. spark-branch-2.4-compile-maven-hadoop-2.7
3. spark-branch-2.4-lint
4. spark-master-compile-maven-hadoop-2.7
5. spark-master-compile-maven-hadoop-3.2
6. spark-master-lint

Now, we added `GitHub Action` jobs. You can see the green check at every
commit.

https://github.com/apache/spark/commits/master
https://github.com/apache/spark/commits/branch-2.4

If you click the green check, you can see the detail.
The followings are the example runs at the last commits on both branches.

https://github.com/apache/spark/runs/310411948 (master)
https://github.com/apache/spark/runs/309522646 (branch-2.4)

New `GitHub Action` have more combination than the old Jenkins jobs.

- branch-2.4-scala-2.11-hadoop-2.6 (compile/package/install)
- branch-2.4-scala-2.12-hadoop-2.6 (compile/package/install)
- branch-2.4-scala-2.11-hadoop-2.7 (compile/package/install)
- branch-2.4-scala-2.12-hadoop-2.7 (compile/package/install)
- branch-2.4-linters (Scala/Java/Python/R)
- master-scala-2.12-hadoop-2.7 (compile/package/install)
- master-scala-2.12-hadoop-3.2 (compile/package/install)
- master-scala-2.12-hadoop-3.2-jdk11 (compile/package/install)
- master-linters (Scala/Java/Python/R)

In addition, this is a part of Apache Spark code base and everyone can make
contributions on this.

Finally, as the last piece of this work, we are going to remove the legacy
Jenkins jobs via the following JIRA issue.

https://issues.apache.org/jira/browse/SPARK-29935

Please let me know if you have any concerns on this.
(We can keep the legacy jobs, but two of them are already broken.)

Bests,
Dongjoon.

Re: Migration `Spark QA Compile` Jenkins jobs to GitHub Action

2019-11-19 Thread Dongjoon Hyun

Thank you, Sean, Shane, and Xiao!

Bests,
Dongjoon.

On Tue, Nov 19, 2019 at 2:15 PM Shane Knapp  wrote:

> i had a few minutes and everything has been deleted!
>
> On Tue, Nov 19, 2019 at 2:02 PM Shane Knapp  wrote:
> >
> > thank sean!
> >
> > i am all for moving these jobs to github actions, and will be doing
> > this 'soon' as i'm @ kubecon this week.
> >
> > btw the R ecosystem definitely needs some attention, however, but
> > that's an issue for another time.  :)
> >
> > On Tue, Nov 19, 2019 at 1:49 PM Sean Owen  wrote:
> > >
> > > I would favor moving whatever we can to Github. It's difficult to
> > > modify the Jenkins instances without Shane's valiant help, and over
> > > time makes more sense to modernize and integrate it into the project.
> > >
> > > On Tue, Nov 19, 2019 at 3:35 PM Dongjoon Hyun 
> wrote:
> > > >
> > > > Hi, All.
> > > >
> > > > Apache Spark community used the following dashboard as post-hook
> verifications.
> > > >
> > > >
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/
> > > >
> > > > There are six registered jobs.
> > > >
> > > > 1. spark-branch-2.4-compile-maven-hadoop-2.6
> > > > 2. spark-branch-2.4-compile-maven-hadoop-2.7
> > > > 3. spark-branch-2.4-lint
> > > > 4. spark-master-compile-maven-hadoop-2.7
> > > > 5. spark-master-compile-maven-hadoop-3.2
> > > > 6. spark-master-lint
> > > >
> > > > Now, we added `GitHub Action` jobs. You can see the green check at
> every commit.
> > > >
> > > > https://github.com/apache/spark/commits/master
> > > > https://github.com/apache/spark/commits/branch-2.4
> > > >
> > > > If you click the green check, you can see the detail.
> > > > The followings are the example runs at the last commits on both
> branches.
> > > >
> > > > https://github.com/apache/spark/runs/310411948 (master)
> > > > https://github.com/apache/spark/runs/309522646 (branch-2.4)
> > > >
> > > > New `GitHub Action` have more combination than the old Jenkins jobs.
> > > >
> > > > - branch-2.4-scala-2.11-hadoop-2.6 (compile/package/install)
> > > > - branch-2.4-scala-2.12-hadoop-2.6 (compile/package/install)
> > > > - branch-2.4-scala-2.11-hadoop-2.7 (compile/package/install)
> > > > - branch-2.4-scala-2.12-hadoop-2.7 (compile/package/install)
> > > > - branch-2.4-linters (Scala/Java/Python/R)
> > > > - master-scala-2.12-hadoop-2.7 (compile/package/install)
> > > > - master-scala-2.12-hadoop-3.2 (compile/package/install)
> > > > - master-scala-2.12-hadoop-3.2-jdk11 (compile/package/install)
> > > > - master-linters (Scala/Java/Python/R)
> > > >
> > > > In addition, this is a part of Apache Spark code base and everyone
> can make contributions on this.
> > > >
> > > > Finally, as the last piece of this work, we are going to remove the
> legacy Jenkins jobs via the following JIRA issue.
> > > >
> > > > https://issues.apache.org/jira/browse/SPARK-29935
> > > >
> > > > Please let me know if you have any concerns on this.
> > > > (We can keep the legacy jobs, but two of them are already broken.)
> > > >
> > > > Bests,
> > > > Dongjoon.
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> >
> >
> > --
> > Shane Knapp
> > UC Berkeley EECS Research / RISELab Staff Technical Lead
> > https://rise.cs.berkeley.edu
>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Dongjoon Hyun

Thank you for much thoughtful clarification. I agree with your all options.

Especially, for Hive Metastore connection, `Hive isolated client loader` is
also important with Hive 2.3 because Hive 2.3 client cannot talk with Hive
2.1 and lower. `Hive Isolated client loader` is one of the good design in
Apache Spark.

One of the reason I started this thread focusing on the fork is that we *don't
use* that fork actually.

https://mvnrepository.com/artifact/org.spark-project.hive/

Big companies (and vendors) maintains their own fork of that fork or
upgrade its hive dependency already. So, when we say it's battle-tested, it
does not mean it really. It's not tested.

The above repository becomes something like a stranded phantom. We pointed
that repo as a legacy interface, and we don't use the code really in the
large production environments. Since there is no way to contribute back to
that repo, we also have a segmentation problem on the experience with Hive
1.2.1. Someone may say it's good while the others still struggles without
any community support.

Anyway, thank you so much for the conclusion.
I'll try to make a JIRA and PR for `hive-1.2` profile first as a conclusion.

Bests,
Dongjoon.


On Wed, Nov 20, 2019 at 4:10 PM Cheng Lian  wrote:

> Oh, actually, in order to decouple Hadoop 3.2 and Hive 2.3 upgrades, we
> will need a hive-2.3 profile anyway, no matter having the hive-1.2
> profile or not.
>
> On Wed, Nov 20, 2019 at 3:33 PM Cheng Lian  wrote:
>
>> Just to summarize my points:
>>
>>1. Let's still keep the Hive 1.2 dependency in Spark 3.0, but it is
>>optional. End-users may choose between Hive 1.2/2.3 via a new profile
>>(either adding a hive-1.2 profile or adding a hive-2.3 profile works for
>>me, depending on which Hive version we pick as the default version).
>>2. Decouple Hive version upgrade and Hadoop version upgrade, so that
>>people may have more choices in production, and makes Spark 3.0 migration
>>easier (e.g., you don't have to switch to Hadoop 3 in order to pick Hive
>>2.3 and/or JDK 11.).
>>3. For default Hadoop/Hive versions in Spark 3.0, I personally do not
>>have a preference as long as the above two are met.
>>
>>
>> On Wed, Nov 20, 2019 at 3:22 PM Cheng Lian  wrote:
>>
>>> Dongjoon, I don't think we have any conflicts here. As stated in other
>>> threads multiple times, as long as Hive 2.3 and Hadoop 3.2 version upgrades
>>> can be decoupled, I have no preference over picking which Hive/Hadoop
>>> version as the default version. So the following two plans both work for me:
>>>
>>>1. Keep Hive 1.2 as default Spark 3.0 execution Hive version, and
>>>have an extra hive-2.3 profile.
>>>2. Choose Hive 2.3 as default Spark 3.0 execution Hive version, and
>>>have an extra hive-1.2 profile.
>>>
>>> BTW, I was also discussing Hive dependency issues with other people
>>> offline, and I realized that the Hive isolated client loader is not well
>>> known, and caused unnecessary confusion/worry. So I would like to provide
>>> some background context for readers who are not familiar with Spark Hive
>>> integration here. *Building Spark 3.0 with Hive 1.2.1 does NOT mean
>>> that you can only interact with Hive 1.2.1.*
>>>
>>> Spark does work with different versions of Hive metastore via an
>>> isolated classloading mechanism. *Even if Spark itself is built with
>>> the Hive 1.2.1 fork, you can still interact with a Hive 2.3 metastore, and
>>> this has been true ever since Spark 1.x.* In order to do this, just set
>>> the following two options according to instructions in our official doc
>>> page
>>> <http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore>
>>> :
>>>
>>>- spark.sql.hive.metastore.version
>>>- spark.sql.hive.metastore.jars
>>>
>>> Say you set "spark.sql.hive.metastore.version" to "2.3.6", and
>>> "spark.sql.hive.metastore.jars" to "maven", Spark will pull Hive 2.3.6
>>> dependencies from Maven at runtime when initializing the Hive metastore
>>> client. And those dependencies will NOT conflict with the built-in Hive
>>> 1.2.1 jars, because the downloaded jars are loaded using an isolated
>>> classloader (see here
>>> <https://github.com/apache/spark/blob/1febd373ea806326d269a60048ee52543a76c918/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala>).
>>> Historically, we call these two sets of Hive dependencies "exe

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-20 Thread Dongjoon Hyun

Thank you all.

I'll try to make JIRA and PR for that.

Bests,
Dongjoon.

On Wed, Nov 20, 2019 at 4:08 PM Cheng Lian  wrote:

> Sean, thanks for the corner cases you listed. They make a lot of sense.
> Now I do incline to have Hive 2.3 as the default version.
>
> Dongjoon, apologize if I didn't make it clear before. What made me
> concerned initially was only the following part:
>
> > can we remove the usage of forked `hive` in Apache Spark 3.0 completely
> officially?
>
> So having Hive 2.3 as the default Hive version and adding a `hive-1.2`
> profile to keep the Hive 1.2.1 fork looks like a feasible approach to me.
> Thanks for starting the discussion!
>
> On Wed, Nov 20, 2019 at 9:46 AM Dongjoon Hyun 
> wrote:
>
>> Yes. Right. That's the situation we are hitting and the result I expected.
>> We need to change our default with Hive 2 in the POM.
>>
>> Dongjoon.
>>
>>
>> On Wed, Nov 20, 2019 at 5:20 AM Sean Owen  wrote:
>>
>>> Yes, good point. A user would get whatever the POM says without
>>> profiles enabled so it matters.
>>>
>>> Playing it out, an app _should_ compile with the Spark dependency
>>> marked 'provided'. In that case the app that is spark-submit-ted is
>>> agnostic to the Hive dependency as the only one that matters is what's
>>> on the cluster. Right? we don't leak through the Hive API in the Spark
>>> API. And yes it's then up to the cluster to provide whatever version
>>> it wants. Vendors will have made a specific version choice when
>>> building their distro one way or the other.
>>>
>>> If you run a Spark cluster yourself, you're using the binary distro,
>>> and we're already talking about also publishing a binary distro with
>>> this variation, so that's not the issue.
>>>
>>> The corner cases where it might matter are:
>>>
>>> - I unintentionally package Spark in the app and by default pull in
>>> Hive 2 when I will deploy against Hive 1. But that's user error, and
>>> causes other problems
>>> - I run tests locally in my project, which will pull in a default
>>> version of Hive defined by the POM
>>>
>>> Double-checking, is that right? if so it kind of implies it doesn't
>>> matter. Which is an argument either way about what's the default. I
>>> too would then prefer defaulting to Hive 2 in the POM. Am I missing
>>> something about the implication?
>>>
>>> (That fork will stay published forever anyway, that's not an issue per
>>> se.)
>>>
>>> On Wed, Nov 20, 2019 at 1:40 AM Dongjoon Hyun 
>>> wrote:
>>> > Sean, our published POM is pointing and advertising the illegitimate
>>> Hive 1.2 fork as a compile dependency.
>>> > Yes. It can be overridden. So, why does Apache Spark need to publish
>>> like that?
>>> > If someone want to use that illegitimate Hive 1.2 fork, let them
>>> override it. We are unable to delete those illegitimate Hive 1.2 fork.
>>> > Those artifacts will be orphans.
>>> >
>>>
>>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Dongjoon Hyun

Yes. It does. I meant SPARK-20202.

Thanks. I understand that it can be considered like Scala version issue.
So, that's the reason why I put this as a `policy` issue from the beginning.

> First of all, I want to put this as a policy issue instead of a technical
issue.

In the policy perspective, we should remove this immediately if we have a
solution to fix this.
For now, I set `Target Versions` of SPARK-20202 to `3.1.0` according to the
current discussion status.

https://issues.apache.org/jira/browse/SPARK-20202

And, if there is no other issues, I'll create a PR to remove it from
`master` branch when we cut `branch-3.0`.

For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do
you think about this, Sean?
The preparation is already started in another email thread and I believe
that is a keystone to prove `Hive 2.3` version stability
(which Cheng/Hyukjin/you asked).

Bests,
Dongjoon.


On Tue, Nov 19, 2019 at 2:09 PM Cheng Lian  wrote:

> It's kinda like Scala version upgrade. Historically, we only remove the
> support of an older Scala version when the newer version is proven to be
> stable after one or more Spark minor versions.
>
> On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian  wrote:
>
>> Hmm, what exactly did you mean by "remove the usage of forked `hive` in
>> Apache Spark 3.0 completely officially"? I thought you wanted to remove the
>> forked Hive 1.2 dependencies completely, no? As long as we still keep the
>> Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
>> particular preference between using Hive 1.2 or 2.3 as the default Hive
>> version. After all, for end-users and providers who need a particular
>> version combination, they can always build Spark with proper profiles
>> themselves.
>>
>> And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's
>> due to the folder name.
>>
>> On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun 
>> wrote:
>>
>>> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>>>
>>> For directory name, we use '1.2.1' and '2.3.5' because we just delayed
>>> the renaming the directories until 3.0.0 deadline to minimize the diff.
>>>
>>> We can replace it immediately if we want right now.
>>>
>>>
>>>
>>> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun 
>>> wrote:
>>>
>>>> Hi, Cheng.
>>>>
>>>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
>>>> If we consider them, it could be the followings.
>>>>
>>>> +--+-++
>>>> |  | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>>>> +-+
>>>> |Legitimate|X| O  |
>>>> |JDK11 |X| O  |
>>>> |Hadoop3   |X| O  |
>>>> |Hadoop2   |O| O  |
>>>> |Functions | Baseline|   More |
>>>> |Bug fixes | Baseline|   More |
>>>> +-+
>>>>
>>>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>>>> (including Jenkins/GitHubAction/AppVeyor).
>>>>
>>>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>>>> to give more visibility to the whole community,
>>>>
>>>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>>>> distribution
>>>> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
>>>> after `branch-3.0` branch cut.
>>>>
>>>> I know that we have been reluctant to (1) and (2) due to its burden.
>>>> But, it's time to prepare. Without them, we are going to be
>>>> insufficient again and again.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian 
>>>> wrote:
>>>>
>>>>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x
>>>>> minor release to stabilize Hive 2.3 code paths before retiring the Hive 
>>>>> 1.2
>>>>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
>>>>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
>>>>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
>>>&g

Re: Spark 3.0 preview release on-going features discussion

2019-09-20 Thread Dongjoon Hyun

Thank you for the summarization, Xingbo.

I also agree with Sean because I don't think those block 3.0.0 preview
release.
Especially, correctness issues should not be there.

Instead, could you summarize what we have as of now for 3.0.0 preview?

I believe JDK11 (SPARK-28684) and Hive 2.3.5 (SPARK-23710) will be in the
what-we-have list for 3.0.0 preview.

Bests,
Dongjoon.

On Fri, Sep 20, 2019 at 6:22 AM Sean Owen  wrote:

> Is this a list of items that might be focused on for the final 3.0
> release? At least, Scala 2.13 support shouldn't be on that list. The
> others look plausible, or are already done, but there are probably
> more.
>
> As for the 3.0 preview, I wouldn't necessarily block on any particular
> feature, though, yes, the more work that can go into important items
> between now and then, the better.
> I wouldn't necessarily present any list of things that will or might
> be in 3.0 with that preview; just list the things that are done, like
> JDK 11 support.
>
> On Fri, Sep 20, 2019 at 2:46 AM Xingbo Jiang 
> wrote:
> >
> > Hi all,
> >
> > Let's start a new thread to discuss the on-going features for Spark 3.0
> preview release.
> >
> > Below is the feature list for the Spark 3.0 preview release. The list is
> collected from the previous discussions in the dev list.
> >
> > Followup of the shuffle+repartition correctness issue: support roll back
> shuffle stages (https://issues.apache.org/jira/browse/SPARK-25341)
> > Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 (
> https://issues.apache.org/jira/browse/SPARK-23710)
> > JDK 11 support (https://issues.apache.org/jira/browse/SPARK-28684)
> > Scala 2.13 support (https://issues.apache.org/jira/browse/SPARK-25075)
> > DataSourceV2 features
> >
> > Enable file source v2 writers (
> https://issues.apache.org/jira/browse/SPARK-27589)
> > CREATE TABLE USING with DataSourceV2
> > New pushdown API for DataSourceV2
> > Support DELETE/UPDATE/MERGE Operations in DataSourceV2 (
> https://issues.apache.org/jira/browse/SPARK-28303)
> >
> > Correctness issue: Stream-stream joins - left outer join gives
> inconsistent output (https://issues.apache.org/jira/browse/SPARK-26154)
> > Revisiting Python / pandas UDF (
> https://issues.apache.org/jira/browse/SPARK-28264)
> > Spark Graph (https://issues.apache.org/jira/browse/SPARK-25994)
> >
> > Features that are nice to have:
> >
> > Use remote storage for persisting shuffle data (
> https://issues.apache.org/jira/browse/SPARK-25299)
> > Spark + Hadoop + Parquet + Avro compatibility problems (
> https://issues.apache.org/jira/browse/SPARK-25588)
> > Introduce new option to Kafka source - specify timestamp to start and
> end offset (https://issues.apache.org/jira/browse/SPARK-26848)
> > Delete files after processing in structured streaming (
> https://issues.apache.org/jira/browse/SPARK-20568)
> >
> > Here, I am proposing to cut the branch on October 15th. If the features
> are targeting to 3.0 preview release, please prioritize the work and finish
> it before the date. Note, Oct. 15th is not the code freeze of Spark 3.0.
> That means, the community will still work on the features for the upcoming
> Spark 3.0 release, even if they are not included in the preview release.
> The goal of preview release is to collect more feedback from the community
> regarding the new 3.0 features/behavior changes.
> >
> > Thanks!
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-10 Thread Dongjoon Hyun

+1

Bests,
Dongjoon

On Thu, Oct 10, 2019 at 10:14 Ryan Blue  wrote:

> +1
>
> Thanks for fixing this!
>
> On Thu, Oct 10, 2019 at 6:30 AM Xiao Li  wrote:
>
>> +1
>>
>> On Thu, Oct 10, 2019 at 2:13 AM Hyukjin Kwon  wrote:
>>
>>> +1 (binding)
>>>
>>> 2019년 10월 10일 (목) 오후 5:11, Takeshi Yamamuro 님이
>>> 작성:
>>>
 Thanks for the great work, Gengliang!

 +1 for that.
 As I said before, the behaviour is pretty common in DBMSs, so the change
 helps for DMBS users.

 Bests,
 Takeshi


 On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang <
 gengliang.w...@databricks.com> wrote:

> Hi everyone,
>
> I'd like to call for a new vote on SPARK-28885
>  "Follow ANSI
> store assignment rules in table insertion by default" after revising the
> ANSI store assignment policy(SPARK-29326
> ).
> When inserting a value into a column with the different data type,
> Spark performs type coercion. Currently, we support 3 policies for the
> store assignment rules: ANSI, legacy and strict, which can be set via the
> option "spark.sql.storeAssignmentPolicy":
> 1. ANSI: Spark performs the store assignment as per ANSI SQL. In
> practice, the behavior is mostly the same as PostgreSQL. It disallows
> certain unreasonable type conversions such as converting `string` to `int`
> and `double` to `boolean`. It will throw a runtime exception if the value
> is out-of-range(overflow).
> 2. Legacy: Spark allows the store assignment as long as it is a valid
> `Cast`, which is very loose. E.g., converting either `string` to `int` or
> `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
> for compatibility with Hive. When inserting an out-of-range value to an
> integral field, the low-order bits of the value is inserted(the same as
> Java/Scala numeric type casting). For example, if 257 is inserted into a
> field of Byte type, the result is 1.
> 3. Strict: Spark doesn't allow any possible precision loss or data
> truncation in store assignment, e.g., converting either `double` to `int`
> or `decimal` to `double` is allowed. The rules are originally for Dataset
> encoder. As far as I know, no mainstream DBMS is using this policy by
> default.
>
> Currently, the V1 data source uses "Legacy" policy by default, while
> V2 uses "Strict". This proposal is to use "ANSI" policy by default for 
> both
> V1 and V2 in Spark 3.0.
>
> This vote is open until Friday (Oct. 11).
>
> [ ] +1: Accept the proposal
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
> Thank you!
>
> Gengliang
>


 --
 ---
 Takeshi Yamamuro

>>> --
>> [image: Databricks Summit - Watch the talks]
>> 
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Spark 3.0 preview release feature list and major changes

2019-10-08 Thread Dongjoon Hyun

Thank you for the preparation of 3.0-preview, Xingbo!

Bests,
Dongjoon.

On Tue, Oct 8, 2019 at 2:32 PM Xingbo Jiang  wrote:

>  What's the process to propose a feature to be included in the final Spark
>> 3.0 release?
>>
>
> I don't know whether there exists any specific process here, normally you
> just merge the feature into Spark master before release code freeze, and
> then the feature would probably be included in the release. The code freeze
> date for Spark 3.0 has not been decided yet, though.
>
> Li Jin  于2019年10月8日周二 下午2:14写道：
>
>> Thanks for summary!
>>
>> I have a question that is semi-related - What's the process to propose a
>> feature to be included in the final Spark 3.0 release?
>>
>> In particular, I am interested in
>> https://issues.apache.org/jira/browse/SPARK-28006.  I am happy to do the
>> work so want to make sure I don't miss the "cut" date.
>>
>> On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang 
>> wrote:
>>
>>> Hi all,
>>>
>>> Thanks for all the feedbacks, here is the updated feature list:
>>>
>>> SPARK-11215 
>>> Multiple columns support added to various Transformers: StringIndexer
>>>
>>> SPARK-11150 
>>> Implement Dynamic Partition Pruning
>>>
>>> SPARK-13677  Support
>>> Tree-Based Feature Transformation
>>>
>>> SPARK-16692  Add
>>> MultilabelClassificationEvaluator
>>>
>>> SPARK-19591  Add
>>> sample weights to decision trees
>>>
>>> SPARK-19712  Pushing
>>> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>>>
>>> SPARK-19827  R API
>>> for Power Iteration Clustering
>>>
>>> SPARK-20286  Improve
>>> logic for timing out executors in dynamic allocation
>>>
>>> SPARK-20636 
>>> Eliminate unnecessary shuffle with adjacent Window expressions
>>>
>>> SPARK-22148  Acquire
>>> new executors to avoid hang because of blacklisting
>>>
>>> SPARK-22796 
>>> Multiple columns support added to various Transformers: PySpark
>>> QuantileDiscretizer
>>>
>>> SPARK-23128  A new
>>> approach to do adaptive execution in Spark SQL
>>>
>>> SPARK-23155  Apply
>>> custom log URL pattern for executor log URLs in SHS
>>>
>>> SPARK-23539  Add
>>> support for Kafka headers
>>>
>>> SPARK-23674  Add
>>> Spark ML Listener for Tracking ML Pipeline Status
>>>
>>> SPARK-23710  Upgrade
>>> the built-in Hive to 2.3.5 for hadoop-3.2
>>>
>>> SPARK-24333  Add fit
>>> with validation set to Gradient Boosted Trees: Python API
>>>
>>> SPARK-24417  Build
>>> and Run Spark on JDK11
>>>
>>> SPARK-24615 
>>> Accelerator-aware task scheduling for Spark
>>>
>>> SPARK-24920  Allow
>>> sharing Netty's memory pool allocators
>>>
>>> SPARK-25250  Fix
>>> race condition with tasks running when new attempt for same stage is
>>> created leads to other task in the next attempt running on the same
>>> partition id retry multiple times
>>>
>>> SPARK-25341  Support
>>> rolling back a shuffle map stage and re-generate the shuffle files
>>>
>>> SPARK-25348  Data
>>> source for binary files
>>>
>>> SPARK-25501  Add
>>> kafka delegation token support
>>>
>>> SPARK-25603 
>>> Generalize Nested Column Pruning
>>>
>>> SPARK-26132  Remove
>>> support for Scala 2.11 in Spark 3.0.0
>>>
>>> SPARK-26215  define
>>> reserved keywords after SQL standard
>>>
>>> SPARK-26412  Allow
>>> Pandas UDF to take an iterator of pd.DataFrames
>>>
>>> SPARK-26759  Arrow
>>> optimization in SparkR's interoperability
>>>
>>> SPARK-26785  data
>>> source v2 API refactor: streaming write
>>>
>>> SPARK-26848

Re: Auto-closing PRs when there are no feedback or response from its author

2019-10-09 Thread Dongjoon Hyun

Thank you for keeping eyes on this difficult issue, Hyukjin.

Although we try our best, there exist some corner cases always. For
examples,

1. Although we close old JIRA issues on EOL-version only, but some issues
doesn't have `Affected Versions` field  info at all.
- https://issues.apache.org/jira/browse/SPARK-8542

2. Although we can do auto-close PRs that have a merge conflict and haven't
been updated in months, but some PRs don't have conflicts.
- https://github.com/apache/spark/pull/7842 (Actually, this is the
oldest PR due to the above reason.)

So, I'm +1 for trying to add a new manual tagging process
because I believe it's better than no-activity status and that sounds
softer than the direct closing due to the grace-period.

Bests,
Dongjoon.


On Tue, Oct 8, 2019 at 7:26 PM Sean Owen  wrote:

> I'm generally all for closing pretty old PRs. They can be reopened
> easily. Closing a PR (a particular proposal for how to resolve an
> issue) is less drastic than closing a JIRA (a description of an
> issue). Closing them just delivers the reality, that nobody is going
> to otherwise revisit it, and can actually prompt a few contributors to
> update or revisit their proposal.
>
> I wouldn't necessarily want to adopt new process or tools though. Is
> it not sufficient to auto-close PRs that have a merge conflict and
> haven't been updated in months? or just haven't been updated in a
> year? Those are probably manual-ish processes, but, don't need to
> happen more than a couple times a year.
>
> If there's little overhead to adoption, cool, though I doubt people
> will consistently use a new tag. I'd prefer any process or tool that
> implements the above.
>
>
> On Tue, Oct 8, 2019 at 8:19 PM Hyukjin Kwon  wrote:
> >
> > Hi all,
> >
> > I think we talked about this before. Roughly speaking, there are two
> cases of PRs:
> >   1. PRs waiting for review and 2. PRs waiting for author's reaction
> > We might not have to take an action but wait for reviewing for the first
> case.
> > However, we can ping and/or take an action for the second case.
> >
> > I noticed (at Read the Docs,
> https://github.com/readthedocs/readthedocs.org/blob/master/.github/no-response.yml)
> there's one bot integrated with Github app that does exactly what we want
> (see https://github.com/probot/no-response).
> >
> > 1. Maintainers (committers) can add a tag to a PR (e.g.,
> need-more-information)
> > 2. If the PR author responds with a comment or update, the bot removes
> the tag
> > 3. If the PR author does not respond, the bot closes the PR after
> waiting for the configured number of days.
> >
> > We already have a kind of simple mechanism for windowing the number of
> JIRAs. I think it's time to have such mechanism in Github PR as well.
> >
> > Although this repo doesn't look popular or widely used enough, seems
> exactly matched to what we want and less aggressive since this mechanism
> will only work when maintainers (committers) add a tag to a PR.
> >
> > WDYT guys?
> >
> > I cc'ed few people who I think were in the past similar discussions.
> >
>

Re: [VOTE] SPARK 3.0.0-preview2 (RC2)

2019-12-21 Thread Dongjoon Hyun

Hi, Yuming.

Could you summarize the vote result?

Bests,
Dongjoon.

On Wed, Dec 18, 2019 at 19:28 Wenchen Fan  wrote:

> +1, all tests pass
>
> On Thu, Dec 19, 2019 at 7:18 AM Takeshi Yamamuro 
> wrote:
>
>> Thanks, Yuming!
>>
>> I checked the links and the prepared binaries.
>> Also, I run tests with  -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver
>> -Pmesos -Pkubernetes -Psparkr
>> on java version "1.8.0_181.
>> All the things above look fine.
>>
>> Bests,
>> Takeshi
>>
>> On Thu, Dec 19, 2019 at 6:31 AM Dongjoon Hyun 
>> wrote:
>>
>>> +1
>>>
>>> I also check the signatures and docs. And, built and tested with JDK
>>> 11.0.5, Hadoop 3.2, Hive 2.3.
>>> In addition, the newly added
>>> `spark-3.0.0-preview2-bin-hadoop2.7-hive1.2.tgz` distribution looks correct.
>>>
>>> Thank you Yuming and all.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Tue, Dec 17, 2019 at 4:11 PM Sean Owen  wrote:
>>>
>>>> Same result as last time. It all looks good and tests pass for me on
>>>> Ubuntu with all profiles enables (Hadoop 3.2 + Hive 2.3), building
>>>> from source.
>>>> 'pyspark-3.0.0.dev2.tar.gz' appears to be the desired python artifact
>>>> name, yes.
>>>> +1
>>>>
>>>> On Tue, Dec 17, 2019 at 12:36 AM Yuming Wang  wrote:
>>>> >
>>>> > Please vote on releasing the following candidate as Apache Spark
>>>> version 3.0.0-preview2.
>>>> >
>>>> > The vote is open until December 20 PST and passes if a majority +1
>>>> PMC votes are cast, with
>>>> > a minimum of 3 +1 votes.
>>>> >
>>>> > [ ] +1 Release this package as Apache Spark 3.0.0-preview2
>>>> > [ ] -1 Do not release this package because ...
>>>> >
>>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>>> >
>>>> > The tag to be voted on is v3.0.0-preview2-rc2 (commit
>>>> bcadd5c3096109878fe26fb0d57a9b7d6fdaa257):
>>>> > https://github.com/apache/spark/tree/v3.0.0-preview2-rc2
>>>> >
>>>> > The release files, including signatures, digests, etc. can be found
>>>> at:
>>>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview2-rc2-bin/
>>>> >
>>>> > Signatures used for Spark RCs can be found in this file:
>>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>> >
>>>> > The staging repository for this release can be found at:
>>>> >
>>>> https://repository.apache.org/content/repositories/orgapachespark-1338/
>>>> >
>>>> > The documentation corresponding to this release can be found at:
>>>> >
>>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview2-rc2-docs/
>>>> >
>>>> > The list of bug fixes going into 3.0.0 can be found at the following
>>>> URL:
>>>> > https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>>> >
>>>> > FAQ
>>>> >
>>>> > =
>>>> > How can I help test this release?
>>>> > =
>>>> >
>>>> > If you are a Spark user, you can help us test this release by taking
>>>> > an existing Spark workload and running on this release candidate, then
>>>> > reporting any regressions.
>>>> >
>>>> > If you're working in PySpark you can set up a virtual env and install
>>>> > the current RC and see if anything important breaks, in the Java/Scala
>>>> > you can add the staging repository to your projects resolvers and test
>>>> > with the RC (make sure to clean up the artifact cache before/after so
>>>> > you don't end up building with an out of date RC going forward).
>>>> >
>>>> > ===
>>>> > What should happen to JIRA tickets still targeting 3.0.0?
>>>> > ===
>>>> >
>>>> > The current list of open tickets targeted at 3.0.0 can be found at:
>>>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>> Version/s" = 3.0.0
>>>> >
>>>> > Committers should look at those and triage. Extremely important bug
>>>> > fixes, documentation, and API tweaks that impact compatibility should
>>>> > be worked on immediately.
>>>> >
>>>> > ==
>>>> > But my bug isn't fixed?
>>>> > ==
>>>> >
>>>> > In order to make timely releases, we will typically not hold the
>>>> > release unless the bug in question is a regression from the previous
>>>> > release. That being said, if there is something which is a regression
>>>> > that has not been correctly targeted please ping me or a committer to
>>>> > help target the issue.
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

Re: [VOTE][RESULT] SPARK 3.0.0-preview2 (RC2)

2019-12-22 Thread Dongjoon Hyun

Thank you all. Especially, Yuming as a release manager!
Happy Holidays!

Cheers,
Dongjoon.


On Sun, Dec 22, 2019 at 12:51 AM Yuming Wang  wrote:

> Hi, All.
>
> The vote passes. Thanks to all who helped with this release
> 3.0.0-preview2!
> I'll follow up later with a release announcement once everything is
> published.
>
> +1 (* = binding):
> - Sean Owen *
> - Dongjoon Hyun *
> - Takeshi Yamamuro *
> - Wenchen Fan *
>
> +0: None
>
> -1: None
>
>
>
>
> Regards,
> Yuming
>

Re: [VOTE] SPARK 3.0.0-preview2 (RC2)

2019-12-18 Thread Dongjoon Hyun

+1

I also check the signatures and docs. And, built and tested with JDK
11.0.5, Hadoop 3.2, Hive 2.3.
In addition, the newly added
`spark-3.0.0-preview2-bin-hadoop2.7-hive1.2.tgz` distribution looks correct.

Thank you Yuming and all.

Bests,
Dongjoon.


On Tue, Dec 17, 2019 at 4:11 PM Sean Owen  wrote:

> Same result as last time. It all looks good and tests pass for me on
> Ubuntu with all profiles enables (Hadoop 3.2 + Hive 2.3), building
> from source.
> 'pyspark-3.0.0.dev2.tar.gz' appears to be the desired python artifact
> name, yes.
> +1
>
> On Tue, Dec 17, 2019 at 12:36 AM Yuming Wang  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 3.0.0-preview2.
> >
> > The vote is open until December 20 PST and passes if a majority +1 PMC
> votes are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.0.0-preview2
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v3.0.0-preview2-rc2 (commit
> bcadd5c3096109878fe26fb0d57a9b7d6fdaa257):
> > https://github.com/apache/spark/tree/v3.0.0-preview2-rc2
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview2-rc2-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1338/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview2-rc2-docs/
> >
> > The list of bug fixes going into 3.0.0 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12339177
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with an out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 3.0.0?
> > ===
> >
> > The current list of open tickets targeted at 3.0.0 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.0.0
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [ANNOUNCE] Announcing Apache Spark 2.4.5

2020-02-08 Thread Dongjoon Hyun

There was a typo in one URL. The correct release note URL is here.

https://spark.apache.org/releases/spark-release-2-4-5.html



On Sat, Feb 8, 2020 at 5:22 PM Dongjoon Hyun 
wrote:

> We are happy to announce the availability of Spark 2.4.5!
>
> Spark 2.4.5 is a maintenance release containing stability fixes. This
> release is based on the branch-2.4 maintenance branch of Spark. We strongly
> recommend all 2.4 users to upgrade to this stable release.
>
> To download Spark 2.4.5, head over to the download page:
> http://spark.apache.org/downloads.html
>
> Note that you might need to clear your browser cache or
> to use `Private`/`Incognito` mode according to your browsers.
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-2.4.5.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Dongjoon Hyun
>

[ANNOUNCE] Announcing Apache Spark 2.4.5

2020-02-08 Thread Dongjoon Hyun

We are happy to announce the availability of Spark 2.4.5!

Spark 2.4.5 is a maintenance release containing stability fixes. This
release is based on the branch-2.4 maintenance branch of Spark. We strongly
recommend all 2.4 users to upgrade to this stable release.

To download Spark 2.4.5, head over to the download page:
http://spark.apache.org/downloads.html

Note that you might need to clear your browser cache or
to use `Private`/`Incognito` mode according to your browsers.

To view the release notes:
https://spark.apache.org/releases/spark-release-2.4.5.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Dongjoon Hyun

Re: GitHub action permissions

2020-02-28 Thread Dongjoon Hyun

Hi, Thomas.

If you log-in with a GitHub account registered Apache project member, it
will be enough.

On some PRs of Apache Spark, can you see 'Squash and merge'  button?

Bests,
Dongjoon

On Fri, Feb 28, 2020 at 07:15 Thomas graves  wrote:

> Does anyone know how the GitHub action permissions are setup?
>
> I see a lot of random failures and want to be able to rerun them, but
> I don't seem to have a "rerun" button like some folks do.
>
> Thanks,
> Tom
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-03-05 Thread Dongjoon Hyun

Hi, All.

There is a on-going Xiao's PR referencing this email.

https://github.com/apache/spark/pull/27821

Bests,
Dongjoon.

On Fri, Feb 28, 2020 at 11:20 AM Sean Owen  wrote:

> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau 
> wrote:
> >> 1. Could you estimate how many revert commits are required in
> `branch-3.0` for new rubric?
>
> Fair question about what actual change this implies for 3.0? so far it
> seems like some targeted, quite reasonable reverts. I don't think
> anyone's suggesting reverting loads of changes.
>
>
> >> 2. Are you going to revert all removed test cases for the
> deprecated ones?
> > This is a good point, making sure we keep the tests as well is important
> (worse than removing a deprecated API is shipping it broken),.
>
> (I'd say, yes of course! which seems consistent with what is happening now)
>
>
> >> 3. Does it make any delay for Apache Spark 3.0.0 release?
> >> (I believe it was previously scheduled on June before Spark
> Summit 2020)
> >
> > I think if we need to delay to make a better release this is ok,
> especially given our current preview releases being available to gather
> community feedback.
>
> Of course these things block 3.0 -- all the more reason to keep it
> specific and targeted -- but nothing so far seems inconsistent with
> finishing in a month or two.
>
>
> >> Although there was a discussion already, I want to make the following
> tough parts sure.
> >> 4. We are not going to add Scala 2.11 API, right?
> > I hope not.
> >>
> >> 5. We are not going to support Python 2.x in Apache Spark 3.1+,
> right?
> > I think doing that would be bad, it's already end of lifed elsewhere.
>
> Yeah this is an important subtext -- the valuable principles here
> could be interpreted in many different ways depending on how much you
> weight value of keeping APIs for compatibility vs value in simplifying
> Spark and pushing users to newer APIs more forcibly. They're all
> judgment calls, based on necessarily limited data about the universe
> of users. We can only go on rare direct user feedback, on feedback
> perhaps from vendors as proxies for a subset of users, and the general
> good faith judgment of committers who have lived Spark for years.
>
> My specific interpretation is that the standard is (correctly)
> tightening going forward, and retroactively a bit for 3.0. But, I do
> not think anyone is advocating for the logical extreme of, for
> example, maintaining Scala 2.11 compatibility indefinitely. I think
> that falls out readily from the rubric here: maintaining 2.11
> compatibility is really quite painful if you ever support 2.13 too,
> for example.
>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-28 Thread Dongjoon Hyun

Hi, Matei and Michael.

I'm also a big supporter for policy-based project management.

Before going further,

1. Could you estimate how many revert commits are required in
`branch-3.0` for new rubric?
2. Are you going to revert all removed test cases for the deprecated
ones?
3. Does it make any delay for Apache Spark 3.0.0 release?
(I believe it was previously scheduled on June before Spark Summit
2020)

Although there was a discussion already, I want to make the following tough
parts sure.

4. We are not going to add Scala 2.11 API, right?
5. We are not going to support Python 2.x in Apache Spark 3.1+, right?
6. Do we have enough resource for testing the deprecated ones?
(Currently, we have 8 heavy Jenkins jobs for `branch-3.0` already.)

Especially, for (2) and (6), we know that keeping deprecated ones without
testings doesn't give us any support for the new rubric.

Bests,
Dongjoon.

On Thu, Feb 27, 2020 at 5:31 PM Matei Zaharia 
wrote:

> +1 on this new rubric. It definitely captures the issues I’ve seen in
> Spark and in other projects. If we write down this rubric (or something
> like it), it will also be easier to refer to it during code reviews or in
> proposals of new APIs (we could ask “do you expect to have to change this
> API in the future, and if so, how”).
>
> Matei
>
> On Feb 24, 2020, at 3:02 PM, Michael Armbrust 
> wrote:
>
> Hello Everyone,
>
> As more users have started upgrading to Spark 3.0 preview (including
> myself), there have been many discussions around APIs that have been broken
> compared with Spark 2.x. In many of these discussions, one of the
> rationales for breaking an API seems to be "Spark follows semantic
> versioning , so this
> major release is our chance to get it right [by breaking APIs]". Similarly,
> in many cases the response to questions about why an API was completely
> removed has been, "this API has been deprecated since x.x, so we have to
> remove it".
>
> As a long time contributor to and user of Spark this interpretation of the
> policy is concerning to me. This reasoning misses the intention of the
> original policy, and I am worried that it will hurt the long-term success
> of the project.
>
> I definitely understand that these are hard decisions, and I'm not
> proposing that we never remove anything from Spark. However, I would like
> to give some additional context and also propose a different rubric for
> thinking about API breakage moving forward.
>
> Spark adopted semantic versioning back in 2014 during the preparations for
> the 1.0 release. As this was the first major release -- and as, up until
> fairly recently, Spark had only been an academic project -- no real
> promises had been made about API stability ever.
>
> During the discussion, some committers suggested that this was an
> opportunity to clean up cruft and give the Spark APIs a once-over, making
> cosmetic changes to improve consistency. However, in the end, it was
> decided that in many cases it was not in the best interests of the Spark
> community to break things just because we could. Matei actually said it
> pretty forcefully
> 
> :
>
> I know that some names are suboptimal, but I absolutely detest breaking
> APIs, config names, etc. I’ve seen it happen way too often in other
> projects (even things we depend on that are officially post-1.0, like Akka
> or Protobuf or Hadoop), and it’s very painful. I think that we as fairly
> cutting-edge users are okay with libraries occasionally changing, but many
> others will consider it a show-stopper. Given this, I think that any
> cosmetic change now, even though it might improve clarity slightly, is not
> worth the tradeoff in terms of creating an update barrier for existing
> users.
>
> In the end, while some changes were made, most APIs remained the same and
> users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think
> this served the project very well, as compatibility means users are able to
> upgrade and we keep as many people on the latest versions of Spark (though
> maybe not the latest APIs of Spark) as possible.
>
> As Spark grows, I think compatibility actually becomes more important and
> we should be more conservative rather than less. Today, there are very
> likely more Spark programs running than there were at any other time in the
> past. Spark is no longer a tool only used by advanced hackers, it is now
> also running "traditional enterprise workloads.'' In many cases these jobs
> are powering important processes long after the original author leaves.
>
> Broken APIs can also affect libraries that extend Spark. This dependency
> can be even harder for users, as if the library has not been upgraded to
> use new APIs and they need that library, they are stuck.
>
> Given all of this, I'd like to propose the

Re: 'spark-master-docs' job missing in Jenkins

2020-02-26 Thread Dongjoon Hyun

Instead of adding another Jenkins job, adding GitHub Action job will be a
better solution because we can share the long-term workload of maintenance.

I'll make a PR for that.

Bests,
Dongjoon.


On Tue, Feb 25, 2020 at 9:10 PM Hyukjin Kwon  wrote:

> Hm, we should still run this I believe. PR builders do not run doc build
> (more specifically `cd docs && jekyll build`)
>
> Fortunately, Javadoc, Scaladoc, SparkR documentation and PySpark API
> documentation are being tested in PR builder.
> However, for MD file itself under `docs` and SQL Built-in
> Function documentation (
> https://spark.apache.org/docs/latest/api/sql/index.html) are
> not being tested anymore if I am not mistaken. I believe spark-master-docs
>  was only
> the job which tests it.
>
> Would it be difficult to re-enable?
>
> 2020년 2월 26일 (수) 오후 12:37, shane knapp ☠ 님이 작성:
>
>> it's been gone for quite a long time.  these docs were being built but
>> not published.
>>
>> relevant discussion:
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-moving-the-spark-jenkins-job-builder-repo-from-dbricks-spark-tp25325p26222.html
>>
>> shane
>>
>> On Tue, Feb 25, 2020 at 6:18 PM Hyukjin Kwon  wrote:
>>
>>> Hi all,
>>>
>>> I just noticed we apparently don't build the documentation in the
>>> Jenkins anymore.
>>> I remember we have the job:
>>> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-docs
>>> Does anybody know what happened to this job?
>>>
>>> Thanks.
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

`Target Version` management on correctness/data-loss Issues

2020-01-26 Thread Dongjoon Hyun

Hi, All.

After 2.4.5 RC1 vote failure, I asked your opinions about
correctness/dataloss issues (at mailing lists/JIRAs/PRs) in order to
collect the current status and public opinion widely in the community to
build a consensus on this at this time.

Before talking about those issues, please remind that

- Apache Spark 2.4.x is the only live version because 2.3.x is EOL and
3.0.0 is not released.
- Apache Spark community has the following rule: "Correctness and data
loss issues should be considered Blockers."

Unfortunately, we didn't build a consensus on what is really blocked by
that. In reality, it was just our resolution for the quality and it works a
little differently.

In this email, I want to talk about correctness/dataloss issues and
observed public opinions. They fall into the following categories roughly.

1. Resolved in both 3.0.0 and 2.4.x
   - ex) SPARK-30447 Constant propagation nullability issue
   - No problem. However, this case sometimes goes to (2)

2. Resolved in both 3.0.0 and 2.4.x. But, reverted in 2.4.x later.
   - ex) SPARK-26021 -0.0 and 0.0 not treated consistently, doesn't match
Hive
   - "We don't want to change the behavior in the maintenence release"

3. Resolved in 3.0.0 and not backported because this is 3.0.0-specific.
   - ex) SPARK-29906 Reading of csv file fails with adaptive execution
turned on
   - No problem.

4. Resolved in 3.0.0 and not backported due to technical difficulty.
   - ex) SPARK-26154 Stream-stream joins - left outer join gives
inconsistent output
   - "This is not backported due to the technical difficulty"

5. Resolved in 3.0.0 and not backported because this is not public API.
   - ex) SPARK-29503 MapObjects doesn't copy Unsafe data when nested under
Safe data
   - "Since `catalyst` is not public, it's less worth backporting this."

6. Resolved in 3.0.0 and not backported because we forget since there was a
no Target Version.
   - ex) SPARK-28375 Make pullupCorrelatedPredicate idempotent
   - "Adding the 'correctness' label so we remember to backport this fix to
2.4.x."
   - "This is possible, if users add the rule into
postHocOptimizationBatches"

7. Open with Target Version 3.0.0.
   - ex) SPARK-29701 Correct behaviours of group analytical queries when
empty input given
   - "We aren't fully SQL compliant there and I think that has been true
since the beginning of spark sql"
   - "This is not a regression"

8. Open without Target Version.
   - I removed this case last week to give more visibility on them.

Here, I want to focus that Apache Spark is a very healthy community because
we have diverse opinions and reevaluating JIRA issues are the results of
the community decision based on the discusson. I believe that it will go
well eventually. In the above, I added those example JIRA IDs and the
collected reasons just to give some colors to illustrate all cases are the
real cases. There is no case to be blamed in the above.

Although some JIRA issues will jump from one category into another category
time to time, the categories will remain there. I want to propose a small
additional work on `Target Version` to distinguish the above categories
easily to communicate clearly in the community. This should be done by
committers because we have the following policy on `Target Version`.

"Target Version. This is assigned by committers to indicate a PR has
been accepted for possible fix by the target version."

Proposed Idea:
A. To reduce the mismatch between `Target Version` vs `Affected
Version`:
   When a committer set `correctness` or `data-loss` label, `Target
Version` should be set together according to the `Affected Versions`.
   In case of the insufficient `Target Version` (e.g. `Target
Version`=`3.0.0` for `Affected Version`=`2.4.4,3.0.0`), he/she need to add
a comment on the JIRA.
   For example, "This is 3.0.0-specific issue"

B. To reduce the mismatch between `Target Version` vs `Fixed Version`:
   When a committer resolve `correctness` or `data-loss` labeled issue,
`Target Version` should be compared with `Fixed Version`.
   In case of the insufficient `Fixed Version` (e.g. `Target
Version`=`2.4.4,3.0.0` and `Fixed Version`=`3.0.0`), he/she need to add a
comment on the JIRA and adjust `Target Version` according to his/her
decision.
   For example, "This is not backported due to the technical
difficulty. I'll remove `2.4.4` from `Target Version`."

With the above rules, the combination of `Affected Version` / `Target
Version` / `Fixed Version` will serve us with much easier way in searching
them, understanding categories, and discussing how to handle properly.

Bests,
Dongjoon.

Re: Block a user from spark-website who repeatedly open the invalid same PR

2020-01-26 Thread Dongjoon Hyun

+1

On Sun, Jan 26, 2020 at 13:22 Shane Knapp  wrote:

> +1
>
> On Sun, Jan 26, 2020 at 10:01 AM Denny Lee  wrote:
> >
> > +1
> >
> > On Sun, Jan 26, 2020 at 09:59 Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
> >>
> >> +1
> >>
> >> I think y'all have shown this person more patience than is merited by
> their behavior.
> >>
> >> On Sun, Jan 26, 2020 at 5:16 AM Takeshi Yamamuro 
> wrote:
> >>>
> >>> +1
> >>>
> >>> Bests,
> >>> Takeshi
> >>>
> >>> On Sun, Jan 26, 2020 at 3:05 PM Hyukjin Kwon 
> wrote:
> 
>  Hi all,
> 
>  I am thinking about opening an infra ticket to block @DataWanderer
> user from spark-website
>  repository, who repeatedly opens the invalid PRs.
> 
>  The PR is about fix a documentation in the released version 2.4.4,
> and it should be fixed in spark
>  repository. It was explained multiple times by me and Sean but this
> user opens the same PR
>  repeatedly which brings overhead to the dev.
> 
>  See the PRs below:
> 
>  https://github.com/apache/spark-website/pull/257
>  https://github.com/apache/spark-website/pull/256
>  https://github.com/apache/spark-website/pull/255
>  https://github.com/apache/spark-website/pull/254
>  https://github.com/apache/spark-website/pull/250
>  https://github.com/apache/spark-website/pull/249
> 
>  If there is no objection, and this guy opens the PR again, I am going
> to open an infra ticket to block
>  this guy from spark-webiste repo to prevent such behaviours.
> 
>  Please let me know if you guys have any concerns.
> 
> >>>
> >>>
> >>> --
> >>> ---
> >>> Takeshi Yamamuro
>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: `Target Version` management on correctness/data-loss Issues

2020-01-27 Thread Dongjoon Hyun

Yes. That is what I pointed in `Unfortunately, we didn't build a consensus
on what is really blocked by that.` If you are suggesting a vote, do you
mean a majority-win vote or an unanimous decision? Will it be a permanent
decision?

> I think the other interesting thing here is how exactly to come to
agreement on whether it needs to be fixed in a particular release. Like we
have been discussing on SPARK-29701. This could be a matter of opinion, so
should we do something like mail the dev list whenever one of these issues
is tagged if its not going to be back ported to an affected release?

The following seems to happen when the committers initially think like
"Seems behavioral to me and its been consistent so seems ok to skip for
2.4.5"
For example, SPARK-27619 MapType should be prohibited in hash expressions.

> A) I'm not clear on this one as to why affected and target would be
different initially,

BTW, in this email thread, I'm focusing on the `Target Version` management.
That is the only way to detect the community decision change.

Bests,
Dongjoon.

On Mon, Jan 27, 2020 at 11:12 AM Tom Graves  wrote:

> thanks for bringing this up.
>
> A) I'm not clear on this one as to why affected and target would be
> different initially, other then the reasons target versions != fixed
> versions.  Is the intention here just to say, if its already been discussed
> and came to consensus not needed in certain release?  The only other
> obvious time is in spark releases that are no longer maintained.
>
> I think the other interesting thing here is how exactly to come to
> agreement on whether it needs to be fixed in a particular release. Like we
> have been discussing on SPARK-29701. This could be a matter of opinion, so
> should we do something like mail the dev list whenever one of these issues
> is tagged if its not going to be back ported to an affected release?
>
> Tom
> On Sunday, January 26, 2020, 11:22:13 PM CST, Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
>
>
> Hi, All.
>
> After 2.4.5 RC1 vote failure, I asked your opinions about
> correctness/dataloss issues (at mailing lists/JIRAs/PRs) in order to
> collect the current status and public opinion widely in the community to
> build a consensus on this at this time.
>
> Before talking about those issues, please remind that
>
> - Apache Spark 2.4.x is the only live version because 2.3.x is EOL and
> 3.0.0 is not released.
> - Apache Spark community has the following rule: "Correctness and data
> loss issues should be considered Blockers."
>
> Unfortunately, we didn't build a consensus on what is really blocked by
> that. In reality, it was just our resolution for the quality and it works a
> little differently.
>
> In this email, I want to talk about correctness/dataloss issues and
> observed public opinions. They fall into the following categories roughly.
>
> 1. Resolved in both 3.0.0 and 2.4.x
>- ex) SPARK-30447 Constant propagation nullability issue
>- No problem. However, this case sometimes goes to (2)
>
> 2. Resolved in both 3.0.0 and 2.4.x. But, reverted in 2.4.x later.
>- ex) SPARK-26021 -0.0 and 0.0 not treated consistently, doesn't match
> Hive
>- "We don't want to change the behavior in the maintenence release"
>
> 3. Resolved in 3.0.0 and not backported because this is 3.0.0-specific.
>- ex) SPARK-29906 Reading of csv file fails with adaptive execution
> turned on
>- No problem.
>
> 4. Resolved in 3.0.0 and not backported due to technical difficulty.
>- ex) SPARK-26154 Stream-stream joins - left outer join gives
> inconsistent output
>- "This is not backported due to the technical difficulty"
>
> 5. Resolved in 3.0.0 and not backported because this is not public API.
>- ex) SPARK-29503 MapObjects doesn't copy Unsafe data when nested under
> Safe data
>- "Since `catalyst` is not public, it's less worth backporting this."
>
> 6. Resolved in 3.0.0 and not backported because we forget since there was
> a no Target Version.
>- ex) SPARK-28375 Make pullupCorrelatedPredicate idempotent
>- "Adding the 'correctness' label so we remember to backport this fix
> to 2.4.x."
>- "This is possible, if users add the rule into
> postHocOptimizationBatches"
>
> 7. Open with Target Version 3.0.0.
>- ex) SPARK-29701 Correct behaviours of group analytical queries when
> empty input given
>- "We aren't fully SQL compliant there and I think that has been true
> since the beginning of spark sql"
>- "This is not a regression"
>
> 8. Open without Target Version.
>- I removed this case last week to give more visibility on them.
>
> Here, I want to

Re: Spark 2.4.5 RC2 Preparation Status

2020-01-29 Thread Dongjoon Hyun

Great. Sean.

Then, what is your criteria to remove the targeting it from 2.4.5?

It doesn't depend on `Who`, right?

Bests,
Dongjoon.


On Wed, Jan 29, 2020 at 9:56 AM Sean Owen  wrote:

> OK what if anything is in question for 2.4.5? I don't see anything open
> and targeted for it.
> Are we talking about https://issues.apache.org/jira/browse/SPARK-28344 -
> targeted for 2.4.5 but not backported, and a 'correctness' issue?
> Simply: who argues this must hold up 2.4.5, and if so what's the status?
>
> On Wed, Jan 29, 2020 at 11:27 AM Dongjoon Hyun 
> wrote:
>
>> Hi, Nicholas and all.
>>
>> RC2 is blocked by the community policy on correctness/dataloss issues.
>>
>> I cut the RC1 when there were no correctness/dataloss issue targeting on
>> 2.4.5. However, it fails because one correctness issue (target = 3.0.0) is
>> resolved and the community changes the target to 2.4.5 at the last day of
>> RC1 vote.
>>
>> As of now, there exists 2.4.5 targeting correctness issue. As a release
>> manager, I cannot cut RC2 until there is no correctness/dataloss issue with
>> target=2.4.5. We need to fix it or we need to move the target version to
>> 2.4.6.
>>
>> That's the current situation. I'm trying to follow the existing community
>> policies, but it seems too idealistic for the release. I'm trying to figure
>> out what is the best option for 2.4.5 in the community. Hopefully, we can
>> start RC2 without known risks at least.
>>
>> For non-correctness issues, it's up to the progress and decision on them.
>> Those issues are not blockers.
>>
>> Bests,
>> Dongjoon.
>>
>> On Wed, Jan 29, 2020 at 05:39 Nicholas Marion  wrote:
>>
>>> Hello,
>>>
>>> Was wondering if RC2 is expected to release soon? Any chance that
>>> https://issues.apache.org/jira/browse/SPARK-30310 could be added to
>>> branch-2.4 as well for 2.4.5 release? Especially since 2.4.x introduced the
>>> bug?
>>>
>>>
>>> Regards,
>>>
>>> *NICHOLAS T. MARION *
>>> IBM Open Data Analytics for z/OS - *CPO* and *Service Team Lead*
>>> --
>>> *Phone: *1-845-433-5010 | *Tie-Line: *293-5010
>>> *E-mail:* *nmar...@us.ibm.com* 
>>> *Find me on:* [image: LinkedIn:
>>> http://www.linkedin.com/in/nicholasmarion]
>>> <http://www.linkedin.com/in/nicholasmarion>
>>> [image: IBM]
>>>
>>> 2455 South Rd
>>> Poughkeepie, New York 12601-5400
>>> United States
>>> [image: IBM Redbooks Silver Author][image: Data Science Foundations -
>>> Level 1]
>>>
>>>
>>> [image: Inactive hide details for Dongjoon Hyun ---01/20/2020 11:27:19
>>> PM---Hi, All. RC2 was scheduled on Today and all RC1 feedbacks s]Dongjoon
>>> Hyun ---01/20/2020 11:27:19 PM---Hi, All. RC2 was scheduled on Today and
>>> all RC1 feedbacks seems to be addressed.
>>>
>>>
>>>
>>> From: Dongjoon Hyun 
>>> To: dev 
>>> Date: 01/20/2020 11:27 PM
>>> Subject: [EXTERNAL] Spark 2.4.5 RC2 Preparation Status
>>> --
>>>
>>>
>>>
>>> Hi, All.
>>>
>>> RC2 was scheduled on Today and all RC1 feedbacks seems to be addressed.
>>> However, I'm waiting for another on-going correctness PR.
>>>
>>> *https://github.com/apache/spark/pull/27233*
>>> <https://github.com/apache/spark/pull/27233>
>>> [SPARK-29701][SQL] Correct behaviours of group analytical queries
>>> when empty input given
>>>
>>> Unlike the other correctness issues (I sent previsouly), this one is
>>> active enough to make RC2 fail. As we know, Spark 2.4.5 RC1 vote failed
>>> because the correctness patch landed on `master` branch during the RC1 vote
>>> period and there was official requests for backporting.
>>>
>>> *https://github.com/apache/spark/pull/27229*
>>> <https://github.com/apache/spark/pull/27229>
>>> [SPARK-29708][SQL][2.4] Correct aggregated values when grouping sets
>>> are duplicated
>>>
>>> It's risk to start RC2 without considering it because VOTE is also
>>> consuming the community resources.
>>>
>>> BTW, if there is another on-going notable PR for 2.4.5 RC1, please reply
>>> to me.
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>>
>>>

Re: `Target Version` management on correctness/data-loss Issues

2020-01-27 Thread Dongjoon Hyun

Hi, All.

Currently, there is only one correctness issue which is targeting at 2.4.5.

SPARK-28344 Fail the query if detect ambiguous self join
-> Duplicated by
 SPARK-10892 Join with Data Frame returns wrong results
 SPARK-27547 fix DataFrame self-join problems
 SPARK-30218 Columns used in inequality conditions for joins not
resolved correctly in case of common lineage

As I sent yesterday, we revisited the correctness/dataloss issues and
reinitiated the further discussion. Also, the use of `Target Version` is
proposed. So, please set `Target Version` explicitly if you think there is
any other correctness/dataloss issue which is blocking 2.4.5 RC2.
Otherwise, it's very hard for the release manager to notice it from the hey
stacks of JIRA comments and PR comments.

Bests,
Dongjoon.


On Mon, Jan 27, 2020 at 12:30 PM Dongjoon Hyun 
wrote:

> Yes. That is what I pointed in `Unfortunately, we didn't build a consensus
> on what is really blocked by that.` If you are suggesting a vote, do you
> mean a majority-win vote or an unanimous decision? Will it be a permanent
> decision?
>
> > I think the other interesting thing here is how exactly to come to
> agreement on whether it needs to be fixed in a particular release. Like we
> have been discussing on SPARK-29701. This could be a matter of opinion, so
> should we do something like mail the dev list whenever one of these issues
> is tagged if its not going to be back ported to an affected release?
>
> The following seems to happen when the committers initially think like
> "Seems behavioral to me and its been consistent so seems ok to skip for
> 2.4.5"
> For example, SPARK-27619 MapType should be prohibited in hash expressions.
>
> > A) I'm not clear on this one as to why affected and target would be
> different initially,
>
> BTW, in this email thread, I'm focusing on the `Target Version` management.
> That is the only way to detect the community decision change.
>
> Bests,
> Dongjoon.
>
> On Mon, Jan 27, 2020 at 11:12 AM Tom Graves  wrote:
>
>> thanks for bringing this up.
>>
>> A) I'm not clear on this one as to why affected and target would be
>> different initially, other then the reasons target versions != fixed
>> versions.  Is the intention here just to say, if its already been discussed
>> and came to consensus not needed in certain release?  The only other
>> obvious time is in spark releases that are no longer maintained.
>>
>> I think the other interesting thing here is how exactly to come to
>> agreement on whether it needs to be fixed in a particular release. Like we
>> have been discussing on SPARK-29701. This could be a matter of opinion, so
>> should we do something like mail the dev list whenever one of these issues
>> is tagged if its not going to be back ported to an affected release?
>>
>> Tom
>> On Sunday, January 26, 2020, 11:22:13 PM CST, Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>>
>>
>> Hi, All.
>>
>> After 2.4.5 RC1 vote failure, I asked your opinions about
>> correctness/dataloss issues (at mailing lists/JIRAs/PRs) in order to
>> collect the current status and public opinion widely in the community to
>> build a consensus on this at this time.
>>
>> Before talking about those issues, please remind that
>>
>> - Apache Spark 2.4.x is the only live version because 2.3.x is EOL
>> and 3.0.0 is not released.
>> - Apache Spark community has the following rule: "Correctness and
>> data loss issues should be considered Blockers."
>>
>> Unfortunately, we didn't build a consensus on what is really blocked by
>> that. In reality, it was just our resolution for the quality and it works a
>> little differently.
>>
>> In this email, I want to talk about correctness/dataloss issues and
>> observed public opinions. They fall into the following categories roughly.
>>
>> 1. Resolved in both 3.0.0 and 2.4.x
>>- ex) SPARK-30447 Constant propagation nullability issue
>>- No problem. However, this case sometimes goes to (2)
>>
>> 2. Resolved in both 3.0.0 and 2.4.x. But, reverted in 2.4.x later.
>>- ex) SPARK-26021 -0.0 and 0.0 not treated consistently, doesn't match
>> Hive
>>- "We don't want to change the behavior in the maintenence release"
>>
>> 3. Resolved in 3.0.0 and not backported because this is 3.0.0-specific.
>>- ex) SPARK-29906 Reading of csv file fails with adaptive execution
>> turned on
>>- No problem.
>>
>> 4. Resolved in 3.0.0 and not backported due to technical difficulty.
>>- ex) SPARK-26154 Stream-st

Re: Spark 2.4.5 RC2 Preparation Status

2020-01-29 Thread Dongjoon Hyun

Thanks, Sean.

If there is no further objection to the mailing list,
could you remove the `Target Version: 2.4.5` from the followings?

SPARK-28344 Fail the query if detect ambiguous self join
SPARK-29578 JDK 1.8.0_232 timezone updates cause "Kwajalein" test
failures again

Then, after the regular RC preparation testing including the manual
integration tests,
I can roll 2.4.5 RC2 next Monday (Feb. 3rd, PST) and all late blocker
patches will block 2.4.6 instead of causing RC failure.

Bests,
Dongjoon.


On Wed, Jan 29, 2020 at 12:16 PM Sean Owen  wrote:

> OK, that's specific. It's always a judgment call whether to hold the
> release train for one more fix or not. Depends on how impactful it is (harm
> of releasing without it), and how big it is (harm of delaying release of
> other fixes further). I think we tend to weight regressions from a previous
> 2.4.x release more heavily; those are typically Blockers, otherwise not.
> Otherwise once RCs start, we're driving primarily to a no-Blocker release.
> The default should be to punt to 2.4.6 -- which can come relatively soon if
> one wants.
>
> SPARK-28125 is not even a bug, I'd argue, let alone Blocker. Looks like it
> was marked 'correctness' by the reporter. It's always been the case since
> Spark 1.0 (i.e. not a regression) that RDDs need to be deterministic for
> most of the semantics one expects to work out. If it isn't, many bets are
> off. I get that this is a 'gotcha', but it isn't even about the
> randomSplit. If anything recomputes the RDD, it could be different.
>
> SPARK-28067, I don't know anything about, but also is being reported as
> not a 2.4.x regression, and I don't see anyone working on it. For that
> reason, not sure it's a Blocker for 2.4.x.
>
> SPARK-30310 is not a 2.4.x regression either, nor particularly critical
> IMHO. Doesn't mean we can't back-port it to 2.4 though, and it's 'done' (in
> master)
>
> Anything else? not according to JIRA at least.
>
> I think it's valid to continue with RC2 assuming none of these are
> necessary for 2.4.5.
> It's not wrong to 'wait' if there are strong feelings about something,
> but, if we can't see a reason to expect the situation changes in a week, 2
> weeks, then, why? The release of 2.4.5 nowish doesn't necessarily make the
> release of said fix much further away -- in 2.4.6.
>
> On Wed, Jan 29, 2020 at 1:28 PM Dongjoon Hyun 
> wrote:
>
>> > SPARK-28125 dataframes created by randomSplit have overlapping rows
>> > Seems like something we should fix
>> > SPARK-28067 Incorrect results in decimal aggregation with
>> whole-stage code gen enabled
>> > Seems like we should fix
>>
>> Here, I'm trying to narrow down our focus to the issues with `Explicit
>> Target Version` and continue to release. In other words, as a release
>> manager, I hope I can officially ignore the other correctness issues which
>> is not targeting to 2.4.5 explicitly.
>>
>> Most correctness issues are long-standing and cause behavior changes.
>> During maintenance RC vote, for those kind of issues, I hope we set the
>> Target Version `2.4.6` instead of casting a veto RC. It's the same policy
>> with Fix Version. During RC vote period, Fix Version is set to the next
>> version `2.4.6` instead of the current RC `2.4.5`. Since maintenance
>> happens more frequently, I believe that's okay.
>>
>>>
>>>

Re: Spark 2.4.5 RC2 Preparation Status

2020-01-29 Thread Dongjoon Hyun

Got it. Thanks!

Bests,
Dongjoon.

On Wed, Jan 29, 2020 at 1:40 PM Sean Owen  wrote:

> OK, we can wait a tick to confirm there aren't strong objections.
> I suppose I'd prefer someone who knows
> https://issues.apache.org/jira/browse/SPARK-28344 to confirm it was
> either erroneously targeted to 2.4, or else it's valid, but, not
> critical for the RC. Hearing nothing else shortly, I'd untarget it.
>
> SPARK-29578 is a tiny low-risk test change but probably worth picking
> up to avoid failing on certain JDKs during testing. I'll make a
> back-port, as this should be noncontroversial. (Not sure why I didn't
> backport originally)
>
> On Wed, Jan 29, 2020 at 3:27 PM Dongjoon Hyun 
> wrote:
> >
> > Thanks, Sean.
> >
> > If there is no further objection to the mailing list,
> > could you remove the `Target Version: 2.4.5` from the followings?
> >
> > SPARK-28344 Fail the query if detect ambiguous self join
> > SPARK-29578 JDK 1.8.0_232 timezone updates cause "Kwajalein" test
> failures again
> >
> > Then, after the regular RC preparation testing including the manual
> integration tests,
> > I can roll 2.4.5 RC2 next Monday (Feb. 3rd, PST) and all late blocker
> patches will block 2.4.6 instead of causing RC failure.
> >
> > Bests,
> > Dongjoon.
> >
> >
> > On Wed, Jan 29, 2020 at 12:16 PM Sean Owen  wrote:
> >>
> >> OK, that's specific. It's always a judgment call whether to hold the
> release train for one more fix or not. Depends on how impactful it is (harm
> of releasing without it), and how big it is (harm of delaying release of
> other fixes further). I think we tend to weight regressions from a previous
> 2.4.x release more heavily; those are typically Blockers, otherwise not.
> Otherwise once RCs start, we're driving primarily to a no-Blocker release.
> The default should be to punt to 2.4.6 -- which can come relatively soon if
> one wants.
> >>
> >> SPARK-28125 is not even a bug, I'd argue, let alone Blocker. Looks like
> it was marked 'correctness' by the reporter. It's always been the case
> since Spark 1.0 (i.e. not a regression) that RDDs need to be deterministic
> for most of the semantics one expects to work out. If it isn't, many bets
> are off. I get that this is a 'gotcha', but it isn't even about the
> randomSplit. If anything recomputes the RDD, it could be different.
> >>
> >> SPARK-28067, I don't know anything about, but also is being reported as
> not a 2.4.x regression, and I don't see anyone working on it. For that
> reason, not sure it's a Blocker for 2.4.x.
> >>
> >> SPARK-30310 is not a 2.4.x regression either, nor particularly critical
> IMHO. Doesn't mean we can't back-port it to 2.4 though, and it's 'done' (in
> master)
> >>
> >> Anything else? not according to JIRA at least.
> >>
> >> I think it's valid to continue with RC2 assuming none of these are
> necessary for 2.4.5.
> >> It's not wrong to 'wait' if there are strong feelings about something,
> but, if we can't see a reason to expect the situation changes in a week, 2
> weeks, then, why? The release of 2.4.5 nowish doesn't necessarily make the
> release of said fix much further away -- in 2.4.6.
> >>
> >> On Wed, Jan 29, 2020 at 1:28 PM Dongjoon Hyun 
> wrote:
> >>>
> >>> > SPARK-28125 dataframes created by randomSplit have overlapping
> rows
> >>> > Seems like something we should fix
> >>> > SPARK-28067 Incorrect results in decimal aggregation with
> whole-stage code gen enabled
> >>> > Seems like we should fix
> >>>
> >>> Here, I'm trying to narrow down our focus to the issues with `Explicit
> Target Version` and continue to release. In other words, as a release
> manager, I hope I can officially ignore the other correctness issues which
> is not targeting to 2.4.5 explicitly.
> >>>
> >>> Most correctness issues are long-standing and cause behavior changes.
> During maintenance RC vote, for those kind of issues, I hope we set the
> Target Version `2.4.6` instead of casting a veto RC. It's the same policy
> with Fix Version. During RC vote period, Fix Version is set to the next
> version `2.4.6` instead of the current RC `2.4.5`. Since maintenance
> happens more frequently, I believe that's okay.
> >>>>
> >>>>
>

Re: Spark 3.0 and ORC 1.6

2020-01-29 Thread Dongjoon Hyun

Hi, David.

Thank you for sharing your opinion.
I'm also a supporter for ZStandard.

Apache Spark 3.0 starts to take advantage of ZStd a lot.

   1) Switch the default codec for MapOutputStatus from GZip to ZStd.
   2) Add spark.eventLog.compression.codec to allow ZStd.
   3) Use Parquet+ZStd easily for data storage off the shelf.
   (with Hadoop 3.2 pre-built distribution)

So, the last big missing piece is ORC, right.

As a PMC member of both Apache Spark and Apache ORC,
I've been trying to reduce those gaps of both communities
in order to maximize the synergy.

Historically,

   1) At Apache Spark 2.3.0, Apache Spark started to depend on Apache ORC
1.4.0 (SPARK-21422)
   2) At Apache Spark 2.4.0, Apache ORC 1.5.2 becomes the default ORC
library. (SPARK-23456)
   3) At Apache Spark 3.0.0, Apache Spark embraced the breaking change of
Apach ORC 1.5.6 (SPARK-28208)
  And, Apache Spark 3.0.0-preview2 catched up until ORC 1.5.8 while
Spark 2.4.5 and 2.4.6 will stay with 1.5.5.

However, Apache ORC 1.5.9 RC0 also had another breaking change. Although we
minimized the regression at 1.5.9 RC1, we still need to adapt for the
followings. (Please see the ORC 1.5.9 RC0/RC1 vote email for the detail.)

   - Upgrade hive-storage-api upgrade from 2.6.0 to 2.7.1
   - Add new dependency `threeten-extra-1.5.0.jar`

The above breaking changes will be a subset of that of ORC 1.6.x. And, we
need to validate any potential performance regression at 1.6.x, too. I hope
we can use Apache ORC 1.5.9 as a stepping stone to reach Apache ORC 1.6.x

I'll create a PR to upgrade to Apache ORC 1.5.9 as soon as possible, but
Spark community will decide whether it will be in or not in 3.0.0.

In short, given the circumstance, Apache Spark 3.1.0 will be a more safer
release candidate for Apache ORC 1.6.x adoption. Spark 3.1.0 will arrive in
six month after Spark 3.0.0. We have a release cadence. At that time, we
can focus on ORC improvements with more references. (For now, Apache ORC
1.6 is not used at Apache Hive, either.)

Bests,
Dongjoon.

On Tue, Jan 28, 2020 at 12:41 PM David Christle
 wrote:

> Hi all,
>
>
>
> I am a heavy user of Spark at LinkedIn, and am excited about the ZStandard
> compression option recently incorporated into ORC 1.6. I would love to
> explore using it for storing/querying of large (>10 TB) tables for my own
> disk I/O intensive workloads, and other users & companies may be interested
> in adopting ZStandard more broadly, since it seems to offer faster
> compression speeds at higher compression ratios with better multi-threaded
> support than zlib/Snappy. At scale, improvements of even ~10% on disk
> and/or compute, hopefully just from setting the “orc.compress” flag to a
> different value, could translate into palpable gains in capacity/cost
> cluster wide without requiring broad engineering migrations. See a somewhat
> recent FB Engineering blog post on the topic for their reported
> experiences: https://engineering.fb.com/core-data/zstandard/
>
>
>
> Do we know if ORC 1.6.x will make the cut for Spark 3.0?
>
>
>
> A recent PR (https://github.com/apache/spark/pull/26669) updated ORC to
> 1.5.8, but I don’t have a good understanding of how difficult incorporating
> ORC 1.6.x into Spark will be. For instance, in the PRs for enabling Java
> Zstd in ORC (https://github.com/apache/orc/pull/306 &
> https://github.com/apache/orc/pull/412), some additional work/discussion
> around Hadoop shims occurred to maintain compatibility across different
> versions of Hadoop (e.g. 2.7) and aircompressor (a library containing Java
> implementations of various compression codecs, so that dependence on Hadoop
> 2.9 is not required). Again, these may be non-issues, but I wanted to
> kindle discussion around whether this can make the cut for 3.0, since I
> imagine it’s a major upgrade many users will focus on migrating to once
> released.
>
>
>
> Kind regards,
>
> David Christle
>

Apache Spark Docker image repository

2020-02-05 Thread Dongjoon Hyun

Hi, All.

>From 2020, shall we have an official Docker image repository as an
additional distribution channel?

I'm considering the following images.

- Public binary release (no snapshot image)
- Public non-Spark base image (OS + R + Python)
  (This can be used in GitHub Action Jobs and Jenkins K8s Integration
Tests to speed up jobs and to have more stabler environments)

Bests,
Dongjoon.

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2020-02-04 Thread Dongjoon Hyun

Thank you, Shane! :D

Bests,
Dongjoon

On Tue, Feb 4, 2020 at 13:28 shane knapp ☠  wrote:

> all the 3.0 builds have been created and are currently churning away!
>
> (the failed builds were to a silly bug in the build scripts sneaking it's
> way back in, but that's resolved now)
>
> shane
>
> On Sat, Feb 1, 2020 at 6:16 PM Reynold Xin  wrote:
>
>> Note that branch-3.0 was cut. Please focus on testing, polish, and let's
>> get the release out!
>>
>>
>> On Wed, Jan 29, 2020 at 3:41 PM, Reynold Xin  wrote:
>>
>>> Just a reminder - code freeze is coming this Fri!
>>>
>>> There can always be exceptions, but those should be exceptions and
>>> discussed on a case by case basis rather than becoming the norm.
>>>
>>>
>>>
>>> On Tue, Dec 24, 2019 at 4:55 PM, Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Jan 31 sounds good to me.
>>>>
>>>> Just curious, do we allow some exception on code freeze? One thing came
>>>> into my mind is that some feature could have multiple subtasks and part of
>>>> subtasks have been merged and other subtask(s) are in reviewing. In this
>>>> case do we allow these subtasks to have more days to get reviewed and
>>>> merged later?
>>>>
>>>> Happy Holiday!
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>> On Wed, Dec 25, 2019 at 8:36 AM Takeshi Yamamuro 
>>>> wrote:
>>>>
>>>>> Looks nice, happy holiday, all!
>>>>>
>>>>> Bests,
>>>>> Takeshi
>>>>>
>>>>> On Wed, Dec 25, 2019 at 3:56 AM Dongjoon Hyun 
>>>>> wrote:
>>>>>
>>>>>> +1 for January 31st.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>> On Tue, Dec 24, 2019 at 7:11 AM Xiao Li 
>>>>>> wrote:
>>>>>>
>>>>>>> Jan 31 is pretty reasonable. Happy Holidays!
>>>>>>>
>>>>>>> Xiao
>>>>>>>
>>>>>>> On Tue, Dec 24, 2019 at 5:52 AM Sean Owen  wrote:
>>>>>>>
>>>>>>>> Yep, always happens. Is earlier realistic, like Jan 15? it's all
>>>>>>>> arbitrary but indeed this has been in progress for a while, and 
>>>>>>>> there's a
>>>>>>>> downside to not releasing it, to making the gap to 3.0 larger.
>>>>>>>> On my end I don't know of anything that's holding up a release; is
>>>>>>>> it basically DSv2?
>>>>>>>>
>>>>>>>> BTW these are the items still targeted to 3.0.0, some of which may
>>>>>>>> not have been legitimately tagged. It may be worth reviewing what's 
>>>>>>>> still
>>>>>>>> open and necessary, and what should be untargeted.
>>>>>>>>
>>>>>>>> SPARK-29768 nondeterministic expression fails column pruning
>>>>>>>> SPARK-29345 Add an API that allows a user to define and observe
>>>>>>>> arbitrary metrics on streaming queries
>>>>>>>> SPARK-29348 Add observable metrics
>>>>>>>> SPARK-29429 Support Prometheus monitoring natively
>>>>>>>> SPARK-29577 Implement p-value simulation and unit tests for chi2
>>>>>>>> test
>>>>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>>>>>>> SPARK-28588 Build a SQL reference doc
>>>>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>>>>>> SPARK-28684 Hive module support JDK 11
>>>>>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>>>>>>>> after some operations
>>>>>>>> SPARK-28264 Revisiting Python / pandas UDF
>>>>>>>> SPARK-28301 fix the behavior of table name resolution with
>>>>>>>> multi-catalog
>>>>>>>> SPARK-28155 do not leak SaveMode to file source v2
>>>>>>>> SPARK-28103 Cannot infer filters fr

[VOTE][RESULT] Spark 2.4.5 (RC2)

2020-02-05 Thread Dongjoon Hyun

Hi, All.

The vote passes. Thanks to all who helped with this release 2.4.5!
I'll follow up later with a release announcement once everything is
published.

+1 (* = binding):
- Dongjoon Hyun *
- Wenchen Fan *
- Hyukjin Kwon *
- Takeshi Yamamuro
- Maxim Gekk
- Sean Owen *

+0: None

-1: None

Bests,
Dongjoon.

[FYI] `Target Version` on `Improvement`/`New Feature` JIRA issues

2020-02-01 Thread Dongjoon Hyun

Hi, All.

>From Today, we have `branch-3.0` as a tool of `Feature Freeze`.

https://github.com/apache/spark/tree/branch-3.0

All open JIRA issues whose type is `Improvement` or `New Feature` and had
`3.0.0` as a `Target Version` are changed accordingly first.

- Most of them are re-targeted to `3.1.0`.
- Some of them are resolved according to the JIRA content.
- Some unauthorized target versions are removed according to the
community policy.

To sum up, we have no open `Improvement/New Feature` JIRA issues targeting
`3.0.0` officially. For exceptional cases, we will discuss on them case by
case during `3.0.0 QA` phase.

Bests,
Dongjoon

Revise the blocker policy

2020-01-31 Thread Dongjoon Hyun

Hi, All.

We discussed the correctness/dataloss policies for two weeks.
According to our practice, I want to revise our policy in our website
explicitly.

- Correctness and data loss issues should be considered Blockers
+ Correctness and data loss issues should be considered Blockers for their
target versions.

I made a PR. Please review and give me your feedbacks on it.

https://github.com/apache/spark-website/pull/258/files

Bests,
Dongjoon.

Re: [VOTE] Release Apache Spark 2.4.5 (RC2)

2020-02-03 Thread Dongjoon Hyun

Yes, it does officially since 2.4.0.

2.4.5 is a maintenance release of 2.4.x line and the community didn't
support Hadoop 3.x on 'branch-2.4'. We didn't run test at all.

Bests,
Dongjoon.

On Sun, Feb 2, 2020 at 22:58 Ajith shetty  wrote:

> Is hadoop-3.1 profile supported for this release.? i see lot of UTs
> failing under this profile.
> https://github.com/apache/spark/blob/v2.4.5-rc2/pom.xml
>
> *Example:*
>  [INFO] Running org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite
> [ERROR] Tests run: 3, Failures: 0, Errors: 3, Skipped: 0, Time elapsed:
> 1.717 s <<< FAILURE! - in
> org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite
> [ERROR]
> saveExternalTableAndQueryIt(org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite)
> Time elapsed: 1.675 s  <<< ERROR!
> java.lang.ExceptionInInitializerError
> at
> org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite.setUp(JavaMetastoreDataSourcesSuite.java:66)
> Caused by: java.lang.IllegalArgumentException: *Unrecognized Hadoop major
> version number: 3.1.0*
> at
> org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite.setUp(JavaMetastoreDataSourcesSuite.java:66)
>

Re: new branch-3.0 jenkins job configs are ready to be deployed...

2020-01-31 Thread Dongjoon Hyun

Thank you, Shane.

BTW, we need to enable JDK11 unit run on Python and R. (Currently, it's
only tested in PRBuilder.)

https://issues.apache.org/jira/browse/SPARK-28900

Today, Thomas and I'm hitting Python UT failure on JDK11 environment in
independent PRs.

ERROR [32.750s]: test_parameter_accuracy
(pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
--
...
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
line 226, in condition
self.assertAlmostEqual(rel, 0.1, 1)
AssertionError: 0.17619737864096185 != 0.1 within 1 places

Although I'm investigating now, we need Jenkins jobs as a 3rd party
validator from 3.0.0 QA period.

Bests,
Dongjoon


On Fri, Jan 31, 2020 at 11:26 AM Xiao Li  wrote:

> Thank you always, Shane!
>
> Xiao
>
> On Fri, Jan 31, 2020 at 11:19 AM shane knapp ☠ 
> wrote:
>
>> ...whenever i get the word.  :)
>>
>> FWIW they will all be identical to the current group of master
>> builds/tests.
>>
>> shane
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> 
>

Re: new branch-3.0 jenkins job configs are ready to be deployed...

2020-01-31 Thread Dongjoon Hyun

Oops. I found this flaky test fails event in `Hadoop 2.7 with Hive 1.2`.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-1.2/lastCompletedBuild/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_parameter_accuracy/

Anyway, I'll file a JIRA issue for this Python flakiness.

Bests,
Dongjoon.


On Fri, Jan 31, 2020 at 5:17 PM Dongjoon Hyun 
wrote:

> Thank you, Shane.
>
> BTW, we need to enable JDK11 unit run on Python and R. (Currently, it's
> only tested in PRBuilder.)
>
> https://issues.apache.org/jira/browse/SPARK-28900
>
> Today, Thomas and I'm hitting Python UT failure on JDK11 environment in
> independent PRs.
>
> ERROR [32.750s]: test_parameter_accuracy
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> --
> ...
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 226, in condition
> self.assertAlmostEqual(rel, 0.1, 1)
> AssertionError: 0.17619737864096185 != 0.1 within 1 places
>
> Although I'm investigating now, we need Jenkins jobs as a 3rd party
> validator from 3.0.0 QA period.
>
> Bests,
> Dongjoon
>
>
> On Fri, Jan 31, 2020 at 11:26 AM Xiao Li  wrote:
>
>> Thank you always, Shane!
>>
>> Xiao
>>
>> On Fri, Jan 31, 2020 at 11:19 AM shane knapp ☠ 
>> wrote:
>>
>>> ...whenever i get the word.  :)
>>>
>>> FWIW they will all be identical to the current group of master
>>> builds/tests.
>>>
>>> shane
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> <https://databricks.com/sparkaisummit/north-america>
>>
>

Re: Apache Spark Docker image repository

2020-02-07 Thread Dongjoon Hyun

Thank you, Sean, Jiaxin, Shane, and Tom, for feedbacks.

1. For legal questions, please see the following three Apache-approved
approaches. We can follow one of them.

   1. https://hub.docker.com/u/apache (93 repositories,
Airflow/NiFi/Beam/Druid/Zeppelin/Hadoop/...)
   2. https://hub.docker.com/_/solr (This is also official. There are
more instances like this.)
   3. https://hub.docker.com/u/apachestreampipes (Some projects tries
this form.)

2. For non-Spark dev-environment images, definitely it will help both our
Jenkins and GitHub Action jobs. Apache Infra team also supports GitHub
Action secret like the following.

   https://issues.apache.org/jira/browse/INFRA-19565 Create a Docker
Hub secret for Github Actions

3. For Spark image content questions, we should not do the following. It's
because not only for legal issues, but also we cannot contain or maintain
all popular libraries like Nvidia library/TensorFlow in our image.

   https://issues.apache.org/jira/browse/SPARK-26398 Support building
GPU docker images

4. The way I see this is a minimal legal image containing only our
artifacts from the followings. We can check the other Apache repos's best
practice.

   https://www.apache.org/dist/spark/

5. For OS/Java/Python/R runtimes and libraries, those (except OS) can
be overlayed as an additional layers by the users in general. I don't think
we need to provide every combination (Debian/Ubuntu/CentOS/Alpine) x
(JDK/JRE) x (Python2/Python3/PyPy) x (R 3.6/3.6) x (many libraries).
Specifically, I don't think we need to install all libraries like `arrow`.

6. For the target users, this is a general docker image. We don't need to
assume that this is for K8s-only environment. This can be used in any
Docker environment.

7. For the number of images, as suggested in this thread, we may want to
follow our existing K8s integration test suite way by splitting PySpark and
R images from Java. But, I don't have any requirement for this.

What I want to propose in this thread is that we can start with a minimal
viable product and evolve them (if needed) as an open source community.

Bests,
Dongjoon.

PS. BTW, Apache Spark 2.4.5 artifacts are published into our doc website,
our distribution repo, Maven Central, PyPi, CRAN, Homebrew.
   I'm preparing website news and download page update.

On Thu, Feb 6, 2020 at 11:19 AM Tom Graves  wrote:

> When discussions of docker have occurred in the past - mostly related to
> k8s - there is a lot of discussion about what is the right image to
> publish, as well as making sure Apache is ok with it. Apache official
> release is the source code so we may need to make sure to have disclaimer
> and we need to make sure it doesn't contain anything licensed that it
> shouldn't.  What happens when one of the docker images we publish has
> security update. We would need to make sure all the legal bases are covered
> first.
>
> Then the discussion comes into what is in the docker images and how useful
> it is. People run different os's, different python versions, etc. And like
> Sean mentioned how useful really is it other then a few examples.  Some
> discussions on https://issues.apache.org/jira/browse/SPARK-24655
>
> Tom
>
>
>
> On Wednesday, February 5, 2020, 02:16:37 PM CST, Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
>
>
> Hi, All.
>
> From 2020, shall we have an official Docker image repository as an
> additional distribution channel?
>
> I'm considering the following images.
>
> - Public binary release (no snapshot image)
> - Public non-Spark base image (OS + R + Python)
>   (This can be used in GitHub Action Jobs and Jenkins K8s Integration
> Tests to speed up jobs and to have more stabler environments)
>
> Bests,
> Dongjoon.
>

Re: [spark-packages.org] Jenkins down

2020-01-24 Thread Dongjoon Hyun

Thank you for working on that, Xiao.

BTW, I'm wondering why SPARK-30636 is a blocker for 2.4.5 release?

Do you mean `Critical`?

Bests,
Dongjoon.

On Fri, Jan 24, 2020 at 10:20 AM Xiao Li  wrote:

> Hi, all,
>
> Because the Jenkins of spark-packages.org is down, new packages or
> releases are unable to be created in spark-packages.org.
>
> Now, we are working on it. For the latest status, please follow the ticket
> https://issues.apache.org/jira/browse/SPARK-30636.
>
> Happy lunar new year,
>
> Xiao
>
>

Re: [spark-packages.org] Jenkins down

2020-01-24 Thread Dongjoon Hyun

Thank you for updating!

On Fri, Jan 24, 2020 at 10:29 AM Xiao Li  wrote:

> It does not block any Spark release. Reduced the priority to Critical.
>
> Cheers,
>
> Xiao
>
> Dongjoon Hyun  于2020年1月24日周五 上午10:24写道：
>
>> Thank you for working on that, Xiao.
>>
>> BTW, I'm wondering why SPARK-30636 is a blocker for 2.4.5 release?
>>
>> Do you mean `Critical`?
>>
>> Bests,
>> Dongjoon.
>>
>> On Fri, Jan 24, 2020 at 10:20 AM Xiao Li  wrote:
>>
>>> Hi, all,
>>>
>>> Because the Jenkins of spark-packages.org is down, new packages or
>>> releases are unable to be created in spark-packages.org.
>>>
>>> Now, we are working on it. For the latest status, please follow the
>>> ticket https://issues.apache.org/jira/browse/SPARK-30636.
>>>
>>> Happy lunar new year,
>>>
>>> Xiao
>>>
>>>

Re: [DISCUSS][SPARK-30275] Discussion about whether to add a gitlab-ci.yml file

2020-01-23 Thread Dongjoon Hyun

Hi, Jim.

Thank you for the proposal. I understand the request.
However, the following key benefit sounds like unofficial snapshot binary
releases.

> For example, this was used to build a version of spark that included
SPARK-28938 which has yet to be released and was necessary for
spark-operator to work properly with GKE service accounts

Historically, we removed the existing snapshot binaries in some personal
repositories and there is no plan to add it back.
Also, for snapshot dev jars, we use only the official Apache Maven snapshot
repository.

For official releases, we aim to release Apache Spark source code (and its
artifacts) according to the pre-defined release cadence in an official
manner.

BTW, SPARK-28938 doesn't mean that we need to publish a docker image. Even
in the official release, as you know, we only provide a reference
Dockerfile. That's the reason why we don't publish docker image via GitHub
Action (as of Today).

To achieve the following custom requirement, I'd like to recommend you to
have your own Dockerfile.
That is the best way for you to have the flexibility.

> One value of this is the ability to create versions of dependent packages
such as spark-on-k8s-operator

Thanks,
Dongjoon.

On Thu, Jan 23, 2020 at 9:32 AM Jim Kleckner  wrote:

> This story [1] proposes adding a .gitlab-ci.yml file to make it easy to
> create artifacts and images for spark.
>
> Using this mechanism, people can submit any subsequent version of spark for
> building and image hosting with gitlab.com.
>
> There is a companion WIP branch [2] with a candidate and example for doing
> this.
> The exact steps for building are in the yml file [3].
> The images get published into the namespace of the user as here [4]
>
> One value of this is the ability to create versions of dependent packages
> such as spark-on-k8s-operator that might use upgraded packages or
> modifications for testing.  For example, this was used to build a version
> of spark that included SPARK-28938 which has yet to be released and was
> necessary for spark-operator to work properly with GKE service accounts
> [5].
>
> Comments about desirability?
>
> [1] https://issues.apache.org/jira/browse/SPARK-30275
> [2] https://gitlab.com/jkleckner/spark/tree/add-gitlab-ci-yml
> [3] https://gitlab.com/jkleckner/spark
> /blob/add-gitlab-ci-yml/.gitlab-ci.yml
> [4] https://gitlab.com/jkleckner/spark/container_registry
> [5] https://gitlab.com/jkleckner/spark-on-k8s-operator/container_registry
>

Re: `Target Version` management on correctness/data-loss Issues

2020-01-28 Thread Dongjoon Hyun

Thanks, Tom.

I agree that emails are good for urgent announcement and reaching fast
agreement. Also, more visible in a short time period.

However, some correctness issues are long-standing and sometime they
changes their faces with different JIRA IDs. We can see the relationship
easily in the JIRA, but it's difficult in email thread. Also, email search
is not so helpful because it's individual email is read-only and not
itemized.

BTW, To All.
I'm continuing this correctness threads with multiple perspectives because
our RC process seems to be flaky. If we have a flaky test, we are trying to
fix. Why not about the flaky RC process? RC is designed to be okay to fail,
but that doesn't mean we don't have an efficient RC process.

The main root cause of RC failure is our insufficient management and
agreement on `Target Version`.

Bests,
Dongjoon.

On Tue, Jan 28, 2020 at 5:47 AM Tom Graves  wrote:

> I was just thinking an info email  (perhaps tagged with
> correctness/dataloss) to dev rather than an official vote, that way its
> more visible and if anyone sees it and disagrees with the targeting it can
> be discussed on that thread.  It might also just bring more visibility to
> those important issues and get people interesting in working on them sooner.
>
> Tom
>
> On Monday, January 27, 2020, 02:31:03 PM CST, Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
>
>
> Yes. That is what I pointed in `Unfortunately, we didn't build a consensus
> on what is really blocked by that.` If you are suggesting a vote, do you
> mean a majority-win vote or an unanimous decision? Will it be a permanent
> decision?
>
> > I think the other interesting thing here is how exactly to come to
> agreement on whether it needs to be fixed in a particular release. Like we
> have been discussing on SPARK-29701. This could be a matter of opinion, so
> should we do something like mail the dev list whenever one of these issues
> is tagged if its not going to be back ported to an affected release?
>
> The following seems to happen when the committers initially think like
> "Seems behavioral to me and its been consistent so seems ok to skip for
> 2.4.5"
> For example, SPARK-27619 MapType should be prohibited in hash expressions.
>
> > A) I'm not clear on this one as to why affected and target would be
> different initially,
>
> BTW, in this email thread, I'm focusing on the `Target Version` management.
> That is the only way to detect the community decision change.
>
> Bests,
> Dongjoon.
>
> On Mon, Jan 27, 2020 at 11:12 AM Tom Graves  wrote:
>
> thanks for bringing this up.
>
> A) I'm not clear on this one as to why affected and target would be
> different initially, other then the reasons target versions != fixed
> versions.  Is the intention here just to say, if its already been discussed
> and came to consensus not needed in certain release?  The only other
> obvious time is in spark releases that are no longer maintained.
>
> I think the other interesting thing here is how exactly to come to
> agreement on whether it needs to be fixed in a particular release. Like we
> have been discussing on SPARK-29701. This could be a matter of opinion, so
> should we do something like mail the dev list whenever one of these issues
> is tagged if its not going to be back ported to an affected release?
>
> Tom
> On Sunday, January 26, 2020, 11:22:13 PM CST, Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
>
>
> Hi, All.
>
> After 2.4.5 RC1 vote failure, I asked your opinions about
> correctness/dataloss issues (at mailing lists/JIRAs/PRs) in order to
> collect the current status and public opinion widely in the community to
> build a consensus on this at this time.
>
> Before talking about those issues, please remind that
>
> - Apache Spark 2.4.x is the only live version because 2.3.x is EOL and
> 3.0.0 is not released.
> - Apache Spark community has the following rule: "Correctness and data
> loss issues should be considered Blockers."
>
> Unfortunately, we didn't build a consensus on what is really blocked by
> that. In reality, it was just our resolution for the quality and it works a
> little differently.
>
> In this email, I want to talk about correctness/dataloss issues and
> observed public opinions. They fall into the following categories roughly.
>
> 1. Resolved in both 3.0.0 and 2.4.x
>- ex) SPARK-30447 Constant propagation nullability issue
>- No problem. However, this case sometimes goes to (2)
>
> 2. Resolved in both 3.0.0 and 2.4.x. But, reverted in 2.4.x later.
>- ex) SPARK-26021 -0.0 and 0.0 not treated consistently, doesn't match
> Hive
>- "We don't want to change the behavior in the maintenence release

[VOTE] Release Apache Spark 2.4.5 (RC2)

2020-02-02 Thread Dongjoon Hyun

Please vote on releasing the following candidate as Apache Spark version
2.4.5.

The vote is open until February 5th 11PM PST and passes if a majority +1
PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.5
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.5-rc2 (commit
cee4ecbb16917fa85f02c635925e2687400aa56b):
https://github.com/apache/spark/tree/v2.4.5-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.5-rc2-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1340/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.5-rc2-docs/

The list of bug fixes going into 2.4.5 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12346042

This release is using the release script of the tag v2.4.5-rc2.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.5?
===

The current list of open tickets targeted at 2.4.5 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.4.5

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

Re: [VOTE] Release Apache Spark 2.4.5 (RC2)

2020-02-02 Thread Dongjoon Hyun

I'll start with my +1.

Today, I verified the artifacts with GPG, and built and tested RC2 with the
followings.

  - Profile: -Pyarn -Phadoop-2.7 -Pkubernetes -Pkinesis-asl -Phive
-Phive-thriftserver
  - OS: CentOS (7.5.1804)
  - Java: OpenJDK 1.8.0_242
 * All Scala/Java UTs and JDBC IT passed.
  - Python 2.7.17 (with numpy 1.16.4, scipy 1.2.2, pandas 0.19.2, pyarrow
0.8.0)
 * All PySpark UTs passed.
  - Python 3.7.6 (with numpy 1.16.4, scipy 1.2.2, pandas 0.23.2, pyarrow
0.11.0)
 * All PySpark UTs passed.
  - Tested with Amazon EKS
 Client Version: v1.17.2
 Server Version: v1.14.9-eks-c0eccc

Bests,
Dongjoon.


On Sun, Feb 2, 2020 at 9:30 PM Dongjoon Hyun 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.5.
>
> The vote is open until February 5th 11PM PST and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.5
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.5-rc2 (commit
> cee4ecbb16917fa85f02c635925e2687400aa56b):
> https://github.com/apache/spark/tree/v2.4.5-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.5-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1340/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.5-rc2-docs/
>
> The list of bug fixes going into 2.4.5 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12346042
>
> This release is using the release script of the tag v2.4.5-rc2.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.5?
> ===
>
> The current list of open tickets targeted at 2.4.5 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.5
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

2020-02-17 Thread Dongjoon Hyun

Thank you for feedback, Wenchen, Maxim, and Takeshi.

1. "we would also promote the SQL standard TRIM syntax"
2. "we could output a warning with a notice about the order of
parameters"
3. "it looks nice to make these (two parameters) trim functions
unrecommended in future releases"

Yes, in case of reverting SPARK-28093, we had better deprecate these
two-parameter SQL function invocations. It's because this is really
esoteric and even inconsistent inside Spark (Scala API also follows
PostgresSQL/Presto style like the following.)

def trim(e: Column, trimString: String)

If we keep this situation in 3.0.0 release (a major release), it means
Apache Spark will be forever.
And, it becomes worse and worse because it leads more users to fall into
this trap.

There are a few deprecation ways. I believe (3) can be a proper next step
in case of reverting because (2) is infeasible and (1) is considered
insufficient because it's silent when we do SPARK-28093. We need non-silent
(noisy) one in this case. Technically, (3) can be done in
`Analyzer.ResolveFunctions`.

1. Documentation-only: Removing example and add migration guide
2. Compile-time warning by annotation: This is not an option for SQL
function in SQL string.
3. Runtime warning with a directional guide
   (log.warn("... USE TRIM(trimStr FROM str) INSTEAD")

How do you think about (3)?

Bests,
Dongjoon.

On Sun, Feb 16, 2020 at 1:22 AM Takeshi Yamamuro 
wrote:

> The revert looks fine to me for keeping the compatibility.
> Also, I think the different orders between the systems easily lead to
> mistakes, so
> , as Wenchen suggested, it looks nice to make these (two parameters) trim
> functions
> unrecommended in future releases:
> https://github.com/apache/spark/pull/27540#discussion_r377682518
> Actually, I think the SQL TRIM syntax is enough for trim use cases...
>
> Bests,
> Takeshi
>
>
> On Sun, Feb 16, 2020 at 3:02 AM Maxim Gekk 
> wrote:
>
>> Also if we look at possible combination of trim parameters:
>> 1. foldable srcStr + foldable trimStr
>> 2. foldable srcStr + non-foldable trimStr
>> 3. non-foldable srcStr + foldable trimStr
>> 4. non-foldable srcStr + non-foldable trimStr
>>
>> The case # 2 seems a rare case, and # 3 is probably the most common case.
>> Once we see the second case, we could output a warning with a notice about
>> the order of parameters.
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>>
>> On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan  wrote:
>>
>>> It's unfortunate that we don't have a clear document to talk about
>>> breaking changes (I'm working on it BTW). I believe the general guidance
>>> is: *avoid breaking changes unless we have to*. For example, the
>>> previous result was so broken that we have to fix it, moving to Scala 2.12
>>> makes us have to break some APIs, etc.
>>>
>>> For this particular case, do we have to switch the parameter order? It's
>>> different from some systems, the order was not decided explicitly, but I
>>> don't think they are strong reasons. People from RDBMS should use the SQL
>>> standard TRIM syntax more often. People using prior Spark versions should
>>> have figured out the parameter order of Spark TRIM (there was no document)
>>> and adjusted their queries. There is no such standard that defines the
>>> parameter order of the TRIM function.
>>>
>>> In the long term, we would also promote the SQL standard TRIM syntax. I
>>> don't see much benefit of "fixing" the parameter order that worth to make a
>>> breaking change.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>>
>>>
>>> On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun 
>>> wrote:
>>>
>>>> Please note that the context if TRIM/LTRIM/RTRIM with two-parameters
>>>> and TRIM(trimStr FROM str) syntax.
>>>>
>>>> This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.
>>>>
>>>> On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun 
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> I'm sending this email because the Apache Spark committers had better
>>>>> have a consistent point of views for the upcoming PRs. And, the community
>>>>> policy is the way to lead the community members transparently and clearly
>>>>> for a long term good.
>>>>>
>>>>> First of all, I want to emphasize that, like Apache Spark 2.0.0,
>>>>> Apache Spark 3.0.0 is going to achiev

< 1 2 3 4 5 6 7 8 >

201 - 300 of 784 matches

Mail list logo