from:"Hyukjin Kwon"

Re: I would like to add JDBCDialect to support Vertica database

2019-12-11 Thread Hyukjin Kwon

I am not so sure about it too. I think it is enough to expose JDBCDialect
as an API (which seems already is).
It brings some overhead to dev (e.g., to test and review PRs related to
another third party).
Such third party integration might better exist as a third party library
without a strong reason.

2019년 12월 12일 (목) 오전 12:58, Bryan Herger 님이 작성:

> It kind of already is.  I was able to build the VerticaDialect as a sort
> of plugin as follows:
>
>
>
> Check out apache/spark tree
>
> Copy in VerticaDialect.scala
>
> Build with “mvn -DskipTests compile”
>
> package the compiled class plus companion object into a JAR
>
> Copy JAR to jars folder in Spark binary installation (optional, probably
> can set path in an extra --jars argument instead)
>
>
>
> Then run the following test in spark-shell after creating Vertica table
> and sample data:
>
>
>
>
> org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(org.apache.spark.sql.jdbc.VerticaDialect)
>
> val jdbcDF = spark.read.format("jdbc").option("url",
> "jdbc:vertica://hpbox:5433/docker").option("dbtable",
> "test_alltypes").option("user", "dbadmin").option("password",
> "Vertica1!").load()
>
> jdbcDF.show()
>
> jdbcDF.write.mode("append").format("jdbc").option("url",
> "jdbc:vertica://hpbox:5433/docker").option("dbtable",
> "test_alltypes").option("user", "dbadmin").option("password",
> "Vertica1!").save()
>
> JdbcDialects.unregisterDialect(org.apache.spark.sql.jdbc.VerticaDialect)
>
>
>
> If it would be preferable to write documentation describing the above, I
> can do that instead.  The hard part is checking out the matching
> apache/spark tree then copying to the Spark cluster – I can install master
> branch and latest binary and apply patches since I have root on all my test
> boxes, but customers may not be able to.  Still, this provides another
> route to support new JDBC dialects.
>
>
>
> BryanH
>
>
>
> *From:* Wenchen Fan [mailto:cloud0...@gmail.com]
> *Sent:* Wednesday, December 11, 2019 10:48 AM
> *To:* Xiao Li 
> *Cc:* Bryan Herger ; Sean Owen <
> sro...@gmail.com>; dev@spark.apache.org
> *Subject:* Re: I would like to add JDBCDialect to support Vertica database
>
>
>
> Can we make the JDBCDialect a public API that users can plugin? It looks
> like an end-less job to make sure Spark JDBC source supports all databases.
>
>
>
> On Wed, Dec 11, 2019 at 11:41 PM Xiao Li  wrote:
>
> You can follow how we test the other JDBC dialects. All JDBC dialects
> require the docker integration tests.
> https://github.com/apache/spark/tree/master/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc
>
>
>
>
>
> On Wed, Dec 11, 2019 at 7:33 AM Bryan Herger 
> wrote:
>
> Hi, to answer both questions raised:
>
>
>
> Though Vertica is derived from Postgres, Vertica does not recognize type
> names TEXT, NVARCHAR, BYTEA, ARRAY, and also handles DATETIME differently
> enough to cause issues.  The major changes are to use type names and date
> format supported by Vertica.
>
>
>
> For testing, I have a SQL script plus Scala and PySpark scripts, but these
> require a Vertica database to connect, so automated testing on a build
> server wouldn’t work.  It’s possible to include my test scripts and
> directions to run manually, but not sure where in the repo that would go.
> If automated testing is required, I can ask our engineers whether there
> exists something like a mockito that could be included.
>
>
>
> Thanks, Bryan H
>
>
>
> *From:* Xiao Li [mailto:lix...@databricks.com]
> *Sent:* Wednesday, December 11, 2019 10:13 AM
> *To:* Sean Owen 
> *Cc:* Bryan Herger ; dev@spark.apache.org
> *Subject:* Re: I would like to add JDBCDialect to support Vertica database
>
>
>
> How can the dev community test it?
>
>
>
> Xiao
>
>
>
> On Wed, Dec 11, 2019 at 6:52 AM Sean Owen  wrote:
>
> It's probably OK, IMHO. The overhead of another dialect is small. Are
> there differences that require a new dialect? I assume so and might
> just be useful to summarize them if you open a PR.
>
> On Tue, Dec 10, 2019 at 7:14 AM Bryan Herger
>  wrote:
> >
> > Hi, I am a Vertica support engineer, and we have open support requests
> around NULL values and SQL type conversion with DataFrame read/write over
> JDBC when connecting to a Vertica database.  The stack traces point to
> issues with the generic JDBCDialect in Spark-SQL.
> >
> > I saw that other vendors (Teradata, DB2...) have contributed a
> JDBCDialect class to address JDBC compatibility, so I wrote up a dialect
> for Vertica.
> >
> > The changeset is on my fork of apache/spark at
> https://github.com/bryanherger/spark/commit/84d3014e4ead18146147cf299e8996c5c56b377d
> >
> > I have tested this against Vertica 9.3 and found that this changeset
> addresses both issues reported to us (issue with NULL values - setNull() -
> for valid java.sql.Types, and String to VARCHAR conversion)
> >
> > Is the an acceptable change?  If so, how should I go about submitting a
> pull request?
> >
> > Thanks, Bryan Herger
> > Vertica

Re: Closing stale PRs with a GitHub Action

2019-12-08 Thread Hyukjin Kwon

It doesn't need to exactly follow the conditions I used before as long as
Github Actions can provide other good options or conditions.
I just wanted to make sure the condition is reasonable.

2019년 12월 7일 (토) 오전 11:23, Hyukjin Kwon 님이 작성:

> lol how did you know I'm going to read this email Sean?
>
> When I manually identified the stale PRs, I used this conditions below:
>
> 1. Author's inactivity over a year. If the PRs were simply waiting for a
> review, I excluded it from stale PR list.
> 2. Ping one time and see if there are any updates within 3 days.
> 3. If it meets both conditions above, they were considered as stale PRs.
>
> Yeah, I agree with it. But I think the conditions of stale PRs matter.
> What kind of conditions and actions the Github Actions support, and which
> of them do you plan to add?
>
> I didn't like to close and automate the stale PRs but I think it's time to
> consider. But I think the conditions have to be pretty reasonable
> so that we give a proper reason to the author and/or don't happen to close
> some good and worthy PRs.
>
>
> 2019년 12월 7일 (토) 오전 3:23, Sean Owen 님이 작성:
>
>> We used to not be able to close PRs directly, but now we can, so I assume
>> this is as fine a way of doing so, if we want to. I don't think there's a
>> policy against it or anything.
>> Hyukjin how have you managed this one in the past?
>> I don't mind it being automated if the idle time is long and it posts
>> some friendly message about reopening if there is a material change in the
>> proposed PR, the problem, or interest in merging it.
>>
>> On Fri, Dec 6, 2019 at 11:20 AM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> That's true, we do use Actions today. I wonder if Apache Infra allows
>>> Actions to close PRs vs. just updating commit statuses. I only ask because
>>> I remember permissions were an issue in the past when discussing tooling
>>> like this.
>>>
>>> In any case, I'd be happy to submit a PR adding this in if there are no
>>> concerns. We can hash out the details on the PR.
>>>
>>> On Fri, Dec 6, 2019 at 11:08 AM Sean Owen  wrote:
>>>
>>>> I think we can add Actions, right? they're used for the newer tests in
>>>> Github?
>>>> I'm OK closing PRs inactive for a 'long time', where that's maybe 6-12
>>>> months or something. It's standard practice and doesn't mean it can't be
>>>> reopened.
>>>> Often the related JIRA should be closed as well but we have done that
>>>> separately with bulk-close in the past.
>>>>
>>>> On Thu, Dec 5, 2019 at 3:24 PM Nicholas Chammas <
>>>> nicholas.cham...@gmail.com> wrote:
>>>>
>>>>> It’s that topic again. 
>>>>>
>>>>> We have almost 500 open PRs. A good chunk of them are more than a year
>>>>> old. The oldest open PR dates to summer 2015.
>>>>>
>>>>>
>>>>> https://github.com/apache/spark/pulls?q=is%3Apr+is%3Aopen+sort%3Acreated-asc
>>>>>
>>>>> GitHub has an Action for closing stale PRs.
>>>>>
>>>>> https://github.com/marketplace/actions/close-stale-issues
>>>>>
>>>>> What do folks think about deploying it? Does Apache Infra give us the
>>>>> ability to even deploy a tool like this?
>>>>>
>>>>> Nick
>>>>>
>>>>

Re: Closing stale PRs with a GitHub Action

2019-12-06 Thread Hyukjin Kwon

lol how did you know I'm going to read this email Sean?

When I manually identified the stale PRs, I used this conditions below:

1. Author's inactivity over a year. If the PRs were simply waiting for a
review, I excluded it from stale PR list.
2. Ping one time and see if there are any updates within 3 days.
3. If it meets both conditions above, they were considered as stale PRs.

Yeah, I agree with it. But I think the conditions of stale PRs matter.
What kind of conditions and actions the Github Actions support, and which
of them do you plan to add?

I didn't like to close and automate the stale PRs but I think it's time to
consider. But I think the conditions have to be pretty reasonable
so that we give a proper reason to the author and/or don't happen to close
some good and worthy PRs.

2019년 12월 7일 (토) 오전 3:23, Sean Owen 님이 작성:

> We used to not be able to close PRs directly, but now we can, so I assume
> this is as fine a way of doing so, if we want to. I don't think there's a
> policy against it or anything.
> Hyukjin how have you managed this one in the past?
> I don't mind it being automated if the idle time is long and it posts some
> friendly message about reopening if there is a material change in the
> proposed PR, the problem, or interest in merging it.
>
> On Fri, Dec 6, 2019 at 11:20 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> That's true, we do use Actions today. I wonder if Apache Infra allows
>> Actions to close PRs vs. just updating commit statuses. I only ask because
>> I remember permissions were an issue in the past when discussing tooling
>> like this.
>>
>> In any case, I'd be happy to submit a PR adding this in if there are no
>> concerns. We can hash out the details on the PR.
>>
>> On Fri, Dec 6, 2019 at 11:08 AM Sean Owen  wrote:
>>
>>> I think we can add Actions, right? they're used for the newer tests in
>>> Github?
>>> I'm OK closing PRs inactive for a 'long time', where that's maybe 6-12
>>> months or something. It's standard practice and doesn't mean it can't be
>>> reopened.
>>> Often the related JIRA should be closed as well but we have done that
>>> separately with bulk-close in the past.
>>>
>>> On Thu, Dec 5, 2019 at 3:24 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 It’s that topic again. 

 We have almost 500 open PRs. A good chunk of them are more than a year
 old. The oldest open PR dates to summer 2015.

 https://github.com/apache/spark/pulls?q=is%3Apr+is%3Aopen+sort%3Acreated-asc

 GitHub has an Action for closing stale PRs.

 https://github.com/marketplace/actions/close-stale-issues

 What do folks think about deploying it? Does Apache Infra give us the
 ability to even deploy a tool like this?

 Nick

>>>

Revisiting Python / pandas UDF (continues)

2019-12-04 Thread Hyukjin Kwon

Hi all,

I would like to finish redesigning Pandas UDF ones in Spark 3.0.
If you guys don't have a minor concern in general about (see
https://issues.apache.org/jira/browse/SPARK-28264),
I would like to start soon after addressing existing comments.

Please take a look and comment on the design docs.

Thanks!

Re: Slower than usual on PRs

2019-12-03 Thread Hyukjin Kwon

Yeah, please take care of your heath first!

2019년 12월 3일 (화) 오후 1:32, Wenchen Fan 님이 작성:

> Sorry to hear that. Hope you get better soon!
>
> On Tue, Dec 3, 2019 at 1:28 AM Holden Karau  wrote:
>
>> Hi Spark dev folks,
>>
>> Just an FYI I'm out dealing with recovering from a motorcycle accident so
>> my lack of (or slow) responses on PRs/docs is health related and please
>> don't block on any of my reviews. I'll do my best to find some OSS cycles
>> once I get back home.
>>
>> Cheers,
>>
>> Holden
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

Re: Auto-linking Jira tickets to their PRs

2019-12-03 Thread Hyukjin Kwon

I think it's broken .. cc Josh Rosen

2019년 12월 4일 (수) 오전 10:25, Nicholas Chammas 님이
작성:

> We used to have a bot or something that automatically linked Jira tickets
> to PRs that mentioned them in their title. I don't see that happening
> anymore. 
>
> Did we intentionally remove this functionality, or is it temporarily
> broken for some reason?
>
> Nick
>
>

Re: Adding JIRA ID as the prefix for the test case name

2019-11-21 Thread Hyukjin Kwon

I opened a PR - https://github.com/apache/spark-website/pull/232

2019년 11월 19일 (화) 오전 9:22, Hyukjin Kwon 님이 작성:

> Let me document as below in few days:
>
> 1. For Python and Java, write a single comment that starts with JIRA ID
> and short description, e.g. (SPARK-X: test blah blah)
> 2. For R, use JIRA ID as a prefix for its test name.
>
> assuming everybody is happy.
>
> 2019년 11월 18일 (월) 오전 11:36, Hyukjin Kwon 님이 작성:
>
>> Actually there are not so many Java test cases in Spark (because Scala
>> runs on JVM as everybody knows)[1].
>>
>> Given that, I think we can avoid to put some efforts on this for now .. I
>> don't mind if somebody wants to give a shot since it looks good anyway but
>> to me I wouldn't spend so much time on this ..
>>
>> Let me just go ahead as I suggested if you don't mind. Anyone can give a
>> shot for Display Name - I'm willing to actively review and help.
>>
>> [1]
>> git ls-files '*Suite.java' | wc -l
>>  172
>> git ls-files '*Suite.scala' | wc -l
>> 1161
>>
>> 2019년 11월 18일 (월) 오전 3:27, Steve Loughran 님이 작성:
>>
>>> Test reporters do often contain some assumptions about the characters in
>>> the test methods. Historically JUnit XML reporters have never sanitised the
>>> method names so XML injection attacks have been fairly trivial. Haven't
>>> tried this for a while.
>>>
>>> That whole JUnit XML report "standard" was actually put together in the
>>> Ant project with  doing the postprocessing of the JUnit run.
>>> It was driven by the team's XSL skills than any overreaching strategic goal
>>> about how to present test results of tests which could run for hours and
>>> whose output you would really want to aggregate the locks from multiple
>>> machines and processes and present in awake you can actually navigate. With
>>> hindsight, a key failing is that we chose to store the test summaries (test
>>> count, failure count...) as attributes on the root XML mode. Which is why
>>> the whole DOM gets built up in the JUnit runner. Which is why when that
>>> JUnit process crashes, you get no report at all.
>>>
>>> It'd be straightforward to fix -except too much relies on that file
>>> now...important things will break. And the maven runner has historically
>>> never supported custom reporters, to let you experiment with it.
>>>
>>> Maybe this is an opportunity to change things.
>>>
>>> On Sun, Nov 17, 2019 at 1:42 AM Hyukjin Kwon 
>>> wrote:
>>>
>>>> DisplayName looks good in general but actually here I would like first
>>>> to find a existing pattern to document in guidelines given the actual
>>>> existing practice we all are used to. I'm trying to be very conservative
>>>> since this guidelines affect everybody.
>>>>
>>>> I think it might be better to discuss separately if we want to change
>>>> what we have been used to.
>>>>
>>>> Also, using arbitrary names might not be actually free due to such bug
>>>> like https://github.com/apache/spark/pull/25630 . It will need some
>>>> more efforts to investigate as well.
>>>>
>>>> On Fri, 15 Nov 2019, 20:56 Steve Loughran, 
>>>> wrote:
>>>>
>>>>>  Junit5: Display names.
>>>>>
>>>>> Goes all the way to the XML.
>>>>>
>>>>>
>>>>> https://junit.org/junit5/docs/current/user-guide/#writing-tests-display-names
>>>>>
>>>>> On Thu, Nov 14, 2019 at 6:13 PM Shixiong(Ryan) Zhu <
>>>>> shixi...@databricks.com> wrote:
>>>>>
>>>>>> Should we also add a guideline for non Scala tests? Other languages
>>>>>> (Java, Python, R) don't support using string as a test name.
>>>>>>
>>>>>> Best Regards,
>>>>>> Ryan
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 14, 2019 at 4:04 AM Hyukjin Kwon 
>>>>>> wrote:
>>>>>>
>>>>>>> I opened a PR - https://github.com/apache/spark-website/pull/231
>>>>>>>
>>>>>>> 2019년 11월 13일 (수) 오전 10:43, Hyukjin Kwon 님이 작성:
>>>>>>>
>>>>>>>> > In general a test should be self descriptive and I don't think we
>>>>>>>> should be adding JIRA ticket references wholesale. Any action that the
>>>>>>>> reader has to

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-19 Thread Hyukjin Kwon

We don't have an official Spark with Hadoop 3 yet (except the preview) if I
am not mistaken.
I think it's more natural to one minor release term before switching this
...
How about we target Hadoop 3 as default in Spark 3.1?


2019년 11월 20일 (수) 오전 7:40, Cheng Lian 님이 작성:

> Hey Steve,
>
> In terms of Maven artifact, I don't think the default Hadoop version
> matters except for the spark-hadoop-cloud module, which is only meaningful
> under the hadoop-3.2 profile. All  the other spark-* artifacts published to
> Maven central are Hadoop-version-neutral.
>
> Another issue about switching the default Hadoop version to 3.2 is PySpark
> distribution. Right now, we only publish PySpark artifacts prebuilt with
> Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to
> 3.2 is feasible for PySpark users. Or maybe we should publish PySpark
> prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.
>
> Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the
> proposed hive-2.3 profile, I personally don't have a preference over having
> Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing
> the release management work, in case we decided to publish other spark-*
> Maven artifacts from a Hadoop 2.7 build, we can still special case
> spark-hadoop-cloud and publish it using a hadoop-3.2 build.
>
> On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun 
> wrote:
>
>> I also agree with Steve and Felix.
>>
>> Let's have another thread to discuss Hive issue
>>
>> because this thread was originally for `hadoop` version.
>>
>> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
>> `hadoop-3.0` versions.
>>
>> We don't need to mix both.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung 
>> wrote:
>>
>>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution.
>>> It is old and rather buggy; and It’s been *years*
>>>
>>> I think we should decouple hive change from everything else if people
>>> are concerned?
>>>
>>> --
>>> *From:* Steve Loughran 
>>> *Sent:* Sunday, November 17, 2019 9:22:09 AM
>>> *To:* Cheng Lian 
>>> *Cc:* Sean Owen ; Wenchen Fan ;
>>> Dongjoon Hyun ; dev ;
>>> Yuming Wang 
>>> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>>
>>> Can I take this moment to remind everyone that the version of hive which
>>> spark has historically bundled (the org.spark-project one) is an orphan
>>> project put together to deal with Hive's shading issues and a source of
>>> unhappiness in the Hive project. What ever get shipped should do its best
>>> to avoid including that file.
>>>
>>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest
>>> move from a risk minimisation perspective. If something has broken then it
>>> is you can start with the assumption that it is in the o.a.s packages
>>> without having to debug o.a.hadoop and o.a.hive first. There is a cost: if
>>> there are problems with the hadoop / hive dependencies those teams will
>>> inevitably ignore filed bug reports for the same reason spark team will
>>> probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the
>>> Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that
>>> in mind. It's not been tested, it has dependencies on artifacts we know are
>>> incompatible, and as far as the Hadoop project is concerned: people should
>>> move to branch 3 if they want to run on a modern version of Java
>>>
>>> It would be really really good if the published spark maven artefacts
>>> (a) included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop
>>> 3.x. That way people doing things with their own projects will get
>>> up-to-date dependencies and don't get WONTFIX responses themselves.
>>>
>>> -Steve
>>>
>>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last
>>> ever" branch-2 release and then declare its predecessors EOL; 2.10 will be
>>> the transition release.
>>>
>>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian 
>>> wrote:
>>>
>>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
>>> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
>>> seemed risky, and therefore we only introduced Hive 2.3 under the
>>> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
>>> here...
>>>
>>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed
>>> that Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
>>> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
>>> upgrade together looks too risky.
>>>
>>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen  wrote:
>>>
>>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
>>> than introduce yet another build combination. Does Hadoop 2 + Hive 2
>>> work and is there demand for it?
>>>
>>> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan  wrote:
>>> >
>>> > Do

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Hyukjin Kwon

> Should Hadoop 2 + Hive 2 be considered to work on JDK 11?
This seems being investigated by Yuming's PR (
https://github.com/apache/spark/pull/26533) if I am not mistaken.

Oh, yes, what I meant by (default) was the default profiles we will use in
Spark.


2019년 11월 20일 (수) 오전 10:14, Sean Owen 님이 작성:

> Should Hadoop 2 + Hive 2 be considered to work on JDK 11? I wasn't
> sure if 2.7 did, but honestly I've lost track.
> Anyway, it doesn't matter much as the JDK doesn't cause another build
> permutation. All are built targeting Java 8.
>
> I also don't know if we have to declare a binary release a default.
> The published POM will be agnostic to Hadoop / Hive; well, it will
> link against a particular version but can be overridden. That's what
> you're getting at?
>
>
> On Tue, Nov 19, 2019 at 7:11 PM Hyukjin Kwon  wrote:
> >
> > So, are we able to conclude our plans as below?
> >
> > 1. In Spark 3,  we release as below:
> >   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
> >   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11
> >   - Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)
> >
> > 2. In Spark 3.1, we target:
> >   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
> >   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11 (default)
> >
> > 3. Avoid to remove "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)"
> combo right away after cutting branch-3 to see if Hive 2.3 is considered as
> stable in general.
> > I roughly suspect it would be a couple of months after Spark 3.0
> release (?).
> >
> > BTW, maybe we should officially note that "Hadoop 2.7 + Hive 1.2.1
> (fork) + JDK8 (default)" combination is deprecated anyway in Spark 3.
> >
>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Hyukjin Kwon

So, are we able to conclude our plans as below?

1. In Spark 3,  we release as below:
  - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
  - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11
  - Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)

2. In Spark 3.1, we target:
  - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
  - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11 (default)

3. Avoid to remove "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)" combo
right away after cutting branch-3 to see if Hive 2.3 is considered as
stable in general.
I roughly suspect it would be a couple of months after Spark 3.0
release (?).

BTW, maybe we should officially note that "Hadoop 2.7 + Hive 1.2.1 (fork) +
JDK8 (default)" combination is deprecated anyway in Spark 3.



2019년 11월 20일 (수) 오전 9:52, Cheng Lian 님이 작성:

> Thanks for taking care of this, Dongjoon!
>
> We can target SPARK-20202 to 3.1.0, but I don't think we should do it
> immediately after cutting the branch-3.0. The Hive 1.2 code paths can only
> be removed once the Hive 2.3 code paths are proven to be stable. If it
> turned out to be buggy in Spark 3.1, we may want to further postpone
> SPARK-20202 to 3.2.0 by then.
>
> On Tue, Nov 19, 2019 at 2:53 PM Dongjoon Hyun 
> wrote:
>
>> Yes. It does. I meant SPARK-20202.
>>
>> Thanks. I understand that it can be considered like Scala version issue.
>> So, that's the reason why I put this as a `policy` issue from the
>> beginning.
>>
>> > First of all, I want to put this as a policy issue instead of a
>> technical issue.
>>
>> In the policy perspective, we should remove this immediately if we have a
>> solution to fix this.
>> For now, I set `Target Versions` of SPARK-20202 to `3.1.0` according to
>> the current discussion status.
>>
>> https://issues.apache.org/jira/browse/SPARK-20202
>>
>> And, if there is no other issues, I'll create a PR to remove it from
>> `master` branch when we cut `branch-3.0`.
>>
>> For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do
>> you think about this, Sean?
>> The preparation is already started in another email thread and I believe
>> that is a keystone to prove `Hive 2.3` version stability
>> (which Cheng/Hyukjin/you asked).
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Tue, Nov 19, 2019 at 2:09 PM Cheng Lian  wrote:
>>
>>> It's kinda like Scala version upgrade. Historically, we only remove the
>>> support of an older Scala version when the newer version is proven to be
>>> stable after one or more Spark minor versions.
>>>
>>> On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian 
>>> wrote:
>>>
 Hmm, what exactly did you mean by "remove the usage of forked `hive` in
 Apache Spark 3.0 completely officially"? I thought you wanted to remove the
 forked Hive 1.2 dependencies completely, no? As long as we still keep the
 Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
 particular preference between using Hive 1.2 or 2.3 as the default Hive
 version. After all, for end-users and providers who need a particular
 version combination, they can always build Spark with proper profiles
 themselves.

 And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that
 it's due to the folder name.

 On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun 
 wrote:

> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>
> For directory name, we use '1.2.1' and '2.3.5' because we just delayed
> the renaming the directories until 3.0.0 deadline to minimize the diff.
>
> We can replace it immediately if we want right now.
>
>
>
> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
>
>> Hi, Cheng.
>>
>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8
>> world.
>> If we consider them, it could be the followings.
>>
>> +--+-++
>> |  | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>> +-+
>> |Legitimate|X| O  |
>> |JDK11 |X| O  |
>> |Hadoop3   |X| O  |
>> |Hadoop2   |O| O  |
>> |Functions | Baseline|   More |
>> |Bug fixes | Baseline|   More |
>> +-+
>>
>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>> (including Jenkins/GitHubAction/AppVeyor).
>>
>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>> to give more visibility to the whole community,
>>
>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>> distribution
>> 2. We need to switch our default Hive usage to 2.3 in `master` for
>>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-18 Thread Hyukjin Kwon

I struggled hard to deal with this issue multiple times over a year and
thankfully we finally
decided to use the official version of Hive 2.3.x too (thank you, Yuming,
Alan, and guys)
I think this is already a huge progress that we started to use the
official version of Hive.

I think we should at least have one minor release term to let users test
out Spark with Hive 2.3.x. before switching this
as a default. My impression was it's the decision made before at:
http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Upgrade-built-in-Hive-to-2-3-4-td26153.html

How about we try to make it default in Spark 3.1 by using this thread as a
reference? I think it's too a radical change.


2019년 11월 19일 (화) 오후 2:11, Dongjoon Hyun 님이 작성:

> Hi, All.
>
> First of all, I want to put this as a policy issue instead of a technical
> issue.
> Also, this is orthogonal from `hadoop` version discussion.
>
> Apache Spark community kept (not maintained) the forked Apache Hive
> 1.2.1 because there has been no other options before. As we see at
> SPARK-20202, it's not a desirable situation among the Apache projects.
>
> https://issues.apache.org/jira/browse/SPARK-20202
>
> Also, please note that we `kept`, not `maintained`, because we know it's
> not good.
> There are several attempt to update that forked repository
> for several reasons (Hadoop 3 support is one of the example),
> but those attempts are also turned down.
>
> From Apache Spark 3.0, it seems that we have a new feasible option
> `hive-2.3` profile. What about moving forward in this direction further?
>
> For example, can we remove the usage of forked `hive` in Apache Spark 3.0
> completely officially? If someone still needs to use the forked `hive`, we
> can
> have a profile `hive-1.2`. Of course, it should not be a default profile
> in the community.
>
> I want to say this is a goal we should achieve someday.
> If we don't do anything, nothing happen. At least we need to prepare this.
> Without any preparation, Spark 3.1+ will be the same.
>
> Shall we focus on what are our problems with Hive 2.3.6?
> If the only reason is that we didn't use it before, we can release
> another
> `3.0.0-preview` for that.
>
> Bests,
> Dongjoon.
>

Re: Adding JIRA ID as the prefix for the test case name

2019-11-18 Thread Hyukjin Kwon

Let me document as below in few days:

1. For Python and Java, write a single comment that starts with JIRA ID and
short description, e.g. (SPARK-X: test blah blah)
2. For R, use JIRA ID as a prefix for its test name.

assuming everybody is happy.

2019년 11월 18일 (월) 오전 11:36, Hyukjin Kwon 님이 작성:

> Actually there are not so many Java test cases in Spark (because Scala
> runs on JVM as everybody knows)[1].
>
> Given that, I think we can avoid to put some efforts on this for now .. I
> don't mind if somebody wants to give a shot since it looks good anyway but
> to me I wouldn't spend so much time on this ..
>
> Let me just go ahead as I suggested if you don't mind. Anyone can give a
> shot for Display Name - I'm willing to actively review and help.
>
> [1]
> git ls-files '*Suite.java' | wc -l
>  172
> git ls-files '*Suite.scala' | wc -l
> 1161
>
> 2019년 11월 18일 (월) 오전 3:27, Steve Loughran 님이 작성:
>
>> Test reporters do often contain some assumptions about the characters in
>> the test methods. Historically JUnit XML reporters have never sanitised the
>> method names so XML injection attacks have been fairly trivial. Haven't
>> tried this for a while.
>>
>> That whole JUnit XML report "standard" was actually put together in the
>> Ant project with  doing the postprocessing of the JUnit run.
>> It was driven by the team's XSL skills than any overreaching strategic goal
>> about how to present test results of tests which could run for hours and
>> whose output you would really want to aggregate the locks from multiple
>> machines and processes and present in awake you can actually navigate. With
>> hindsight, a key failing is that we chose to store the test summaries (test
>> count, failure count...) as attributes on the root XML mode. Which is why
>> the whole DOM gets built up in the JUnit runner. Which is why when that
>> JUnit process crashes, you get no report at all.
>>
>> It'd be straightforward to fix -except too much relies on that file
>> now...important things will break. And the maven runner has historically
>> never supported custom reporters, to let you experiment with it.
>>
>> Maybe this is an opportunity to change things.
>>
>> On Sun, Nov 17, 2019 at 1:42 AM Hyukjin Kwon  wrote:
>>
>>> DisplayName looks good in general but actually here I would like first
>>> to find a existing pattern to document in guidelines given the actual
>>> existing practice we all are used to. I'm trying to be very conservative
>>> since this guidelines affect everybody.
>>>
>>> I think it might be better to discuss separately if we want to change
>>> what we have been used to.
>>>
>>> Also, using arbitrary names might not be actually free due to such bug
>>> like https://github.com/apache/spark/pull/25630 . It will need some
>>> more efforts to investigate as well.
>>>
>>> On Fri, 15 Nov 2019, 20:56 Steve Loughran, 
>>> wrote:
>>>
>>>>  Junit5: Display names.
>>>>
>>>> Goes all the way to the XML.
>>>>
>>>>
>>>> https://junit.org/junit5/docs/current/user-guide/#writing-tests-display-names
>>>>
>>>> On Thu, Nov 14, 2019 at 6:13 PM Shixiong(Ryan) Zhu <
>>>> shixi...@databricks.com> wrote:
>>>>
>>>>> Should we also add a guideline for non Scala tests? Other languages
>>>>> (Java, Python, R) don't support using string as a test name.
>>>>>
>>>>> Best Regards,
>>>>> Ryan
>>>>>
>>>>>
>>>>> On Thu, Nov 14, 2019 at 4:04 AM Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>>> I opened a PR - https://github.com/apache/spark-website/pull/231
>>>>>>
>>>>>> 2019년 11월 13일 (수) 오전 10:43, Hyukjin Kwon 님이 작성:
>>>>>>
>>>>>>> > In general a test should be self descriptive and I don't think we
>>>>>>> should be adding JIRA ticket references wholesale. Any action that the
>>>>>>> reader has to take to understand why a test was introduced is one too 
>>>>>>> many.
>>>>>>> However in some cases the thing we are trying to test is very subtle 
>>>>>>> and in
>>>>>>> that case a reference to a JIRA ticket might be useful, I do still feel
>>>>>>> that this should be a backstop and that properly documenting your tests 
>>>>>>> is
>>>>>>> a much better w

Re: Adding JIRA ID as the prefix for the test case name

2019-11-17 Thread Hyukjin Kwon

Actually there are not so many Java test cases in Spark (because Scala runs
on JVM as everybody knows)[1].

Given that, I think we can avoid to put some efforts on this for now .. I
don't mind if somebody wants to give a shot since it looks good anyway but
to me I wouldn't spend so much time on this ..

Let me just go ahead as I suggested if you don't mind. Anyone can give a
shot for Display Name - I'm willing to actively review and help.

[1]
git ls-files '*Suite.java' | wc -l
 172
git ls-files '*Suite.scala' | wc -l
1161

2019년 11월 18일 (월) 오전 3:27, Steve Loughran 님이 작성:

> Test reporters do often contain some assumptions about the characters in
> the test methods. Historically JUnit XML reporters have never sanitised the
> method names so XML injection attacks have been fairly trivial. Haven't
> tried this for a while.
>
> That whole JUnit XML report "standard" was actually put together in the
> Ant project with  doing the postprocessing of the JUnit run.
> It was driven by the team's XSL skills than any overreaching strategic goal
> about how to present test results of tests which could run for hours and
> whose output you would really want to aggregate the locks from multiple
> machines and processes and present in awake you can actually navigate. With
> hindsight, a key failing is that we chose to store the test summaries (test
> count, failure count...) as attributes on the root XML mode. Which is why
> the whole DOM gets built up in the JUnit runner. Which is why when that
> JUnit process crashes, you get no report at all.
>
> It'd be straightforward to fix -except too much relies on that file
> now...important things will break. And the maven runner has historically
> never supported custom reporters, to let you experiment with it.
>
> Maybe this is an opportunity to change things.
>
> On Sun, Nov 17, 2019 at 1:42 AM Hyukjin Kwon  wrote:
>
>> DisplayName looks good in general but actually here I would like first to
>> find a existing pattern to document in guidelines given the actual existing
>> practice we all are used to. I'm trying to be very conservative since this
>> guidelines affect everybody.
>>
>> I think it might be better to discuss separately if we want to change
>> what we have been used to.
>>
>> Also, using arbitrary names might not be actually free due to such bug
>> like https://github.com/apache/spark/pull/25630 . It will need some more
>> efforts to investigate as well.
>>
>> On Fri, 15 Nov 2019, 20:56 Steve Loughran, 
>> wrote:
>>
>>>  Junit5: Display names.
>>>
>>> Goes all the way to the XML.
>>>
>>>
>>> https://junit.org/junit5/docs/current/user-guide/#writing-tests-display-names
>>>
>>> On Thu, Nov 14, 2019 at 6:13 PM Shixiong(Ryan) Zhu <
>>> shixi...@databricks.com> wrote:
>>>
>>>> Should we also add a guideline for non Scala tests? Other languages
>>>> (Java, Python, R) don't support using string as a test name.
>>>>
>>>> Best Regards,
>>>> Ryan
>>>>
>>>>
>>>> On Thu, Nov 14, 2019 at 4:04 AM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>>> I opened a PR - https://github.com/apache/spark-website/pull/231
>>>>>
>>>>> 2019년 11월 13일 (수) 오전 10:43, Hyukjin Kwon 님이 작성:
>>>>>
>>>>>> > In general a test should be self descriptive and I don't think we
>>>>>> should be adding JIRA ticket references wholesale. Any action that the
>>>>>> reader has to take to understand why a test was introduced is one too 
>>>>>> many.
>>>>>> However in some cases the thing we are trying to test is very subtle and 
>>>>>> in
>>>>>> that case a reference to a JIRA ticket might be useful, I do still feel
>>>>>> that this should be a backstop and that properly documenting your tests 
>>>>>> is
>>>>>> a much better way of dealing with this.
>>>>>>
>>>>>> Yeah, the test should be self-descriptive. I don't think adding a
>>>>>> JIRA prefix harms this point. Probably I should add this sentence in the
>>>>>> guidelines as well.
>>>>>> Adding a JIRA prefix just adds one extra hint to track down details.
>>>>>> I think it's fine to stick to this practice and make it simpler and clear
>>>>>> to follow.
>>>>>>
>>>>>> > 1. what if multiple JIRA IDs relating to the same test? we just
>>>>>> take the very firs

Re: Adding JIRA ID as the prefix for the test case name

2019-11-16 Thread Hyukjin Kwon

DisplayName looks good in general but actually here I would like first to
find a existing pattern to document in guidelines given the actual existing
practice we all are used to. I'm trying to be very conservative since this
guidelines affect everybody.

I think it might be better to discuss separately if we want to change what
we have been used to.

Also, using arbitrary names might not be actually free due to such bug like
https://github.com/apache/spark/pull/25630 . It will need some more efforts
to investigate as well.

On Fri, 15 Nov 2019, 20:56 Steve Loughran, 
wrote:

>  Junit5: Display names.
>
> Goes all the way to the XML.
>
>
> https://junit.org/junit5/docs/current/user-guide/#writing-tests-display-names
>
> On Thu, Nov 14, 2019 at 6:13 PM Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Should we also add a guideline for non Scala tests? Other languages
>> (Java, Python, R) don't support using string as a test name.
>>
>> Best Regards,
>> Ryan
>>
>>
>> On Thu, Nov 14, 2019 at 4:04 AM Hyukjin Kwon  wrote:
>>
>>> I opened a PR - https://github.com/apache/spark-website/pull/231
>>>
>>> 2019년 11월 13일 (수) 오전 10:43, Hyukjin Kwon 님이 작성:
>>>
>>>> > In general a test should be self descriptive and I don't think we
>>>> should be adding JIRA ticket references wholesale. Any action that the
>>>> reader has to take to understand why a test was introduced is one too many.
>>>> However in some cases the thing we are trying to test is very subtle and in
>>>> that case a reference to a JIRA ticket might be useful, I do still feel
>>>> that this should be a backstop and that properly documenting your tests is
>>>> a much better way of dealing with this.
>>>>
>>>> Yeah, the test should be self-descriptive. I don't think adding a JIRA
>>>> prefix harms this point. Probably I should add this sentence in the
>>>> guidelines as well.
>>>> Adding a JIRA prefix just adds one extra hint to track down details. I
>>>> think it's fine to stick to this practice and make it simpler and clear to
>>>> follow.
>>>>
>>>> > 1. what if multiple JIRA IDs relating to the same test? we just take
>>>> the very first JIRA ID?
>>>> Ideally one JIRA should describe one issue and one PR should fix one
>>>> JIRA with a dedicated test.
>>>> Yeah, I think I would take the very first JIRA ID.
>>>>
>>>> > 2. are we going to have a full scan of all existing tests and attach
>>>> a JIRA ID to it?
>>>> Yea, let's don't do this.
>>>>
>>>> > It's a nice-to-have, not super essential, just because ...
>>>> It's been asked multiple times and each committer seems having a
>>>> different understanding on this.
>>>> It's not a biggie but wanted to make it clear and conclude this.
>>>>
>>>> > I'd add this only when a test specifically targets a certain issue.
>>>> Yes, so this one I am not sure. From what I heard, people adds the JIRA
>>>> in cases below:
>>>>
>>>> - Whenever the JIRA type is a bug
>>>> - When a PR adds a couple of tests
>>>> - Only when a test specifically targets a certain issue.
>>>> - ...
>>>>
>>>> Which one do we prefer and simpler to follow?
>>>>
>>>> Or I can combine as below (im gonna reword when I actually document
>>>> this):
>>>> 1. In general, we should add a JIRA ID as prefix of a test when a PR
>>>> targets to fix a specific issue.
>>>> In practice, it usually happens when a JIRA type is a bug or a PR
>>>> adds a couple of tests.
>>>> 2. Uses "SPARK-: test name" format
>>>>
>>>> If we have no objection with ^, let me go with this.
>>>>
>>>> 2019년 11월 13일 (수) 오전 8:14, Sean Owen 님이 작성:
>>>>
>>>>> Let's suggest "SPARK-12345:" but not go back and change a bunch of
>>>>> test cases.
>>>>> I'd add this only when a test specifically targets a certain issue.
>>>>> It's a nice-to-have, not super essential, just because in the rare
>>>>> case you need to understand why a test asserts something, you can go
>>>>> back and find what added it in the git history without much trouble.
>>>>>
>>>>> On Mon, Nov 11, 2019 at 10:46 AM Hyukjin Kwon 
>>>>> wrote:
>>>>> >
>>>>>

Re: Adding JIRA ID as the prefix for the test case name

2019-11-14 Thread Hyukjin Kwon

Yeah, sounds good to have it.

In case of R, it seems not quite common to write down JIRA ID [1] but looks
some have the prefix in its test name in general.
In case of Python and Java, seems we time to time write a JIRA ID in the
comment right under the test method [2][3].

Given this pattern, I would like to suggest use the same format but:

1. For Python and Java, write a single comment that starts with JIRA ID and
short description, e.g. (SPARK-X: test blah blah)
2. For R, use JIRA ID as a prefix for its test name.

[1] git grep -r "SPARK-" -- '*test*.R'
[2] git grep -r "SPARK-" -- '*Suite.java'
[3] git grep -r "SPARK-" -- '*test*.py'

Does that make sense? Adding Felix and Shivaram too.


2019년 11월 15일 (금) 오전 3:13, Shixiong(Ryan) Zhu 님이
작성:

> Should we also add a guideline for non Scala tests? Other languages (Java,
> Python, R) don't support using string as a test name.
>
> Best Regards,
> Ryan
>
>
> On Thu, Nov 14, 2019 at 4:04 AM Hyukjin Kwon  wrote:
>
>> I opened a PR - https://github.com/apache/spark-website/pull/231
>>
>> 2019년 11월 13일 (수) 오전 10:43, Hyukjin Kwon 님이 작성:
>>
>>> > In general a test should be self descriptive and I don't think we
>>> should be adding JIRA ticket references wholesale. Any action that the
>>> reader has to take to understand why a test was introduced is one too many.
>>> However in some cases the thing we are trying to test is very subtle and in
>>> that case a reference to a JIRA ticket might be useful, I do still feel
>>> that this should be a backstop and that properly documenting your tests is
>>> a much better way of dealing with this.
>>>
>>> Yeah, the test should be self-descriptive. I don't think adding a JIRA
>>> prefix harms this point. Probably I should add this sentence in the
>>> guidelines as well.
>>> Adding a JIRA prefix just adds one extra hint to track down details. I
>>> think it's fine to stick to this practice and make it simpler and clear to
>>> follow.
>>>
>>> > 1. what if multiple JIRA IDs relating to the same test? we just take
>>> the very first JIRA ID?
>>> Ideally one JIRA should describe one issue and one PR should fix one
>>> JIRA with a dedicated test.
>>> Yeah, I think I would take the very first JIRA ID.
>>>
>>> > 2. are we going to have a full scan of all existing tests and attach a
>>> JIRA ID to it?
>>> Yea, let's don't do this.
>>>
>>> > It's a nice-to-have, not super essential, just because ...
>>> It's been asked multiple times and each committer seems having a
>>> different understanding on this.
>>> It's not a biggie but wanted to make it clear and conclude this.
>>>
>>> > I'd add this only when a test specifically targets a certain issue.
>>> Yes, so this one I am not sure. From what I heard, people adds the JIRA
>>> in cases below:
>>>
>>> - Whenever the JIRA type is a bug
>>> - When a PR adds a couple of tests
>>> - Only when a test specifically targets a certain issue.
>>> - ...
>>>
>>> Which one do we prefer and simpler to follow?
>>>
>>> Or I can combine as below (im gonna reword when I actually document
>>> this):
>>> 1. In general, we should add a JIRA ID as prefix of a test when a PR
>>> targets to fix a specific issue.
>>> In practice, it usually happens when a JIRA type is a bug or a PR
>>> adds a couple of tests.
>>> 2. Uses "SPARK-: test name" format
>>>
>>> If we have no objection with ^, let me go with this.
>>>
>>> 2019년 11월 13일 (수) 오전 8:14, Sean Owen 님이 작성:
>>>
>>>> Let's suggest "SPARK-12345:" but not go back and change a bunch of test
>>>> cases.
>>>> I'd add this only when a test specifically targets a certain issue.
>>>> It's a nice-to-have, not super essential, just because in the rare
>>>> case you need to understand why a test asserts something, you can go
>>>> back and find what added it in the git history without much trouble.
>>>>
>>>> On Mon, Nov 11, 2019 at 10:46 AM Hyukjin Kwon 
>>>> wrote:
>>>> >
>>>> > Hi all,
>>>> >
>>>> > Maybe it's not a big deal but it brought some confusions time to time
>>>> into Spark dev and community. I think it's time to discuss about when/which
>>>> format to add a JIRA ID as a prefix for the test case name in Scala test
>>>> cases.
>>>> >
>

Re: Adding JIRA ID as the prefix for the test case name

2019-11-14 Thread Hyukjin Kwon

I opened a PR - https://github.com/apache/spark-website/pull/231

2019년 11월 13일 (수) 오전 10:43, Hyukjin Kwon 님이 작성:

> > In general a test should be self descriptive and I don't think we should
> be adding JIRA ticket references wholesale. Any action that the reader has
> to take to understand why a test was introduced is one too many. However in
> some cases the thing we are trying to test is very subtle and in that case
> a reference to a JIRA ticket might be useful, I do still feel that this
> should be a backstop and that properly documenting your tests is a much
> better way of dealing with this.
>
> Yeah, the test should be self-descriptive. I don't think adding a JIRA
> prefix harms this point. Probably I should add this sentence in the
> guidelines as well.
> Adding a JIRA prefix just adds one extra hint to track down details. I
> think it's fine to stick to this practice and make it simpler and clear to
> follow.
>
> > 1. what if multiple JIRA IDs relating to the same test? we just take the
> very first JIRA ID?
> Ideally one JIRA should describe one issue and one PR should fix one JIRA
> with a dedicated test.
> Yeah, I think I would take the very first JIRA ID.
>
> > 2. are we going to have a full scan of all existing tests and attach a
> JIRA ID to it?
> Yea, let's don't do this.
>
> > It's a nice-to-have, not super essential, just because ...
> It's been asked multiple times and each committer seems having a different
> understanding on this.
> It's not a biggie but wanted to make it clear and conclude this.
>
> > I'd add this only when a test specifically targets a certain issue.
> Yes, so this one I am not sure. From what I heard, people adds the JIRA in
> cases below:
>
> - Whenever the JIRA type is a bug
> - When a PR adds a couple of tests
> - Only when a test specifically targets a certain issue.
> - ...
>
> Which one do we prefer and simpler to follow?
>
> Or I can combine as below (im gonna reword when I actually document this):
> 1. In general, we should add a JIRA ID as prefix of a test when a PR
> targets to fix a specific issue.
> In practice, it usually happens when a JIRA type is a bug or a PR adds
> a couple of tests.
> 2. Uses "SPARK-: test name" format
>
> If we have no objection with ^, let me go with this.
>
> 2019년 11월 13일 (수) 오전 8:14, Sean Owen 님이 작성:
>
>> Let's suggest "SPARK-12345:" but not go back and change a bunch of test
>> cases.
>> I'd add this only when a test specifically targets a certain issue.
>> It's a nice-to-have, not super essential, just because in the rare
>> case you need to understand why a test asserts something, you can go
>> back and find what added it in the git history without much trouble.
>>
>> On Mon, Nov 11, 2019 at 10:46 AM Hyukjin Kwon 
>> wrote:
>> >
>> > Hi all,
>> >
>> > Maybe it's not a big deal but it brought some confusions time to time
>> into Spark dev and community. I think it's time to discuss about when/which
>> format to add a JIRA ID as a prefix for the test case name in Scala test
>> cases.
>> >
>> > Currently we have many test case names with prefixes as below:
>> >
>> > test("SPARK-X blah blah")
>> > test("SPARK-X: blah blah")
>> > test("SPARK-X - blah blah")
>> > test("[SPARK-X] blah blah")
>> > …
>> >
>> > It is a good practice to have the JIRA ID in general because, for
>> instance,
>> > it makes us put less efforts to track commit histories (or even when
>> the files
>> > are totally moved), or to track related information of tests failed.
>> > Considering Spark's getting big, I think it's good to document.
>> >
>> > I would like to suggest this and document it in our guideline:
>> >
>> > 1. Add a prefix into a test name when a PR adds a couple of tests.
>> > 2. Uses "SPARK-: test name" format which is used in our code base
>> most
>> >   often[1].
>> >
>> > We should make it simple and clear but closer to the actual practice.
>> So, I would like to listen to what other people think. I would appreciate
>> if you guys give some feedback about when to add the JIRA prefix. One
>> alternative is that, we only add the prefix when the JIRA's type is bug.
>> >
>> > [1]
>> > git grep -E 'test\("\SPARK-([0-9]+):' | wc -l
>> >  923
>> > git grep -E 'test\("\SPARK-([0-9]+) ' | wc -l
>> >  477
>> > git grep -E 'test\("\[SPARK-([0-9]+)\]' | wc -l
>> >   16
>> > git grep -E 'test\("\SPARK-([0-9]+) -' | wc -l
>> >   13
>> >
>> >
>> >
>>
>

Re: Adding JIRA ID as the prefix for the test case name

2019-11-12 Thread Hyukjin Kwon

> In general a test should be self descriptive and I don't think we should
be adding JIRA ticket references wholesale. Any action that the reader has
to take to understand why a test was introduced is one too many. However in
some cases the thing we are trying to test is very subtle and in that case
a reference to a JIRA ticket might be useful, I do still feel that this
should be a backstop and that properly documenting your tests is a much
better way of dealing with this.

Yeah, the test should be self-descriptive. I don't think adding a JIRA
prefix harms this point. Probably I should add this sentence in the
guidelines as well.
Adding a JIRA prefix just adds one extra hint to track down details. I
think it's fine to stick to this practice and make it simpler and clear to
follow.

> 1. what if multiple JIRA IDs relating to the same test? we just take the
very first JIRA ID?
Ideally one JIRA should describe one issue and one PR should fix one JIRA
with a dedicated test.
Yeah, I think I would take the very first JIRA ID.

> 2. are we going to have a full scan of all existing tests and attach a
JIRA ID to it?
Yea, let's don't do this.

> It's a nice-to-have, not super essential, just because ...
It's been asked multiple times and each committer seems having a different
understanding on this.
It's not a biggie but wanted to make it clear and conclude this.

> I'd add this only when a test specifically targets a certain issue.
Yes, so this one I am not sure. From what I heard, people adds the JIRA in
cases below:

- Whenever the JIRA type is a bug
- When a PR adds a couple of tests
- Only when a test specifically targets a certain issue.
- ...

Which one do we prefer and simpler to follow?

Or I can combine as below (im gonna reword when I actually document this):
1. In general, we should add a JIRA ID as prefix of a test when a PR
targets to fix a specific issue.
In practice, it usually happens when a JIRA type is a bug or a PR adds
a couple of tests.
2. Uses "SPARK-: test name" format

If we have no objection with ^, let me go with this.

2019년 11월 13일 (수) 오전 8:14, Sean Owen 님이 작성:

> Let's suggest "SPARK-12345:" but not go back and change a bunch of test
> cases.
> I'd add this only when a test specifically targets a certain issue.
> It's a nice-to-have, not super essential, just because in the rare
> case you need to understand why a test asserts something, you can go
> back and find what added it in the git history without much trouble.
>
> On Mon, Nov 11, 2019 at 10:46 AM Hyukjin Kwon  wrote:
> >
> > Hi all,
> >
> > Maybe it's not a big deal but it brought some confusions time to time
> into Spark dev and community. I think it's time to discuss about when/which
> format to add a JIRA ID as a prefix for the test case name in Scala test
> cases.
> >
> > Currently we have many test case names with prefixes as below:
> >
> > test("SPARK-X blah blah")
> > test("SPARK-X: blah blah")
> > test("SPARK-X - blah blah")
> > test("[SPARK-X] blah blah")
> > …
> >
> > It is a good practice to have the JIRA ID in general because, for
> instance,
> > it makes us put less efforts to track commit histories (or even when the
> files
> > are totally moved), or to track related information of tests failed.
> > Considering Spark's getting big, I think it's good to document.
> >
> > I would like to suggest this and document it in our guideline:
> >
> > 1. Add a prefix into a test name when a PR adds a couple of tests.
> > 2. Uses "SPARK-: test name" format which is used in our code base
> most
> >   often[1].
> >
> > We should make it simple and clear but closer to the actual practice.
> So, I would like to listen to what other people think. I would appreciate
> if you guys give some feedback about when to add the JIRA prefix. One
> alternative is that, we only add the prefix when the JIRA's type is bug.
> >
> > [1]
> > git grep -E 'test\("\SPARK-([0-9]+):' | wc -l
> >  923
> > git grep -E 'test\("\SPARK-([0-9]+) ' | wc -l
> >  477
> > git grep -E 'test\("\[SPARK-([0-9]+)\]' | wc -l
> >   16
> > git grep -E 'test\("\SPARK-([0-9]+) -' | wc -l
> >   13
> >
> >
> >
>

Re: Adding JIRA ID as the prefix for the test case name

2019-11-11 Thread Hyukjin Kwon

In few days, I will wrote this in our guidelines probably after rewording
it a bit better:

1. Add a prefix into a test name when a PR adds a couple of tests.
2. Uses "SPARK-: test name" format.

Please let me know if you have any different opinion about what/when to
write the JIRA ID as the prefix.
I would like to make sure this simple rule is closer to the actual practice
from you guys.


2019년 11월 12일 (화) 오전 8:41, Gengliang 님이 작성:

> +1 for making it a guideline. This is helpful when the test cases are
> moved to a different file.
>
> On Mon, Nov 11, 2019 at 3:23 PM Takeshi Yamamuro 
> wrote:
>
>> +1 for having that consistent rule in test names.
>> This is a trivial problem though, I think documenting this rule in the
>> contribution guide
>> might be able to make reviewer overhead a little smaller.
>>
>> Bests,
>> Takeshi
>>
>> On Tue, Nov 12, 2019 at 1:46 AM Hyukjin Kwon  wrote:
>>
>>> Hi all,
>>>
>>> Maybe it's not a big deal but it brought some confusions time to time
>>> into Spark dev and community. I think it's time to discuss about when/which
>>> format to add a JIRA ID as a prefix for the test case name in Scala test
>>> cases.
>>>
>>> Currently we have many test case names with prefixes as below:
>>>
>>>- test("SPARK-X blah blah")
>>>- test("SPARK-X: blah blah")
>>>- test("SPARK-X - blah blah")
>>>- test("[SPARK-X] blah blah")
>>>- …
>>>
>>> It is a good practice to have the JIRA ID in general because, for
>>> instance,
>>> it makes us put less efforts to track commit histories (or even when the
>>> files
>>> are totally moved), or to track related information of tests failed.
>>> Considering Spark's getting big, I think it's good to document.
>>>
>>> I would like to suggest this and document it in our guideline:
>>>
>>> 1. Add a prefix into a test name when a PR adds a couple of tests.
>>> 2. Uses "SPARK-: test name" format which is used in our code base
>>> most
>>>   often[1].
>>>
>>> We should make it simple and clear but closer to the actual practice.
>>> So, I would like to listen to what other people think. I would appreciate
>>> if you guys give some feedback about when to add the JIRA prefix. One
>>> alternative is that, we only add the prefix when the JIRA's type is bug.
>>>
>>> [1]
>>> git grep -E 'test\("\SPARK-([0-9]+):' | wc -l
>>>  923
>>> git grep -E 'test\("\SPARK-([0-9]+) ' | wc -l
>>>  477
>>> git grep -E 'test\("\[SPARK-([0-9]+)\]' | wc -l
>>>   16
>>> git grep -E 'test\("\SPARK-([0-9]+) -' | wc -l
>>>   13
>>>
>>>
>>>
>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

Adding JIRA ID as the prefix for the test case name

2019-11-11 Thread Hyukjin Kwon

Hi all,

Maybe it's not a big deal but it brought some confusions time to time into
Spark dev and community. I think it's time to discuss about when/which
format to add a JIRA ID as a prefix for the test case name in Scala test
cases.

Currently we have many test case names with prefixes as below:

   - test("SPARK-X blah blah")
   - test("SPARK-X: blah blah")
   - test("SPARK-X - blah blah")
   - test("[SPARK-X] blah blah")
   - …

It is a good practice to have the JIRA ID in general because, for instance,
it makes us put less efforts to track commit histories (or even when the
files
are totally moved), or to track related information of tests failed.
Considering Spark's getting big, I think it's good to document.

I would like to suggest this and document it in our guideline:

1. Add a prefix into a test name when a PR adds a couple of tests.
2. Uses "SPARK-: test name" format which is used in our code base most
  often[1].

We should make it simple and clear but closer to the actual practice. So, I
would like to listen to what other people think. I would appreciate if you
guys give some feedback about when to add the JIRA prefix. One alternative
is that, we only add the prefix when the JIRA's type is bug.

[1]
git grep -E 'test\("\SPARK-([0-9]+):' | wc -l
 923
git grep -E 'test\("\SPARK-([0-9]+) ' | wc -l
 477
git grep -E 'test\("\[SPARK-([0-9]+)\]' | wc -l
  16
git grep -E 'test\("\SPARK-([0-9]+) -' | wc -l
  13

Re: dev/merge_spark_pr.py broken on python 2

2019-11-10 Thread Hyukjin Kwon

Yeah.. let's stick to Python 3 in general ..
I plan to drop Python 2 completely right after Spark 3.0 release.

The exception you face .. seems like run_cmd now produces unicode instead
of bytes in Python 2 with the merge script. Later, seems this unicode is
attempted to be casted to bytes implicitly by %-formatting - IIRC implicit
cast uses its default encoding which is ascii in Python.


On Sat, 9 Nov 2019, 03:32 Marcelo Vanzin, 
wrote:

> I remember merging PRs with non-ascii chars in the past...
>
> Anyway, for these scripts, might be easier to just use python3 for
> everything, instead of trying to keep them working on two different
> versions.
>
> On Fri, Nov 8, 2019 at 10:28 AM Sean Owen  wrote:
> >
> > Ah OK. I think it's the same type of issue that the last change
> > actually was trying to fix for Python 2. Here it seems like the author
> > name might have non-ASCII chars?
> > I don't immediately know enough to know how to resolve that for Python
> > 2. Something with how raw_input works, I take it. You could 'fix' the
> > author name if that's the case, or just use python 3.
> >
> > On Fri, Nov 8, 2019 at 12:20 PM Marcelo Vanzin 
> wrote:
> > >
> > > Something related to non-ASCII characters. Worked fine with python 3.
> > >
> > > git branch -D PR_TOOL_MERGE_PR_26426_MASTER
> > > Traceback (most recent call last):
> > >   File "./dev/merge_spark_pr.py", line 577, in 
> > > main()
> > >   File "./dev/merge_spark_pr.py", line 552, in main
> > > merge_hash = merge_pr(pr_num, target_ref, title, body,
> pr_repo_desc)
> > >   File "./dev/merge_spark_pr.py", line 147, in merge_pr
> > > distinct_authors[0])
> > > UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in
> > > position 65: ordinal not in range(128)
> > > M   docs/running-on-kubernetes.md
> > > Already on 'master'
> > > Your branch is up to date with 'apache-github/master'.
> > > error: cannot pull with rebase: Your index contains uncommitted
> changes.
> > > error: please commit or stash them.
> > >
> > > On Fri, Nov 8, 2019 at 10:17 AM Sean Owen  wrote:
> > > >
> > > > Hm, the last change was on Oct 1, and should have actually helped it
> > > > still work with Python 2:
> > > >
> https://github.com/apache/spark/commit/2ec3265ae76fc1e136e44c240c476ce572b679df#diff-c321b6c82ebb21d8fd225abea9b7b74c
> > > >
> > > > Hasn't otherwise changed in a while. What's the error?
> > > >
> > > > On Fri, Nov 8, 2019 at 11:37 AM Marcelo Vanzin
> > > >  wrote:
> > > > >
> > > > > Hey all,
> > > > >
> > > > > Something broke that script when running with python 2.
> > > > >
> > > > > I know we want to deprecate python 2, but in that case, scripts
> should
> > > > > at least be changed to use "python3" in the shebang line...
> > > > >
> > > > > --
> > > > > Marcelo
> > > > >
> > > > >
> -
> > > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > > > >
> > >
> > >
> > >
> > > --
> > > Marcelo
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-07 Thread Hyukjin Kwon

+1

2019년 11월 6일 (수) 오후 11:38, Wenchen Fan 님이 작성:

> Sounds reasonable to me. We should make the behavior consistent within
> Spark.
>
> On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler  wrote:
>
>> Currently, when a PySpark Row is created with keyword arguments, the
>> fields are sorted alphabetically. This has created a lot of confusion with
>> users because it is not obvious (although it is stated in the pydocs) that
>> they will be sorted alphabetically. Then later when applying a schema and
>> the field order does not match, an error will occur. Here is a list of some
>> of the JIRAs that I have been tracking all related to this issue:
>> SPARK-24915, SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion
>> of the issue [1].
>>
>> The original reason for sorting fields is because kwargs in python < 3.6
>> are not guaranteed to be in the same order that they were entered [2].
>> Sorting alphabetically ensures a consistent order. Matters are further
>> complicated with the flag _*from_dict*_ that allows the Row fields to to
>> be referenced by name when made by kwargs, but this flag is not serialized
>> with the Row and leads to inconsistent behavior. For instance:
>>
>> >>> spark.createDataFrame([Row(A="1", B="2")], "B string, A string").first()
>> Row(B='2', A='1')>>> 
>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1", B="2")]), 
>> "B string, A string").first()
>> Row(B='1', A='2')
>>
>> I think the best way to fix this is to remove the sorting of fields when
>> constructing a Row. For users with Python 3.6+, nothing would change
>> because these versions of Python ensure that the kwargs stays in the
>> ordered entered. For users with Python < 3.6, using kwargs would check a
>> conf to either raise an error or fallback to a LegacyRow that sorts the
>> fields as before. With Python < 3.6 being deprecated now, this LegacyRow
>> can also be removed at the same time. There are also other ways to create
>> Rows that will not be affected. I have opened a JIRA [3] to capture this,
>> but I am wondering what others think about fixing this for Spark 3.0?
>>
>> [1] https://github.com/apache/spark/pull/20280
>> [2] https://www.python.org/dev/peps/pep-0468/
>> [3] https://issues.apache.org/jira/browse/SPARK-29748
>>
>>

Re: [VOTE] SPARK 3.0.0-preview (RC2)

2019-11-01 Thread Hyukjin Kwon

+1

On Fri, 1 Nov 2019, 15:36 Wenchen Fan,  wrote:

> The PR builder uses Hadoop 2.7 profile, which makes me think that 2.7 is
> more stable and we should make releases using 2.7 by default.
>
> +1
>
> On Fri, Nov 1, 2019 at 7:16 AM Xiao Li  wrote:
>
>> Spark 3.0 will still use the Hadoop 2.7 profile by default, I think.
>> Hadoop 2.7 profile is much more stable than Hadoop 3.2 profile.
>>
>> On Thu, Oct 31, 2019 at 3:54 PM Sean Owen  wrote:
>>
>>> This isn't a big thing, but I see that the pyspark build includes
>>> Hadoop 2.7 rather than 3.2. Maybe later we change the build to put in
>>> 3.2 by default.
>>>
>>> Otherwise, the tests all seems to pass with JDK 8 / 11 with all
>>> profiles enabled, so I'm +1 on it.
>>>
>>>
>>> On Thu, Oct 31, 2019 at 1:00 AM Xingbo Jiang 
>>> wrote:
>>> >
>>> > Please vote on releasing the following candidate as Apache Spark
>>> version 3.0.0-preview.
>>> >
>>> > The vote is open until November 3 PST and passes if a majority +1 PMC
>>> votes are cast, with
>>> > a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 3.0.0-preview
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>> >
>>> > The tag to be voted on is v3.0.0-preview-rc2 (commit
>>> 007c873ae34f58651481ccba30e8e2ba38a692c4):
>>> > https://github.com/apache/spark/tree/v3.0.0-preview-rc2
>>> >
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc2-bin/
>>> >
>>> > Signatures used for Spark RCs can be found in this file:
>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >
>>> > The staging repository for this release can be found at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1336/
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc2-docs/
>>> >
>>> > The list of bug fixes going into 3.0.0 can be found at the following
>>> URL:
>>> > https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>> >
>>> > FAQ
>>> >
>>> > =
>>> > How can I help test this release?
>>> > =
>>> >
>>> > If you are a Spark user, you can help us test this release by taking
>>> > an existing Spark workload and running on this release candidate, then
>>> > reporting any regressions.
>>> >
>>> > If you're working in PySpark you can set up a virtual env and install
>>> > the current RC and see if anything important breaks, in the Java/Scala
>>> > you can add the staging repository to your projects resolvers and test
>>> > with the RC (make sure to clean up the artifact cache before/after so
>>> > you don't end up building with an out of date RC going forward).
>>> >
>>> > ===
>>> > What should happen to JIRA tickets still targeting 3.0.0?
>>> > ===
>>> >
>>> > The current list of open tickets targeted at 3.0.0 can be found at:
>>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.0.0
>>> >
>>> > Committers should look at those and triage. Extremely important bug
>>> > fixes, documentation, and API tweaks that impact compatibility should
>>> > be worked on immediately.
>>> >
>>> > ==
>>> > But my bug isn't fixed?
>>> > ==
>>> >
>>> > In order to make timely releases, we will typically not hold the
>>> > release unless the bug in question is a regression from the previous
>>> > release. That being said, if there is something which is a regression
>>> > that has not been correctly targeted please ping me or a committer to
>>> > help target the issue.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> 
>>
>

Re: [DISCUSS] Deprecate Python < 3.6 in Spark 3.0

2019-10-28 Thread Hyukjin Kwon

+1 from me as well.

2019년 10월 29일 (화) 오전 5:34, Xiangrui Meng 님이 작성:

> +1. And we should start testing 3.7 and maybe 3.8 in Jenkins.
>
> On Thu, Oct 24, 2019 at 9:34 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for starting the thread.
>>
>> In addition to that, we currently are testing Python 3.6 only in Apache
>> Spark Jenkins environment.
>>
>> Given that Python 3.8 is already out and Apache Spark 3.0.0 RC1 will
>> start next January
>> (https://spark.apache.org/versioning-policy.html), I'm +1 for the
>> deprecation (Python < 3.6) at Apache Spark 3.0.0.
>>
>> It's just a deprecation to prepare the next-step development cycle.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Thu, Oct 24, 2019 at 1:10 AM Maciej Szymkiewicz <
>> mszymkiew...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> While deprecation of Python 2 in 3.0.0 has been announced
>>> ,
>>> there is no clear statement about specific continuing support of different
>>> Python 3 version.
>>>
>>> Specifically:
>>>
>>>- Python 3.4 has been retired this year.
>>>- Python 3.5 is already in the "security fixes only" mode and should
>>>be retired in the middle of 2020.
>>>
>>> Continued support of these two blocks adoption of many new Python
>>> features (PEP 468)  and it is hard to justify beyond 2020.
>>>
>>> Should these two be deprecated in 3.0.0 as well?
>>>
>>> --
>>> Best regards,
>>> Maciej
>>>
>>>

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-10 Thread Hyukjin Kwon

+1 (binding)

2019년 10월 10일 (목) 오후 5:11, Takeshi Yamamuro 님이 작성:

> Thanks for the great work, Gengliang!
>
> +1 for that.
> As I said before, the behaviour is pretty common in DBMSs, so the change
> helps for DMBS users.
>
> Bests,
> Takeshi
>
>
> On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang <
> gengliang.w...@databricks.com> wrote:
>
>> Hi everyone,
>>
>> I'd like to call for a new vote on SPARK-28885
>>  "Follow ANSI store
>> assignment rules in table insertion by default" after revising the ANSI
>> store assignment policy(SPARK-29326
>> ).
>> When inserting a value into a column with the different data type, Spark
>> performs type coercion. Currently, we support 3 policies for the store
>> assignment rules: ANSI, legacy and strict, which can be set via the option
>> "spark.sql.storeAssignmentPolicy":
>> 1. ANSI: Spark performs the store assignment as per ANSI SQL. In
>> practice, the behavior is mostly the same as PostgreSQL. It disallows
>> certain unreasonable type conversions such as converting `string` to `int`
>> and `double` to `boolean`. It will throw a runtime exception if the value
>> is out-of-range(overflow).
>> 2. Legacy: Spark allows the store assignment as long as it is a valid
>> `Cast`, which is very loose. E.g., converting either `string` to `int` or
>> `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
>> for compatibility with Hive. When inserting an out-of-range value to an
>> integral field, the low-order bits of the value is inserted(the same as
>> Java/Scala numeric type casting). For example, if 257 is inserted into a
>> field of Byte type, the result is 1.
>> 3. Strict: Spark doesn't allow any possible precision loss or data
>> truncation in store assignment, e.g., converting either `double` to `int`
>> or `decimal` to `double` is allowed. The rules are originally for Dataset
>> encoder. As far as I know, no mainstream DBMS is using this policy by
>> default.
>>
>> Currently, the V1 data source uses "Legacy" policy by default, while V2
>> uses "Strict". This proposal is to use "ANSI" policy by default for both V1
>> and V2 in Spark 3.0.
>>
>> This vote is open until Friday (Oct. 11).
>>
>> [ ] +1: Accept the proposal
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>> Thank you!
>>
>> Gengliang
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Auto-closing PRs when there are no feedback or response from its author

2019-10-09 Thread Hyukjin Kwon

Yes, the problem was that it is difficult to automate. I think this has
been discussed twice(?) in the mailing list;
however, it ended up with doing nothing because it was difficult to
automate.

I think in case of PRs unlike JIRAs, there are some more different cases
that need manual judgement.

As an example, while JIRAs are easy to keep it updated in general, I think
it might not be fair to request to keep updating and resolving
conflicts of a PR with indefinitely waiting for review. For some large PRs,
it's kind of painful to keep it updated always.
It might be more reasonable to be updated per request when a committer has
some time to review.

> If there's little overhead to adoption, cool, though I doubt people
> will consistently use a new tag.

Yea, this is a good point. But in fact the standard about when to use is
quite simple - in a PR, leave a comment or review and tag this.
In case of readthedocs, they seem always tagging this whenever they leave a
comment or responds.

In fact, I myself am not sure about how useful it would be but to me it
looked worth trying. I remember we tried
such bots and dropped it back when it is found practically not quite useful.

2019년 10월 9일 (수) 오전 11:26, Sean Owen 님이 작성:

> I'm generally all for closing pretty old PRs. They can be reopened
> easily. Closing a PR (a particular proposal for how to resolve an
> issue) is less drastic than closing a JIRA (a description of an
> issue). Closing them just delivers the reality, that nobody is going
> to otherwise revisit it, and can actually prompt a few contributors to
> update or revisit their proposal.
>
> I wouldn't necessarily want to adopt new process or tools though. Is
> it not sufficient to auto-close PRs that have a merge conflict and
> haven't been updated in months? or just haven't been updated in a
> year? Those are probably manual-ish processes, but, don't need to
> happen more than a couple times a year.
>
> If there's little overhead to adoption, cool, though I doubt people
> will consistently use a new tag. I'd prefer any process or tool that
> implements the above.
>
>
> On Tue, Oct 8, 2019 at 8:19 PM Hyukjin Kwon  wrote:
> >
> > Hi all,
> >
> > I think we talked about this before. Roughly speaking, there are two
> cases of PRs:
> >   1. PRs waiting for review and 2. PRs waiting for author's reaction
> > We might not have to take an action but wait for reviewing for the first
> case.
> > However, we can ping and/or take an action for the second case.
> >
> > I noticed (at Read the Docs,
> https://github.com/readthedocs/readthedocs.org/blob/master/.github/no-response.yml)
> there's one bot integrated with Github app that does exactly what we want
> (see https://github.com/probot/no-response).
> >
> > 1. Maintainers (committers) can add a tag to a PR (e.g.,
> need-more-information)
> > 2. If the PR author responds with a comment or update, the bot removes
> the tag
> > 3. If the PR author does not respond, the bot closes the PR after
> waiting for the configured number of days.
> >
> > We already have a kind of simple mechanism for windowing the number of
> JIRAs. I think it's time to have such mechanism in Github PR as well.
> >
> > Although this repo doesn't look popular or widely used enough, seems
> exactly matched to what we want and less aggressive since this mechanism
> will only work when maintainers (committers) add a tag to a PR.
> >
> > WDYT guys?
> >
> > I cc'ed few people who I think were in the past similar discussions.
> >
>

Re: Auto-closing PRs when there are no feedback or response from its author

2019-10-09 Thread Hyukjin Kwon

> 1. Although we close old JIRA issues on EOL-version only, but some issues
doesn't have `Affected Versions` field  info at all.
>- https://issues.apache.org/jira/browse/SPARK-8542

For this case actually, I thought we resolved such cases all .. maybe some
of them slipped out of my hand.
Few years ago, we made affected version a required field:
[image: Screen Shot 2019-10-09 at 3.36.15 PM.png]
It should be good to resolve them at least to let reporters to update the
affected versions. and all such JIRAs will be old JIRAs anyway.


> 2. Although we can do auto-close PRs that have a merge conflict and
haven't been updated in months, but some PRs don't have conflicts.
> - https://github.com/apache/spark/pull/7842 (Actually, this is the
oldest PR due to the above reason.)

Yea, this is a good point. This might be one of the reasons for that manual
tagging way to identify case by case.



2019년 10월 9일 (수) 오후 3:02, Dongjoon Hyun 님이 작성:

> Thank you for keeping eyes on this difficult issue, Hyukjin.
>
> Although we try our best, there exist some corner cases always. For
> examples,
>
> 1. Although we close old JIRA issues on EOL-version only, but some issues
> doesn't have `Affected Versions` field  info at all.
> - https://issues.apache.org/jira/browse/SPARK-8542
>
> 2. Although we can do auto-close PRs that have a merge conflict and
> haven't been updated in months, but some PRs don't have conflicts.
> - https://github.com/apache/spark/pull/7842 (Actually, this is the
> oldest PR due to the above reason.)
>
> So, I'm +1 for trying to add a new manual tagging process
> because I believe it's better than no-activity status and that sounds
> softer than the direct closing due to the grace-period.
>
> Bests,
> Dongjoon.
>
>
> On Tue, Oct 8, 2019 at 7:26 PM Sean Owen  wrote:
>
>> I'm generally all for closing pretty old PRs. They can be reopened
>> easily. Closing a PR (a particular proposal for how to resolve an
>> issue) is less drastic than closing a JIRA (a description of an
>> issue). Closing them just delivers the reality, that nobody is going
>> to otherwise revisit it, and can actually prompt a few contributors to
>> update or revisit their proposal.
>>
>> I wouldn't necessarily want to adopt new process or tools though. Is
>> it not sufficient to auto-close PRs that have a merge conflict and
>> haven't been updated in months? or just haven't been updated in a
>> year? Those are probably manual-ish processes, but, don't need to
>> happen more than a couple times a year.
>>
>> If there's little overhead to adoption, cool, though I doubt people
>> will consistently use a new tag. I'd prefer any process or tool that
>> implements the above.
>>
>>
>> On Tue, Oct 8, 2019 at 8:19 PM Hyukjin Kwon  wrote:
>> >
>> > Hi all,
>> >
>> > I think we talked about this before. Roughly speaking, there are two
>> cases of PRs:
>> >   1. PRs waiting for review and 2. PRs waiting for author's reaction
>> > We might not have to take an action but wait for reviewing for the
>> first case.
>> > However, we can ping and/or take an action for the second case.
>> >
>> > I noticed (at Read the Docs,
>> https://github.com/readthedocs/readthedocs.org/blob/master/.github/no-response.yml)
>> there's one bot integrated with Github app that does exactly what we want
>> (see https://github.com/probot/no-response).
>> >
>> > 1. Maintainers (committers) can add a tag to a PR (e.g.,
>> need-more-information)
>> > 2. If the PR author responds with a comment or update, the bot removes
>> the tag
>> > 3. If the PR author does not respond, the bot closes the PR after
>> waiting for the configured number of days.
>> >
>> > We already have a kind of simple mechanism for windowing the number of
>> JIRAs. I think it's time to have such mechanism in Github PR as well.
>> >
>> > Although this repo doesn't look popular or widely used enough, seems
>> exactly matched to what we want and less aggressive since this mechanism
>> will only work when maintainers (committers) add a tag to a PR.
>> >
>> > WDYT guys?
>> >
>> > I cc'ed few people who I think were in the past similar discussions.
>> >
>>
>

Auto-closing PRs when there are no feedback or response from its author

2019-10-08 Thread Hyukjin Kwon

Hi all,

I think we talked about this before. Roughly speaking, there are two cases
of PRs:
  1. PRs waiting for review and 2. PRs waiting for author's reaction
We might not have to take an action but wait for reviewing for the first
case.
However, we can ping and/or take an action for the second case.

I noticed (at Read the Docs,
https://github.com/readthedocs/readthedocs.org/blob/master/.github/no-response.yml)
there's one bot integrated with Github app that does exactly what we want
(see https://github.com/probot/no-response).

1. Maintainers (committers) can add a tag to a PR (e.g.,
need-more-information)
2. If the PR author responds with a comment or update, the bot removes the
tag
3. If the PR author does not respond, the bot closes the PR after waiting
for the configured number of days.

We already have a kind of simple mechanism for windowing the number of
JIRAs. I think it's time to have such mechanism in Github PR as well.

Although this repo doesn't look popular or widely used enough, seems
exactly matched to what we want and less aggressive since this mechanism
will only work when maintainers (committers) add a tag to a PR.

WDYT guys?

I cc'ed few people who I think were in the past similar discussions.

Re: Resolving all JIRAs affecting EOL releases

2019-10-07 Thread Hyukjin Kwon

I am going to resolve those JIRAs now.

2019년 9월 9일 (월) 오전 9:46, Hyukjin Kwon 님이 작성:

> Yup, no worries. I roughly set the one week delay considering the official
> release date :D
>
> On Mon, 9 Sep 2019, 09:45 Dongjoon Hyun,  wrote:
>
>> Thank you, Hyukjin.
>>
>> +1 for closing according to 2.3.x EOL.
>>
>> For the timing, please do that after the official 2.3.4 release
>> announcement.
>>
>> Bests,
>> Dongjoon.
>>
>> On Sun, Sep 8, 2019 at 16:27 Sean Owen  wrote:
>>
>>> I think simply closing old issues with no activity in a long time is
>>> OK. The "Affected Version" is somewhat noisy, so not even particularly
>>> important to also query, but yeah I see some value in trying to limit
>>> the scope this way.
>>>
>>> On Sat, Sep 7, 2019 at 10:15 PM Hyukjin Kwon 
>>> wrote:
>>> >
>>> > HI all,
>>> >
>>> > We have resolved JIRAs that targets EOL releases (up to Spark 2.2.x)
>>> in order to make it
>>> > the manageable size before.
>>> > Since Spark 2.3.4 will be EOL release, I plan to do this again roughly
>>> in a week.
>>> >
>>> > The JIRAs that has not been updated for the last year, and having
>>> affect version of EOL releases will be:
>>> >   - Resolved as 'Incomplete' status
>>> >   - Has a 'bulk-closed' label.
>>> >
>>> > I plan to use this JQL
>>> >
>>> > project = SPARK
>>> >   AND status in (Open, "In Progress", Reopened)
>>> >   AND (
>>> > affectedVersion = EMPTY OR
>>> > NOT (affectedVersion in versionMatch("^3.*")
>>> >   OR affectedVersion in versionMatch("^2.4.*")
>>> > )
>>> >   )
>>> >   AND updated <= -52w
>>> >
>>> >
>>> > You could click this link and check.
>>> >
>>> >
>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20(affectedVersion%20%3D%20EMPTY%20OR%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)))%20AND%20updated%20%3C%3D%20-52w
>>> >
>>> > Please let me know if you guys have any concern or opinion on this.
>>> >
>>> > Thanks.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

Re: Spark 3.0 preview release feature list and major changes

2019-10-07 Thread Hyukjin Kwon

Cogroup Pandas UDF missing:

SPARK-27463  Support
Dataframe Cogroup via Pandas UDFs
Vectorized R execution:

SPARK-26759  Arrow
optimization in SparkR's interoperability


2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim 님이 작성:

> Thanks for bringing the nice summary of Spark 3.0 improvements!
>
> I'd like to add some items from structured streaming side,
>
> SPARK-28199  Move
> Trigger implementations to Triggers.scala and avoid exposing these to the
> end users (removal of deprecated)
> SPARK-23539  Add
> support for Kafka headers in Structured Streaming
> SPARK-25501  Add kafka
> delegation token support (there were follow-up issues to add
> functionalities like support multi clusters, etc.)
> SPARK-26848  Introduce
> new option to Kafka source: offset by timestamp (starting/ending)
> SPARK-28074  Log warn
> message on possible correctness issue for multiple stateful operations in
> single query
>
> and core side,
>
> SPARK-23155  New
> feature: apply custom log URL pattern for executor log URLs in SHS
> (follow-up issue expanded the functionality to Spark UI as well)
>
> FYI if we count on current work in progress, there's ongoing umbrella
> issue regarding rolling event log & snapshot (SPARK-28594
> ) which we struggle to
> get things done in Spark 3.0.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang  wrote:
>
>> Hi all,
>>
>> I went over all the finished JIRA tickets targeted to Spark 3.0.0, here
>> I'm listing all the notable features and major changes that are ready to
>> test/deliver, please don't hesitate to add more to the list:
>>
>> SPARK-11215  Multiple
>> columns support added to various Transformers: StringIndexer
>>
>> SPARK-11150 
>> Implement Dynamic Partition Pruning
>>
>> SPARK-13677  Support
>> Tree-Based Feature Transformation
>>
>> SPARK-16692  Add
>> MultilabelClassificationEvaluator
>>
>> SPARK-19591  Add
>> sample weights to decision trees
>>
>> SPARK-19712  Pushing
>> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>>
>> SPARK-19827  R API
>> for Power Iteration Clustering
>>
>> SPARK-20286  Improve
>> logic for timing out executors in dynamic allocation
>>
>> SPARK-20636 
>> Eliminate unnecessary shuffle with adjacent Window expressions
>>
>> SPARK-22148  Acquire
>> new executors to avoid hang because of blacklisting
>>
>> SPARK-22796  Multiple
>> columns support added to various Transformers: PySpark QuantileDiscretizer
>>
>> SPARK-23128  A new
>> approach to do adaptive execution in Spark SQL
>>
>> SPARK-23674  Add
>> Spark ML Listener for Tracking ML Pipeline Status
>>
>> SPARK-23710  Upgrade
>> the built-in Hive to 2.3.5 for hadoop-3.2
>>
>> SPARK-24333  Add fit
>> with validation set to Gradient Boosted Trees: Python API
>>
>> SPARK-24417  Build
>> and Run Spark on JDK11
>>
>> SPARK-24615 
>> Accelerator-aware task scheduling for Spark
>>
>> SPARK-24920  Allow
>> sharing Netty's memory pool allocators
>>
>> SPARK-25250  Fix race
>> condition with tasks running when new attempt for same stage is created
>> leads to other task in the next attempt running on the same partition id
>> retry multiple times
>>
>> SPARK-25341  Support
>> rolling back a shuffle map stage and re-generate the shuffle files
>>
>> SPARK-25348  Data
>> source for binary files
>>
>> SPARK-25603 
>> Generalize Nested Column Pruning
>>
>> SPARK-26132

Re: [DISCUSS] Spark 2.5 release

2019-09-22 Thread Hyukjin Kwon

+1 for Matei's as well.

On Sun, 22 Sep 2019, 14:59 Marco Gaido,  wrote:

> I agree with Matei too.
>
> Thanks,
> Marco
>
> Il giorno dom 22 set 2019 alle ore 03:44 Dongjoon Hyun <
> dongjoon.h...@gmail.com> ha scritto:
>
>> +1 for Matei's suggestion!
>>
>> Bests,
>> Dongjoon.
>>
>> On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia 
>> wrote:
>>
>>> If the goal is to get people to try the DSv2 API and build DSv2 data
>>> sources, can we recommend the 3.0-preview release for this? That would get
>>> people shifting to 3.0 faster, which is probably better overall compared to
>>> maintaining two major versions. There’s not that much else changing in 3.0
>>> if you already want to update your Java version.
>>>
>>> On Sep 21, 2019, at 2:45 PM, Ryan Blue 
>>> wrote:
>>>
>>> > If you insist we shouldn't change the unstable temporary API in 3.x .
>>> . .
>>>
>>> Not what I'm saying at all. I said we should carefully consider whether
>>> a breaking change is the right decision in the 3.x line.
>>>
>>> All I'm suggesting is that we can make a 2.5 release with the feature
>>> and an API that is the same as the one in 3.0.
>>>
>>> > I also don't get this backporting a giant feature to 2.x line
>>>
>>> I am planning to do this so we can use DSv2 before 3.0 is released. Then
>>> we can have a source implementation that works in both 2.x and 3.0 to make
>>> the transition easier. Since I'm already doing the work, I'm offering to
>>> share it with the community.
>>>
>>>
>>> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin  wrote:
>>>
 Because for example we'd need to move the location of InternalRow,
 breaking the package name. If you insist we shouldn't change the unstable
 temporary API in 3.x to maintain compatibility with 3.0, which is totally
 different from my understanding of the situation when you exposed it, then
 I'd say we should gate 3.0 on having a stable row interface.

 I also don't get this backporting a giant feature to 2.x line ... as
 suggested by others in the thread, DSv2 would be one of the main reasons
 people upgrade to 3.0. What's so special about DSv2 that we are doing this?
 Why not abandoning 3.0 entirely and backport all the features to 2.x?



 On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue  wrote:

> Why would that require an incompatible change?
>
> We *could* make an incompatible change and remove support for
> InternalRow, but I think we would want to carefully consider whether that
> is the right decision. And in any case, we would be able to keep 2.5 and
> 3.0 compatible, which is the main goal.
>
> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin 
> wrote:
>
> How would you not make incompatible changes in 3.x? As discussed the
> InternalRow API is not stable and needs to change.
>
> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue  wrote:
>
> > Making downstream to diverge their implementation heavily between
> minor versions (say, 2.4 vs 2.5) wouldn't be a good experience
>
> You're right that the API has been evolving in the 2.x line. But, it
> is now reasonably stable with respect to the current feature set and we
> should not need to break compatibility in the 3.x line. Because we have
> reached our goals for the 3.0 release, we can backport at least those
> features to 2.x and confidently have an API that works in both a 2.x
> release and is compatible with 3.0, if not 3.1 and later releases as well.
>
> > I'd rather say preparation of Spark 2.5 should be started after
> Spark 3.0 is officially released
>
> The reason I'm suggesting this is that I'm already going to do the
> work to backport the 3.0 release features to 2.4. I've been asked by
> several people when DSv2 will be released, so I know there is a lot of
> interest in making this available sooner than 3.0. If I'm already doing 
> the
> work, then I'd be happy to share that with the community.
>
> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
> about complete so we can easily release the same set of features and API 
> in
> 2.5 and 3.0.
>
> If we decide for some reason to wait until after 3.0 is released, I
> don't know that there is much value in a 2.5. The purpose is to be a step
> toward 3.0, and releasing that step after 3.0 doesn't seem helpful to me.
> It also wouldn't get these features out any sooner than 3.0, as a 2.5
> release probably would, given the work needed to validate the incompatible
> changes in 3.0.
>
> > DSv2 change would be the major backward incompatibility which Spark
> 2.x users may hesitate to upgrade
>
> As I pointed out, DSv2 has been changing in the 2.x line, so this is
> expected. I don't think it will need incompatible changes in the 3.x line.
>

Re: Resolving all JIRAs affecting EOL releases

2019-09-08 Thread Hyukjin Kwon

Yup, no worries. I roughly set the one week delay considering the official
release date :D

On Mon, 9 Sep 2019, 09:45 Dongjoon Hyun,  wrote:

> Thank you, Hyukjin.
>
> +1 for closing according to 2.3.x EOL.
>
> For the timing, please do that after the official 2.3.4 release
> announcement.
>
> Bests,
> Dongjoon.
>
> On Sun, Sep 8, 2019 at 16:27 Sean Owen  wrote:
>
>> I think simply closing old issues with no activity in a long time is
>> OK. The "Affected Version" is somewhat noisy, so not even particularly
>> important to also query, but yeah I see some value in trying to limit
>> the scope this way.
>>
>> On Sat, Sep 7, 2019 at 10:15 PM Hyukjin Kwon  wrote:
>> >
>> > HI all,
>> >
>> > We have resolved JIRAs that targets EOL releases (up to Spark 2.2.x) in
>> order to make it
>> > the manageable size before.
>> > Since Spark 2.3.4 will be EOL release, I plan to do this again roughly
>> in a week.
>> >
>> > The JIRAs that has not been updated for the last year, and having
>> affect version of EOL releases will be:
>> >   - Resolved as 'Incomplete' status
>> >   - Has a 'bulk-closed' label.
>> >
>> > I plan to use this JQL
>> >
>> > project = SPARK
>> >   AND status in (Open, "In Progress", Reopened)
>> >   AND (
>> > affectedVersion = EMPTY OR
>> > NOT (affectedVersion in versionMatch("^3.*")
>> >   OR affectedVersion in versionMatch("^2.4.*")
>> > )
>> >   )
>> >   AND updated <= -52w
>> >
>> >
>> > You could click this link and check.
>> >
>> >
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20(affectedVersion%20%3D%20EMPTY%20OR%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)))%20AND%20updated%20%3C%3D%20-52w
>> >
>> > Please let me know if you guys have any concern or opinion on this.
>> >
>> > Thanks.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: Resolving all JIRAs affecting EOL releases

2019-09-07 Thread Hyukjin Kwon

Thanks for checking it.

I think it's fine by two reasons below:

1. It has another condition for such cases - one year time range.
  Basically, such PRs have not been merged for one year, which I believe
are not likely merged soon.
  The JIRA status will be updated when such PRs are merged anyway.

2. The JIRAs and PRs ideally should be updated. If the PR authors forgot to
update affected versions,
it could be a good ping to update the affected versions in its JIRA,
which is a good practice I believe.

FWIW, currently setting 'in-Progress' doesn't properly work. It has been
few months.
I raised this issue several times at
http://apache-spark-developers-list.1001551.n3.nabble.com/In-Apache-Spark-JIRA-spark-dev-github-jira-sync-py-not-running-properly-td27077.html
 because
it blocked me to search JIRAs. I had to change my JQL to check JIRAs. It's
still not being fixed. I don't know who to ask about this.

If this is not being fixed, we might not have to care about 'In-Progress'
anymore.

2019년 9월 8일 (일) 오후 1:31, Takeshi Yamamuro 님이 작성:

> Hi, Hyukjin,
>
> I checked entries in the list and I found that some entries have
> 'In-Progress' in their status and have oepn prs (e.g., SPARK-25211
> <https://issues.apache.org/jira/browse/SPARK-25211>).
> We can also close these PRs according to the bulk close?
> (But, we might need to check the corresponding PRs manually?)
>
> Bests,
> Takeshi
>
>
> On Sun, Sep 8, 2019 at 12:15 PM Hyukjin Kwon  wrote:
>
>> HI all,
>>
>> We have resolved JIRAs that targets EOL releases (up to Spark 2.2.x) in
>> order to make it
>> the manageable size before.
>> Since Spark 2.3.4 will be EOL release, I plan to do this again roughly in
>> a week.
>>
>> The JIRAs that has not been updated for the last year, and having affect
>> version of EOL releases will be:
>>   - Resolved as 'Incomplete' status
>>   - Has a 'bulk-closed' label.
>>
>> I plan to use this JQL
>>
>> project = SPARK
>>   AND status in (Open, "In Progress", Reopened)
>>   AND (
>> affectedVersion = EMPTY OR
>> NOT (affectedVersion in versionMatch("^3.*")
>>   OR affectedVersion in versionMatch("^2.4.*")
>> )
>>   )
>>   AND updated <= -52w
>>
>>
>> You could click this link and check.
>>
>>
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20(affectedVersion%20%3D%20EMPTY%20OR%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)))%20AND%20updated%20%3C%3D%20-52w
>>
>> Please let me know if you guys have any concern or opinion on this.
>>
>> Thanks.
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Resolving all JIRAs affecting EOL releases

2019-09-07 Thread Hyukjin Kwon

HI all,

We have resolved JIRAs that targets EOL releases (up to Spark 2.2.x) in
order to make it
the manageable size before.
Since Spark 2.3.4 will be EOL release, I plan to do this again roughly in a
week.

The JIRAs that has not been updated for the last year, and having affect
version of EOL releases will be:
  - Resolved as 'Incomplete' status
  - Has a 'bulk-closed' label.

I plan to use this JQL

project = SPARK
  AND status in (Open, "In Progress", Reopened)
  AND (
affectedVersion = EMPTY OR
NOT (affectedVersion in versionMatch("^3.*")
  OR affectedVersion in versionMatch("^2.4.*")
)
  )
  AND updated <= -52w


You could click this link and check.

https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20(affectedVersion%20%3D%20EMPTY%20OR%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)))%20AND%20updated%20%3C%3D%20-52w

Please let me know if you guys have any concern or opinion on this.

Thanks.

Re: DataSourceV2: pushFilters() is not invoked for each read call - spark 2.3.2

2019-09-06 Thread Hyukjin Kwon

I believe this issue was fixed in Spark 2.4.

Spark DataSource V2 has been still being radically developed - It is not
complete yet until now.
So, I think the feasible option to get through at the current moment is:
  1. upgrade to higher Spark versions
  2. disable filter push down at your DataSource V2 implementation

I don't think Spark community will backport or fix things at branch-2.3
which will be EOL release soon.
For each branch, DataSource V2 has totally different codes.
Fixing those specifically in each branch will bring considerable overhead.
I believe that's usually the same case too for some internal Spark forks as
well.



2019년 9월 6일 (금) 오후 3:25, Shubham Chaurasia 님이 작성:

> Hi,
>
> I am using spark v2.3.2. I have an implementation of DSV2. Here is what is
> happening:
>
> 1) Obtained a dataframe using MyDataSource
>
> scala> val df1 = spark.read.format("com.shubham.MyDataSource").load
>> MyDataSource.MyDataSource
>> MyDataSource.createReader: Going to create a new MyDataSourceReader
>> MyDataSourceReader.MyDataSourceReader:
>> Instantiatedcom.shubham.reader.MyDataSourceReader@2b85edc7
>> MyDataSourceReader.readSchema:
>> com.shubham.reader.MyDataSourceReader@2b85edc7 baseSchema:
>> StructType(StructField(c1,IntegerType,true),
>> StructField(c2,IntegerType,true), StructField(c3,IntegerType,true))
>> df1: org.apache.spark.sql.DataFrame = [c1: int, c2: int ... 1 more field]
>>
>
> 2) show() on df1
>
>> scala> df1.show
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pruneColumns:
>> StructType(StructField(c1,IntegerType,true),
>> StructField(c2,IntegerType,true), StructField(c3,IntegerType,true))
>> MyDataSourceReader.readSchema:
>> com.shubham.reader.MyDataSourceReader@2b85edc7 baseSchema:
>> StructType(StructField(c1,IntegerType,true),
>> StructField(c2,IntegerType,true), StructField(c3,IntegerType,true))
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> ===MyDataSourceReader.createBatchDataReaderFactories===
>> prunedSchema = StructType(StructField(c1,IntegerType,true),
>> StructField(c2,IntegerType,true), StructField(c3,IntegerType,true))
>> pushedFilters = []
>> ===MyDataSourceReader.createBatchDataReaderFactories===
>> +---+---+---+
>> | c1| c2| c3|
>> +---+---+---+
>> +---+---+---+
>>
>
> 3) val df2 = df1.filter($"c3" > 1)
>
>>
>> scala> df2.show
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pruneColumns:
>> StructType(StructField(c1,IntegerType,true),
>> StructField(c2,IntegerType,true), StructField(c3,IntegerType,true))
>> MyDataSourceReader.readSchema:
>> com.shubham.reader.MyDataSourceReader@2b85edc7 baseSchema:
>> StructType(StructField(c1,IntegerType,true),
>> StructField(c2,IntegerType,true), StructField(c3,IntegerType,true))
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> ===MyDataSourceReader.createBatchDataReaderFactories===
>> prunedSchema = StructType(StructField(c1,IntegerType,true),
>> StructField(c2,IntegerType,true), StructField(c3,IntegerType,true))
>> pushedFilters = [IsNotNull(c3), GreaterThan(c3,1)]
>> ===MyDataSourceReader.createBatchDataReaderFactories===
>> +---+---+---+
>> | c1| c2| c3|
>> +---+---+---+
>> +---+---+---+
>
>
> 4) Again df1.show() <=== As df2 is derived from df1(and share same
> instance of MyDataSourceReader), this modifies pushedFilters even for df1
>
>> scala> df1.show
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>>

Re: [ANNOUNCE] Announcing Apache Spark 2.4.4

2019-09-01 Thread Hyukjin Kwon

YaY!

2019년 9월 2일 (월) 오후 1:27, Wenchen Fan 님이 작성:

> Great! Thanks!
>
> On Mon, Sep 2, 2019 at 5:55 AM Dongjoon Hyun 
> wrote:
>
>> We are happy to announce the availability of Spark 2.4.4!
>>
>> Spark 2.4.4 is a maintenance release containing stability fixes. This
>> release is based on the branch-2.4 maintenance branch of Spark. We
>> strongly
>> recommend all 2.4 users to upgrade to this stable release.
>>
>> To download Spark 2.4.4, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> Note that you might need to clear your browser cache or
>> to use `Private`/`Incognito` mode according to your browsers.
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-2-4-4.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>> Dongjoon Hyun
>>
>

Re: [VOTE] Release Apache Spark 2.4.4 (RC3)

2019-08-28 Thread Hyukjin Kwon

+1 (from the last blocker PR)

2019년 8월 29일 (목) 오전 8:20, Takeshi Yamamuro 님이 작성:

> I checked the tests passed again on the same env.
> It looks ok.
>
>
> On Thu, Aug 29, 2019 at 6:15 AM Marcelo Vanzin 
> wrote:
>
>> +1
>>
>> On Tue, Aug 27, 2019 at 4:06 PM Dongjoon Hyun 
>> wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 2.4.4.
>> >
>> > The vote is open until August 30th 5PM PST and passes if a majority +1
>> PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.4.4
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v2.4.4-rc3 (commit
>> 7955b3962ac46b89564e0613db7bea98a1478bf2):
>> > https://github.com/apache/spark/tree/v2.4.4-rc3
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc3-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1332/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc3-docs/
>> >
>> > The list of bug fixes going into 2.4.4 can be found at the following
>> URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12345466
>> >
>> > This release is using the release script of the tag v2.4.4-rc3.
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 2.4.4?
>> > ===
>> >
>> > The current list of open tickets targeted at 2.4.4 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.4.4
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>>
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>

Re: JDK11 Support in Apache Spark

2019-08-27 Thread Hyukjin Kwon

YaY!

2019년 8월 27일 (화) 오후 3:36, Dongjoon Hyun 님이 작성:

> Hi, All.
>
> Thank you for your attention!
>
> UPDATE: We succeeded to build with JDK8 and test with JDK11.
>
> - https://github.com/apache/spark/pull/25587
> -
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4842
> (Scala/Java/Python/R)
>
> We are ready to release Maven artifacts as a single artifact for both JDK8
> and JDK11.
>
> According to this email thread, I believe this is the last piece to
> resolve the following issue.
>
> https://issues.apache.org/jira/browse/SPARK-24417 (Build and Run
> Spark on JDK11)
>
> To committers, please use `[test-hadoop3.2][test-java11]` to verify JDK11
> compatibility on the relevant PRs.
>
> Bests,
> Dongjoon.
>

Re: [VOTE] Release Apache Spark 2.4.4 (RC2)

2019-08-26 Thread Hyukjin Kwon

-1

Seems there's one critical correctness issue specifically in branch-2.4 ...
Please take a look for https://github.com/apache/spark/pull/25593

2019년 8월 27일 (화) 오후 2:38, Takeshi Yamamuro 님이 작성:

> Hi, Dongjoon
>
> I checked that all the test passed on my Mac/x86_64 env with:
> -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pkubernetes
> -Pkubernetes-integration-tests -Psparkr
>
> maropu@~/spark-2.4.4-rc2:$java -version
> java version "1.8.0_181"
> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>
> Bests,
> Takeshi
>
>
> On Tue, Aug 27, 2019 at 11:06 AM Sean Owen  wrote:
>
>> +1 as per response to RC1. The existing issues identified there seem
>> to have been fixed.
>>
>>
>> On Mon, Aug 26, 2019 at 2:45 AM Dongjoon Hyun 
>> wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 2.4.4.
>> >
>> > The vote is open until August 29th 1AM PST and passes if a majority +1
>> PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.4.4
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v2.4.4-rc2 (commit
>> b7a15b69aca8a2fc3f308105e5978a69dff0f4fb):
>> > https://github.com/apache/spark/tree/v2.4.4-rc2
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc2-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1327/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc2-docs/
>> >
>> > The list of bug fixes going into 2.4.4 can be found at the following
>> URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12345466
>> >
>> > This release is using the release script of the tag v2.4.4-rc2.
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 2.4.4?
>> > ===
>> >
>> > The current list of open tickets targeted at 2.4.4 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.4.4
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Release Spark 2.3.4

2019-08-17 Thread Hyukjin Kwon

+1 too

2019년 8월 17일 (토) 오후 3:06, Dilip Biswal 님이 작성:

> +1
>
> Regards,
> Dilip Biswal
> Tel: 408-463-4980
> dbis...@us.ibm.com
>
>
>
> - Original message -
> From: John Zhuge 
> To: Xiao Li 
> Cc: Takeshi Yamamuro , Spark dev list <
> dev@spark.apache.org>, Kazuaki Ishizaki 
> Subject: [EXTERNAL] Re: Release Spark 2.3.4
> Date: Fri, Aug 16, 2019 4:33 PM
>
> +1
>
> On Fri, Aug 16, 2019 at 4:25 PM Xiao Li  wrote:
>
> +1
>
> On Fri, Aug 16, 2019 at 4:11 PM Takeshi Yamamuro 
> wrote:
>
> +1, too
>
> Bests,
> Takeshi
>
> On Sat, Aug 17, 2019 at 7:25 AM Dongjoon Hyun 
> wrote:
>
> +1 for 2.3.4 release as the last release for `branch-2.3` EOL.
>
> Also, +1 for next week release.
>
> Bests,
> Dongjoon.
>
>
> On Fri, Aug 16, 2019 at 8:19 AM Sean Owen  wrote:
>
> I think it's fine to do these in parallel, yes. Go ahead if you are
> willing.
>
> On Fri, Aug 16, 2019 at 9:48 AM Kazuaki Ishizaki 
> wrote:
> >
> > Hi, All.
> >
> > Spark 2.3.3 was released six months ago (15th February, 2019) at
> http://spark.apache.org/news/spark-2-3-3-released.html. And, about 18
> months have been passed after Spark 2.3.0 has been released (28th February,
> 2018).
> > As of today (16th August), there are 103 commits (69 JIRAs) in
> `branch-23` since 2.3.3.
> >
> > It would be great if we can have Spark 2.3.4.
> > If it is ok, shall we start `2.3.4 RC1` concurrent with 2.4.4 or after
> 2.4.4 will be released?
> >
> > A issue list in jira:
> https://issues.apache.org/jira/projects/SPARK/versions/12344844
> > A commit list in github from the last release:
> https://github.com/apache/spark/compare/66fd9c34bf406a4b5f86605d06c9607752bd637a...branch-2.3
> > The 8 correctness issues resolved in branch-2.3:
> >
> https://issues.apache.org/jira/browse/SPARK-26873?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012344844%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC
> >
> > Best Regards,
> > Kazuaki Ishizaki
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
>
> --
> ---
> Takeshi Yamamuro
>
>
>
> --
> [image: Databricks Summit - Watch the talks]
> 
>
>
>
> --
> John Zhuge
>
>
>
> - To
> unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] Migrate development scripts under dev/ from Python2 to Python 3

2019-08-15 Thread Hyukjin Kwon

Yeah, we will probably drop Python 2 entirely after 3.0.0. Python 2 is
already deprecated.

On Thu, 15 Aug 2019, 18:25 Driesprong, Fokko,  wrote:

> Sorry for the late reply, was a bit busy lately, but I still would like to
> share my thoughts on this.
>
> For Apache Airflow we're dropping support for Python 2 in the next major
> release. We're now supporting Python 3.5+. Mostly because:
>
>- Easier to maintain and test, and less if/else constructions for the
>different Python versions. Also, not having to test against Python 2.x
>reduces the build matrix.
>- Python 3 has support for typing. From Python 3.5 you can include
>provisional type hints. An excellent presentation by Guido himself:
>https://www.youtube.com/watch?v=2wDvzy6Hgxg. From Python 3.5 it is
>still provisional, but it is a really good idea. From Airflow we've noticed
>that using mypy is catching bugs early:
>   - This will put less stress on the (boring part of the) reviewing
>   process since a lot of this stuff is checked automatically.
>   - For new developers, it is easier to read the code because of the
>   annotations.
>   - Can be used as an input for generated documentation (or check if
>   it still in sync with the docstrings)
>   - Easier to extend the code since you know what kind of types you
>   can expect, and your IDE will also pick up the hinting.
>- Python 2.x will be EOL end this year
>
> I have a strong preference to migrate everything to Python 3.
>
> Cheers, Fokko
>
>
> Op wo 7 aug. 2019 om 12:14 schreef Weichen Xu :
>
>> All right we could support both Python 2 and Python 3 for spark 3.0.
>>
>> On Wed, Aug 7, 2019 at 6:10 PM Hyukjin Kwon  wrote:
>>
>>> We didn't drop Python 2 yet although it's deprecated. So I think It
>>> should support both Python 2 and Python 3 at the current status.
>>>
>>> 2019년 8월 7일 (수) 오후 6:54, Weichen Xu 님이 작성:
>>>
>>>> Hi all,
>>>>
>>>> I would like to discuss the compatibility for dev scripts. Because we
>>>> already decided to deprecate python2 in spark 3.0, for development scripts
>>>> under dev/ , we have two choice:
>>>> 1) Migration from Python 2 to Python 3
>>>> 2) Support both Python 2 and Python 3
>>>>
>>>> I tend to option (2) which is more friendly to maintenance.
>>>>
>>>> Regards,
>>>> Weichen
>>>>
>>>

Re: [DISCUSS] Migrate development scripts under dev/ from Python2 to Python 3

2019-08-15 Thread Hyukjin Kwon

I mean python 2 _will be_ deprecated in Spark 3.

On Thu, 15 Aug 2019, 18:37 Hyukjin Kwon,  wrote:

> Yeah, we will probably drop Python 2 entirely after 3.0.0. Python 2 is
> already deprecated.
>
> On Thu, 15 Aug 2019, 18:25 Driesprong, Fokko, 
> wrote:
>
>> Sorry for the late reply, was a bit busy lately, but I still would like
>> to share my thoughts on this.
>>
>> For Apache Airflow we're dropping support for Python 2 in the next major
>> release. We're now supporting Python 3.5+. Mostly because:
>>
>>- Easier to maintain and test, and less if/else constructions for the
>>different Python versions. Also, not having to test against Python 2.x
>>reduces the build matrix.
>>- Python 3 has support for typing. From Python 3.5 you can include
>>provisional type hints. An excellent presentation by Guido himself:
>>https://www.youtube.com/watch?v=2wDvzy6Hgxg. From Python 3.5 it is
>>still provisional, but it is a really good idea. From Airflow we've 
>> noticed
>>that using mypy is catching bugs early:
>>   - This will put less stress on the (boring part of the) reviewing
>>   process since a lot of this stuff is checked automatically.
>>   - For new developers, it is easier to read the code because of the
>>   annotations.
>>   - Can be used as an input for generated documentation (or check if
>>   it still in sync with the docstrings)
>>   - Easier to extend the code since you know what kind of types you
>>   can expect, and your IDE will also pick up the hinting.
>>- Python 2.x will be EOL end this year
>>
>> I have a strong preference to migrate everything to Python 3.
>>
>> Cheers, Fokko
>>
>>
>> Op wo 7 aug. 2019 om 12:14 schreef Weichen Xu > >:
>>
>>> All right we could support both Python 2 and Python 3 for spark 3.0.
>>>
>>> On Wed, Aug 7, 2019 at 6:10 PM Hyukjin Kwon  wrote:
>>>
>>>> We didn't drop Python 2 yet although it's deprecated. So I think It
>>>> should support both Python 2 and Python 3 at the current status.
>>>>
>>>> 2019년 8월 7일 (수) 오후 6:54, Weichen Xu 님이 작성:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I would like to discuss the compatibility for dev scripts. Because we
>>>>> already decided to deprecate python2 in spark 3.0, for development scripts
>>>>> under dev/ , we have two choice:
>>>>> 1) Migration from Python 2 to Python 3
>>>>> 2) Support both Python 2 and Python 3
>>>>>
>>>>> I tend to option (2) which is more friendly to maintenance.
>>>>>
>>>>> Regards,
>>>>> Weichen
>>>>>
>>>>

Re: Release Apache Spark 2.4.4

2019-08-14 Thread Hyukjin Kwon

Adding Shixiong

WDYT?

2019년 8월 14일 (수) 오후 2:30, Terry Kim 님이 작성:

> Can the following be included?
>
> [SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch in
> EpochTracker (to support Python UDFs)
> 
>
> Thanks,
> Terry
>
> On Tue, Aug 13, 2019 at 10:24 PM Wenchen Fan  wrote:
>
>> +1
>>
>> On Wed, Aug 14, 2019 at 12:52 PM Holden Karau 
>> wrote:
>>
>>> +1
>>> Does anyone have any critical fixes they’d like to see in 2.4.4?
>>>
>>> On Tue, Aug 13, 2019 at 5:22 PM Sean Owen  wrote:
>>>
 Seems fine to me if there are enough valuable fixes to justify another
 release. If there are any other important fixes imminent, it's fine to
 wait for those.


 On Tue, Aug 13, 2019 at 6:16 PM Dongjoon Hyun 
 wrote:
 >
 > Hi, All.
 >
 > Spark 2.4.3 was released three months ago (8th May).
 > As of today (13th August), there are 112 commits (75 JIRAs) in
 `branch-24` since 2.4.3.
 >
 > It would be great if we can have Spark 2.4.4.
 > Shall we start `2.4.4 RC1` next Monday (19th August)?
 >
 > Last time, there was a request for K8s issue and now I'm waiting for
 SPARK-27900.
 > Please let me know if there is another issue.
 >
 > Thanks,
 > Dongjoon.

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

 --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>

Re: Release Apache Spark 2.4.4

2019-08-13 Thread Hyukjin Kwon

+1

2019년 8월 14일 (수) 오전 9:13, Takeshi Yamamuro 님이 작성:

> Hi,
>
> Thanks for your notification, Dongjoon!
> I put some links for the other committers/PMCs to access the info easily:
>
> A commit list in github from the last release:
> https://github.com/apache/spark/compare/5ac2014e6c118fbeb1fe8e5c8064c4a8ee9d182a...branch-2.4
> A issue list in jira:
> https://issues.apache.org/jira/projects/SPARK/versions/12345466#release-report-tab-body
> The 5 correctness issues resolved in branch-2.4:
>
> https://issues.apache.org/jira/browse/SPARK-27798?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012345466%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC
>
> Anyway, +1
>
> Best,
> Takeshi
>
> On Wed, Aug 14, 2019 at 8:25 AM DB Tsai  wrote:
>
>> +1
>>
>> On Tue, Aug 13, 2019 at 4:16 PM Dongjoon Hyun 
>> wrote:
>> >
>> > Hi, All.
>> >
>> > Spark 2.4.3 was released three months ago (8th May).
>> > As of today (13th August), there are 112 commits (75 JIRAs) in
>> `branch-24` since 2.4.3.
>> >
>> > It would be great if we can have Spark 2.4.4.
>> > Shall we start `2.4.4 RC1` next Monday (19th August)?
>> >
>> > Last time, there was a request for K8s issue and now I'm waiting for
>> SPARK-27900.
>> > Please let me know if there is another issue.
>> >
>> > Thanks,
>> > Dongjoon.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Recognizing non-code contributions

2019-08-07 Thread Hyukjin Kwon

> Currently, I have heard some ideas or attitudes that I consider to be
overly motivated by fear of unlikely occurrences.
> And I've heard some statements disregard widely accepted principles of
inclusiveness at the Apache Software Foundation.
> But I suspect that there's more to the attitude of not including
non-coding committers at Spark.

I missed some contexts you mentioned. Yes, SVN and commons look good
examples.
Also, for clarification, I did not mean to absolutely do not add
non-codding committers.
Spark already has build/config committer and I am happy with that.

I was replaying to "the risk is very small". Given my experience in Spark
dev, people (and I) make mistakes
which, for instance, blocks the release months. Sometimes it requires to
rewrite whole PRs with courtesy
(rather than, for instance, reverting). This is already happening and it
brings some overhead to the dev.
Yes, maybe the volume matters to handle those issues.

The point I was trying to make was commit bit could be a too strong sword
and might have to be
given and used with familiarity and caution.

For clarification, I have no issue except one concern above for the fact
that someone becomes a non-code committer since Spark already has it.


2019년 8월 7일 (수) 오후 6:04, Myrle Krantz 님이 작성:

>
>
> On Tue, Aug 6, 2019 at 7:57 PM Sean Owen  wrote:
>
>> On Tue, Aug 6, 2019 at 11:45 AM Myrle Krantz  wrote:
>> > I had understood your position to be that you would be willing to make
>> at least some non-coding contributors to committers but that your "line" is
>> somewhat different than my own.   My response to you assumed that position
>> on your part.  I do not think it's good for a project to accept absolutely
>> no non-code committers.  If nothing else, it violates my sense of fairness,
>> both towards those contributors, and also towards the ASF which relies on a
>> pipeline of non-code contributors who come to us through the projects.
>>
>> Oh OK, I thought this argument was made repeatedly: someone who has
>> not and evidently will not ever commit anything to a project doesn't
>> seem to need the commit bit. Agree to disagree. That was the
>> 'non-code' definition?
>>
>
> That argument was made and acknowledged.  And then answered with:
> a.) the commit bit is only part of what makes a committer, and not the
> most important part.
> b.) including the commit bit in the package is harmless.  The risk
> associated with giving someone the commit bit who is not going to use it is
> lower than the risk associated with the average pull request.
> c.) creating a new package without the commit-bit creates significant
> effort and bears significant risks.
>
>
>> Someone who contributes docs to the project? Sure. We actually have
>> done this, albeit for a build and config contributions. Agree.
>>
>> Pardon a complicated analogy to explain my thinking, but: let's say
>> the space of acceptable decisions on adding committers at the ASF
>> ranges from 1 (Super Aggressive) to 10 (Very Conservative). Most
>> project decisions probably fall in, say, 3 to 7. Here we're debating
>> whether a project should theoretically at times go all the way to 1,
>> or at most 2, and I think that's just not that important. We're pretty
>> much agreeing 2 is not out of the question, 1 we agree to disagree.
>>
>> Spark decisions here are probably 5-7 on average. I'd like it be like
>> 4-6 personally. I suspect the real inbound argument is: all projects
>> should be making all decisions in 1-3 or else it isn't The Apache Way.
>> I accept anecdotes that projects function well in that range, but,
>> Spark and Hadoop don't seem to (nor evidently Cassandra). I have a
>> hard time rationalizing this. These are, after all, some of the
>> biggest and most successful projects at Apache. At times it sounds
>> like concern trolling, to 'help' these projects not fall apart.
>>
>> If so, you read correctly that there is a significant difference of
>> opinion here, but that's what it is. Not the theoretical debate above.
>>
>
> I think this misrepresents where the "middle" is in Apache projects.  I
> think the middle is probably closer to where OfBiz is: occasionally
> offering non-coding contributors committership, but probably not with the
> frequency I would like.  But even that occasional committership for
> non-coding committers has been extraordinarily important for the ASF as an
> organization.  Sharan Foga started as a non-coding contributor for OfBiz,
> and is now VP of Community Development at the ASF, and organized the Apache
> Roadshow in Berlin last year (where Spark talks were well-received and
> probably helped your community). The OfBiz project did us all a huge favor
> by providing Sharan with the first step into our organization.
>
> What you are perceiving as an extreme is the SVN project in which all you
> have to do to receive committership is to ask.  Or the commons project in
> which in which every ASF member is automatically a committer.  Those
>

Re: [DISCUSS] Migrate development scripts under dev/ from Python2 to Python 3

2019-08-07 Thread Hyukjin Kwon

We didn't drop Python 2 yet although it's deprecated. So I think It should
support both Python 2 and Python 3 at the current status.

2019년 8월 7일 (수) 오후 6:54, Weichen Xu 님이 작성:

> Hi all,
>
> I would like to discuss the compatibility for dev scripts. Because we
> already decided to deprecate python2 in spark 3.0, for development scripts
> under dev/ , we have two choice:
> 1) Migration from Python 2 to Python 3
> 2) Support both Python 2 and Python 3
>
> I tend to option (2) which is more friendly to maintenance.
>
> Regards,
> Weichen
>

Re: Recognizing non-code contributions

2019-08-06 Thread Hyukjin Kwon

> I wonder which project nominees non-coding only committers but I at least
know multiple projects. They all have that serious problem then.

I mean It know multiple projects don't do that and according to what you
said, they all have that serious problem.

2019년 8월 7일 (수) 오전 1:05, Hyukjin Kwon 님이 작성:

> Well, actually I am rather less conservative on adding committers. There
> are multiple people who are active in both non-coding and coding activities.
> I as an example am one of Korean meetup admin and my main focus was to
> management JIRA. In addition, review the PRs that are not being reviewed.
> As I said earlier at the very first time, I think committers should
> ideally be used to the dev at some degrees as primary. Other contributions
> should be counted.
>
> I wonder which project nominees non-coding only committers but I at least
> know multiple projects. They all have that serious problem then.
>
> 2019년 8월 7일 (수) 오전 12:46, Myrle Krantz 님이 작성:
>
>>
>>
>> On Tue, Aug 6, 2019 at 5:36 PM Sean Owen  wrote:
>>
>>> You can tell there's a range of opinions here. I'm probably less
>>> 'conservative' about adding committers than most on the PMC, right or
>>> wrong, but more conservative than some at the ASF. I think there's
>>> room to inch towards the middle ground here and this is good
>>> discussion informing the thinking.
>>>
>>
>> That's not actually my current reading of the Spark community.  My
>> current reading based on the responses of Hyukjin, and Jungtaek, is that
>> your community wouldn't take a non-coding committer no matter how clear
>> their contributions are to the community, and that by extension such a
>> person could never become a PMC member.
>>
>> If my reading is correct (and the sample size *is* still quite small, and
>> only includes one PMC member), I see that as a serious problem.
>>
>> How do the other PMC members and community members see this?
>>
>> Best Regards,
>> Myrle
>>
>

Re: Recognizing non-code contributions

2019-08-06 Thread Hyukjin Kwon

Well, actually I am rather less conservative on adding committers. There
are multiple people who are active in both non-coding and coding activities.
I as an example am one of Korean meetup admin and my main focus was to
management JIRA. In addition, review the PRs that are not being reviewed.
As I said earlier at the very first time, I think committers should ideally
be used to the dev at some degrees as primary. Other contributions should
be counted.

I wonder which project nominees non-coding only committers but I at least
know multiple projects. They all have that serious problem then.

2019년 8월 7일 (수) 오전 12:46, Myrle Krantz 님이 작성:

>
>
> On Tue, Aug 6, 2019 at 5:36 PM Sean Owen  wrote:
>
>> You can tell there's a range of opinions here. I'm probably less
>> 'conservative' about adding committers than most on the PMC, right or
>> wrong, but more conservative than some at the ASF. I think there's
>> room to inch towards the middle ground here and this is good
>> discussion informing the thinking.
>>
>
> That's not actually my current reading of the Spark community.  My current
> reading based on the responses of Hyukjin, and Jungtaek, is that your
> community wouldn't take a non-coding committer no matter how clear their
> contributions are to the community, and that by extension such a person
> could never become a PMC member.
>
> If my reading is correct (and the sample size *is* still quite small, and
> only includes one PMC member), I see that as a serious problem.
>
> How do the other PMC members and community members see this?
>
> Best Regards,
> Myrle
>

Re: Recognizing non-code contributions

2019-08-06 Thread Hyukjin Kwon

I usually make such judgement about commit bit based upon community
activity in coding and reviewing.
If somebody has no activity about those commit bits, I would have no way to
know about this guy,
Simply I can't make a judgement about coding activity based upon non-coding
activity.

Those bugs and commit stuff are pretty critical in this project as I
described. I would rather try to decrease such
possibility, not increase it even when such "commit bit" is unnecessary.

We have found and discussed nicer other ways to recognise them, for
instance, listing them in somewhere else in Spark website.
Once they are in that list, I suspect it's easier and closer to the
committership to, say, get an Apache email if it matters.

Shall we avoid such possibilities at all and go for such other safer ways?
I think you also accept commit bit is unnecessary in this case.
So, we don't unnecessarily give it to them, which is anyhow critical in
this project.

> Based on this argumentation you will never invite any committers or even
merge any pull requests.
BTW, how did you reach that conclusion? I want somebody who can review PRs
and fix such bugs, rather than who has more possibility to make such
mistakes.


2019년 8월 6일 (화) 오후 7:26, Myrle Krantz 님이 작성:

> Hey Hyukjin,
>
> Apologies for sending this to you twice.  : o)
>
> On Tue, Aug 6, 2019 at 9:55 AM Hyukjin Kwon  wrote:
>
>> Myrle,
>>
>> > We need to balance two sets of risks here.  But in the case of access
>> to our software artifacts, the risk is very small, and already has
>> *multiple* mitigating factors, from the fact that all changes are tracked
>> to an individual, to the fact that there are notifications sent when
>> changes are made, (and I'm going to stop listing the benefits of a modern
>> source control system here, because I know you are aware of them), on
>> through the fact that you have automated tests, and continuing through the
>> fact that there is a release process during which artifacts get checked
>> again.
>> > If someone makes a commit who you are not expecting to make a commit,
>> or in an area you weren't expecting changes in, you'll notice that, right?
>> > What you're talking about here is your security model for your source
>> repository.  But restricting access isn't really the right security model
>> for an open source project.
>>
>> I don't quite get the argument about commit bit. I _strongly_ disagree
>> about "the risk is very small,".
>> Not all of committers track all the changes. There are so many changes in
>> the upstream and it's already overhead to check all.
>> Do you know how many bugs Spark faces due to such lack of reviews that
>> entirely blocks the release sometimes, and how much it takes time to fix up
>> such commits?
>> We need expertise and familiarity to Spark.
>>
>
> Let's unroll that a bit.  Say that you invite a non-coding contributor to
> be a committer.  To make an inappropriate commit two things would have to
> happen: this person would have to decide to make the commit, and this
> person would have to set up access to the git repository, either by
> enabling gitbox integration, or accessing the apache git repository
> directly.  Before you invite them you make an estimation of the probability
> that they would do the first: that is decide to make an inappropriate
> commit.  You decide that that is fairly unlikely.  But for a non-coding
> contributor the chances of them actually going through the mechanics of
> making a commit is even more unlikely.  I think we can safely assume that
> the chance of someone who you've determined is committed to the community
> and knows their limits of doing this is simply 00.00%.
>
> That leaves the question of what the chance is that this person will leak
> their credentials to a malicious third party intent on introducing bugs
> into Spark code.  Do you believe there are such malicious third parties?
> How many attacks have there been on Spark committer credentials?  I believe
> the likelihood of this happening is 00.00% (but I am willing to be swayed
> by evidence otherwise -- should probably be discussed on the private@
> list though if it's out there.: o).
>
> But let's say I'm wrong about both of those probabilities.  Let's say the
> combined probability of one of those two things happening is actually
> 0.01%.  This is where the advantages of modern source control and tests
> come in.  Even if there's only a 50% chance that watching commits will
> catch the error, and only a further 50% chance that tests will catch the
> error, and only a further 50% chance that the error will be caught in
> release testing, those chances multiply out at 00.00125%.
>
> Based on those guestima

Re: Recognizing non-code contributions

2019-08-06 Thread Hyukjin Kwon

So, here's my thought:

1. Back to the original point, for recognition of such people, I think we
can simply list up such people in Spark Website somewhere. For instance,

  Person A: Spark Book
  Person B: Meetup leader

I don't know if ASF allows this. Someone needs to check it.


2. If we need the in-between status officially (e.g. Apache email or
something), it should be asked and discussed in ASF, not in a single
project here.


2019년 8월 6일 (화) 오후 4:55, Hyukjin Kwon 님이 작성:

> Myrle,
>
> > We need to balance two sets of risks here.  But in the case of access to
> our software artifacts, the risk is very small, and already has *multiple*
> mitigating factors, from the fact that all changes are tracked to an
> individual, to the fact that there are notifications sent when changes are
> made, (and I'm going to stop listing the benefits of a modern source
> control system here, because I know you are aware of them), on through the
> fact that you have automated tests, and continuing through the fact that
> there is a release process during which artifacts get checked again.
> > If someone makes a commit who you are not expecting to make a commit, or
> in an area you weren't expecting changes in, you'll notice that, right?
> > What you're talking about here is your security model for your source
> repository.  But restricting access isn't really the right security model
> for an open source project.
>
> I don't quite get the argument about commit bit. I _strongly_ disagree
> about "the risk is very small,".
> Not all of committers track all the changes. There are so many changes in
> the upstream and it's already overhead to check all.
> Do you know how many bugs Spark faces due to such lack of reviews that
> entirely blocks the release sometimes, and how much it takes time to fix up
> such commits?
> We need expertise and familiarity to Spark.
>
> It virtually means we will add some more overhead to audit each commit,
> even for committers'. Why should we bother add such overhead to harm the
> project?
> To me, this is the most important fact. I don't think we should just count
> the number of positive and negative ones.
>
> For other reasons, we can just add or discuss about the "this kind of
> in-between status Apache-wide", which is a bigger scope than here. You can
> ask it to ASF and discuss further.
>
>
> 2019년 8월 6일 (화) 오후 3:14, Myrle Krantz 님이 작성:
>
>> Hey Sean,
>>
>> Even though we are discussing our differences, on the whole I don't think
>> we're that far apart in our positions.  Still the differences are where the
>> conversation is actually interesting, so here goes:
>>
>> On Mon, Aug 5, 2019 at 3:55 PM Sean Owen  wrote:
>>
>>> On Mon, Aug 5, 2019 at 3:50 AM Myrle Krantz  wrote:
>>> > So... events coordinators?  I'd still make them committers.  I guess
>>> I'm still struggling to understand what problem making people VIP's without
>>> giving them committership is trying to solve.
>>>
>>> We may just agree to disagree, which is fine, but I think the argument
>>> is clear enough: such a person has zero need for the commit bit.
>>> Turning it around, what are we trying to accomplish by giving said
>>> person a commit bit? I know people say there's no harm, but I think
>>> there is at least _some_ downside. We're widening access to change
>>> software artifacts, the main thing that we put ASF process and checks
>>> around for liability reasons. I know the point is trust, and said
>>> person is likely to understand to never use the commit bit, but it
>>> brings us back to the same place. I don't wish to convince anyone else
>>> of my stance, though I do find it more logical, just that it's
>>> reasonable within The Apache Way.
>>>
>>
>> We need to balance two sets of risks here.  But in the case of access to
>> our software artifacts, the risk is very small, and already has *multiple*
>> mitigating factors, from the fact that all changes are tracked to an
>> individual, to the fact that there are notifications sent when changes are
>> made, (and I'm going to stop listing the benefits of a modern source
>> control system here, because I know you are aware of them), on through the
>> fact that you have automated tests, and continuing through the fact that
>> there is a release process during which artifacts get checked again.
>>
>> If someone makes a commit who you are not expecting to make a commit, or
>> in an area you weren't expecting changes in, you'll notice that, right?
>>
>> What you're talking about here is your security model for your source
>> repository.

Re: Recognizing non-code contributions

2019-08-06 Thread Hyukjin Kwon

Myrle,

> We need to balance two sets of risks here.  But in the case of access to
our software artifacts, the risk is very small, and already has *multiple*
mitigating factors, from the fact that all changes are tracked to an
individual, to the fact that there are notifications sent when changes are
made, (and I'm going to stop listing the benefits of a modern source
control system here, because I know you are aware of them), on through the
fact that you have automated tests, and continuing through the fact that
there is a release process during which artifacts get checked again.
> If someone makes a commit who you are not expecting to make a commit, or
in an area you weren't expecting changes in, you'll notice that, right?
> What you're talking about here is your security model for your source
repository.  But restricting access isn't really the right security model
for an open source project.

I don't quite get the argument about commit bit. I _strongly_ disagree
about "the risk is very small,".
Not all of committers track all the changes. There are so many changes in
the upstream and it's already overhead to check all.
Do you know how many bugs Spark faces due to such lack of reviews that
entirely blocks the release sometimes, and how much it takes time to fix up
such commits?
We need expertise and familiarity to Spark.

It virtually means we will add some more overhead to audit each commit,
even for committers'. Why should we bother add such overhead to harm the
project?
To me, this is the most important fact. I don't think we should just count
the number of positive and negative ones.

For other reasons, we can just add or discuss about the "this kind of
in-between status Apache-wide", which is a bigger scope than here. You can
ask it to ASF and discuss further.


2019년 8월 6일 (화) 오후 3:14, Myrle Krantz 님이 작성:

> Hey Sean,
>
> Even though we are discussing our differences, on the whole I don't think
> we're that far apart in our positions.  Still the differences are where the
> conversation is actually interesting, so here goes:
>
> On Mon, Aug 5, 2019 at 3:55 PM Sean Owen  wrote:
>
>> On Mon, Aug 5, 2019 at 3:50 AM Myrle Krantz  wrote:
>> > So... events coordinators?  I'd still make them committers.  I guess
>> I'm still struggling to understand what problem making people VIP's without
>> giving them committership is trying to solve.
>>
>> We may just agree to disagree, which is fine, but I think the argument
>> is clear enough: such a person has zero need for the commit bit.
>> Turning it around, what are we trying to accomplish by giving said
>> person a commit bit? I know people say there's no harm, but I think
>> there is at least _some_ downside. We're widening access to change
>> software artifacts, the main thing that we put ASF process and checks
>> around for liability reasons. I know the point is trust, and said
>> person is likely to understand to never use the commit bit, but it
>> brings us back to the same place. I don't wish to convince anyone else
>> of my stance, though I do find it more logical, just that it's
>> reasonable within The Apache Way.
>>
>
> We need to balance two sets of risks here.  But in the case of access to
> our software artifacts, the risk is very small, and already has *multiple*
> mitigating factors, from the fact that all changes are tracked to an
> individual, to the fact that there are notifications sent when changes are
> made, (and I'm going to stop listing the benefits of a modern source
> control system here, because I know you are aware of them), on through the
> fact that you have automated tests, and continuing through the fact that
> there is a release process during which artifacts get checked again.
>
> If someone makes a commit who you are not expecting to make a commit, or
> in an area you weren't expecting changes in, you'll notice that, right?
>
> What you're talking about here is your security model for your source
> repository.  But restricting access isn't really the right security model
> for an open source project.
>
>
>> > It also just occurred to me this morning: There are actually other
>> privileges which go along with the "commit-bit" other than the ability to
>> commit at will to the project's repos: people who are committers get an
>> Apache e-mail address, and they get discounted entry to ApacheCon.  People
>> who are committers also get added to our committers mailing list, and are
>> thus a little easier to integrate into our foundation-wide efforts.
>> >
>> > To apply this to the example above, the Apache e-mail address can make
>> it a tad easier for an event coordinator to conduct official business for a
>> project.
>>
>> Great points. Again if I'm making it up? a "VIP" should get an Apache
>> email address and discounts. Sure, why not put them on a committers@
>> list too for visibility.
>>
>
> In order to do that, you'd need to create this kind of in-between status
> Apache-wide.  I would be very much opposed to doing that

Re: [DISCUSS] New sections in Github Pull Request description template

2019-07-31 Thread Hyukjin Kwon

I opened a PR https://github.com/apache/spark/pull/25310. Please take a look

2019년 7월 29일 (월) 오후 4:35, Hyukjin Kwon 님이 작성:

> Thanks, guys. Let me probably mimic the template and open a PR soon -
> currently I am stuck in some works. I will take a look in few days later.
>
> 2019년 7월 27일 (토) 오전 3:32, Bryan Cutler 님이 작성:
>
>> The k8s template is pretty good. Under the behavior change section, it
>> would be good to add instructions to also describe previous and new
>> behavior as Hyukjin proposed.
>>
>> On Tue, Jul 23, 2019 at 10:07 PM Reynold Xin  wrote:
>>
>>> I like the spirit, but not sure about the exact proposal. Take a look at
>>> k8s':
>>> https://raw.githubusercontent.com/kubernetes/kubernetes/master/.github/PULL_REQUEST_TEMPLATE.md
>>>
>>>
>>>
>>> On Tue, Jul 23, 2019 at 8:27 PM, Hyukjin Kwon 
>>> wrote:
>>>
>>>> (Plus, it helps to track history too. Spark's commit logs are growing
>>>> and now it's pretty difficult to track the history and see what change
>>>> introduced a specific behaviour)
>>>>
>>>> 2019년 7월 24일 (수) 오후 12:20, Hyukjin Kwon 님이 작성:
>>>>
>>>> Hi all,
>>>>
>>>> I would like to discuss about some new sections under "## What changes
>>>> were proposed in this pull request?":
>>>>
>>>> ### Do the changes affect _any_ user/dev-facing input or output?
>>>>
>>>> (Please answer yes or no. If yes, answer the questions below)
>>>>
>>>> ### What was the previous behavior?
>>>>
>>>> (Please provide the console output, description and/or reproducer about 
>>>> the previous behavior)
>>>>
>>>> ### What is the behavior the changes propose?
>>>>
>>>> (Please provide the console output, description and/or reproducer about 
>>>> the behavior the changes propose)
>>>>
>>>> See
>>>> https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
>>>>  .
>>>>
>>>> From my experience so far in Spark community, and assuming from the
>>>> interaction with other
>>>> committers and contributors, It is pretty critical to know before/after
>>>> behaviour changes even if it
>>>> was a bug. In addition, I think this is requested by reviewers often.
>>>>
>>>> The new sections will make review process much easier, and we're able
>>>> to quickly judge how serious the changes are.
>>>> Given that Spark community still suffer from open PRs just queueing up
>>>> without review, I think this can help
>>>> both reviewers and PR authors.
>>>>
>>>> I do describe them often when I think it's useful and possible.
>>>> For instance see https://github.com/apache/spark/pull/24927 - I am
>>>> sure you guys have clear idea what the
>>>> PR fixes.
>>>>
>>>> I cc'ed some guys I can currently think of for now FYI. Please let me
>>>> know if you guys have any thought on this!
>>>>
>>>>
>>>

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-30 Thread Hyukjin Kwon

>From my look, +1 on the proposal, considering ASCI and other DBMSes in
general.

2019년 7월 30일 (화) 오후 3:21, Wenchen Fan 님이 작성:

> We can add a config for a certain behavior if it makes sense, but the most
> important thing we want to reach an agreement here is: what should be the
> default behavior?
>
> Let's explore the solution space of table insertion behavior first:
> At compile time,
> 1. always add cast
> 2. add cast following the ASNI SQL store assignment rule (e.g. string to
> int is forbidden but long to int is allowed)
> 3. only add cast if it's 100% safe
> At runtime,
> 1. return null for invalid operations
> 2. throw exceptions at runtime for invalid operations
>
> The standards to evaluate a solution:
> 1. How robust the query execution is. For example, users usually don't
> want to see the query fails midway.
> 2. how tolerant to user queries. For example, a user would like to write
> long values to an int column as he knows all the long values won't exceed
> int range.
> 3. How clean the result is. For example, users usually don't want to see
> silently corrupted data (null values).
>
> The current Spark behavior for Data Source V1 tables: always add cast and
> return null for invalid operations. This maximizes standard 1 and 2, but
> the result is least clean and users are very likely to see silently
> corrupted data (null values).
>
> The current Spark behavior for Data Source V2 tables (new in Spark 3.0):
> only add cast if it's 100% safe. This maximizes standard 1 and 3, but many
> queries may fail to compile, even if these queries can run on other SQL
> systems. Note that, people can still see silently corrupted data because
> cast is not the only one that can return corrupted data. Simple operations
> like ADD can also return corrected data if overflow happens. e.g. INSERT
> INTO t1 (intCol) SELECT anotherIntCol + 100 FROM t2
>
> The proposal here: add cast following ANSI SQL store assignment rule, and
> return null for invalid operations. This maximizes standard 1, and also
> fits standard 2 well: if a query can't compile in Spark, it usually can't
> compile in other mainstream databases as well. I think that's tolerant
> enough. For standard 3, this proposal doesn't maximize it but can avoid
> many invalid operations already.
>
> Technically we can't make the result 100% clean at compile-time, we have
> to handle things like overflow at runtime. I think the new proposal makes
> more sense as the default behavior.
>
>
> On Mon, Jul 29, 2019 at 8:31 PM Russell Spitzer 
> wrote:
>
>> I understand spark is making the decisions, i'm say the actual final
>> effect of the null decision would be different depending on the insertion
>> target if the target has different behaviors for null.
>>
>> On Mon, Jul 29, 2019 at 5:26 AM Wenchen Fan  wrote:
>>
>>> > I'm a big -1 on null values for invalid casts.
>>>
>>> This is why we want to introduce the ANSI mode, so that invalid cast
>>> fails at runtime. But we have to keep the null behavior for a while, to
>>> keep backward compatibility. Spark returns null for invalid cast since the
>>> first day of Spark SQL, we can't just change it without a way to restore to
>>> the old behavior.
>>>
>>> I'm OK with adding a strict mode for the upcast behavior in table
>>> insertion, but I don't agree with making it the default. The default
>>> behavior should be either the ANSI SQL behavior or the legacy Spark
>>> behavior.
>>>
>>> > other modes should be allowed only with strict warning the behavior
>>> will be determined by the underlying sink.
>>>
>>> Seems there is some misunderstanding. The table insertion behavior is
>>> fully controlled by Spark. Spark decides when to add cast and Spark decided
>>> whether invalid cast should return null or fail. The sink is only
>>> responsible for writing data, not the type coercion/cast stuff.
>>>
>>> On Sun, Jul 28, 2019 at 12:24 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
 I'm a big -1 on null values for invalid casts. This can lead to a lot
 of even more unexpected errors and runtime behavior since null is

 1. Not allowed in all schemas (Leading to a runtime error anyway)
 2. Is the same as delete in some systems (leading to data loss)

 And this would be dependent on the sink being used. Spark won't just be
 interacting with ANSI compliant sinks so I think it makes much more sense
 to be strict. I think Upcast mode is a sensible default and other modes
 should be allowed only with strict warning the behavior will be determined
 by the underlying sink.

 On Sat, Jul 27, 2019 at 8:05 AM Takeshi Yamamuro 
 wrote:

> Hi, all
>
> +1 for implementing this new store cast mode.
> From a viewpoint of DBMS users, this cast is pretty common for INSERTs
> and I think this functionality could
> promote migrations from existing DBMSs to Spark.
>
> The most important thing for DBMS users is that they could

Re: [DISCUSS] New sections in Github Pull Request description template

2019-07-29 Thread Hyukjin Kwon

Thanks, guys. Let me probably mimic the template and open a PR soon -
currently I am stuck in some works. I will take a look in few days later.

2019년 7월 27일 (토) 오전 3:32, Bryan Cutler 님이 작성:

> The k8s template is pretty good. Under the behavior change section, it
> would be good to add instructions to also describe previous and new
> behavior as Hyukjin proposed.
>
> On Tue, Jul 23, 2019 at 10:07 PM Reynold Xin  wrote:
>
>> I like the spirit, but not sure about the exact proposal. Take a look at
>> k8s':
>> https://raw.githubusercontent.com/kubernetes/kubernetes/master/.github/PULL_REQUEST_TEMPLATE.md
>>
>>
>>
>> On Tue, Jul 23, 2019 at 8:27 PM, Hyukjin Kwon 
>> wrote:
>>
>>> (Plus, it helps to track history too. Spark's commit logs are growing
>>> and now it's pretty difficult to track the history and see what change
>>> introduced a specific behaviour)
>>>
>>> 2019년 7월 24일 (수) 오후 12:20, Hyukjin Kwon 님이 작성:
>>>
>>> Hi all,
>>>
>>> I would like to discuss about some new sections under "## What changes
>>> were proposed in this pull request?":
>>>
>>> ### Do the changes affect _any_ user/dev-facing input or output?
>>>
>>> (Please answer yes or no. If yes, answer the questions below)
>>>
>>> ### What was the previous behavior?
>>>
>>> (Please provide the console output, description and/or reproducer about the 
>>> previous behavior)
>>>
>>> ### What is the behavior the changes propose?
>>>
>>> (Please provide the console output, description and/or reproducer about the 
>>> behavior the changes propose)
>>>
>>> See
>>> https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
>>>  .
>>>
>>> From my experience so far in Spark community, and assuming from the
>>> interaction with other
>>> committers and contributors, It is pretty critical to know before/after
>>> behaviour changes even if it
>>> was a bug. In addition, I think this is requested by reviewers often.
>>>
>>> The new sections will make review process much easier, and we're able to
>>> quickly judge how serious the changes are.
>>> Given that Spark community still suffer from open PRs just queueing up
>>> without review, I think this can help
>>> both reviewers and PR authors.
>>>
>>> I do describe them often when I think it's useful and possible.
>>> For instance see https://github.com/apache/spark/pull/24927 - I am sure
>>> you guys have clear idea what the
>>> PR fixes.
>>>
>>> I cc'ed some guys I can currently think of for now FYI. Please let me
>>> know if you guys have any thought on this!
>>>
>>>
>>

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2019-07-25 Thread Hyukjin Kwon

Just FYI, I had to come up with a better JQL to filter out the JIRAs that
already have linked PRs.
In case it helps someone, I use this JQL now to look through the open JIRAs:

project = SPARK AND
status = Open AND
NOT issueFunction in linkedIssuesOfRemote("Github Pull Request *")
ORDER BY created DESC, priority DESC, updated DESC




2019년 7월 19일 (금) 오후 4:54, Hyukjin Kwon 님이 작성:

> That's a great explanation. Thanks I didn't know that.
>
> Josh, do you know who I should ping on this?
>
> On Fri, 19 Jul 2019, 16:52 Dongjoon Hyun,  wrote:
>
>> Hi, Hyukjin.
>>
>> In short, there are two bots. And, the current situation happens when
>> only one bot with `dev/github_jira_sync.py` works.
>>
>> And, `dev/github_jira_sync.py` is irrelevant to the JIRA status change
>> because it only use `add_remote_link` and `add_comment` API.
>> I know only this bot (in Apache Spark repository repo)
>>
>> AFAIK, `deb/github_jira_sync.py`'s activity is done under JIRA ID
>> `githubbot` (Name: `ASF GitHub Bot`).
>> And, the other bot's activity is done under JIRA ID `apachespark` (Name:
>> `Apache Spark`).
>> The other bot is the one which Josh mentioned before. (in
>> `databricks/spark-pr-dashboard` repo).
>>
>> The root cause will be the same. The API key used by the bot is rejected
>> by Apache JIRA and forwarded to CAPCHAR.
>>
>> Bests,
>> Dongjoon.
>>
>> On Thu, Jul 18, 2019 at 8:24 PM Hyukjin Kwon  wrote:
>>
>>> Hi all,
>>>
>>> Seems this issue is re-happening again. Seems the PR link is properly
>>> created in the corresponding JIRA but it doesn't change the JIRA's status
>>> from OPEN to IN-PROGRESS.
>>>
>>> See, for instance,
>>>
>>> https://issues.apache.org/jira/browse/SPARK-28443
>>> https://issues.apache.org/jira/browse/SPARK-28440
>>> https://issues.apache.org/jira/browse/SPARK-28436
>>> https://issues.apache.org/jira/browse/SPARK-28434
>>> https://issues.apache.org/jira/browse/SPARK-28433
>>> https://issues.apache.org/jira/browse/SPARK-28431
>>>
>>> Josh and Dongjoon, do you guys maybe have any idea?
>>>
>>> 2019년 4월 25일 (목) 오후 3:09, Hyukjin Kwon 님이 작성:
>>>
>>>> Thank you so much Josh .. !!
>>>>
>>>> 2019년 4월 25일 (목) 오후 3:04, Josh Rosen 님이 작성:
>>>>
>>>>> The code for this runs in http://spark-prs.appspot.com (see
>>>>> https://github.com/databricks/spark-pr-dashboard/blob/1e799c9e510fa8cdc9a6c084a777436bebeabe10/sparkprs/controllers/tasks.py#L137
>>>>> )
>>>>>
>>>>> I checked the AppEngine logs and it looks like we're getting error
>>>>> responses, possibly due to a credentials issue:
>>>>>
>>>>> Exception when starting progress on JIRA issue SPARK-27355 (
>>>>>> /base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/controllers/tasks.py:142
>>>>>> <https://console.cloud.google.com/debug/fromlog?appModule=default=live=%2Fbase%2Fdata%2Fhome%2Fapps%2Fs~spark-prs%2Flive.412416057856832734%2Fsparkprs%2Fcontrollers%2Ftasks.py=142=5cc1483600029309a7af76d5=1556170805012269000=3=spark-prs=ac>)
>>>>>> Traceback (most recent call last): File
>>>>>> Traceback (most recent call last):
>>>>>> File 
>>>>>> "/base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/controllers/tasks.py",
>>>>>> line 138
>>>>>> <https://console.cloud.google.com/debug/fromlog?appModule=default=live=%2Fbase%2Fdata%2Fhome%2Fapps%2Fs~spark-prs%2Flive.412416057856832734%2Fsparkprs%2Fcontrollers%2Ftasks.py=138=5cc1483600029309a7af76d5=1556170805012269000=3=spark-prs=ac>,
>>>>>> in update_pr start_issue_progress("%s-%s" % (app.config['JIRA_PROJECT'],
>>>>>> issue_number)) File
>>>>>> start_issue_progress("%s-%s" % (app.config['JIRA_PROJECT'],
>>>>>> issue_number))
>>>>>> File 
>>>>>> "/base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/jira_api.py",
>>>>>> line 27
>>>>>> <https://console.cloud.google.com/debug/fromlog?appModule=default=live=%2Fbase%2Fdata%2Fhome%2Fapps%2Fs~spark-prs%2Flive.412416057856832734%2Fsparkprs%2Fjira_api.py=27=5cc1483600029309a7af76d5=1556170805012269000=3=spark-prs=ac>,
>>>>>> in start_issue_progress jira_client = get_jira_client() File
>>>>>> jira_client = get_jira_client()
>>>>>> File 
>>>

Re: [DISCUSS] New sections in Github Pull Request description template

2019-07-23 Thread Hyukjin Kwon

(Plus, it helps to track history too. Spark's commit logs are growing and
now it's pretty difficult to track the history and see what change
introduced a specific behaviour)

2019년 7월 24일 (수) 오후 12:20, Hyukjin Kwon 님이 작성:

> Hi all,
>
> I would like to discuss about some new sections under "## What changes
> were proposed in this pull request?":
>
> ### Do the changes affect _any_ user/dev-facing input or output?
>
> (Please answer yes or no. If yes, answer the questions below)
>
> ### What was the previous behavior?
>
> (Please provide the console output, description and/or reproducer about the 
> previous behavior)
>
> ### What is the behavior the changes propose?
>
> (Please provide the console output, description and/or reproducer about the 
> behavior the changes propose)
>
> See
> https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
>  .
>
> From my experience so far in Spark community, and assuming from the
> interaction with other
> committers and contributors, It is pretty critical to know before/after
> behaviour changes even if it
> was a bug. In addition, I think this is requested by reviewers often.
>
> The new sections will make review process much easier, and we're able to
> quickly judge how serious the changes are.
> Given that Spark community still suffer from open PRs just queueing up
> without review, I think this can help
> both reviewers and PR authors.
>
> I do describe them often when I think it's useful and possible.
> For instance see https://github.com/apache/spark/pull/24927 - I am sure
> you guys have clear idea what the
> PR fixes.
>
> I cc'ed some guys I can currently think of for now FYI. Please let me know
> if you guys have any thought on this!
>
>

[DISCUSS] New sections in Github Pull Request description template

2019-07-23 Thread Hyukjin Kwon

Hi all,

I would like to discuss about some new sections under "## What changes were
proposed in this pull request?":

### Do the changes affect _any_ user/dev-facing input or output?

(Please answer yes or no. If yes, answer the questions below)

### What was the previous behavior?

(Please provide the console output, description and/or reproducer
about the previous behavior)

### What is the behavior the changes propose?

(Please provide the console output, description and/or reproducer
about the behavior the changes propose)

See
https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE .

>From my experience so far in Spark community, and assuming from the
interaction with other
committers and contributors, It is pretty critical to know before/after
behaviour changes even if it
was a bug. In addition, I think this is requested by reviewers often.

The new sections will make review process much easier, and we're able to
quickly judge how serious the changes are.
Given that Spark community still suffer from open PRs just queueing up
without review, I think this can help
both reviewers and PR authors.

I do describe them often when I think it's useful and possible.
For instance see https://github.com/apache/spark/pull/24927 - I am sure you
guys have clear idea what the
PR fixes.

I cc'ed some guys I can currently think of for now FYI. Please let me know
if you guys have any thought on this!

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2019-07-19 Thread Hyukjin Kwon

That's a great explanation. Thanks I didn't know that.

Josh, do you know who I should ping on this?

On Fri, 19 Jul 2019, 16:52 Dongjoon Hyun,  wrote:

> Hi, Hyukjin.
>
> In short, there are two bots. And, the current situation happens when only
> one bot with `dev/github_jira_sync.py` works.
>
> And, `dev/github_jira_sync.py` is irrelevant to the JIRA status change
> because it only use `add_remote_link` and `add_comment` API.
> I know only this bot (in Apache Spark repository repo)
>
> AFAIK, `deb/github_jira_sync.py`'s activity is done under JIRA ID
> `githubbot` (Name: `ASF GitHub Bot`).
> And, the other bot's activity is done under JIRA ID `apachespark` (Name:
> `Apache Spark`).
> The other bot is the one which Josh mentioned before. (in
> `databricks/spark-pr-dashboard` repo).
>
> The root cause will be the same. The API key used by the bot is rejected
> by Apache JIRA and forwarded to CAPCHAR.
>
> Bests,
> Dongjoon.
>
> On Thu, Jul 18, 2019 at 8:24 PM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> Seems this issue is re-happening again. Seems the PR link is properly
>> created in the corresponding JIRA but it doesn't change the JIRA's status
>> from OPEN to IN-PROGRESS.
>>
>> See, for instance,
>>
>> https://issues.apache.org/jira/browse/SPARK-28443
>> https://issues.apache.org/jira/browse/SPARK-28440
>> https://issues.apache.org/jira/browse/SPARK-28436
>> https://issues.apache.org/jira/browse/SPARK-28434
>> https://issues.apache.org/jira/browse/SPARK-28433
>> https://issues.apache.org/jira/browse/SPARK-28431
>>
>> Josh and Dongjoon, do you guys maybe have any idea?
>>
>> 2019년 4월 25일 (목) 오후 3:09, Hyukjin Kwon 님이 작성:
>>
>>> Thank you so much Josh .. !!
>>>
>>> 2019년 4월 25일 (목) 오후 3:04, Josh Rosen 님이 작성:
>>>
>>>> The code for this runs in http://spark-prs.appspot.com (see
>>>> https://github.com/databricks/spark-pr-dashboard/blob/1e799c9e510fa8cdc9a6c084a777436bebeabe10/sparkprs/controllers/tasks.py#L137
>>>> )
>>>>
>>>> I checked the AppEngine logs and it looks like we're getting error
>>>> responses, possibly due to a credentials issue:
>>>>
>>>> Exception when starting progress on JIRA issue SPARK-27355 (
>>>>> /base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/controllers/tasks.py:142
>>>>> <https://console.cloud.google.com/debug/fromlog?appModule=default=live=%2Fbase%2Fdata%2Fhome%2Fapps%2Fs~spark-prs%2Flive.412416057856832734%2Fsparkprs%2Fcontrollers%2Ftasks.py=142=5cc1483600029309a7af76d5=1556170805012269000=3=spark-prs=ac>)
>>>>> Traceback (most recent call last): File
>>>>> Traceback (most recent call last):
>>>>> File 
>>>>> "/base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/controllers/tasks.py",
>>>>> line 138
>>>>> <https://console.cloud.google.com/debug/fromlog?appModule=default=live=%2Fbase%2Fdata%2Fhome%2Fapps%2Fs~spark-prs%2Flive.412416057856832734%2Fsparkprs%2Fcontrollers%2Ftasks.py=138=5cc1483600029309a7af76d5=1556170805012269000=3=spark-prs=ac>,
>>>>> in update_pr start_issue_progress("%s-%s" % (app.config['JIRA_PROJECT'],
>>>>> issue_number)) File
>>>>> start_issue_progress("%s-%s" % (app.config['JIRA_PROJECT'],
>>>>> issue_number))
>>>>> File 
>>>>> "/base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/jira_api.py",
>>>>> line 27
>>>>> <https://console.cloud.google.com/debug/fromlog?appModule=default=live=%2Fbase%2Fdata%2Fhome%2Fapps%2Fs~spark-prs%2Flive.412416057856832734%2Fsparkprs%2Fjira_api.py=27=5cc1483600029309a7af76d5=1556170805012269000=3=spark-prs=ac>,
>>>>> in start_issue_progress jira_client = get_jira_client() File
>>>>> jira_client = get_jira_client()
>>>>> File 
>>>>> "/base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/jira_api.py",
>>>>> line 18
>>>>> <https://console.cloud.google.com/debug/fromlog?appModule=default=live=%2Fbase%2Fdata%2Fhome%2Fapps%2Fs~spark-prs%2Flive.412416057856832734%2Fsparkprs%2Fjira_api.py=18=5cc1483600029309a7af76d5=1556170805012269000=3=spark-prs=ac>,
>>>>> in get_jira_client app.config['JIRA_PASSWORD'])) File
>>>>> app.config['JIRA_PASSWORD']))
>>>>> File 
>>>>> "/base/data/home/apps/s~spark-prs/live.412416057856832734/lib/jira/client.py",
>>>>> line 472
>>>>> <https://co

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2019-07-18 Thread Hyukjin Kwon

Hi all,

Seems this issue is re-happening again. Seems the PR link is properly
created in the corresponding JIRA but it doesn't change the JIRA's status
from OPEN to IN-PROGRESS.

See, for instance,

https://issues.apache.org/jira/browse/SPARK-28443
https://issues.apache.org/jira/browse/SPARK-28440
https://issues.apache.org/jira/browse/SPARK-28436
https://issues.apache.org/jira/browse/SPARK-28434
https://issues.apache.org/jira/browse/SPARK-28433
https://issues.apache.org/jira/browse/SPARK-28431

Josh and Dongjoon, do you guys maybe have any idea?

2019년 4월 25일 (목) 오후 3:09, Hyukjin Kwon 님이 작성:

> Thank you so much Josh .. !!
>
> 2019년 4월 25일 (목) 오후 3:04, Josh Rosen 님이 작성:
>
>> The code for this runs in http://spark-prs.appspot.com (see
>> https://github.com/databricks/spark-pr-dashboard/blob/1e799c9e510fa8cdc9a6c084a777436bebeabe10/sparkprs/controllers/tasks.py#L137
>> )
>>
>> I checked the AppEngine logs and it looks like we're getting error
>> responses, possibly due to a credentials issue:
>>
>> Exception when starting progress on JIRA issue SPARK-27355 (
>>> /base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/controllers/tasks.py:142
>>> <https://console.cloud.google.com/debug/fromlog?appModule=default=live=%2Fbase%2Fdata%2Fhome%2Fapps%2Fs~spark-prs%2Flive.412416057856832734%2Fsparkprs%2Fcontrollers%2Ftasks.py=142=5cc1483600029309a7af76d5=1556170805012269000=3=spark-prs=ac>)
>>> Traceback (most recent call last): File
>>> Traceback (most recent call last):
>>> File 
>>> "/base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/controllers/tasks.py",
>>> line 138
>>> <https://console.cloud.google.com/debug/fromlog?appModule=default=live=%2Fbase%2Fdata%2Fhome%2Fapps%2Fs~spark-prs%2Flive.412416057856832734%2Fsparkprs%2Fcontrollers%2Ftasks.py=138=5cc1483600029309a7af76d5=1556170805012269000=3=spark-prs=ac>,
>>> in update_pr start_issue_progress("%s-%s" % (app.config['JIRA_PROJECT'],
>>> issue_number)) File
>>> start_issue_progress("%s-%s" % (app.config['JIRA_PROJECT'],
>>> issue_number))
>>> File 
>>> "/base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/jira_api.py",
>>> line 27
>>> <https://console.cloud.google.com/debug/fromlog?appModule=default=live=%2Fbase%2Fdata%2Fhome%2Fapps%2Fs~spark-prs%2Flive.412416057856832734%2Fsparkprs%2Fjira_api.py=27=5cc1483600029309a7af76d5=1556170805012269000=3=spark-prs=ac>,
>>> in start_issue_progress jira_client = get_jira_client() File
>>> jira_client = get_jira_client()
>>> File 
>>> "/base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/jira_api.py",
>>> line 18
>>> <https://console.cloud.google.com/debug/fromlog?appModule=default=live=%2Fbase%2Fdata%2Fhome%2Fapps%2Fs~spark-prs%2Flive.412416057856832734%2Fsparkprs%2Fjira_api.py=18=5cc1483600029309a7af76d5=1556170805012269000=3=spark-prs=ac>,
>>> in get_jira_client app.config['JIRA_PASSWORD'])) File
>>> app.config['JIRA_PASSWORD']))
>>> File 
>>> "/base/data/home/apps/s~spark-prs/live.412416057856832734/lib/jira/client.py",
>>> line 472
>>> <https://console.cloud.google.com/debug/fromlog?appModule=default=live=%2Fbase%2Fdata%2Fhome%2Fapps%2Fs~spark-prs%2Flive.412416057856832734%2Flib%2Fjira%2Fclient.py=472=5cc1483600029309a7af76d5=1556170805012269000=3=spark-prs=ac>,
>>> in __init__ si = self.server_info() File
>>> si = self.server_info()
>>> File 
>>> "/base/data/home/apps/s~spark-prs/live.412416057856832734/lib/jira/client.py",
>>> line 2133
>>> <https://console.cloud.google.com/debug/fromlog?appModule=default=live=%2Fbase%2Fdata%2Fhome%2Fapps%2Fs~spark-prs%2Flive.412416057856832734%2Flib%2Fjira%2Fclient.py=2133=5cc1483600029309a7af76d5=1556170805012269000=3=spark-prs=ac>,
>>> in server_info j = self._get_json('serverInfo') File
>>> j = self._get_json('serverInfo')
>>> File 
>>> "/base/data/home/apps/s~spark-prs/live.412416057856832734/lib/jira/client.py",
>>> line 2549
>>> <https://console.cloud.google.com/debug/fromlog?appModule=default=live=%2Fbase%2Fdata%2Fhome%2Fapps%2Fs~spark-prs%2Flive.412416057856832734%2Flib%2Fjira%2Fclient.py=2549=5cc1483600029309a7af76d5=1556170805012269000=3=spark-prs=ac>,
>>> in _get_json r = self._session.get(url, params=params) File
>>> r = self._session.get(url, params=params)
>>> File 
>>> "/base/data/home/apps/s~spark-prs/live.412416057856832734/lib/jira/resilientsession.py",
>>> line 151
>>> <https://console.cloud.google.co

Re: Contribution help needed for sub-tasks of an umbrella JIRA - port *.sql tests to improve coverage of Python, Pandas, Scala UDF cases

2019-07-09 Thread Hyukjin Kwon

It's alright - thanks for that.
Anyone can take a look. This is an open source project :D.

2019년 7월 9일 (화) 오후 8:18, Stavros Kontopoulos <
stavros.kontopou...@lightbend.com>님이 작성:

> I can try one and see how it goes, although not familiar with the area.
>
> Stavros
>
> On Tue, Jul 9, 2019 at 6:17 AM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I am currently targeting to improve Python, Pandas UDFs Scala UDF test
>> cases by integrating our existing *.sql files at
>> https://issues.apache.org/jira/browse/SPARK-27921
>>
>> I would appreciate that anyone who's interested in Spark contribution
>> takes some sub-tasks. It's too many for me to do :-). I am doing one by one
>> for now.
>>
>> I wrote some guides about this umbrella JIRA specifically so if you're
>> able to follow it very closely one by one, I think the process itself isn't
>> that difficult.
>>
>> The most import guide that should be carefully addressed is:
>> > 7. If there are diff, analyze it, file or find the JIRA, skip the tests
>> with comments.
>>
>> Thanks!
>>
>
>
>

Contribution help needed for sub-tasks of an umbrella JIRA - port *.sql tests to improve coverage of Python, Pandas, Scala UDF cases

2019-07-08 Thread Hyukjin Kwon

Hi all,

I am currently targeting to improve Python, Pandas UDFs Scala UDF test
cases by integrating our existing *.sql files at
https://issues.apache.org/jira/browse/SPARK-27921

I would appreciate that anyone who's interested in Spark contribution takes
some sub-tasks. It's too many for me to do :-). I am doing one by one for
now.

I wrote some guides about this umbrella JIRA specifically so if you're able
to follow it very closely one by one, I think the process itself isn't that
difficult.

The most import guide that should be carefully addressed is:
> 7. If there are diff, analyze it, file or find the JIRA, skip the tests
with comments.

Thanks!

Re: Disabling `Merge Commits` from GitHub Merge Button

2019-07-01 Thread Hyukjin Kwon

+1

2019년 7월 2일 (화) 오전 9:39, Takeshi Yamamuro 님이 작성:

> I'm also using the script in both cases, anyway +1.
>
> On Tue, Jul 2, 2019 at 5:58 AM Sean Owen  wrote:
>
>> I'm using the merge script in both repos. I think that was the best
>> practice?
>> So, sure, I'm fine with disabling it.
>>
>> On Mon, Jul 1, 2019 at 3:53 PM Dongjoon Hyun 
>> wrote:
>> >
>> > Hi, Apache Spark PMC members and committers.
>> >
>> > We are using GitHub `Merge Button` in `spark-website` repository
>> > because it's very convenient.
>> >
>> > 1. https://github.com/apache/spark-website/commits/asf-site
>> > 2. https://github.com/apache/spark/commits/master
>> >
>> > In order to be consistent with our previous behavior,
>> > can we disable `Allow Merge Commits` from GitHub `Merge Button` setting
>> explicitly?
>> >
>> > I hope we can enforce it in both `spark-website` and `spark` repository
>> consistently.
>> >
>> > Bests,
>> > Dongjoon.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Exposing JIRA issue types at GitHub PRs

2019-06-16 Thread Hyukjin Kwon

Labels look good and useful.

On Sat, 15 Jun 2019, 02:36 Dongjoon Hyun,  wrote:

> Now, you can see the exposed component labels (ordered by the number of
> PRs) here and click the component to search.
>
> https://github.com/apache/spark/labels?sort=count-desc
>
> Dongjoon.
>
>
> On Fri, Jun 14, 2019 at 1:15 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> JIRA and PR is ready for reviews.
>>
>> https://issues.apache.org/jira/browse/SPARK-28051 (Exposing JIRA issue
>> component types at GitHub PRs)
>> https://github.com/apache/spark/pull/24871
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Thu, Jun 13, 2019 at 10:48 AM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for the feedbacks and requirements, Hyukjin, Reynold, Marco.
>>>
>>> Sure, we can do whatever we want.
>>>
>>> I'll wait for more feedbacks and proceed to the next steps.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Wed, Jun 12, 2019 at 11:51 PM Marco Gaido 
>>> wrote:
>>>
 Hi Dongjoon,
 Thanks for the proposal! I like the idea. Maybe we can extend it to
 component too and to some jira labels such as correctness which may be
 worth to highlight in PRs too. My only concern is that in many cases JIRAs
 are created not very carefully so they may be incorrect at the moment of
 the pr creation and it may be updated later: so keeping them in sync may be
 an extra effort..

 On Thu, 13 Jun 2019, 08:09 Reynold Xin,  wrote:

> Seems like a good idea. Can we test this with a component first?
>
> On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Since we use both Apache JIRA and GitHub actively for Apache Spark
>> contributions, we have lots of JIRAs and PRs consequently. One specific
>> thing I've been longing to see is `Jira Issue Type` in GitHub.
>>
>> How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`?
>> There are two main benefits:
>> 1. It helps the communication between the contributors and reviewers
>> with more information.
>> (In some cases, some people only visit GitHub to see the PR and
>> commits)
>> 2. `Labels` is searchable. We don't need to visit Apache Jira to
>> search PRs to see a specific type.
>> (For example, the reviewers can see and review 'BUG' PRs first by
>> using `is:open is:pr label:BUG`.)
>>
>> Of course, this can be done automatically without human intervention.
>> Since we already have GitHub Jenkins job to access JIRA/GitHub, that job
>> can add the labels from the beginning. If needed, I can volunteer to 
>> update
>> the script.
>>
>> To show the demo, I labeled several PRs manually. You can see the
>> result right now in Apache Spark PR page.
>>
>>   - https://github.com/apache/spark/pulls
>>
>> If you're surprised due to those manual activities, I want to
>> apologize for that. I hope we can take advantage of the existing GitHub
>> features to serve Apache Spark community in a way better than yesterday.
>>
>> How do you think about this specific suggestion?
>>
>> Bests,
>> Dongjoon
>>
>> PS. I saw that `Request Review` and `Assign` features are already
>> used for some purposes, but these feature are out of the scope in this
>> email.
>>
>

Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-15 Thread Hyukjin Kwon

Oh btw, why is it 0.23.2, not 0.23.0 or 0.23.4?

On Sat, 15 Jun 2019, 06:56 Bryan Cutler,  wrote:

> Yeah, PyArrow is the only other PySpark dependency we check for a minimum
> version. We updated that not too long ago to be 0.12.1, which I think we
> are still good on for now.
>
> On Fri, Jun 14, 2019 at 11:36 AM Felix Cheung 
> wrote:
>
>> How about pyArrow?
>>
>> --
>> *From:* Holden Karau 
>> *Sent:* Friday, June 14, 2019 11:06:15 AM
>> *To:* Felix Cheung
>> *Cc:* Bryan Cutler; Dongjoon Hyun; Hyukjin Kwon; dev; shane knapp
>> *Subject:* Re: [DISCUSS] Increasing minimum supported version of Pandas
>>
>> Are there other Python dependencies we should consider upgrading at the
>> same time?
>>
>> On Fri, Jun 14, 2019 at 7:45 PM Felix Cheung 
>> wrote:
>>
>>> So to be clear, min version check is 0.23
>>> Jenkins test is 0.24
>>>
>>> I’m ok with this. I hope someone will test 0.23 on releases though
>>> before we sign off?
>>>
>> We should maybe add this to the release instruction notes?
>>
>>>
>>> --
>>> *From:* shane knapp 
>>> *Sent:* Friday, June 14, 2019 10:23:56 AM
>>> *To:* Bryan Cutler
>>> *Cc:* Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
>>> *Subject:* Re: [DISCUSS] Increasing minimum supported version of Pandas
>>>
>>> excellent.  i shall not touch anything.  :)
>>>
>>> On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler  wrote:
>>>
>>>> Shane, I think 0.24.2 is probably more common right now, so if we were
>>>> to pick one to test against, I still think it should be that one. Our
>>>> Pandas usage in PySpark is pretty conservative, so it's pretty unlikely
>>>> that we will add something that would break 0.23.X.
>>>>
>>>> On Fri, Jun 14, 2019 at 10:10 AM shane knapp 
>>>> wrote:
>>>>
>>>>> ah, ok...  should we downgrade the testing env on jenkins then?  any
>>>>> specific version?
>>>>>
>>>>> shane, who is loathe (and i mean LOATHE) to touch python envs ;)
>>>>>
>>>>> On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler 
>>>>> wrote:
>>>>>
>>>>>> I should have stated this earlier, but when the user does something
>>>>>> that requires Pandas, the minimum version is checked against what was
>>>>>> imported and will raise an exception if it is a lower version. So I'm
>>>>>> concerned that using 0.24.2 might be a little too new for users running
>>>>>> older clusters. To give some release dates, 0.23.2 was released about a
>>>>>> year ago, 0.24.0 in January and 0.24.2 in March.
>>>>>>
>>>>> I think given that we’re switching to requiring Python 3 and also a
>> bit of a way from cutting a release 0.24 could be Ok as a min version
>> requirement
>>
>>>
>>>>>>
>>>>>> On Fri, Jun 14, 2019 at 9:27 AM shane knapp 
>>>>>> wrote:
>>>>>>
>>>>>>> just to everyone knows, our python 3.6 testing infra is currently on
>>>>>>> 0.24.2...
>>>>>>>
>>>>>>> On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun <
>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> Thank you for this effort, Bryan!
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I’m +1 for upgrading, although since this is probably the last
>>>>>>>>> easy chance we’ll have to bump version numbers easily I’d suggest 
>>>>>>>>> 0.24.2
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I am +1 to go for 0.23.2 - it brings some overhead to test
>>>>>>>>>> PyArrow and pandas combinations. Spark 3 should be good time to 
>>>>>>>>>> increase.
>&g

Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-13 Thread Hyukjin Kwon

I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and
pandas combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:

> Hi All,
>
> We would like to discuss increasing the minimum supported version of
> Pandas in Spark, which is currently 0.19.2.
>
> Pandas 0.19.2 was released nearly 3 years ago and there are some
> workarounds in PySpark that could be removed if such an old version is not
> required. This will help to keep code clean and reduce maintenance effort.
>
> The change is targeted for Spark 3.0.0 release, see
> https://issues.apache.org/jira/browse/SPARK-28041. The current thought is
> to bump the version to 0.23.2, but we would like to discuss before making a
> change. Does anyone else have thoughts on this?
>
> Regards,
> Bryan
>

Re: Exposing JIRA issue types at GitHub PRs

2019-06-13 Thread Hyukjin Kwon

Yea, I think we can automate this process via, for instance,
https://github.com/apache/spark/blob/master/dev/github_jira_sync.py

+1 for such sort of automatic categorizing and matching metadata between
JIRA and github

Adding Josh and Sean as well.

On Thu, 13 Jun 2019, 13:17 Dongjoon Hyun,  wrote:

> Hi, All.
>
> Since we use both Apache JIRA and GitHub actively for Apache Spark
> contributions, we have lots of JIRAs and PRs consequently. One specific
> thing I've been longing to see is `Jira Issue Type` in GitHub.
>
> How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`?
> There are two main benefits:
> 1. It helps the communication between the contributors and reviewers with
> more information.
> (In some cases, some people only visit GitHub to see the PR and
> commits)
> 2. `Labels` is searchable. We don't need to visit Apache Jira to search
> PRs to see a specific type.
> (For example, the reviewers can see and review 'BUG' PRs first by
> using `is:open is:pr label:BUG`.)
>
> Of course, this can be done automatically without human intervention.
> Since we already have GitHub Jenkins job to access JIRA/GitHub, that job
> can add the labels from the beginning. If needed, I can volunteer to update
> the script.
>
> To show the demo, I labeled several PRs manually. You can see the result
> right now in Apache Spark PR page.
>
>   - https://github.com/apache/spark/pulls
>
> If you're surprised due to those manual activities, I want to apologize
> for that. I hope we can take advantage of the existing GitHub features to
> serve Apache Spark community in a way better than yesterday.
>
> How do you think about this specific suggestion?
>
> Bests,
> Dongjoon
>
> PS. I saw that `Request Review` and `Assign` features are already used for
> some purposes, but these feature are out of the scope in this email.
>

Re: Resolving all JIRAs affecting EOL releases

2019-05-20 Thread Hyukjin Kwon

I took an action for those JIRAs.

The JIRAs that has not been updated for the last year, and having affect
version of EOL releases were now:
  - Resolved as 'Incomplete' status
  - Has a 'bulk-closed' label.

Thanks guys.

2019년 5월 21일 (화) 오전 8:35, shane knapp 님이 작성:

> alright, i found 3 jiras that i was able to close:
>
>1. SPARK-19612 <https://issues.apache.org/jira/browse/SPARK-19612>
>2.
>   1. SPARK-22996 <https://issues.apache.org/jira/browse/SPARK-22996>
>  2.
> 1. SPARK-22766
> <https://issues.apache.org/jira/browse/SPARK-22766>
> 2.
>     3.
>
>
> On Sun, May 19, 2019 at 6:43 PM Hyukjin Kwon  wrote:
>
>> Thanks Shane .. the URL I linked somehow didn't work in other people
>> browser. Hope this link works:
>>
>>
>> https://issues.apache.org/jira/browse/SPARK-23492?jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)%0A%20%20AND%20updated%20%3C%3D%20-52w
>>
>> I will take an action around this time tomorrow considering there were
>> some more changes to make at the last minute.
>>
>>
>> 2019년 5월 19일 (일) 오후 6:39, Hyukjin Kwon 님이 작성:
>>
>>> I will add one more condition for "updated". So, it will additionally
>>> avoid things updated within one year but left open against EOL releases.
>>>
>>> project = SPARK
>>>   AND status in (Open, "In Progress", Reopened)
>>>   AND (
>>> affectedVersion = EMPTY OR
>>> NOT (affectedVersion in versionMatch("^3.*")
>>>   OR affectedVersion in versionMatch("^2.4.*")
>>>   OR affectedVersion in versionMatch("^2.3.*")
>>> )
>>>   )
>>>   AND updated <= -52w
>>>
>>>
>>> https://issues.apache.org/jira/issues/?filter=12344168=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)%0A%20%20AND%20updated%20%3C%3D%20-52w
>>>
>>> This still reduces JIRAs under 1000 which I originally targeted.
>>>
>>>
>>>
>>> 2019년 5월 19일 (일) 오후 6:08, Sean Owen 님이 작성:
>>>
>>>> I'd only tweak this to perhaps not close JIRAs that have been updated
>>>> recently -- even just avoiding things updated in the last month. For
>>>> example this would close
>>>> https://issues.apache.org/jira/browse/SPARK-27758 which was opened
>>>> Friday (though, for other reasons it should probably be closed). Still I
>>>> don't mind it under the logic that it has been reported against 2.1.0.
>>>>
>>>> On the other hand, I'd go further and close _anything_ not updated in a
>>>> long time, like a year (or 2 if feeling conservative). That is there's
>>>> probably a lot of old cruft out there that wasn't marked with an Affected
>>>> Version, before that was required.
>>>>
>>>> On Sat, May 18, 2019 at 10:48 PM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>>> Thanks guys.
>>>>>
>>>>> This thread got more than 3 PMC votes without any objection. I
>>>>> slightly edited JQL from Abdeali's suggestion (thanks, Abdeali).
>>>>>
>>>>>
>>>>> JQL:
>>>>>
>>>>> project = SPARK
>>>>>   AND status in (Open, "In Progress", Reopened)
>>>>>   AND (
>>>>> affectedVersion = EMPTY OR
>>>>> NOT (affectedVersion in versionMatch("^3.*")
>>>>>   OR affectedVersion in versionMatch("^2.4.*")
>>>>>   OR affectedVersion in versionMatch("^2.3.*")
>>>>> )
>>>>>   )
>>>>>
>>>>>
>>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%2

Re: Resolving all JIRAs affecting EOL releases

2019-05-19 Thread Hyukjin Kwon

Thanks Shane .. the URL I linked somehow didn't work in other people
browser. Hope this link works:

https://issues.apache.org/jira/browse/SPARK-23492?jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)%0A%20%20AND%20updated%20%3C%3D%20-52w

I will take an action around this time tomorrow considering there were some
more changes to make at the last minute.


2019년 5월 19일 (일) 오후 6:39, Hyukjin Kwon 님이 작성:

> I will add one more condition for "updated". So, it will additionally
> avoid things updated within one year but left open against EOL releases.
>
> project = SPARK
>   AND status in (Open, "In Progress", Reopened)
>   AND (
> affectedVersion = EMPTY OR
> NOT (affectedVersion in versionMatch("^3.*")
>   OR affectedVersion in versionMatch("^2.4.*")
>   OR affectedVersion in versionMatch("^2.3.*")
> )
>   )
>   AND updated <= -52w
>
>
> https://issues.apache.org/jira/issues/?filter=12344168=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)%0A%20%20AND%20updated%20%3C%3D%20-52w
>
> This still reduces JIRAs under 1000 which I originally targeted.
>
>
>
> 2019년 5월 19일 (일) 오후 6:08, Sean Owen 님이 작성:
>
>> I'd only tweak this to perhaps not close JIRAs that have been updated
>> recently -- even just avoiding things updated in the last month. For
>> example this would close
>> https://issues.apache.org/jira/browse/SPARK-27758 which was opened
>> Friday (though, for other reasons it should probably be closed). Still I
>> don't mind it under the logic that it has been reported against 2.1.0.
>>
>> On the other hand, I'd go further and close _anything_ not updated in a
>> long time, like a year (or 2 if feeling conservative). That is there's
>> probably a lot of old cruft out there that wasn't marked with an Affected
>> Version, before that was required.
>>
>> On Sat, May 18, 2019 at 10:48 PM Hyukjin Kwon 
>> wrote:
>>
>>> Thanks guys.
>>>
>>> This thread got more than 3 PMC votes without any objection. I slightly
>>> edited JQL from Abdeali's suggestion (thanks, Abdeali).
>>>
>>>
>>> JQL:
>>>
>>> project = SPARK
>>>   AND status in (Open, "In Progress", Reopened)
>>>   AND (
>>> affectedVersion = EMPTY OR
>>> NOT (affectedVersion in versionMatch("^3.*")
>>>   OR affectedVersion in versionMatch("^2.4.*")
>>>   OR affectedVersion in versionMatch("^2.3.*")
>>> )
>>>   )
>>>
>>>
>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)
>>>
>>>
>>> It means we will resolve all JIRAs that have EOL releases as affected
>>> versions, including no version specified in affected versions - this will
>>> reduce open JIRAs under 900.
>>>
>>> Looks I can use a bulk action feature in JIRA. Tomorrow at the similar
>>> time, I will
>>> - Label those JIRAs as 'bulk-closed'
>>> - Resolve them via `Incomplete` status.
>>>
>>> Please double check the list and let me know if you guys have any
>>> concern.
>>>
>>>
>>>
>>>
>>>
>>> 2019년 5월 18일 (토) 오후 12:22, Dongjoon Hyun 님이 작성:
>>>
>>>> +1, too.
>>>>
>>>> Thank you, Hyukjin!
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Fri, May 17, 2019 at 9:07 AM Imran Rashid
>>>>  wrote:
>>>>
>>>>&g

Re: Resolving all JIRAs affecting EOL releases

2019-05-19 Thread Hyukjin Kwon

I will add one more condition for "updated". So, it will additionally avoid
things updated within one year but left open against EOL releases.

project = SPARK
  AND status in (Open, "In Progress", Reopened)
  AND (
affectedVersion = EMPTY OR
NOT (affectedVersion in versionMatch("^3.*")
  OR affectedVersion in versionMatch("^2.4.*")
  OR affectedVersion in versionMatch("^2.3.*")
)
  )
  AND updated <= -52w

https://issues.apache.org/jira/issues/?filter=12344168=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)%0A%20%20AND%20updated%20%3C%3D%20-52w

This still reduces JIRAs under 1000 which I originally targeted.



2019년 5월 19일 (일) 오후 6:08, Sean Owen 님이 작성:

> I'd only tweak this to perhaps not close JIRAs that have been updated
> recently -- even just avoiding things updated in the last month. For
> example this would close https://issues.apache.org/jira/browse/SPARK-27758 
> which
> was opened Friday (though, for other reasons it should probably be closed).
> Still I don't mind it under the logic that it has been reported against
> 2.1.0.
>
> On the other hand, I'd go further and close _anything_ not updated in a
> long time, like a year (or 2 if feeling conservative). That is there's
> probably a lot of old cruft out there that wasn't marked with an Affected
> Version, before that was required.
>
> On Sat, May 18, 2019 at 10:48 PM Hyukjin Kwon  wrote:
>
>> Thanks guys.
>>
>> This thread got more than 3 PMC votes without any objection. I slightly
>> edited JQL from Abdeali's suggestion (thanks, Abdeali).
>>
>>
>> JQL:
>>
>> project = SPARK
>>   AND status in (Open, "In Progress", Reopened)
>>   AND (
>> affectedVersion = EMPTY OR
>> NOT (affectedVersion in versionMatch("^3.*")
>>   OR affectedVersion in versionMatch("^2.4.*")
>>   OR affectedVersion in versionMatch("^2.3.*")
>> )
>>   )
>>
>>
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)
>>
>>
>> It means we will resolve all JIRAs that have EOL releases as affected
>> versions, including no version specified in affected versions - this will
>> reduce open JIRAs under 900.
>>
>> Looks I can use a bulk action feature in JIRA. Tomorrow at the similar
>> time, I will
>> - Label those JIRAs as 'bulk-closed'
>> - Resolve them via `Incomplete` status.
>>
>> Please double check the list and let me know if you guys have any concern.
>>
>>
>>
>>
>>
>> 2019년 5월 18일 (토) 오후 12:22, Dongjoon Hyun 님이 작성:
>>
>>> +1, too.
>>>
>>> Thank you, Hyukjin!
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Fri, May 17, 2019 at 9:07 AM Imran Rashid
>>>  wrote:
>>>
>>>> +1, thanks for taking this on
>>>>
>>>> On Wed, May 15, 2019 at 7:26 PM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>>> oh, wait. 'Incomplete' can still make sense in this way then.
>>>>> Yes, I am good with 'Incomplete' too.
>>>>>
>>>>> 2019년 5월 16일 (목) 오전 11:24, Hyukjin Kwon 님이 작성:
>>>>>
>>>>>> I actually recently used 'Incomplete'  a bit when the JIRA is
>>>>>> basically too poorly formed (like just copying and pasting an error) ...
>>>>>>
>>>>>> I was thinking about 'Unresolved' status or `Auto Closed' too. I
>>>>>> double checked they can be reopen as well after resolution.
>>>>>>
>>>>>> [image: Screen Shot 2019-05-16 at 10.35.14 AM.png]
>>>>>> [image: Screen Shot 2019-05-16 at 10.35.39 AM.png]
>>>>>>
>>>>>> 2019년 5월 16일 (목) 오전 11:04, Sean Owen 님이 작성:
>>>>>>
>>>>>>> Agree, anything without an Affected Version

Re: Resolving all JIRAs affecting EOL releases

2019-05-18 Thread Hyukjin Kwon

Thanks guys.

This thread got more than 3 PMC votes without any objection. I slightly
edited JQL from Abdeali's suggestion (thanks, Abdeali).


JQL:

project = SPARK
  AND status in (Open, "In Progress", Reopened)
  AND (
affectedVersion = EMPTY OR
NOT (affectedVersion in versionMatch("^3.*")
  OR affectedVersion in versionMatch("^2.4.*")
  OR affectedVersion in versionMatch("^2.3.*")
)
  )

https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)


It means we will resolve all JIRAs that have EOL releases as affected
versions, including no version specified in affected versions - this will
reduce open JIRAs under 900.

Looks I can use a bulk action feature in JIRA. Tomorrow at the similar
time, I will
- Label those JIRAs as 'bulk-closed'
- Resolve them via `Incomplete` status.

Please double check the list and let me know if you guys have any concern.





2019년 5월 18일 (토) 오후 12:22, Dongjoon Hyun 님이 작성:

> +1, too.
>
> Thank you, Hyukjin!
>
> Bests,
> Dongjoon.
>
>
> On Fri, May 17, 2019 at 9:07 AM Imran Rashid 
> wrote:
>
>> +1, thanks for taking this on
>>
>> On Wed, May 15, 2019 at 7:26 PM Hyukjin Kwon  wrote:
>>
>>> oh, wait. 'Incomplete' can still make sense in this way then.
>>> Yes, I am good with 'Incomplete' too.
>>>
>>> 2019년 5월 16일 (목) 오전 11:24, Hyukjin Kwon 님이 작성:
>>>
>>>> I actually recently used 'Incomplete'  a bit when the JIRA is basically
>>>> too poorly formed (like just copying and pasting an error) ...
>>>>
>>>> I was thinking about 'Unresolved' status or `Auto Closed' too. I double
>>>> checked they can be reopen as well after resolution.
>>>>
>>>> [image: Screen Shot 2019-05-16 at 10.35.14 AM.png]
>>>> [image: Screen Shot 2019-05-16 at 10.35.39 AM.png]
>>>>
>>>> 2019년 5월 16일 (목) 오전 11:04, Sean Owen 님이 작성:
>>>>
>>>>> Agree, anything without an Affected Version should be old enough to
>>>>> time out.
>>>>> I might use "Incomplete" or something as the status, as we haven't
>>>>> otherwise used that. Maybe that's simpler than a label. But, anything like
>>>>> that sounds good.
>>>>>
>>>>> On Wed, May 15, 2019 at 8:40 PM Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>>> BTW, affected version became a required field (I don't remember when
>>>>>> exactly was .. I believe it's around when we work on Spark 2.3):
>>>>>>
>>>>>> [image: Screen Shot 2019-05-16 at 10.29.50 AM.png]
>>>>>>
>>>>>> So, including all EOL versions and affected versions not specified
>>>>>> will roughly work.
>>>>>> Using "Cannot Reproduce" as its status and 'bulk-closed' label makes
>>>>>> the best sense to me.
>>>>>>
>>>>>> Okie. I want to open this roughly for a week before taking an actual
>>>>>> action for this. If there's no more feedback, I will do as I said ^ next
>>>>>> week.
>>>>>>
>>>>>>
>>>>>> 2019년 5월 15일 (수) 오후 11:33, Josh Rosen 님이 작성:
>>>>>>
>>>>>>> +1 in favor of some sort of JIRA cleanup.
>>>>>>>
>>>>>>> My only request is that we attach some sort of 'bulk-closed' label
>>>>>>> to issues that we close via JIRA filter batch operations (and resolve 
>>>>>>> the
>>>>>>> issues as "Timed Out" / "Cannot Reproduce", not "Fixed"). Using a label
>>>>>>> makes it easier to audit what was closed, simplifying the process of
>>>>>>> identifying and re-opening valid issues caught in our dragnet.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 15, 2019 at 7:19 AM Sean Owen  wrote:
>>>>>>>
>>>>>>>> I gave up looking through JIRAs a long time ago, so, big respect for
>>>>>>>> continuing to try to triage them. I am afraid we're missing a few
>>>>>>&g

Re: Resolving all JIRAs affecting EOL releases

2019-05-15 Thread Hyukjin Kwon

I actually recently used 'Incomplete'  a bit when the JIRA is basically too
poorly formed (like just copying and pasting an error) ...

I was thinking about 'Unresolved' status or `Auto Closed' too. I double
checked they can be reopen as well after resolution.

[image: Screen Shot 2019-05-16 at 10.35.14 AM.png]
[image: Screen Shot 2019-05-16 at 10.35.39 AM.png]

2019년 5월 16일 (목) 오전 11:04, Sean Owen 님이 작성:

> Agree, anything without an Affected Version should be old enough to time
> out.
> I might use "Incomplete" or something as the status, as we haven't
> otherwise used that. Maybe that's simpler than a label. But, anything like
> that sounds good.
>
> On Wed, May 15, 2019 at 8:40 PM Hyukjin Kwon  wrote:
>
>> BTW, affected version became a required field (I don't remember when
>> exactly was .. I believe it's around when we work on Spark 2.3):
>>
>> [image: Screen Shot 2019-05-16 at 10.29.50 AM.png]
>>
>> So, including all EOL versions and affected versions not specified will
>> roughly work.
>> Using "Cannot Reproduce" as its status and 'bulk-closed' label makes the
>> best sense to me.
>>
>> Okie. I want to open this roughly for a week before taking an actual
>> action for this. If there's no more feedback, I will do as I said ^ next
>> week.
>>
>>
>> 2019년 5월 15일 (수) 오후 11:33, Josh Rosen 님이 작성:
>>
>>> +1 in favor of some sort of JIRA cleanup.
>>>
>>> My only request is that we attach some sort of 'bulk-closed' label to
>>> issues that we close via JIRA filter batch operations (and resolve the
>>> issues as "Timed Out" / "Cannot Reproduce", not "Fixed"). Using a label
>>> makes it easier to audit what was closed, simplifying the process of
>>> identifying and re-opening valid issues caught in our dragnet.
>>>
>>>
>>> On Wed, May 15, 2019 at 7:19 AM Sean Owen  wrote:
>>>
>>>> I gave up looking through JIRAs a long time ago, so, big respect for
>>>> continuing to try to triage them. I am afraid we're missing a few
>>>> important bug reports in the torrent, but most JIRAs are not
>>>> well-formed, just questions, stale, or simply things that won't be
>>>> added. I do think it's important to reflect that reality, and so I'm
>>>> always in favor of more aggressively closing JIRAs. I think this is
>>>> more standard practice, from projects like TensorFlow/Keras, pandas,
>>>> etc to just automatically drop Issues that don't see activity for N
>>>> days. We won't do that, but, are probably on the other hand far too
>>>> lax in closing them.
>>>>
>>>> Remember that JIRAs stay searchable and can be reopened, so it's not
>>>> like we lose much information.
>>>>
>>>> I'd close anything that hasn't had activity in 2 years (?), as a start.
>>>> I like the idea of closing things that only affect an EOL release,
>>>> but, many items aren't marked, so may need to cast the net wider.
>>>>
>>>> I think only then does it make sense to look at bothering to reproduce
>>>> or evaluate the 1000s that will still remain.
>>>>
>>>> On Wed, May 15, 2019 at 4:25 AM Hyukjin Kwon 
>>>> wrote:
>>>> >
>>>> > Hi all,
>>>> >
>>>> > I would like to propose to resolve all JIRAs that affects EOL
>>>> releases - 2.2 and below. and affected version
>>>> > not specified. I was rather against this way and considered this as
>>>> last resort in roughly 3 years ago
>>>> > when we discussed. Now I think we should go ahead with this. See
>>>> below.
>>>> >
>>>> > I have been talking care of this for so long time almost every day
>>>> those 3 years. The number of JIRAs
>>>> > keeps increasing and it does never go down. Now the number is going
>>>> over 2500 JIRAs.
>>>> > Did you guys know? in JIRA, we can only go through page by page up to
>>>> 1000 items. So, currently we're even
>>>> > having difficulties to go through every JIRA. We should manually
>>>> filter out and check each.
>>>> > The number is going over the manageable size.
>>>> >
>>>> > I am not suggesting this without anything actually trying. This is
>>>> what we have tried within my visibility:
>>>> >
>>>> >   1. In roughly 3 years ago, Sean tried to gather committers and even
>>>> non-committ

Re: Resolving all JIRAs affecting EOL releases

2019-05-15 Thread Hyukjin Kwon

oh, wait. 'Incomplete' can still make sense in this way then.
Yes, I am good with 'Incomplete' too.

2019년 5월 16일 (목) 오전 11:24, Hyukjin Kwon 님이 작성:

> I actually recently used 'Incomplete'  a bit when the JIRA is basically
> too poorly formed (like just copying and pasting an error) ...
>
> I was thinking about 'Unresolved' status or `Auto Closed' too. I double
> checked they can be reopen as well after resolution.
>
> [image: Screen Shot 2019-05-16 at 10.35.14 AM.png]
> [image: Screen Shot 2019-05-16 at 10.35.39 AM.png]
>
> 2019년 5월 16일 (목) 오전 11:04, Sean Owen 님이 작성:
>
>> Agree, anything without an Affected Version should be old enough to time
>> out.
>> I might use "Incomplete" or something as the status, as we haven't
>> otherwise used that. Maybe that's simpler than a label. But, anything like
>> that sounds good.
>>
>> On Wed, May 15, 2019 at 8:40 PM Hyukjin Kwon  wrote:
>>
>>> BTW, affected version became a required field (I don't remember when
>>> exactly was .. I believe it's around when we work on Spark 2.3):
>>>
>>> [image: Screen Shot 2019-05-16 at 10.29.50 AM.png]
>>>
>>> So, including all EOL versions and affected versions not specified will
>>> roughly work.
>>> Using "Cannot Reproduce" as its status and 'bulk-closed' label makes the
>>> best sense to me.
>>>
>>> Okie. I want to open this roughly for a week before taking an actual
>>> action for this. If there's no more feedback, I will do as I said ^ next
>>> week.
>>>
>>>
>>> 2019년 5월 15일 (수) 오후 11:33, Josh Rosen 님이 작성:
>>>
>>>> +1 in favor of some sort of JIRA cleanup.
>>>>
>>>> My only request is that we attach some sort of 'bulk-closed' label to
>>>> issues that we close via JIRA filter batch operations (and resolve the
>>>> issues as "Timed Out" / "Cannot Reproduce", not "Fixed"). Using a label
>>>> makes it easier to audit what was closed, simplifying the process of
>>>> identifying and re-opening valid issues caught in our dragnet.
>>>>
>>>>
>>>> On Wed, May 15, 2019 at 7:19 AM Sean Owen  wrote:
>>>>
>>>>> I gave up looking through JIRAs a long time ago, so, big respect for
>>>>> continuing to try to triage them. I am afraid we're missing a few
>>>>> important bug reports in the torrent, but most JIRAs are not
>>>>> well-formed, just questions, stale, or simply things that won't be
>>>>> added. I do think it's important to reflect that reality, and so I'm
>>>>> always in favor of more aggressively closing JIRAs. I think this is
>>>>> more standard practice, from projects like TensorFlow/Keras, pandas,
>>>>> etc to just automatically drop Issues that don't see activity for N
>>>>> days. We won't do that, but, are probably on the other hand far too
>>>>> lax in closing them.
>>>>>
>>>>> Remember that JIRAs stay searchable and can be reopened, so it's not
>>>>> like we lose much information.
>>>>>
>>>>> I'd close anything that hasn't had activity in 2 years (?), as a start.
>>>>> I like the idea of closing things that only affect an EOL release,
>>>>> but, many items aren't marked, so may need to cast the net wider.
>>>>>
>>>>> I think only then does it make sense to look at bothering to reproduce
>>>>> or evaluate the 1000s that will still remain.
>>>>>
>>>>> On Wed, May 15, 2019 at 4:25 AM Hyukjin Kwon 
>>>>> wrote:
>>>>> >
>>>>> > Hi all,
>>>>> >
>>>>> > I would like to propose to resolve all JIRAs that affects EOL
>>>>> releases - 2.2 and below. and affected version
>>>>> > not specified. I was rather against this way and considered this as
>>>>> last resort in roughly 3 years ago
>>>>> > when we discussed. Now I think we should go ahead with this. See
>>>>> below.
>>>>> >
>>>>> > I have been talking care of this for so long time almost every day
>>>>> those 3 years. The number of JIRAs
>>>>> > keeps increasing and it does never go down. Now the number is going
>>>>> over 2500 JIRAs.
>>>>> > Did you guys know? in JIRA, we can only go through page by page up
>>>>> to 1000 items. So, currently we're even
>>>

Re: Resolving all JIRAs affecting EOL releases

2019-05-15 Thread Hyukjin Kwon

BTW, affected version became a required field (I don't remember when
exactly was .. I believe it's around when we work on Spark 2.3):

[image: Screen Shot 2019-05-16 at 10.29.50 AM.png]

So, including all EOL versions and affected versions not specified will
roughly work.
Using "Cannot Reproduce" as its status and 'bulk-closed' label makes the
best sense to me.

Okie. I want to open this roughly for a week before taking an actual action
for this. If there's no more feedback, I will do as I said ^ next week.


2019년 5월 15일 (수) 오후 11:33, Josh Rosen 님이 작성:

> +1 in favor of some sort of JIRA cleanup.
>
> My only request is that we attach some sort of 'bulk-closed' label to
> issues that we close via JIRA filter batch operations (and resolve the
> issues as "Timed Out" / "Cannot Reproduce", not "Fixed"). Using a label
> makes it easier to audit what was closed, simplifying the process of
> identifying and re-opening valid issues caught in our dragnet.
>
>
> On Wed, May 15, 2019 at 7:19 AM Sean Owen  wrote:
>
>> I gave up looking through JIRAs a long time ago, so, big respect for
>> continuing to try to triage them. I am afraid we're missing a few
>> important bug reports in the torrent, but most JIRAs are not
>> well-formed, just questions, stale, or simply things that won't be
>> added. I do think it's important to reflect that reality, and so I'm
>> always in favor of more aggressively closing JIRAs. I think this is
>> more standard practice, from projects like TensorFlow/Keras, pandas,
>> etc to just automatically drop Issues that don't see activity for N
>> days. We won't do that, but, are probably on the other hand far too
>> lax in closing them.
>>
>> Remember that JIRAs stay searchable and can be reopened, so it's not
>> like we lose much information.
>>
>> I'd close anything that hasn't had activity in 2 years (?), as a start.
>> I like the idea of closing things that only affect an EOL release,
>> but, many items aren't marked, so may need to cast the net wider.
>>
>> I think only then does it make sense to look at bothering to reproduce
>> or evaluate the 1000s that will still remain.
>>
>> On Wed, May 15, 2019 at 4:25 AM Hyukjin Kwon  wrote:
>> >
>> > Hi all,
>> >
>> > I would like to propose to resolve all JIRAs that affects EOL releases
>> - 2.2 and below. and affected version
>> > not specified. I was rather against this way and considered this as
>> last resort in roughly 3 years ago
>> > when we discussed. Now I think we should go ahead with this. See below.
>> >
>> > I have been talking care of this for so long time almost every day
>> those 3 years. The number of JIRAs
>> > keeps increasing and it does never go down. Now the number is going
>> over 2500 JIRAs.
>> > Did you guys know? in JIRA, we can only go through page by page up to
>> 1000 items. So, currently we're even
>> > having difficulties to go through every JIRA. We should manually filter
>> out and check each.
>> > The number is going over the manageable size.
>> >
>> > I am not suggesting this without anything actually trying. This is what
>> we have tried within my visibility:
>> >
>> >   1. In roughly 3 years ago, Sean tried to gather committers and even
>> non-committers people to sort
>> > out this number. At that time, we were only able to keep this
>> number as is. After we lost this momentum,
>> > it kept increasing back.
>> >   2. At least I scanned _all_ the previous JIRAs at least more than two
>> times and resolved them. Roughly
>> > once a year. The rest of them are mostly obsolete but not enough
>> information to investigate further.
>> >   3. I strictly stick to "Contributing to JIRA Maintenance"
>> https://spark.apache.org/contributing.html and
>> > resolve JIRAs.
>> >   4. Promoting other people to comment on JIRA or actively resolve them.
>> >
>> > One of the facts I realised is the increasing number of committers
>> doesn't virtually help this much (although
>> > it might be helpful if somebody active in JIRA becomes a committer.)
>> >
>> > One of the important thing I should note is that, it's now almost
>> pretty difficult to reproduce and test the
>> > issues found in EOL releases. We should git clone, checkout, build and
>> test. And then, see if that issue
>> > still exists in upstream, and fix. This is non-trivial overhead.
>> >
>> > Therefore, I would like to propose resolving _all_ the JIRAs that
>> targets EOL releases - 2.2 and below.
>> > Please let me know if anyone has some concerns or objections.
>> >
>> > Thanks.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: Resolving all JIRAs affecting EOL releases

2019-05-15 Thread Hyukjin Kwon

Yea, more sophisticated condition is welcome. My only goal is to make it to
a manageable size.

I would go for the option that reduces more tickets - under 1000 OPEN (and
REOPEN) tickets so that we can at least go through in one go without coming
up with a non duplicating filter to go through.

On Wed, 15 May 2019, 19:33 Abdeali Kothari, 
wrote:

> Was thinking that getting an estimated statistic of the number of issues
> that would be closed if this is done would help.
>
> Open issues: 3882 (project = SPARK AND status in (Open, "In Progress",
> Reopened))
> Open + Does not affect 3.0+ = 2795
> Open + Does not affect 2.4+ = 2373
> Open + Does not affect 2.3+ = 1765
> Open + Does not affect 2.2+ = 1322
> Open + Does not affect 2.1+ = 967
> Open + Does not affect 2.0+ = 651
>
> Open + Does not affect 2.0+ + Priority in (Urgent, Blocker, Critical,
> High) [JQL1] = 838
> Open + Does not affect 2.0+ + Priority in (Urgent, Blocker, Critical,
> High, Major) = 206
> Open + Does not affect 2.2+ + Priority not in (Urgent, Blocker, Critical,
> High) [JQL2] = 1303
> Open + Does not affect 2.2+ + Priority not in (Urgent, Blocker, Critical,
> High, Major) = 397
> Open + Does not affect 2.3+ + Priority not in (Urgent, Blocker, Critical,
> High) = 1743
> Open + Does not affect 2.3+ + Priority not in (Urgent, Blocker, Critical,
> High, Major) = 550
>
> Resolving ALL seems a bit overkill to me.
> My current opinion seems like:
>  - Resolving "Open + Does not affect 2.0+" is something that should be
> done, as I doubt anyone would be looking at the 1.x versions anymore (651
> tasks)
>  - Resolving "Open + Does not affect 2.3+ + Priority not in (Urgent,
> Blocker, Critical, High, Major)" may be a good idea (an additional ~1k
> tasks)
> The issues with priority Urgent/Blocker/Critical should be triaged - as it
> may have something important.
>
>
> [JQL1]:
> project = SPARK
>  AND status in (Open, "In Progress", Reopened)
>  AND NOT (affectedVersion in versionMatch("^[2-3].*"))
>  AND priority NOT IN (Urgent, Blocker, Critical, High)
>
> [JQL2]:
> project = SPARK
>  AND status in (Open, "In Progress", Reopened)
>  AND NOT (affectedVersion in versionMatch("^3.*") OR affectedVersion in
> versionMatch("^2.4.*") OR affectedVersion in versionMatch("^2.3.*") OR
> affectedVersion in versionMatch("^2.2.*"))
>  AND priority NOT IN (Urgent, Blocker, Critical, High)
>
>
> On Wed, May 15, 2019, 14:55 Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I would like to propose to resolve all JIRAs that affects EOL releases -
>> 2.2 and below. and affected version
>> not specified. I was rather against this way and considered this as last
>> resort in roughly 3 years ago
>> when we discussed. Now I think we should go ahead with this. See below.
>>
>> I have been talking care of this for so long time almost every day those
>> 3 years. The number of JIRAs
>> keeps increasing and it does never go down. Now the number is going over
>> 2500 JIRAs.
>> Did you guys know? in JIRA, we can only go through page by page up to
>> 1000 items. So, currently we're even
>> having difficulties to go through every JIRA. We should manually filter
>> out and check each.
>> The number is going over the manageable size.
>>
>> I am not suggesting this without anything actually trying. This is what
>> we have tried within my visibility:
>>
>>   1. In roughly 3 years ago, Sean tried to gather committers and even
>> non-committers people to sort
>> out this number. At that time, we were only able to keep this number
>> as is. After we lost this momentum,
>> it kept increasing back.
>>   2. At least I scanned _all_ the previous JIRAs at least more than two
>> times and resolved them. Roughly
>> once a year. The rest of them are mostly obsolete but not enough
>> information to investigate further.
>>   3. I strictly stick to "Contributing to JIRA Maintenance"
>> https://spark.apache.org/contributing.html and
>> resolve JIRAs.
>>   4. Promoting other people to comment on JIRA or actively resolve them.
>>
>> One of the facts I realised is the increasing number of committers
>> doesn't virtually help this much (although
>> it might be helpful if somebody active in JIRA becomes a committer.)
>>
>> One of the important thing I should note is that, it's now almost pretty
>> difficult to reproduce and test the
>> issues found in EOL releases. We should git clone, checkout, build and
>> test. And then, see if that issue
>> still exists in upstream, and fix. This is non-trivial overhead.
>>
>> Therefore, I would like to propose resolving _all_ the JIRAs that targets
>> EOL releases - 2.2 and below.
>> Please let me know if anyone has some concerns or objections.
>>
>> Thanks.
>>
>

Resolving all JIRAs affecting EOL releases

2019-05-15 Thread Hyukjin Kwon

Hi all,

I would like to propose to resolve all JIRAs that affects EOL releases -
2.2 and below. and affected version
not specified. I was rather against this way and considered this as last
resort in roughly 3 years ago
when we discussed. Now I think we should go ahead with this. See below.

I have been talking care of this for so long time almost every day those 3
years. The number of JIRAs
keeps increasing and it does never go down. Now the number is going over
2500 JIRAs.
Did you guys know? in JIRA, we can only go through page by page up to 1000
items. So, currently we're even
having difficulties to go through every JIRA. We should manually filter out
and check each.
The number is going over the manageable size.

I am not suggesting this without anything actually trying. This is what we
have tried within my visibility:

  1. In roughly 3 years ago, Sean tried to gather committers and even
non-committers people to sort
out this number. At that time, we were only able to keep this number as
is. After we lost this momentum,
it kept increasing back.
  2. At least I scanned _all_ the previous JIRAs at least more than two
times and resolved them. Roughly
once a year. The rest of them are mostly obsolete but not enough
information to investigate further.
  3. I strictly stick to "Contributing to JIRA Maintenance"
https://spark.apache.org/contributing.html and
resolve JIRAs.
  4. Promoting other people to comment on JIRA or actively resolve them.

One of the facts I realised is the increasing number of committers doesn't
virtually help this much (although
it might be helpful if somebody active in JIRA becomes a committer.)

One of the important thing I should note is that, it's now almost pretty
difficult to reproduce and test the
issues found in EOL releases. We should git clone, checkout, build and
test. And then, see if that issue
still exists in upstream, and fix. This is non-trivial overhead.

Therefore, I would like to propose resolving _all_ the JIRAs that targets
EOL releases - 2.2 and below.
Please let me know if anyone has some concerns or objections.

Thanks.

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2019-04-25 Thread Hyukjin Kwon

args)
>> File 
>> "/base/data/home/apps/s~spark-prs/live.412416057856832734/lib/jira/resilientsession.py",
>> line 57
>> <https://console.cloud.google.com/debug/fromlog?appModule=default=live=%2Fbase%2Fdata%2Fhome%2Fapps%2Fs~spark-prs%2Flive.412416057856832734%2Flib%2Fjira%2Fresilientsession.py=57=5cc1483600029309a7af76d5=1556170805012269000=3=spark-prs=ac>,
>> in raise_on_error r.status_code, error, r.url, request=request, response=r,
>> **kwargs) JIRAError: JiraError HTTP 403 url:
>> https://issues.apache.org/jira/rest/api/2/serverInfo text:
>> CAPTCHA_CHALLENGE; login-url=https://issues.apache.org/jira/login.jsp 
>> r.status_code,
>> error, r.url, request=request, response=r, **kwargs)
>> JIRAError: JiraError HTTP 403 url:
>> https://issues.apache.org/jira/rest/api/2/serverInfo
>> text: CAPTCHA_CHALLENGE; login-url=
>> https://issues.apache.org/jira/login.jsp
>
>
> It looks like ASF JIRA was throwing a captcha challenge at us, so I used
> the credentials to manually log in and complete the challenge.
>
> Hopefully that's enough to fix things, but to prevent re-occurrence we
> might need to change the login credential type from username + password to
> instead use OAuth tokens.
>
> On Wed, Apr 24, 2019 at 10:42 PM Hyukjin Kwon  wrote:
>
>> Can anyone take a look for this one? OPEN status JIRAs are being rapidly
>> increased (from around 2400 to 2600)
>>
>> 2019년 4월 19일 (금) 오후 8:05, Hyukjin Kwon 님이 작성:
>>
>>> Hi all,
>>>
>>> Looks 'spark/dev/github_jira_sync.py' is not running correctly somewhere.
>>> Usually the JIRA's status should be updated to "IN PROGRESS" when
>>> somebody opens a PR against a JIRA.
>>> Looks now it only leaves a link and does not change JIRA's status.
>>>
>>> Can someone else who knows where it's running can check this?
>>>
>>> FWIW, I check every PR and JIRA almost every day but ever since this
>>> happened, this makes (at least to me) duplicately check the JIRAs.
>>> Previously, if I check all the PRs and JIRAs, they were not duplicated
>>> because JIRAs having PRs have different status, "IN PROGRESS" but now all
>>> JIRAs have "OPEN" status.
>>>
>>> Thanks.
>>>
>>

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2019-04-24 Thread Hyukjin Kwon

Can anyone take a look for this one? OPEN status JIRAs are being rapidly
increased (from around 2400 to 2600)

2019년 4월 19일 (금) 오후 8:05, Hyukjin Kwon 님이 작성:

> Hi all,
>
> Looks 'spark/dev/github_jira_sync.py' is not running correctly somewhere.
> Usually the JIRA's status should be updated to "IN PROGRESS" when
> somebody opens a PR against a JIRA.
> Looks now it only leaves a link and does not change JIRA's status.
>
> Can someone else who knows where it's running can check this?
>
> FWIW, I check every PR and JIRA almost every day but ever since this
> happened, this makes (at least to me) duplicately check the JIRAs.
> Previously, if I check all the PRs and JIRAs, they were not duplicated
> because JIRAs having PRs have different status, "IN PROGRESS" but now all
> JIRAs have "OPEN" status.
>
> Thanks.
>

Re: pyspark.sql.functions ide friendly

2019-04-19 Thread Hyukjin Kwon

+1 I'm good with changing too.

On Thu, 18 Apr 2019, 01:18 Reynold Xin,  wrote:

> Are you talking about the ones that are defined in a dictionary? If yes,
> that was actually not that great in hindsight (makes it harder to read &
> change), so I'm OK changing it.
>
> E.g.
>
> _functions = {
> 'lit': _lit_doc,
> 'col': 'Returns a :class:`Column` based on the given column name.',
> 'column': 'Returns a :class:`Column` based on the given column name.',
> 'asc': 'Returns a sort expression based on the ascending order of the
> given column name.',
> 'desc': 'Returns a sort expression based on the descending order of
> the given column name.',
> }
>
>
> On Wed, Apr 17, 2019 at 4:35 AM, Sean Owen  wrote:
>
>> I use IntelliJ and have never seen an issue parsing the pyspark
>> functions... you're just saying the linter has an optional inspection to
>> flag it? just disable that?
>> I don't think we want to complicate the Spark code just for this. They
>> are declared at runtime for a reason.
>>
>> On Wed, Apr 17, 2019 at 6:27 AM educh...@gmail.com 
>> wrote:
>>
>> Hi,
>>
>> I'm aware of various workarounds to make this work smoothly in various
>> IDEs, but wouldn't better to solve the root cause?
>>
>> I've seen the code and don't see anything that requires such level of
>> dynamic code, the translation is 99% trivial.
>>
>> On 2019/04/16 12:16:41, 880f0464 <880f0...@protonmail.com.INVALID>
>> wrote:
>>
>> Hi.
>>
>> That's a problem with Spark as such and in general can be addressed on
>> IDE to IDE basis - see for example https://stackoverflow.com/q/40163106
>> for some hints.
>>
>> Sent with ProtonMail Secure Email.
>>
>> ‐‐‐ Original Message ‐‐‐
>> On Tuesday, April 16, 2019 2:10 PM, educhana  wrote:
>>
>> Hi,
>>
>> Currently using pyspark.sql.functions from an IDE like PyCharm is causing
>> the linters complain due to the functions being declared at runtime.
>>
>> Would a PR fixing this be welcomed? Is there any problems/difficulties
>> I'm unaware?
>>
>> --
>>
>>
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> --
>>
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> - To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> - To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> - To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>

In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2019-04-19 Thread Hyukjin Kwon

Hi all,

Looks 'spark/dev/github_jira_sync.py' is not running correctly somewhere.
Usually the JIRA's status should be updated to "IN PROGRESS" when
somebody opens a PR against a JIRA.
Looks now it only leaves a link and does not change JIRA's status.

Can someone else who knows where it's running can check this?

FWIW, I check every PR and JIRA almost every day but ever since this
happened, this makes (at least to me) duplicately check the JIRAs.
Previously, if I check all the PRs and JIRAs, they were not duplicated
because JIRAs having PRs have different status, "IN PROGRESS" but now all
JIRAs have "OPEN" status.

Thanks.

Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-28 Thread Hyukjin Kwon

Bryan, was there an actual change when to drop Python 3.4 in PyArrow? If
not, I think it might be possible that we can increase the minimal Arrow
version separately.
If there was, it looks inevitable to upgrade Jenkins\s Python from 3.4 to
3.5.

2019년 3월 29일 (금) 오전 1:39, Felix Cheung 님이 작성:

> That’s not necessarily bad. I don’t know if we have plan to ever release
> any new 2.2.x, 2.3.x at this point and we can message this “supported
> version” of python change for any new 2.4 release.
>
> Besides we could still support python 3.4 - it’s just more complicated to
> test manually without Jenkins coverage.
>
>
> --
> *From:* shane knapp 
> *Sent:* Tuesday, March 26, 2019 12:11 PM
> *To:* Bryan Cutler
> *Cc:* dev
> *Subject:* Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]
>
> i'm pretty certain that i've got a solid python 3.5 conda environment
> ready to be deployed, but this isn't a minor change to the build system and
> there might be some bugs to iron out.
>
> another problem is that the current python 3.4 environment is hard-coded
> in to the both the build scripts on jenkins (all over the place) and in the
> codebase (thankfully in only one spot):  export
> PATH=/home/anaconda/envs/py3k/bin:$PATH
>
> this means that every branch (master, 2.x, etc) will test against whatever
> version of python lives in that conda environment.  if we upgrade to 3.5,
> all branches will test against this version.  changing the build and test
> infra to support testing against 2.7, 3.4 or 3.5 based on branch is
> definitely non-trivial...
>
> thoughts?
>
>
>
>
> On Tue, Mar 26, 2019 at 11:39 AM Bryan Cutler  wrote:
>
>> Thanks Hyukjin.  The plan is to get this done for 3.0 only.  Here is a
>> link to the JIRA https://issues.apache.org/jira/browse/SPARK-27276.
>> Shane is also correct in that newer versions of pyarrow have stopped
>> support for Python 3.4, so we should probably have Jenkins test against 2.7
>> and 3.5.
>>
>> On Mon, Mar 25, 2019 at 9:44 PM Reynold Xin  wrote:
>>
>>> +1 on doing this in 3.0.
>>>
>>>
>>> On Mon, Mar 25, 2019 at 9:31 PM, Felix Cheung >> > wrote:
>>>
>>>> I’m +1 if 3.0
>>>>
>>>>
>>>> --
>>>> *From:* Sean Owen 
>>>> *Sent:* Monday, March 25, 2019 6:48 PM
>>>> *To:* Hyukjin Kwon
>>>> *Cc:* dev; Bryan Cutler; Takuya UESHIN; shane knapp
>>>> *Subject:* Re: Upgrading minimal PyArrow version to 0.12.x
>>>> [SPARK-27276]
>>>>
>>>> I don't know a lot about Arrow here, but seems reasonable. Is this for
>>>> Spark 3.0 or for 2.x? Certainly, requiring the latest for Spark 3
>>>> seems right.
>>>>
>>>> On Mon, Mar 25, 2019 at 8:17 PM Hyukjin Kwon 
>>>> wrote:
>>>> >
>>>> > Hi all,
>>>> >
>>>> > We really need to upgrade the minimal version soon. It's actually
>>>> slowing down the PySpark dev, for instance, by the overhead that sometimes
>>>> we need currently to test all multiple matrix of Arrow and Pandas. Also, it
>>>> currently requires to add some weird hacks or ugly codes. Some bugs exist
>>>> in lower versions, and some features are not supported in low PyArrow, for
>>>> instance.
>>>> >
>>>> > Per, (Apache Arrow'+ Spark committer FWIW), Bryan's recommendation
>>>> and my opinion as well, we should better increase the minimal version to
>>>> 0.12.x. (Also, note that Pandas <> Arrow is an experimental feature).
>>>> >
>>>> > So, I and Bryan will proceed this roughly in few days if there isn't
>>>> objections assuming we're fine with increasing it to 0.12.x. Please let me
>>>> know if there are some concerns.
>>>> >
>>>> > For clarification, this requires some jobs in Jenkins to upgrade the
>>>> minimal version of PyArrow (I cc'ed Shane as well).
>>>> >
>>>> > PS: I roughly heard that Shane's busy for some work stuff .. but it's
>>>> kind of important in my perspective.
>>>> >
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>
>>>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

Re: PySpark syntax vs Pandas syntax

2019-03-26 Thread Hyukjin Kwon

BTW, I am working on the documentation related with this subject at
https://issues.apache.org/jira/browse/SPARK-26022 to describe the difference

2019년 3월 26일 (화) 오후 3:34, Reynold Xin 님이 작성:

> We have some early stuff there but not quite ready to talk about it in
> public yet (I hope soon though). Will shoot you a separate email on it.
>
> On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari 
> wrote:
>
>> Thanks for the reply Reynold - Has this shim project started ?
>> I'd love to contribute to it - as it looks like I have started making a
>> bunch of helper functions to do something similar for my current task and
>> would prefer not doing it in isolation.
>> Was considering making a git repo and pushing stuff there just today
>> morning. But if there's already folks working on it - I'd prefer
>> collaborating.
>>
>> Note - I'm not recommending we make the logical plan mutable (as I am
>> scared of that too!). I think there are other ways of handling that - but
>> we can go into details later.
>>
>> On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin  wrote:
>>
>>> We have been thinking about some of these issues. Some of them are
>>> harder to do, e.g. Spark DataFrames are fundamentally immutable, and making
>>> the logical plan mutable is a significant deviation from the current
>>> paradigm that might confuse the hell out of some users. We are considering
>>> building a shim layer as a separate project on top of Spark (so we can make
>>> rapid releases based on feedback) just to test this out and see how well it
>>> could work in practice.
>>>
>>> On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari <
>>> abdealikoth...@gmail.com> wrote:
>>>
 Hi,
 I was doing some spark to pandas (and vice versa) conversion because
 some of the pandas codes we have don't work on huge data. And some spark
 codes work very slow on small data.

 It was nice to see that pyspark had some similar syntax for the common
 pandas operations that the python community is used to.

 GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
 Column selects: df[['col1', 'col2']]
 Row Filters: df[df['col1'] < 3.0]

 I was wondering about a bunch of other functions in pandas which seemed
 common. And thought there must've been a discussion about it in the
 community - hence started this thread.

 I was wondering whether there has been discussion on adding the
 following functions:

 *Column setters*:
 In Pandas:
 df['col3'] = df['col1'] * 3.0
 While I do the following in PySpark:
 df = df.withColumn('col3', df['col1'] * 3.0)

 *Column apply()*:
 In Pandas:
 df['col3'] = df['col1'].apply(lambda x: x * 3.0)
 While I do the following in PySpark:
 df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')(
 df['col1']))

 I understand that this one cannot be as simple as in pandas due to the
 output-type that's needed here. But could be done like:
 df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float')

 Multi column in pandas is:
 df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0)
 Maybe this can be done in pyspark as or if we can send a
 pyspark.sql.Row directly it would be similar (?):
 df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 *
 3.0), 'float')

 *Rename*:
 In Pandas:
 df.rename(columns={...})
 While I do the following in PySpark:
 df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns])

 *To Dictionary*:
 In Pandas:
 df.to_dict(orient='list')
 While I do the following in PySpark:
 {f.name: [row[i] for row in df.collect()] for i, f in
 enumerate(df.schema.fields)}

 I thought I'd start the discussion with these and come back to some of
 the others I see that could be helpful.

 *Note*: (with the column functions in mind) I understand the concept
 of the DataFrame cannot be modified. And I am not suggesting we change that
 nor any underlying principle. Just trying to add syntactic sugar here.

Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-25 Thread Hyukjin Kwon

Hi all,

We really need to upgrade the minimal version soon. It's actually slowing
down the PySpark dev, for instance, by the overhead that sometimes we need
currently to test all multiple matrix of Arrow and Pandas. Also, it
currently requires to add some weird hacks or ugly codes. Some bugs exist
in lower versions, and some features are not supported in low PyArrow, for
instance.

Per, (Apache Arrow'+ Spark committer FWIW), Bryan's recommendation and my
opinion as well, we should better increase the minimal version to 0.12.x.
(Also, note that Pandas <> Arrow is an experimental feature).

So, I and Bryan will proceed this roughly in few days if there isn't
objections assuming we're fine with increasing it to 0.12.x. Please let me
know if there are some concerns.

For clarification, this requires some jobs in Jenkins to upgrade the
minimal version of PyArrow (I cc'ed Shane as well).

PS: I roughly heard that Shane's busy for some work stuff .. but it's kind
of important in my perspective.

Re: Request to disable a bot account, 'Thincrs' in JIRA of Apache Spark

2019-03-13 Thread Hyukjin Kwon

Thanks, I opened https://issues.apache.org/jira/browse/INFRA-18004

2019년 3월 14일 (목) 오전 8:35, Marcelo Vanzin 님이 작성:

> Go for it. I would do it now, instead of waiting, since there's been
> enough time for them to take action.
>
> On Wed, Mar 13, 2019 at 4:32 PM Hyukjin Kwon  wrote:
> >
> > Looks this bot keeps working. I am going to open a INFRA JIRA to block
> this bot in few days.
> > Please let me know if you guys have a different idea to prevent this.
> >
> > 2019년 3월 13일 (수) 오전 8:16, Hyukjin Kwon 님이 작성:
> >>
> >> Hi whom it may concern in Thincrs
> >>
> >>
> >>
> >> I am still observing this bot misuses Apache Spark’s JIRA board (see
> https://issues.apache.org/jira/secure/ViewProfile.jspa?name=Thincrs)
> >>
> >> I contacted you guys once before but I haven’t got any response related
> with it. Still, this bot in this specific company looks misusing Apahce
> JIRA board.
> >> If it continues, I think we should block this bot. Could you guys stop
> misusing this bot please?
> >>
> >>
> >>
> >> From: Hyukjin Kwon 
> >> Date: Tuesday, January 8, 2019 at 11:18 AM
> >> To: "h...@thincrs.com" 
> >> Subject: Request to disable a bot account, 'Thincrs' in JIRA of Apache
> Spark
> >>
> >>
> >>
> >> Hi all,
> >>
> >>
> >>
> >>
> >>
> >> We, Apache Spark community, lately noticed one bot named ‘Thincrs’ in
> Apache Spark’s JIRA:
> https://issues.apache.org/jira/issues/?jql=text%20~%20Thincrs
> >>
> >>
> >>
> >> Looks like this is a bot and it keeps leaving some comments such as:
> >>
> >>
> >>
> >>   A user of thincrs has selected this issue. Deadline: Xxx, Xxx X, 
> XX:XX
> >>
> >>
> >>
> >>
> >>
> >> This makes some noise to Apache Spark maintainers, committers,
> contributors and users. It was asked (by me) to Spark’s dev mailing list
> before:
> >>
> >>
> >>
> >>
> http://apache-spark-developers-list.1001551.n3.nabble.com/A-user-of-thincrs-has-selected-this-issue-Deadline-Xxx-Xxx-X--XX-XX-td25836.html
> >>
> >>
> >>
> >> And, one of PMCs in Apache Spark contacted to stop this bot if I am not
> mistaken.
> >>
> >>
> >>
> >>
> >>
> >> Lately, I noticed again this bot left a comment again as below:
> >>
> >>
> >>
> >>   Thincrs commented on SPARK-25823:
> >>
> >>   -
> >>
> >>
> >>
> >>   A user of thincrs has selected this issue. Deadline: Mon, Jan 14,
> 2019 10:32 PM
> >>
> >>
> >>
> >>
> >>
> >> This comment is not visible by one of Spark committer for now but
> leaving comments there send emails to all the people participating in the
> JIRA.
> >>
> >>
> >>
> >> Could you please stop this bot if it belongs to Thincrs please?
> >>
> >>
> >>
> >>
> >>
> >> Thanks.
>
>
>
> --
> Marcelo
>

Re: Request to disable a bot account, 'Thincrs' in JIRA of Apache Spark

2019-03-13 Thread Hyukjin Kwon

Looks this bot keeps working. I am going to open a INFRA JIRA to block this
bot in few days.
Please let me know if you guys have a different idea to prevent this.

2019년 3월 13일 (수) 오전 8:16, Hyukjin Kwon 님이 작성:

> Hi whom it may concern in Thincrs
>
>
>
> I am still observing this bot misuses Apache Spark’s JIRA board (see
> https://issues.apache.org/jira/secure/ViewProfile.jspa?name=Thincrs)
>
> I contacted you guys once before but I haven’t got any response related
> with it. Still, this bot in this specific company looks misusing Apahce
> JIRA board.
> If it continues, I think we should block this bot. Could you guys stop
> misusing this bot please?
>
>
>
> *From: *Hyukjin Kwon 
> *Date: *Tuesday, January 8, 2019 at 11:18 AM
> *To: *"h...@thincrs.com" 
> *Subject: *Request to disable a bot account, 'Thincrs' in JIRA of Apache
> Spark
>
>
>
> Hi all,
>
>
>
>
>
> We, Apache Spark community, lately noticed one bot named ‘Thincrs’ in
> Apache Spark’s JIRA:
> https://issues.apache.org/jira/issues/?jql=text%20~%20Thincrs
>
>
>
> Looks like this is a bot and it keeps leaving some comments such as:
>
>
>
>   A user of thincrs has selected this issue. Deadline: Xxx, Xxx X, 
> XX:XX
>
>
>
>
>
> This makes some noise to Apache Spark maintainers, committers,
> contributors and users. It was asked (by me) to Spark’s dev mailing list
> before:
>
>
>
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/A-user-of-thincrs-has-selected-this-issue-Deadline-Xxx-Xxx-X--XX-XX-td25836.html
>
>
>
> And, one of PMCs in Apache Spark contacted to stop this bot if I am not
> mistaken.
>
>
>
>
>
> Lately, I noticed again this bot left a comment again as below:
>
>
>
>   Thincrs commented on SPARK-25823:
>
>   -
>
>
>
>   A user of thincrs has selected this issue. Deadline: Mon, Jan 14, 2019
> 10:32 PM
>
>
>
>
>
> This comment is not visible by one of Spark committer for now but leaving
> comments there send emails to all the people participating in the JIRA.
>
>
>
> Could you please stop this bot if it belongs to Thincrs please?
>
>
>
>
>
> Thanks.
>

Re: [pyspark] dataframe map_partition

2019-03-10 Thread Hyukjin Kwon

Because both dapply in R and Scalar Pandas UDF in Python are similar, and
cover each other. FWIW, it somewhat sounds like SPARK-26413 and SPARK-26412


2019년 3월 9일 (토) 오후 12:32, peng yu 님이 작성:

> Cool, thanks for letting me know, but why not support dapply
> http://spark.apache.org/docs/2.0.0/api/R/dapply.html as supported in R,
> so we can just pass in a pandas dataframe
>
> On Fri, Mar 8, 2019 at 6:09 PM Li Jin  wrote:
>
>> Hi,
>>
>> Pandas UDF supports input as struct type. However, note that it will be
>> turned into python dict because pandas itself does not have native struct
>> type.
>> On Fri, Mar 8, 2019 at 2:55 PM peng yu  wrote:
>>
>>> Yeah, that seems most likely i have wanted, does the scalar Pandas UDF
>>> support input is a StructType too ?
>>>
>>> On Fri, Mar 8, 2019 at 2:25 PM Bryan Cutler  wrote:
>>>
 Hi Peng,

 I just added support for scalar Pandas UDF to return a StructType as a
 Pandas DataFrame in https://issues.apache.org/jira/browse/SPARK-23836.
 Is that the functionality you are looking for?

 Bryan

 On Thu, Mar 7, 2019 at 1:13 PM peng yu  wrote:

> right now, i'm using the colums-at-a-time mapping
> https://github.com/yupbank/tf-spark-serving/blob/master/tss/utils.py#L129
>
>
>
>
> On Thu, Mar 7, 2019 at 4:00 PM Sean Owen  wrote:
>
>> Maybe, it depends on what you're doing. It sounds like you are trying
>> to do row-at-a-time mapping, even on a pandas DataFrame. Is what
>> you're doing vectorized? may not help much.
>> Just make the pandas Series into a DataFrame if you want? and a single
>> col back to Series?
>>
>> On Thu, Mar 7, 2019 at 2:45 PM peng yu  wrote:
>> >
>> > pandas/arrow is for the memory efficiency, and mapPartitions is
>> only available to rdds, for sure i can do everything in rdd.
>> >
>> > But i thought that's the whole point of having pandas_udf, so my
>> program run faster and consumes less memory ?
>> >
>> > On Thu, Mar 7, 2019 at 3:40 PM Sean Owen  wrote:
>> >>
>> >> Are you just applying a function to every row in the DataFrame? you
>> >> don't need pandas at all. Just get the RDD of Row from it and map a
>> >> UDF that makes another Row, and go back to DataFrame. Or make a UDF
>> >> that operates on all columns and returns a new value.
>> mapPartitions is
>> >> also available if you want to transform an iterator of Row to
>> another
>> >> iterator of Row.
>> >>
>> >> On Thu, Mar 7, 2019 at 2:33 PM peng yu  wrote:
>> >> >
>> >> > it is very similar to SCALAR, but for SCALAR the output can't be
>> struct/row and the input has to be pd.Series, which doesn't support a 
>> row.
>> >> >
>> >> > I'm doing tensorflow batch inference in spark,
>> https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108
>> >> >
>> >> > Which i have to do the groupBy in order to use the apply
>> function, i'm wondering why not just enable apply to df ?
>> >> >
>> >> > On Thu, Mar 7, 2019 at 3:15 PM Sean Owen 
>> wrote:
>> >> >>
>> >> >> Are you looking for SCALAR? that lets you map one row to one
>> row, but
>> >> >> do it more efficiently in batch. What are you trying to do?
>> >> >>
>> >> >> On Thu, Mar 7, 2019 at 2:03 PM peng yu 
>> wrote:
>> >> >> >
>> >> >> > I'm looking for a mapPartition(pandas_udf) for  a
>> pyspark.Dataframe.
>> >> >> >
>> >> >> > ```
>> >> >> > @pandas_udf(df.schema, PandasUDFType.MAP)
>> >> >> > def do_nothing(pandas_df):
>> >> >> > return pandas_df
>> >> >> >
>> >> >> >
>> >> >> > new_df = df.mapPartition(do_nothing)
>> >> >> > ```
>> >> >> > pandas_udf only support scala or GROUPED_MAP.  Why not
>> support just Map?
>> >> >> >
>> >> >> > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen 
>> wrote:
>> >> >> >>
>> >> >> >> Are you looking for @pandas_udf in Python? Or just
>> mapPartition? Those exist already
>> >> >> >>
>> >> >> >> On Thu, Mar 7, 2019, 1:43 PM peng yu 
>> wrote:
>> >> >> >>>
>> >> >> >>> There is a nice map_partition function in R `dapply`.  so
>> that user can pass a row to udf.
>> >> >> >>>
>> >> >> >>> I'm wondering why we don't have that in python?
>> >> >> >>>
>> >> >> >>> I'm trying to have a map_partition function with pandas_udf
>> supported
>> >> >> >>>
>> >> >> >>> thanks!
>>
>

Re: [build system] Jenkins stopped working

2019-02-19 Thread Hyukjin Kwon

Thanks Shane!! <3

2019년 2월 20일 (수) 오전 10:13, Wenchen Fan 님이 작성:

> Thanks Shane!
>
> On Wed, Feb 20, 2019 at 6:48 AM shane knapp  wrote:
>
>> alright, i increased the httpd and proxy timeouts and kicked apache.
>> i'll keep an eye on things, but as of right now we're happily building.
>>
>> On Tue, Feb 19, 2019 at 2:25 PM shane knapp  wrote:
>>
>>> aand i had to issue another restart.  it's the ever annoying, and
>>> never quite clear as to why it's happening proxy/502 error.
>>>
>>> currently investigating.
>>>
>>> On Tue, Feb 19, 2019 at 9:21 AM shane knapp  wrote:
>>>
>>>> forgot to hit send before i went in to the office:  we're back up and
>>>> building!
>>>>
>>>> On Tue, Feb 19, 2019 at 8:06 AM shane knapp 
>>>> wrote:
>>>>
>>>>> yep, it got wedged.  issued a restart and it should be back up in a
>>>>> few minutes.
>>>>>
>>>>> On Tue, Feb 19, 2019 at 7:32 AM Parth Gandhi 
>>>>> wrote:
>>>>>
>>>>>> Yes, it seems to be down. The unit tests are not getting kicked off.
>>>>>>
>>>>>> Regards,
>>>>>> Parth Kamlesh Gandhi
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 19, 2019 at 8:29 AM Hyukjin Kwon 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Looks Jenkins stopped working. Did I maybe miss a thread, or anybody
>>>>>>> didn't report this yet?
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Shane Knapp
>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>> https://rise.cs.berkeley.edu
>>>>>
>>>>
>>>>
>>>> --
>>>> Shane Knapp
>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>> https://rise.cs.berkeley.edu
>>>>
>>>
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

[build system] Jenkins stopped working

2019-02-19 Thread Hyukjin Kwon

Hi all,

Looks Jenkins stopped working. Did I maybe miss a thread, or anybody didn't
report this yet?

Thanks!

Re: Vectorized R gapply[Collect]() implementation

2019-02-14 Thread Hyukjin Kwon

Thanks guys <3.

FYI, I made a PR for collect and vectorized dapply too.
Given my tests, it boosts up the speed 1500%+, and 4600%+ each.

https://github.com/apache/spark/pull/23760
https://github.com/apache/spark/pull/23787


2019년 2월 11일 (월) 오전 4:45, Felix Cheung 님이 작성:

> This is super awesome!
>
>
> --
> *From:* Shivaram Venkataraman 
> *Sent:* Saturday, February 9, 2019 8:33 AM
> *To:* Hyukjin Kwon
> *Cc:* dev; Felix Cheung; Bryan Cutler; Liang-Chi Hsieh; Shivaram
> Venkataraman
> *Subject:* Re: Vectorized R gapply[Collect]() implementation
>
> Those speedups look awesome! Great work Hyukjin!
>
> Thanks
> Shivaram
>
> On Sat, Feb 9, 2019 at 7:41 AM Hyukjin Kwon  wrote:
> >
> > Guys, as continuation of Arrow optimization for R DataFrame to Spark
> DataFrame,
> >
> > I am trying to make a vectorized gapply[Collect] implementation as an
> experiment like vectorized Pandas UDFs
> >
> > It brought 820%+ performance improvement. See
> https://github.com/apache/spark/pull/23746
> >
> > Please come and take a look if you're interested in R APIs :D. I have
> already cc'ed some people I know but please come, review and discuss for
> both Spark side and Arrow side.
> >
> > This Arrow optimization job is being done under
> https://issues.apache.org/jira/browse/SPARK-26759 . Please feel free to
> take one if anyone of you is interested in it.
> >
> > Thanks.
>

Re: Time to cut an Apache 2.4.1 release?

2019-02-12 Thread Hyukjin Kwon

+1 for 2.4.1

2019년 2월 12일 (화) 오후 4:56, Dongjin Lee 님이 작성:

> > SPARK-23539 is a non-trivial improvement, so probably would not be
> back-ported to 2.4.x.
>
> Got it. It seems reasonable.
>
> Committers:
>
> Please don't omit SPARK-23539 from 2.5.0. Kafka community needs this
> feature.
>
> Thanks,
> Dongjin
>
> On Tue, Feb 12, 2019 at 1:50 PM Takeshi Yamamuro 
> wrote:
>
>> +1, too.
>> branch-2.4 accumulates too many commits..:
>>
>> https://github.com/apache/spark/compare/0a4c03f7d084f1d2aa48673b99f3b9496893ce8d...af3c7111efd22907976fc8bbd7810fe3cfd92092
>>
>> On Tue, Feb 12, 2019 at 12:36 PM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you, DB.
>>>
>>> +1, Yes. It's time for preparing 2.4.1 release.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On 2019/02/12 03:16:05, Sean Owen  wrote:
>>> > I support a 2.4.1 release now, yes.
>>> >
>>> > SPARK-23539 is a non-trivial improvement, so probably would not be
>>> > back-ported to 2.4.x.SPARK-26154 does look like a bug whose fix could
>>> > be back-ported, but that's a big change. I wouldn't hold up 2.4.1 for
>>> > it, but it could go in if otherwise ready.
>>> >
>>> >
>>> > On Mon, Feb 11, 2019 at 5:20 PM Dongjin Lee 
>>> wrote:
>>> > >
>>> > > Hi DB,
>>> > >
>>> > > Could you add SPARK-23539[^1] into 2.4.1? I opened the PR[^2] a
>>> little bit ago, but it has not included in 2.3.0 nor get enough review.
>>> > >
>>> > > Thanks,
>>> > > Dongjin
>>> > >
>>> > > [^1]: https://issues.apache.org/jira/browse/SPARK-23539
>>> > > [^2]: https://github.com/apache/spark/pull/22282
>>> > >
>>> > > On Tue, Feb 12, 2019 at 6:28 AM Jungtaek Lim 
>>> wrote:
>>> > >>
>>> > >> Given SPARK-26154 [1] is a correctness issue and PR [2] is
>>> submitted, I hope it can be reviewed and included within Spark 2.4.1 -
>>> otherwise it will be a long-live correctness issue.
>>> > >>
>>> > >> Thanks,
>>> > >> Jungtaek Lim (HeartSaVioR)
>>> > >>
>>> > >> 1. https://issues.apache.org/jira/browse/SPARK-26154
>>> > >> 2. https://github.com/apache/spark/pull/23634
>>> > >>
>>> > >>
>>> > >> 2019년 2월 12일 (화) 오전 6:17, DB Tsai 님이 작성:
>>> > >>>
>>> > >>> Hello all,
>>> > >>>
>>> > >>> I am preparing to cut a new Apache 2.4.1 release as there are many
>>> bugs and correctness issues fixed in branch-2.4.
>>> > >>>
>>> > >>> The list of addressed issues are
>>> https://issues.apache.org/jira/browse/SPARK-26583?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.4.1%20order%20by%20updated%20DESC
>>> > >>>
>>> > >>> Let me know if you have any concern or any PR you would like to
>>> get in.
>>> > >>>
>>> > >>> Thanks!
>>> > >>>
>>> > >>>
>>> -
>>> > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> > >>>
>>> > >
>>> > >
>>> > > --
>>> > > Dongjin Lee
>>> > >
>>> > > A hitchhiker in the mathematical world.
>>> > >
>>> > > github: github.com/dongjinleekr
>>> > > linkedin: kr.linkedin.com/in/dongjinleekr
>>> > > speakerdeck: speakerdeck.com/dongjin
>>> >
>>> > -
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
> --
> *Dongjin Lee*
>
> *A hitchhiker in the mathematical world.*
> *github:  github.com/dongjinleekr
> linkedin: kr.linkedin.com/in/dongjinleekr
> speakerdeck: speakerdeck.com/dongjin
> *
>

Vectorized R gapply[Collect]() implementation

2019-02-09 Thread Hyukjin Kwon

Guys, as continuation of Arrow optimization for R DataFrame to Spark
DataFrame,

I am trying to make a vectorized gapply[Collect] implementation as an
experiment like vectorized Pandas UDFs

It brought 820%+ performance improvement. See
https://github.com/apache/spark/pull/23746

Please come and take a look if you're interested in R APIs :D. I have
already cc'ed some people I know but please come, review and discuss for
both Spark side and Arrow side.

This Arrow optimization job is being done under
https://issues.apache.org/jira/browse/SPARK-26759 . Please feel free to
take one if anyone of you is interested in it.

Thanks.

Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

2019-02-08 Thread Hyukjin Kwon

Sorry for the last minute vote.

+1

2019년 2월 8일 (금) 오전 10:15, Takeshi Yamamuro 님이 작성:

> Thanks, all.
>
> Yea, I think we don't need to block the release, too.
>
> > Jungtaek
> Thanks! That is very helpful!
> If you find something, please let me know.
>
> Best,
> Takeshi
>
> On Fri, Feb 8, 2019 at 1:10 AM Dongjoon Hyun 
> wrote:
>
>> +1 for 2.3.3 RC2.
>>
>> Thank you, Takeshi.
>>
>> And, +1 for 2.3.4 as 2.3.x EOL release.
>>
>> Cheers,
>> Dongjoon.
>>
>> On Thu, Feb 7, 2019 at 6:48 AM Sean Owen  wrote:
>>
>>> It wouldn't be wasted effort, as there is probably going to be a 2.3.4
>>> release before 2.3.x is EOL. At least, having reliable tests on
>>> Jenkins helps not miss problems with backports to 2.3.x. I seem to
>>> recall something was change in 2.4.x to help this but either didn't
>>> work or didn't apply to 2.3.x, so there may already be a clue in the
>>> 2.4.x branch about the issue.
>>>
>>> On Wed, Feb 6, 2019 at 9:34 PM Jungtaek Lim  wrote:
>>> >
>>> > Might be out of topic: regarding SPARK-24211 (flaky tests in
>>> StreamingJoinSuite) I might volunteer to take a look, but if things are not
>>> flaky with branch 2.4 and EOL on branch 2.3 is coming sooner (in some
>>> months), I wonder we still want to tackle it in any way.
>>> >
>>> > 2019년 2월 7일 (목) 오후 2:21, Sean Owen 님이 작성:
>>> >>
>>> >> +1 from me. I built and tested the source release on the same env and
>>> >> this time not seeing failures. Good, no idea what happened.
>>> >>
>>> >> I updated Fix Version on JIRAs that were marked as 2.3.4 but went in
>>> >> before the RC2 tag.
>>> >>
>>> >> I'm kinda concerned that this test keeps failing in branch 2.3:
>>> >>
>>> >> org.apache.spark.sql.streaming.StreamingOuterJoinSuite.left outer join
>>> >> with non-key condition violated
>>> >>
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/601/testReport/org.apache.spark.sql.streaming/StreamingOuterJoinSuite/left_outer_join_with_non_key_condition_violated/
>>> >>
>>> >> It's among the items tracked in
>>> >> https://issues.apache.org/jira/browse/SPARK-24211
>>> >> I don't think it needs to block a release as I think we believe it's
>>> >> just the test that's flaky, but I'm wondering whether people are
>>> >> seeing this fail when testing the release?
>>> >> I did not see it fail running my tests though.
>>> >>
>>> >>
>>> >> On Tue, Feb 5, 2019 at 5:07 PM Takeshi Yamamuro <
>>> linguin@gmail.com> wrote:
>>> >> >
>>> >> > Please vote on releasing the following candidate as Apache Spark
>>> version 2.3.3.
>>> >> >
>>> >> > The vote is open until February 8 6:00PM (PST) and passes if a
>>> majority +1 PMC votes are cast, with
>>> >> > a minimum of 3 +1 votes.
>>> >> >
>>> >> > [ ] +1 Release this package as Apache Spark 2.3.3
>>> >> > [ ] -1 Do not release this package because ...
>>> >> >
>>> >> > To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>> >> >
>>> >> > The tag to be voted on is v2.3.3-rc2 (commit
>>> 66fd9c34bf406a4b5f86605d06c9607752bd637a):
>>> >> > https://github.com/apache/spark/tree/v2.3.3-rc2
>>> >> >
>>> >> > The release files, including signatures, digests, etc. can be found
>>> at:
>>> >> > https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc2-bin/
>>> >> >
>>> >> > Signatures used for Spark RCs can be found in this file:
>>> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >> >
>>> >> > The staging repository for this release can be found at:
>>> >> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1298/
>>> >> >
>>> >> > The documentation corresponding to this release can be found at:
>>> >> > https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc2-docs/
>>> >> >
>>> >> > The list of bug fixes going into 2.3.3 can be found at the
>>> following URL:
>>> >> > https://issues.apache.org/jira/projects/SPARK/versions/12343759
>>> >> >
>>> >> > FAQ
>>> >> >
>>> >> > =
>>> >> > How can I help test this release?
>>> >> > =
>>> >> >
>>> >> > If you are a Spark user, you can help us test this release by taking
>>> >> > an existing Spark workload and running on this release candidate,
>>> then
>>> >> > reporting any regressions.
>>> >> >
>>> >> > If you're working in PySpark you can set up a virtual env and
>>> install
>>> >> > the current RC and see if anything important breaks, in the
>>> Java/Scala
>>> >> > you can add the staging repository to your projects resolvers and
>>> test
>>> >> > with the RC (make sure to clean up the artifact cache before/after
>>> so
>>> >> > you don't end up building with a out of date RC going forward).
>>> >> >
>>> >> > ===
>>> >> > What should happen to JIRA tickets still targeting 2.3.3?
>>> >> > ===
>>> >> >
>>> >> > The current list of open tickets targeted at 2.3.3 can be found at:
>>> >> > https://issues.apache.org/jira/projects/SPARK and

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Hyukjin Kwon

I should check the details and feasiablity by myself but to me it sounds
fine if it doesn't need extra big efforts.

On Tue, 5 Feb 2019, 4:15 am Xiao Li  Yes. When our support/integration with Hive 2.x becomes stable, we can do
> it in Hadoop 2.x profile too, if needed. The whole proposal is to minimize
> the risk and ensure the release stability and quality.
>
> Hyukjin Kwon  于2019年2月4日周一 下午12:01写道：
>
>> Xiao, to check if I understood correctly, do you mean the below?
>>
>> 1. Use our fork with Hadoop 2.x profile for now, and use Hive 2.x with
>> Hadoop 3.x profile.
>> 2. Make another newer version of thrift server by Hive 2.x(?) in Spark
>> side.
>> 3. Target the transition to Hive 2.x completely and slowly later in the
>> future.
>>
>>
>>
>> 2019년 2월 5일 (화) 오전 1:16, Xiao Li 님이 작성:
>>
>>> To reduce the impact and risk of upgrading Hive execution JARs, we can
>>> just upgrade the built-in Hive to 2.x when using the profile of Hadoop 3.x.
>>> The support of Hadoop 3 will be still experimental in our next release.
>>> That means, the impact and risk are very minimal for most users who are
>>> still using Hadoop 2.x profile.
>>>
>>> The code changes in Spark thrift server are massive. It is risky and
>>> hard to review. The original code of our Spark thrift server is from
>>> Hive-service 1.2.1. To reduce the risk of the upgrade, we can inline the
>>> new version. In the future, we can completely get rid of the thrift server,
>>> and build our own high-performant JDBC server.
>>>
>>> Does this proposal sound good to you?
>>>
>>> In the last two weeks, Yuming was trying this proposal. Now, he is on
>>> vacation. In China, today is already the lunar New Year. I would not expect
>>> he will reply this email in the next 7 days.
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>>
>>>
>>> Sean Owen  于2019年2月4日周一 上午7:56写道：
>>>
>>>> I was unclear from this thread what the objection to these PRs is:
>>>>
>>>> https://github.com/apache/spark/pull/23552
>>>> https://github.com/apache/spark/pull/23553
>>>>
>>>> Would we like to specifically discuss whether to merge these or not? I
>>>> hear support for it, concerns about continuing to support Hive too,
>>>> but I wasn't clear whether those concerns specifically argue against
>>>> these PRs.
>>>>
>>>>
>>>> On Fri, Feb 1, 2019 at 2:03 PM Felix Cheung 
>>>> wrote:
>>>> >
>>>> > What’s the update and next step on this?
>>>> >
>>>> > We have real users getting blocked by this issue.
>>>> >
>>>> >
>>>> > 
>>>> > From: Xiao Li 
>>>> > Sent: Wednesday, January 16, 2019 9:37 AM
>>>> > To: Ryan Blue
>>>> > Cc: Marcelo Vanzin; Hyukjin Kwon; Sean Owen; Felix Cheung; Yuming
>>>> Wang; dev
>>>> > Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>>>> >
>>>> > Thanks for your feedbacks!
>>>> >
>>>> > Working with Yuming to reduce the risk of stability and quality. Will
>>>> keep you posted when the proposal is ready.
>>>> >
>>>> > Cheers,
>>>> >
>>>> > Xiao
>>>> >
>>>> > Ryan Blue  于2019年1月16日周三 上午9:27写道：
>>>> >>
>>>> >> +1 for what Marcelo and Hyukjin said.
>>>> >>
>>>> >> In particular, I agree that we can't expect Hive to release a
>>>> version that is now more than 3 years old just to solve a problem for
>>>> Spark. Maybe that would have been a reasonable ask instead of publishing a
>>>> fork years ago, but I think this is now Spark's problem.
>>>> >>
>>>> >> On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin 
>>>> wrote:
>>>> >>>
>>>> >>> +1 to that. HIVE-16391 by itself means we're giving up things like
>>>> >>> Hadoop 3, and we're also putting the burden on the Hive folks to
>>>> fix a
>>>> >>> problem that we created.
>>>> >>>
>>>> >>> The current PR is basically a Spark-side fix for that bug. It does
>>>> >>> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I
>>>> think
>>>> >>> it's

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Hyukjin Kwon

Xiao, to check if I understood correctly, do you mean the below?

1. Use our fork with Hadoop 2.x profile for now, and use Hive 2.x with
Hadoop 3.x profile.
2. Make another newer version of thrift server by Hive 2.x(?) in Spark side.
3. Target the transition to Hive 2.x completely and slowly later in the
future.



2019년 2월 5일 (화) 오전 1:16, Xiao Li 님이 작성:

> To reduce the impact and risk of upgrading Hive execution JARs, we can
> just upgrade the built-in Hive to 2.x when using the profile of Hadoop 3.x.
> The support of Hadoop 3 will be still experimental in our next release.
> That means, the impact and risk are very minimal for most users who are
> still using Hadoop 2.x profile.
>
> The code changes in Spark thrift server are massive. It is risky and hard
> to review. The original code of our Spark thrift server is from
> Hive-service 1.2.1. To reduce the risk of the upgrade, we can inline the
> new version. In the future, we can completely get rid of the thrift server,
> and build our own high-performant JDBC server.
>
> Does this proposal sound good to you?
>
> In the last two weeks, Yuming was trying this proposal. Now, he is on
> vacation. In China, today is already the lunar New Year. I would not expect
> he will reply this email in the next 7 days.
>
> Cheers,
>
> Xiao
>
>
>
> Sean Owen  于2019年2月4日周一 上午7:56写道：
>
>> I was unclear from this thread what the objection to these PRs is:
>>
>> https://github.com/apache/spark/pull/23552
>> https://github.com/apache/spark/pull/23553
>>
>> Would we like to specifically discuss whether to merge these or not? I
>> hear support for it, concerns about continuing to support Hive too,
>> but I wasn't clear whether those concerns specifically argue against
>> these PRs.
>>
>>
>> On Fri, Feb 1, 2019 at 2:03 PM Felix Cheung 
>> wrote:
>> >
>> > What’s the update and next step on this?
>> >
>> > We have real users getting blocked by this issue.
>> >
>> >
>> > 
>> > From: Xiao Li 
>> > Sent: Wednesday, January 16, 2019 9:37 AM
>> > To: Ryan Blue
>> > Cc: Marcelo Vanzin; Hyukjin Kwon; Sean Owen; Felix Cheung; Yuming Wang;
>> dev
>> > Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>> >
>> > Thanks for your feedbacks!
>> >
>> > Working with Yuming to reduce the risk of stability and quality. Will
>> keep you posted when the proposal is ready.
>> >
>> > Cheers,
>> >
>> > Xiao
>> >
>> > Ryan Blue  于2019年1月16日周三 上午9:27写道：
>> >>
>> >> +1 for what Marcelo and Hyukjin said.
>> >>
>> >> In particular, I agree that we can't expect Hive to release a version
>> that is now more than 3 years old just to solve a problem for Spark. Maybe
>> that would have been a reasonable ask instead of publishing a fork years
>> ago, but I think this is now Spark's problem.
>> >>
>> >> On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin 
>> wrote:
>> >>>
>> >>> +1 to that. HIVE-16391 by itself means we're giving up things like
>> >>> Hadoop 3, and we're also putting the burden on the Hive folks to fix a
>> >>> problem that we created.
>> >>>
>> >>> The current PR is basically a Spark-side fix for that bug. It does
>> >>> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
>> >>> it's really the right path to take here.
>> >>>
>> >>> On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon 
>> wrote:
>> >>> >
>> >>> > Resolving HIVE-16391 means Hive to release 1.2.x that contains the
>> fixes of our Hive fork (correct me if I am mistaken).
>> >>> >
>> >>> > Just to be honest by myself and as a personal opinion, that
>> basically says Hive to take care of Spark's dependency.
>> >>> > Hive looks going ahead for 3.1.x and no one would use the newer
>> release of 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore
>> for instance,
>> >>> >
>> >>> > Frankly, my impression was that it's, honestly, our mistake to fix.
>> Since Spark community is big enough, I was thinking we should try to fix it
>> by ourselves first.
>> >>> > I am not saying upgrading is the only way to get through this but I
>> think we should at least try first, and see what's next.
>> >>> >
>> >>> > It does, yes, sound more risky to upgrade it in our side but I
>> think it's worth to check and try it and see if it's possible.
>> >>> > I think this is a standard approach to upgrade the dependency than
>> using the fork or letting Hive side to release another 1.2.x.
>> >>> >
>> >>> > If we fail to upgrade it for critical or inevitable reasons
>> somehow, yes, we could find an alternative but that basically means
>> >>> > we're going to stay in 1.2.x for, at least, a long time (say ..
>> until Spark 4.0.0?).
>> >>> >
>> >>> > I know somehow it happened to be sensitive but to be just literally
>> honest to myself, I think we should make a try.
>> >>> >
>> >>>
>> >>>
>> >>> --
>> >>> Marcelo
>> >>
>> >>
>> >>
>> >> --
>> >> Ryan Blue
>> >> Software Engineer
>> >> Netflix
>>
>

Missing SparkR in CRAN

2019-01-24 Thread Hyukjin Kwon

Hi all,

I happened to find SparkR is missing in CRAN. See
https://cran.r-project.org/web/packages/SparkR/index.html

I remember I saw some threads about this in spark-dev mailing list a long
long ago IIRC. Is it in progress to fix it somewhere? or is it something I
misunderstood?

Re: Removing old HiveMetastore(0.12~0.14) from Spark 3.0.0?

2019-01-22 Thread Hyukjin Kwon

Yea, I was thinking about that too. They are too old to keep. +1 for
removing them out.

2019년 1월 23일 (수) 오전 11:30, Dongjoon Hyun 님이 작성:

> Hi, All.
>
> Currently, Apache Spark supports Hive Metastore(HMS) 0.12 ~ 2.3.
> Among them, HMS 0.x releases look very old since we are in 2019.
> If these are not used in the production any more, can we drop HMS 0.x
> supports in 3.0.0?
>
> hive-0.12.0 2013-10-10
> hive-0.13.0 2014-04-15
> hive-0.13.1 2014-11-16
> hive-0.14.0 2014-11-16
> ( https://archive.apache.org/dist/hive/ )
>
> In addition, if there is someone who is still using these HMS versions and
> has a plan to install and use Spark 3.0.0 with these HMS versions, could
> you reply this email thread? If there is a reason, that would be very
> helpful for me.
>
> Thanks,
> Dongjoon.
>

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Hyukjin Kwon

Resolving HIVE-16391 means Hive to release 1.2.x that contains the fixes of
our Hive fork (correct me if I am mistaken).

Just to be honest by myself and as a personal opinion, that basically says
Hive to take care of Spark's dependency.
Hive looks going ahead for 3.1.x and no one would use the newer release of
1.2.x. In practice, Spark doesn't make a release 1.6.x anymore for instance,

Frankly, my impression was that it's, honestly, our mistake to fix. Since
Spark community is big enough, I was thinking we should try to fix it by
ourselves first.
I am not saying upgrading is the only way to get through this but I think
we should at least try first, and see what's next.

It does, yes, sound more risky to upgrade it in our side but I think it's
worth to check and try it and see if it's possible.
I think this is a standard approach to upgrade the dependency than using
the fork or letting Hive side to release another 1.2.x.

If we fail to upgrade it for critical or inevitable reasons somehow, yes,
we could find an alternative but that basically means
we're going to stay in 1.2.x for, at least, a long time (say .. until Spark
4.0.0?).

I know somehow it happened to be sensitive but to be just literally honest
to myself, I think we should make a try.

Re: Ask for reviewing on Structured Streaming PRs

2019-01-13 Thread Hyukjin Kwon

But it's true that imho there's less activity in SS in general. Should be
noted. Maybe it's also because committers are busy for other stuffs.

Yea, I agree that one actionable strategy for now might be to make the PR
description as clear as possible to make the review easier, and then ping
them in the PRs.


On Sun, 13 Jan 2019, 10:37 pm Sean Owen  Jungtaek, the best strategy is to find who wrote the code you are
> modifying (use Github history or git blame) and ping them directly on
> the PR. I don't know this code well myself.
> It also helps if you can address why the functionality is important,
> and describe compatibility implications.
>
> Most PRs are not merged, note. Not commenting on this particular one,
> but it's not a 'bug' if it's not being merged.
>
> On Sun, Jan 13, 2019 at 12:29 AM Jungtaek Lim  wrote:
> >
> > I'm sorry but let me remind this, as non-SS PRs are being reviewed
> accordingly, whereas many of SS PRs (regardless of who create) are still
> not reviewed and merged in time.
> >
> > 2019년 1월 3일 (목) 오전 7:57, Jungtaek Lim 님이 작성:
> >>
> >> Spark devs, happy new year!
> >>
> >> I would like to remind this kindly, since there was actually no review
> after initiating the thread.
> >>
> >> Thanks,
> >> Jungtaek Lim (HeartSaVioR)
> >>
> >> 2018년 12월 12일 (수) 오후 11:12, Vaclav Kosar 님이 작성:
> >>>
> >>> I am also waiting for any finalization of my PR [3]. I seems that SS
> PRs are not being reviewed much these days.
> >>>
> >>> [3] https://github.com/apache/spark/pull/21919
> >>>
> >>>
> >>> On 12. 12. 18 14:37, Dongjin Lee wrote:
> >>>
> >>> If it is possible, could you review my PR on Kafka's header
> functionality[^1] also? It was added in Kafka 0.11.0.0 but still not
> supported in Spark.
> >>>
> >>> Thanks,
> >>> Dongjin
> >>>
> >>> [^1]: https://github.com/apache/spark/pull/22282
> >>> [^2]: https://issues.apache.org/jira/browse/KAFKA-4208
> >>>
> >>> On Wed, Dec 12, 2018 at 6:43 PM Jungtaek Lim 
> wrote:
> 
>  Hi devs,
> 
>  Would I kindly ask for reviewing on PRs for Structured Streaming? I
> have 5 open pull requests on SS side [1] (earliest PR was opened around 4
> months so far), and there looks like couple of PR for others [2] which
> looks good to be reviewed, too.
> 
>  Thanks in advance,
>  Jungtaek Lim (HeartSaVioR)
> 
>  1.
> https://github.com/apache/spark/pulls?utf8=%E2%9C%93=is%3Aopen+is%3Apr+author%3AHeartSaVioR+%5BSS%5D
>  2.
> https://github.com/apache/spark/pulls?utf8=%E2%9C%93=is%3Aopen+is%3Apr+%5BSS%5D+
> 
> >>>
> >>>
> >>> --
> >>> Dongjin Lee
> >>>
> >>> A hitchhiker in the mathematical world.
> >>>
> >>> github: github.com/dongjinleekr
> >>> linkedin: kr.linkedin.com/in/dongjinleekr
> >>> speakerdeck: speakerdeck.com/dongjin
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] SPARK 2.2.3 (RC1)

2019-01-10 Thread Hyukjin Kwon

+1

Thanks.

2019년 1월 11일 (금) 오전 7:01, Takeshi Yamamuro 님이 작성:

> ok, thanks for the check.
>
> best,
> takeshi
>
> On Fri, Jan 11, 2019 at 1:37 AM Dongjoon Hyun 
> wrote:
>
>> Hi, Takeshi.
>>
>> Yep. It's not a release blocker. We don't need that as Sean mentioned
>> already.
>> Since you are the release manager of 2.3.3, you may include that in the
>> scope of Spark 2.3.3 before it starts.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Thu, Jan 10, 2019 at 5:44 AM Sean Owen  wrote:
>>
>>> Is that the right link? that is marked as a minor bug, maybe. From
>>> what you describe it's not a regression from 2.2.2 either.
>>>
>>> On Thu, Jan 10, 2019 at 6:37 AM Takeshi Yamamuro 
>>> wrote:
>>> >
>>> > Hi, Dongjoon,
>>> >
>>> > We don't need to include https://github.com/apache/spark/pull/23456
>>> in this release?
>>> > The query there fails in v2.x while it passes in v1.6.
>>> >
>>>
>>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Noisy spark-website notifications

2018-12-19 Thread Hyukjin Kwon

Yea, that's a bit noisy .. I would just completely disable it to be honest.
I failed https://issues.apache.org/jira/browse/INFRA-17469 before. I would
appreciate if there would be more inputs there :-)

2018년 12월 20일 (목) 오전 11:22, Nicholas Chammas 님이
작성:

> I'd prefer it if we disabled all git notifications for spark-website.
> Folks who want to stay on top of what's happening with the site can simply 
> watch
> the repo on GitHub , no?
>
> On Wed, Dec 19, 2018 at 10:00 PM Wenchen Fan  wrote:
>
>> +1, at least it should only send one email when a PR is merged.
>>
>> On Thu, Dec 20, 2018 at 10:58 AM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Can we somehow disable these new email alerts coming through for the
>>> Spark website repo?
>>>
>>> On Wed, Dec 19, 2018 at 8:25 PM GitBox  wrote:
>>>
 ueshin commented on a change in pull request #163: Announce the
 schedule of 2019 Spark+AI summit at SF
 URL:
 https://github.com/apache/spark-website/pull/163#discussion_r243130975



  ##
  File path: site/sitemap.xml
  ##
  @@ -139,657 +139,661 @@
  
  
  
 -  https://spark.apache.org/releases/spark-release-2-4-0.html
 
 +  
 http://localhost:4000/news/spark-ai-summit-apr-2019-agenda-posted.html
 

  Review comment:
Still remaining `localhost:4000` in this file.

 
 This is an automated message from the Apache Git Service.
 To respond to the message, please log on GitHub and use the
 URL above to go to the specific comment.

 For queries about this service, please contact Infrastructure at:
 us...@infra.apache.org


 With regards,
 Apache Git Services

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-18 Thread Hyukjin Kwon

Similar issues are going on in spark-website as well. I also filed a ticket
at https://issues.apache.org/jira/browse/INFRA-17469.

2018년 12월 12일 (수) 오전 9:02, Reynold Xin 님이 작성:

> I filed a ticket: https://issues.apache.org/jira/browse/INFRA-17403
>
> Please add your support there.
>
>
> On Tue, Dec 11, 2018 at 4:58 PM, Sean Owen  wrote:
>
>> I asked on the original ticket at
>> https://issues.apache.org/jira/browse/INFRA-17385 but no follow-up. Go
>> ahead and open a new INFRA ticket.
>>
>> On Tue, Dec 11, 2018 at 6:20 PM Reynold Xin  wrote:
>>
>>> Thanks, Sean. Which INFRA ticket is it? It's creating a lot of noise so
>>> I want to put some pressure myself there too.
>>>
>>>
>>> On Mon, Dec 10, 2018 at 9:51 AM, Sean Owen  wrote:
>>>
 Agree, I'll ask on the INFRA ticket and follow up. That's a lot of
 extra noise.

 On Mon, Dec 10, 2018 at 11:37 AM Marcelo Vanzin 
 wrote:

 Hmm, it also seems that github comments are being sync'ed to jira.
 That's gonna get old very quickly, we should probably ask infra to disable
 that (if we can't do it ourselves).
 On Mon, Dec 10, 2018 at 9:13 AM Sean Owen  wrote:

 Update for committers: now that my user ID is synced, I can
 successfully push to remote https://github.com/apache/spark directly.
 Use that as the 'apache' remote (if you like; gitbox also works). I
 confirmed the sync works both ways.

 As a bonus you can directly close pull requests when needed instead of
 using "Close Stale PRs" pull requests.

 On Mon, Dec 10, 2018 at 10:30 AM Sean Owen  wrote:

 Per the thread last week, the Apache Spark repos have migrated from
 https://git-wip-us.apache.org/repos/asf to
 https://gitbox.apache.org/repos/asf

 Non-committers:

 This just means repointing any references to the old repository to the
 new one. It won't affect you if you were already referencing
 https://github.com/apache/spark .

 Committers:

 Follow the steps at https://reference.apache.org/committer/github to
 fully sync your ASF and Github accounts, and then wait up to an hour for it
 to finish.

 Then repoint your git-wip-us remotes to gitbox in your git checkouts.
 For our standard setup that works with the merge script, that should be
 your 'apache' remote. For example here are my current remotes:

 $ git remote -v
 apache https://gitbox.apache.org/repos/asf/spark.git (fetch) apache
 https://gitbox.apache.org/repos/asf/spark.git (push) apache-github
 git://github.com/apache/spark (fetch) apache-github git://
 github.com/apache/spark (push) origin https://github.com/srowen/spark
 (fetch)
 origin https://github.com/srowen/spark (push)
 upstream https://github.com/apache/spark (fetch)
 upstream https://github.com/apache/spark (push)

 In theory we also have read/write access to github.com now too, but
 right now it hadn't yet worked for me. It may need to sync. This note just
 makes sure anyone knows how to keep pushing commits right now to the new
 ASF repo.

 Report any problems here!

 Sean

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

 --
 Marcelo

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

>>>
>

Re: How can I help?

2018-12-17 Thread Hyukjin Kwon

Please take a look for https://spark.apache.org/contributing.html . It
contains virtually all information it needs for contributions.

2018년 12월 18일 (화) 오전 3:54, Raghunadh Madamanchi 님이
작성:

> Hi,
>
> I am Raghu, I live in Dallas,TX.
> Having 15+  years of Experience in Software Development and Design using
> Java related technologies,Hadoop, Hive..etc.
>
> I wanted to get involved with this group by contributing my knowledge.
> Please let me know, if you have something, which i can start working on.
>
> Regards,
> Raghu
>
>

< 1 2 3 4 5 6 7 >

401 - 500 of 688 matches

Mail list logo