Re: [VOTE] SPARK 2.4.0 (RC4)

Wenchen Fan Thu, 25 Oct 2018 08:19:20 -0700

Personally I don't think it matters. Users can build arbitrary
expressions/plans themselves with internal API, and we never guarantee the
result.


Removing these functions from the function registry is a small patch and
easy to review, and to me it's better than a 1000+ LOC patch that removes
the whole thing.

Again I don't have a strong opinion here. I'm OK to remove the entire thing
if a PR is ready and well reviewed.

On Thu, Oct 25, 2018 at 11:00 PM Dongjoon Hyun <[email protected]>
wrote:

> Thank you for the decision, All.
>
> As of now, to unblock this, it seems that we are trying to remove them
> from the function registry.
>
> https://github.com/apache/spark/pull/22821
>
> One problem here is that users can recover those functions like this
> simply.
>
> scala> 
> spark.sessionState.functionRegistry.createOrReplaceTempFunction("map_filter", 
> x => org.apache.spark.sql.catalyst.expressions.MapFilter(x(0),x(1)))
>
>
> Technically, the PR looks like a compromised way to unblock the release
> and to allow some users that feature completely.
>
> At first glance, I thought this is a workaround to ignore the discussion
> context. But, that sounds like one of the practical ways for Apache Spark.
> (We had Spark 2.0 Tech. Preview before.)
>
> I want to finalize the decision on `map_filter` (and related three
> functions) issue. Are we good to go with
> https://github.com/apache/spark/pull/22821?
>
> Bests,
> Dongjoon.
>
> PS. Also, there is a PR to completely remove them, too.
>        https://github.com/cloud-fan/spark/pull/11
>
>
> On Wed, Oct 24, 2018 at 10:14 PM Xiao Li <[email protected]> wrote:
>
>> @Dongjoon Hyun <[email protected]>  Thanks! This is a blocking
>> ticket. It returns a wrong result due to our undefined behavior. I agree we
>> should revert the newly added map-oriented functions. In 3.0 release, we
>> need to define the behavior of duplicate keys in the data type MAP and fix
>> all the related issues that are confusing to our end users.
>>
>> Thanks,
>>
>> Xiao
>>
>> On Wed, Oct 24, 2018 at 9:54 PM Wenchen Fan <[email protected]> wrote:
>>
>>> Ah now I see the problem. `map_filter` has a very weird semantic that is
>>> neither "earlier entry wins" or "latter entry wins".
>>>
>>> I've opened https://github.com/apache/spark/pull/22821 , to remove
>>> these newly added map-related functions from FunctionRegistry(for 2.4.0),
>>> so that they are invisible to end-users, and the weird behavior of Spark
>>> map type with duplicated keys are not escalated. We should fix it ASAP in
>>> the master branch.
>>>
>>> If others are OK with it, I'll start a new RC after that PR is merged.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Thu, Oct 25, 2018 at 10:32 AM Dongjoon Hyun <[email protected]>
>>> wrote:
>>>
>>>> For the first question, it's `bin/spark-sql` result. I didn't check
>>>> STS, but it will return the same with `bin/spark-sql`.
>>>>
>>>> > I think map_filter is implemented correctly. map(1,2,1,3) is
>>>> actually map(1,2) according to the "earlier entry wins" semantic. I
>>>> don't think this will change in 2.4.1.
>>>>
>>>> For the second one, `map_filter` issue is not about `earlier entry
>>>> wins` stuff. Please see the following example.
>>>>
>>>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT
>>>> map_concat(map(1,2), map(1,3)) m);
>>>> {1:3} {1:2}
>>>>
>>>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT
>>>> map_concat(map(1,2), map(1,3)) m);
>>>> {1:3} {1:3}
>>>>
>>>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT
>>>> map_concat(map(1,2), map(1,3)) m);
>>>> {1:3} {}
>>>>
>>>> In other words, `map_filter` works like `push-downed filter` to the map
>>>> in terms of the output result
>>>> while users assumed that `map_filter` works on top of the result of
>>>> `m`.
>>>>
>>>> This is a function semantic issue.
>>>>
>>>>
>>>> On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan <[email protected]>
>>>> wrote:
>>>>
>>>>> > spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
>>>>> > {1:3}
>>>>>
>>>>> Are you running in the thrift-server? Then maybe this is caused by the
>>>>> bug in `Dateset.collect` as I mentioned above.
>>>>>
>>>>> I think map_filter is implemented correctly. map(1,2,1,3) is actually
>>>>> map(1,2) according to the "earlier entry wins" semantic. I don't
>>>>> think this will change in 2.4.1.
>>>>>
>>>>> On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thank you for the follow-ups.
>>>>>>
>>>>>> Then, Spark 2.4.1 will return `{1:2}` differently from the followings
>>>>>> (including Spark/Scala) in the end?
>>>>>>
>>>>>> I hoped to fix the `map_filter`, but now Spark looks inconsistent in
>>>>>> many ways.
>>>>>>
>>>>>> scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
>>>>>> +---------------+
>>>>>> |map(1, 2, 1, 3)|
>>>>>> +---------------+
>>>>>> |    Map(1 -> 3)|
>>>>>> +---------------+
>>>>>>
>>>>>>
>>>>>> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
>>>>>> {1:3}
>>>>>>
>>>>>>
>>>>>> hive> select map(1,2,1,3);  // Hive 1.2.2
>>>>>> OK
>>>>>> {1:3}
>>>>>>
>>>>>>
>>>>>> presto> SELECT map_concat(map(array[1],array[2]),
>>>>>> map(array[1],array[3])); // Presto 0.212
>>>>>>  _col0
>>>>>> -------
>>>>>>  {1=3}
>>>>>>
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Dongjoon,
>>>>>>>
>>>>>>> Thanks for reporting it! This is indeed a bug that needs to be fixed.
>>>>>>>
>>>>>>> The problem is not about the function `map_filter`, but about how
>>>>>>> the map type values are created in Spark, when there are duplicated 
>>>>>>> keys.
>>>>>>>
>>>>>>> In programming languages like Java/Scala, when creating map, the
>>>>>>> later entry wins. e.g. in scala
>>>>>>> scala> Map(1 -> 2, 1 -> 3)
>>>>>>> res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)
>>>>>>>
>>>>>>> scala> Map(1 -> 2, 1 -> 3).get(1)
>>>>>>> res1: Option[Int] = Some(3)
>>>>>>>
>>>>>>> However, in Spark, the earlier entry wins
>>>>>>> scala> sql("SELECT map(1,2,1,3)[1]").show
>>>>>>> +------------------+
>>>>>>> |map(1, 2, 1, 3)[1]|
>>>>>>> +------------------+
>>>>>>> |                 2|
>>>>>>> +------------------+
>>>>>>>
>>>>>>> So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 ->
>>>>>>> 2).
>>>>>>>
>>>>>>> But there are several bugs in Spark
>>>>>>>
>>>>>>> scala> sql("SELECT map(1,2,1,3)").show
>>>>>>> +----------------+
>>>>>>> | map(1, 2, 1, 3)|
>>>>>>> +----------------+
>>>>>>> |[1 -> 2, 1 -> 3]|
>>>>>>> +----------------+
>>>>>>> The displayed string of map values has a bug and we should
>>>>>>> deduplicate the entries, This is tracked by SPARK-25824.
>>>>>>>
>>>>>>>
>>>>>>> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
>>>>>>> res11: org.apache.spark.sql.DataFrame = []
>>>>>>>
>>>>>>> scala> sql("select * from t").show
>>>>>>> +--------+
>>>>>>> |     map|
>>>>>>> +--------+
>>>>>>> |[1 -> 3]|
>>>>>>> +--------+
>>>>>>> The Hive map value convert has a bug, we should respect the "earlier
>>>>>>> entry wins" semantic. No ticket yet.
>>>>>>>
>>>>>>>
>>>>>>> scala> sql("select map(1,2,1,3)").collect
>>>>>>> res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
>>>>>>> Same bug happens at `collect`. No ticket yet.
>>>>>>>
>>>>>>> I'll create tickets and list all of them as known issues in 2.4.0.
>>>>>>>
>>>>>>> It's arguable if the "earlier entry wins" semantic is reasonable.
>>>>>>> Fixing it is a behavior change and we can only apply it to master 
>>>>>>> branch.
>>>>>>>
>>>>>>> Going back to https://issues.apache.org/jira/browse/SPARK-25823,
>>>>>>> it's just a symptom of the hive map value converter bug. I think it's a
>>>>>>> non-blocker.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Wenchen
>>>>>>>
>>>>>>> On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi, All.
>>>>>>>>
>>>>>>>> -0 due to the following issue. From Spark 2.4.0, users may get an
>>>>>>>> incorrect result when they use new `map_fitler` with `map_concat` 
>>>>>>>> functions.
>>>>>>>>
>>>>>>>> https://issues.apache.org/jira/browse/SPARK-25823
>>>>>>>>
>>>>>>>> SPARK-25823 is only aiming to fix the data correctness issue from
>>>>>>>> `map_filter`.
>>>>>>>>
>>>>>>>> PMC members are able to lower the priority. Always, I respect PMC's
>>>>>>>> decision.
>>>>>>>>
>>>>>>>> I'm sending this email to draw more attention to this bug and to
>>>>>>>> give some warning on the new feature's limitation to the community.
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>>>> version 2.4.0.
>>>>>>>>>
>>>>>>>>> The vote is open until October 26 PST and passes if a majority +1
>>>>>>>>> PMC votes are cast, with
>>>>>>>>> a minimum of 3 +1 votes.
>>>>>>>>>
>>>>>>>>> [ ] +1 Release this package as Apache Spark 2.4.0
>>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>>
>>>>>>>>> To learn more about Apache Spark, please see
>>>>>>>>> http://spark.apache.org/
>>>>>>>>>
>>>>>>>>> The tag to be voted on is v2.4.0-rc4 (commit
>>>>>>>>> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
>>>>>>>>> https://github.com/apache/spark/tree/v2.4.0-rc4
>>>>>>>>>
>>>>>>>>> The release files, including signatures, digests, etc. can be
>>>>>>>>> found at:
>>>>>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>>>>>>>>>
>>>>>>>>> Signatures used for Spark RCs can be found in this file:
>>>>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>>>>>
>>>>>>>>> The staging repository for this release can be found at:
>>>>>>>>>
>>>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1290
>>>>>>>>>
>>>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>>>>>>>>>
>>>>>>>>> The list of bug fixes going into 2.4.0 can be found at the
>>>>>>>>> following URL:
>>>>>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>>>>>>>>>
>>>>>>>>> FAQ
>>>>>>>>>
>>>>>>>>> =========================
>>>>>>>>> How can I help test this release?
>>>>>>>>> =========================
>>>>>>>>>
>>>>>>>>> If you are a Spark user, you can help us test this release by
>>>>>>>>> taking
>>>>>>>>> an existing Spark workload and running on this release candidate,
>>>>>>>>> then
>>>>>>>>> reporting any regressions.
>>>>>>>>>
>>>>>>>>> If you're working in PySpark you can set up a virtual env and
>>>>>>>>> install
>>>>>>>>> the current RC and see if anything important breaks, in the
>>>>>>>>> Java/Scala
>>>>>>>>> you can add the staging repository to your projects resolvers and
>>>>>>>>> test
>>>>>>>>> with the RC (make sure to clean up the artifact cache before/after
>>>>>>>>> so
>>>>>>>>> you don't end up building with a out of date RC going forward).
>>>>>>>>>
>>>>>>>>> ===========================================
>>>>>>>>> What should happen to JIRA tickets still targeting 2.4.0?
>>>>>>>>> ===========================================
>>>>>>>>>
>>>>>>>>> The current list of open tickets targeted at 2.4.0 can be found at:
>>>>>>>>> https://issues.apache.org/jira/projects/SPARK and search for
>>>>>>>>> "Target Version/s" = 2.4.0
>>>>>>>>>
>>>>>>>>> Committers should look at those and triage. Extremely important bug
>>>>>>>>> fixes, documentation, and API tweaks that impact compatibility
>>>>>>>>> should
>>>>>>>>> be worked on immediately. Everything else please retarget to an
>>>>>>>>> appropriate release.
>>>>>>>>>
>>>>>>>>> ==================
>>>>>>>>> But my bug isn't fixed?
>>>>>>>>> ==================
>>>>>>>>>
>>>>>>>>> In order to make timely releases, we will typically not hold the
>>>>>>>>> release unless the bug in question is a regression from the
>>>>>>>>> previous
>>>>>>>>> release. That being said, if there is something which is a
>>>>>>>>> regression
>>>>>>>>> that has not been correctly targeted please ping me or a committer
>>>>>>>>> to
>>>>>>>>> help target the issue.
>>>>>>>>>
>>>>>>>>
>>
>> --
>> [image: Spark+AI Summit North America 2019]
>> <http://t.sidekickopen24.com/s1t/c/5/f18dQhb0S7lM8dDMPbW2n0x6l2B9nMJN7t5X-FfhMynN2z8MDjQsyTKW56dzQQ1-_gV6102?t=https%3A%2F%2Fdatabricks.com%2Fsparkaisummit%2Fnorth-america&si=undefined&pi=406b8c9a-b648-4923-9ed1-9a51ffe213fa>
>>
>

Re: [VOTE] SPARK 2.4.0 (RC4)

Reply via email to