Re: [VOTE] SPARK 2.4.0 (RC4)

Xiao Li Wed, 24 Oct 2018 22:15:02 -0700

@Dongjoon Hyun <dongjoon.h...@gmail.com>  Thanks! This is a blocking
ticket. It returns a wrong result due to our undefined behavior. I agree we
should revert the newly added map-oriented functions. In 3.0 release, we
need to define the behavior of duplicate keys in the data type MAP and fix
all the related issues that are confusing to our end users.


Thanks,

Xiao

On Wed, Oct 24, 2018 at 9:54 PM Wenchen Fan <cloud0...@gmail.com> wrote:

> Ah now I see the problem. `map_filter` has a very weird semantic that is
> neither "earlier entry wins" or "latter entry wins".
>
> I've opened https://github.com/apache/spark/pull/22821 , to remove these
> newly added map-related functions from FunctionRegistry(for 2.4.0), so that
> they are invisible to end-users, and the weird behavior of Spark map type
> with duplicated keys are not escalated. We should fix it ASAP in the master
> branch.
>
> If others are OK with it, I'll start a new RC after that PR is merged.
>
> Thanks,
> Wenchen
>
> On Thu, Oct 25, 2018 at 10:32 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> For the first question, it's `bin/spark-sql` result. I didn't check STS,
>> but it will return the same with `bin/spark-sql`.
>>
>> > I think map_filter is implemented correctly. map(1,2,1,3) is actually
>> map(1,2) according to the "earlier entry wins" semantic. I don't think
>> this will change in 2.4.1.
>>
>> For the second one, `map_filter` issue is not about `earlier entry wins`
>> stuff. Please see the following example.
>>
>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT
>> map_concat(map(1,2), map(1,3)) m);
>> {1:3} {1:2}
>>
>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT
>> map_concat(map(1,2), map(1,3)) m);
>> {1:3} {1:3}
>>
>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT
>> map_concat(map(1,2), map(1,3)) m);
>> {1:3} {}
>>
>> In other words, `map_filter` works like `push-downed filter` to the map
>> in terms of the output result
>> while users assumed that `map_filter` works on top of the result of `m`.
>>
>> This is a function semantic issue.
>>
>>
>> On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> > spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
>>> > {1:3}
>>>
>>> Are you running in the thrift-server? Then maybe this is caused by the
>>> bug in `Dateset.collect` as I mentioned above.
>>>
>>> I think map_filter is implemented correctly. map(1,2,1,3) is actually
>>> map(1,2) according to the "earlier entry wins" semantic. I don't think
>>> this will change in 2.4.1.
>>>
>>> On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>> wrote:
>>>
>>>> Thank you for the follow-ups.
>>>>
>>>> Then, Spark 2.4.1 will return `{1:2}` differently from the followings
>>>> (including Spark/Scala) in the end?
>>>>
>>>> I hoped to fix the `map_filter`, but now Spark looks inconsistent in
>>>> many ways.
>>>>
>>>> scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
>>>> +---------------+
>>>> |map(1, 2, 1, 3)|
>>>> +---------------+
>>>> |    Map(1 -> 3)|
>>>> +---------------+
>>>>
>>>>
>>>> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
>>>> {1:3}
>>>>
>>>>
>>>> hive> select map(1,2,1,3);  // Hive 1.2.2
>>>> OK
>>>> {1:3}
>>>>
>>>>
>>>> presto> SELECT map_concat(map(array[1],array[2]),
>>>> map(array[1],array[3])); // Presto 0.212
>>>>  _col0
>>>> -------
>>>>  {1=3}
>>>>
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan <cloud0...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Dongjoon,
>>>>>
>>>>> Thanks for reporting it! This is indeed a bug that needs to be fixed.
>>>>>
>>>>> The problem is not about the function `map_filter`, but about how the
>>>>> map type values are created in Spark, when there are duplicated keys.
>>>>>
>>>>> In programming languages like Java/Scala, when creating map, the later
>>>>> entry wins. e.g. in scala
>>>>> scala> Map(1 -> 2, 1 -> 3)
>>>>> res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)
>>>>>
>>>>> scala> Map(1 -> 2, 1 -> 3).get(1)
>>>>> res1: Option[Int] = Some(3)
>>>>>
>>>>> However, in Spark, the earlier entry wins
>>>>> scala> sql("SELECT map(1,2,1,3)[1]").show
>>>>> +------------------+
>>>>> |map(1, 2, 1, 3)[1]|
>>>>> +------------------+
>>>>> |                 2|
>>>>> +------------------+
>>>>>
>>>>> So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2)
>>>>> .
>>>>>
>>>>> But there are several bugs in Spark
>>>>>
>>>>> scala> sql("SELECT map(1,2,1,3)").show
>>>>> +----------------+
>>>>> | map(1, 2, 1, 3)|
>>>>> +----------------+
>>>>> |[1 -> 2, 1 -> 3]|
>>>>> +----------------+
>>>>> The displayed string of map values has a bug and we should deduplicate
>>>>> the entries, This is tracked by SPARK-25824.
>>>>>
>>>>>
>>>>> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
>>>>> res11: org.apache.spark.sql.DataFrame = []
>>>>>
>>>>> scala> sql("select * from t").show
>>>>> +--------+
>>>>> |     map|
>>>>> +--------+
>>>>> |[1 -> 3]|
>>>>> +--------+
>>>>> The Hive map value convert has a bug, we should respect the "earlier
>>>>> entry wins" semantic. No ticket yet.
>>>>>
>>>>>
>>>>> scala> sql("select map(1,2,1,3)").collect
>>>>> res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
>>>>> Same bug happens at `collect`. No ticket yet.
>>>>>
>>>>> I'll create tickets and list all of them as known issues in 2.4.0.
>>>>>
>>>>> It's arguable if the "earlier entry wins" semantic is reasonable.
>>>>> Fixing it is a behavior change and we can only apply it to master branch.
>>>>>
>>>>> Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's
>>>>> just a symptom of the hive map value converter bug. I think it's a
>>>>> non-blocker.
>>>>>
>>>>> Thanks,
>>>>> Wenchen
>>>>>
>>>>> On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> -0 due to the following issue. From Spark 2.4.0, users may get an
>>>>>> incorrect result when they use new `map_fitler` with `map_concat` 
>>>>>> functions.
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/SPARK-25823
>>>>>>
>>>>>> SPARK-25823 is only aiming to fix the data correctness issue from
>>>>>> `map_filter`.
>>>>>>
>>>>>> PMC members are able to lower the priority. Always, I respect PMC's
>>>>>> decision.
>>>>>>
>>>>>> I'm sending this email to draw more attention to this bug and to give
>>>>>> some warning on the new feature's limitation to the community.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <cloud0...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>> version 2.4.0.
>>>>>>>
>>>>>>> The vote is open until October 26 PST and passes if a majority +1
>>>>>>> PMC votes are cast, with
>>>>>>> a minimum of 3 +1 votes.
>>>>>>>
>>>>>>> [ ] +1 Release this package as Apache Spark 2.4.0
>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>
>>>>>>> To learn more about Apache Spark, please see
>>>>>>> http://spark.apache.org/
>>>>>>>
>>>>>>> The tag to be voted on is v2.4.0-rc4 (commit
>>>>>>> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
>>>>>>> https://github.com/apache/spark/tree/v2.4.0-rc4
>>>>>>>
>>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>>> at:
>>>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>>>>>>>
>>>>>>> Signatures used for Spark RCs can be found in this file:
>>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>>>
>>>>>>> The staging repository for this release can be found at:
>>>>>>>
>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1290
>>>>>>>
>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>>>>>>>
>>>>>>> The list of bug fixes going into 2.4.0 can be found at the following
>>>>>>> URL:
>>>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>>>>>>>
>>>>>>> FAQ
>>>>>>>
>>>>>>> =========================
>>>>>>> How can I help test this release?
>>>>>>> =========================
>>>>>>>
>>>>>>> If you are a Spark user, you can help us test this release by taking
>>>>>>> an existing Spark workload and running on this release candidate,
>>>>>>> then
>>>>>>> reporting any regressions.
>>>>>>>
>>>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>>>> the current RC and see if anything important breaks, in the
>>>>>>> Java/Scala
>>>>>>> you can add the staging repository to your projects resolvers and
>>>>>>> test
>>>>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>>>>> you don't end up building with a out of date RC going forward).
>>>>>>>
>>>>>>> ===========================================
>>>>>>> What should happen to JIRA tickets still targeting 2.4.0?
>>>>>>> ===========================================
>>>>>>>
>>>>>>> The current list of open tickets targeted at 2.4.0 can be found at:
>>>>>>> https://issues.apache.org/jira/projects/SPARK and search for
>>>>>>> "Target Version/s" = 2.4.0
>>>>>>>
>>>>>>> Committers should look at those and triage. Extremely important bug
>>>>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>>>>> be worked on immediately. Everything else please retarget to an
>>>>>>> appropriate release.
>>>>>>>
>>>>>>> ==================
>>>>>>> But my bug isn't fixed?
>>>>>>> ==================
>>>>>>>
>>>>>>> In order to make timely releases, we will typically not hold the
>>>>>>> release unless the bug in question is a regression from the previous
>>>>>>> release. That being said, if there is something which is a regression
>>>>>>> that has not been correctly targeted please ping me or a committer to
>>>>>>> help target the issue.
>>>>>>>
>>>>>>

-- 
[image: Spark+AI Summit North America 2019]
<http://t.sidekickopen24.com/s1t/c/5/f18dQhb0S7lM8dDMPbW2n0x6l2B9nMJN7t5X-FfhMynN2z8MDjQsyTKW56dzQQ1-_gV6102?t=https%3A%2F%2Fdatabricks.com%2Fsparkaisummit%2Fnorth-america&si=undefined&pi=406b8c9a-b648-4923-9ed1-9a51ffe213fa>

Re: [VOTE] SPARK 2.4.0 (RC4)

Reply via email to