Re: [VOTE] SPARK 2.4.0 (RC4)

Wenchen Fan Wed, 24 Oct 2018 21:55:10 -0700

Ah now I see the problem. `map_filter` has a very weird semantic that is
neither "earlier entry wins" or "latter entry wins".


I've opened https://github.com/apache/spark/pull/22821 , to remove these
newly added map-related functions from FunctionRegistry(for 2.4.0), so that
they are invisible to end-users, and the weird behavior of Spark map type
with duplicated keys are not escalated. We should fix it ASAP in the master
branch.

If others are OK with it, I'll start a new RC after that PR is merged.

Thanks,
Wenchen

On Thu, Oct 25, 2018 at 10:32 AM Dongjoon Hyun <[email protected]>
wrote:

> For the first question, it's `bin/spark-sql` result. I didn't check STS,
> but it will return the same with `bin/spark-sql`.
>
> > I think map_filter is implemented correctly. map(1,2,1,3) is actually
> map(1,2) according to the "earlier entry wins" semantic. I don't think
> this will change in 2.4.1.
>
> For the second one, `map_filter` issue is not about `earlier entry wins`
> stuff. Please see the following example.
>
> spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT
> map_concat(map(1,2), map(1,3)) m);
> {1:3} {1:2}
>
> spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT
> map_concat(map(1,2), map(1,3)) m);
> {1:3} {1:3}
>
> spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT
> map_concat(map(1,2), map(1,3)) m);
> {1:3} {}
>
> In other words, `map_filter` works like `push-downed filter` to the map in
> terms of the output result
> while users assumed that `map_filter` works on top of the result of `m`.
>
> This is a function semantic issue.
>
>
> On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan <[email protected]> wrote:
>
>> > spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
>> > {1:3}
>>
>> Are you running in the thrift-server? Then maybe this is caused by the
>> bug in `Dateset.collect` as I mentioned above.
>>
>> I think map_filter is implemented correctly. map(1,2,1,3) is actually
>> map(1,2) according to the "earlier entry wins" semantic. I don't think
>> this will change in 2.4.1.
>>
>> On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun <[email protected]>
>> wrote:
>>
>>> Thank you for the follow-ups.
>>>
>>> Then, Spark 2.4.1 will return `{1:2}` differently from the followings
>>> (including Spark/Scala) in the end?
>>>
>>> I hoped to fix the `map_filter`, but now Spark looks inconsistent in
>>> many ways.
>>>
>>> scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
>>> +---------------+
>>> |map(1, 2, 1, 3)|
>>> +---------------+
>>> |    Map(1 -> 3)|
>>> +---------------+
>>>
>>>
>>> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
>>> {1:3}
>>>
>>>
>>> hive> select map(1,2,1,3);  // Hive 1.2.2
>>> OK
>>> {1:3}
>>>
>>>
>>> presto> SELECT map_concat(map(array[1],array[2]),
>>> map(array[1],array[3])); // Presto 0.212
>>>  _col0
>>> -------
>>>  {1=3}
>>>
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan <[email protected]> wrote:
>>>
>>>> Hi Dongjoon,
>>>>
>>>> Thanks for reporting it! This is indeed a bug that needs to be fixed.
>>>>
>>>> The problem is not about the function `map_filter`, but about how the
>>>> map type values are created in Spark, when there are duplicated keys.
>>>>
>>>> In programming languages like Java/Scala, when creating map, the later
>>>> entry wins. e.g. in scala
>>>> scala> Map(1 -> 2, 1 -> 3)
>>>> res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)
>>>>
>>>> scala> Map(1 -> 2, 1 -> 3).get(1)
>>>> res1: Option[Int] = Some(3)
>>>>
>>>> However, in Spark, the earlier entry wins
>>>> scala> sql("SELECT map(1,2,1,3)[1]").show
>>>> +------------------+
>>>> |map(1, 2, 1, 3)[1]|
>>>> +------------------+
>>>> |                 2|
>>>> +------------------+
>>>>
>>>> So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).
>>>>
>>>> But there are several bugs in Spark
>>>>
>>>> scala> sql("SELECT map(1,2,1,3)").show
>>>> +----------------+
>>>> | map(1, 2, 1, 3)|
>>>> +----------------+
>>>> |[1 -> 2, 1 -> 3]|
>>>> +----------------+
>>>> The displayed string of map values has a bug and we should deduplicate
>>>> the entries, This is tracked by SPARK-25824.
>>>>
>>>>
>>>> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
>>>> res11: org.apache.spark.sql.DataFrame = []
>>>>
>>>> scala> sql("select * from t").show
>>>> +--------+
>>>> |     map|
>>>> +--------+
>>>> |[1 -> 3]|
>>>> +--------+
>>>> The Hive map value convert has a bug, we should respect the "earlier
>>>> entry wins" semantic. No ticket yet.
>>>>
>>>>
>>>> scala> sql("select map(1,2,1,3)").collect
>>>> res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
>>>> Same bug happens at `collect`. No ticket yet.
>>>>
>>>> I'll create tickets and list all of them as known issues in 2.4.0.
>>>>
>>>> It's arguable if the "earlier entry wins" semantic is reasonable.
>>>> Fixing it is a behavior change and we can only apply it to master branch.
>>>>
>>>> Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's
>>>> just a symptom of the hive map value converter bug. I think it's a
>>>> non-blocker.
>>>>
>>>> Thanks,
>>>> Wenchen
>>>>
>>>> On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> -0 due to the following issue. From Spark 2.4.0, users may get an
>>>>> incorrect result when they use new `map_fitler` with `map_concat` 
>>>>> functions.
>>>>>
>>>>> https://issues.apache.org/jira/browse/SPARK-25823
>>>>>
>>>>> SPARK-25823 is only aiming to fix the data correctness issue from
>>>>> `map_filter`.
>>>>>
>>>>> PMC members are able to lower the priority. Always, I respect PMC's
>>>>> decision.
>>>>>
>>>>> I'm sending this email to draw more attention to this bug and to give
>>>>> some warning on the new feature's limitation to the community.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>> On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>> version 2.4.0.
>>>>>>
>>>>>> The vote is open until October 26 PST and passes if a majority +1 PMC
>>>>>> votes are cast, with
>>>>>> a minimum of 3 +1 votes.
>>>>>>
>>>>>> [ ] +1 Release this package as Apache Spark 2.4.0
>>>>>> [ ] -1 Do not release this package because ...
>>>>>>
>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>>
>>>>>> The tag to be voted on is v2.4.0-rc4 (commit
>>>>>> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
>>>>>> https://github.com/apache/spark/tree/v2.4.0-rc4
>>>>>>
>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>> at:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>>>>>>
>>>>>> Signatures used for Spark RCs can be found in this file:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>>
>>>>>> The staging repository for this release can be found at:
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1290
>>>>>>
>>>>>> The documentation corresponding to this release can be found at:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>>>>>>
>>>>>> The list of bug fixes going into 2.4.0 can be found at the following
>>>>>> URL:
>>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>>>>>>
>>>>>> FAQ
>>>>>>
>>>>>> =========================
>>>>>> How can I help test this release?
>>>>>> =========================
>>>>>>
>>>>>> If you are a Spark user, you can help us test this release by taking
>>>>>> an existing Spark workload and running on this release candidate, then
>>>>>> reporting any regressions.
>>>>>>
>>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>>> the current RC and see if anything important breaks, in the Java/Scala
>>>>>> you can add the staging repository to your projects resolvers and test
>>>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>>>> you don't end up building with a out of date RC going forward).
>>>>>>
>>>>>> ===========================================
>>>>>> What should happen to JIRA tickets still targeting 2.4.0?
>>>>>> ===========================================
>>>>>>
>>>>>> The current list of open tickets targeted at 2.4.0 can be found at:
>>>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>>>> Version/s" = 2.4.0
>>>>>>
>>>>>> Committers should look at those and triage. Extremely important bug
>>>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>>>> be worked on immediately. Everything else please retarget to an
>>>>>> appropriate release.
>>>>>>
>>>>>> ==================
>>>>>> But my bug isn't fixed?
>>>>>> ==================
>>>>>>
>>>>>> In order to make timely releases, we will typically not hold the
>>>>>> release unless the bug in question is a regression from the previous
>>>>>> release. That being said, if there is something which is a regression
>>>>>> that has not been correctly targeted please ping me or a committer to
>>>>>> help target the issue.
>>>>>>
>>>>>

Re: [VOTE] SPARK 2.4.0 (RC4)

Reply via email to