Re: [VOTE] SPARK 2.4.0 (RC4)

Wenchen Fan Wed, 24 Oct 2018 18:07:31 -0700

> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
> {1:3}

Are you running in the thrift-server? Then maybe this is caused by the bug
in `Dateset.collect` as I mentioned above.


I think map_filter is implemented correctly. map(1,2,1,3) is actually
map(1,2) according to the "earlier entry wins" semantic. I don't think this
will change in 2.4.1.

On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun <[email protected]>
wrote:

> Thank you for the follow-ups.
>
> Then, Spark 2.4.1 will return `{1:2}` differently from the followings
> (including Spark/Scala) in the end?
>
> I hoped to fix the `map_filter`, but now Spark looks inconsistent in many
> ways.
>
> scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
> +---------------+
> |map(1, 2, 1, 3)|
> +---------------+
> |    Map(1 -> 3)|
> +---------------+
>
>
> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
> {1:3}
>
>
> hive> select map(1,2,1,3);  // Hive 1.2.2
> OK
> {1:3}
>
>
> presto> SELECT map_concat(map(array[1],array[2]), map(array[1],array[3]));
> // Presto 0.212
>  _col0
> -------
>  {1=3}
>
>
> Bests,
> Dongjoon.
>
>
> On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan <[email protected]> wrote:
>
>> Hi Dongjoon,
>>
>> Thanks for reporting it! This is indeed a bug that needs to be fixed.
>>
>> The problem is not about the function `map_filter`, but about how the map
>> type values are created in Spark, when there are duplicated keys.
>>
>> In programming languages like Java/Scala, when creating map, the later
>> entry wins. e.g. in scala
>> scala> Map(1 -> 2, 1 -> 3)
>> res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)
>>
>> scala> Map(1 -> 2, 1 -> 3).get(1)
>> res1: Option[Int] = Some(3)
>>
>> However, in Spark, the earlier entry wins
>> scala> sql("SELECT map(1,2,1,3)[1]").show
>> +------------------+
>> |map(1, 2, 1, 3)[1]|
>> +------------------+
>> |                 2|
>> +------------------+
>>
>> So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).
>>
>> But there are several bugs in Spark
>>
>> scala> sql("SELECT map(1,2,1,3)").show
>> +----------------+
>> | map(1, 2, 1, 3)|
>> +----------------+
>> |[1 -> 2, 1 -> 3]|
>> +----------------+
>> The displayed string of map values has a bug and we should deduplicate
>> the entries, This is tracked by SPARK-25824.
>>
>>
>> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
>> res11: org.apache.spark.sql.DataFrame = []
>>
>> scala> sql("select * from t").show
>> +--------+
>> |     map|
>> +--------+
>> |[1 -> 3]|
>> +--------+
>> The Hive map value convert has a bug, we should respect the "earlier
>> entry wins" semantic. No ticket yet.
>>
>>
>> scala> sql("select map(1,2,1,3)").collect
>> res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
>> Same bug happens at `collect`. No ticket yet.
>>
>> I'll create tickets and list all of them as known issues in 2.4.0.
>>
>> It's arguable if the "earlier entry wins" semantic is reasonable. Fixing
>> it is a behavior change and we can only apply it to master branch.
>>
>> Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's
>> just a symptom of the hive map value converter bug. I think it's a
>> non-blocker.
>>
>> Thanks,
>> Wenchen
>>
>> On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <[email protected]>
>> wrote:
>>
>>> Hi, All.
>>>
>>> -0 due to the following issue. From Spark 2.4.0, users may get an
>>> incorrect result when they use new `map_fitler` with `map_concat` functions.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-25823
>>>
>>> SPARK-25823 is only aiming to fix the data correctness issue from
>>> `map_filter`.
>>>
>>> PMC members are able to lower the priority. Always, I respect PMC's
>>> decision.
>>>
>>> I'm sending this email to draw more attention to this bug and to give
>>> some warning on the new feature's limitation to the community.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <[email protected]>
>>> wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 2.4.0.
>>>>
>>>> The vote is open until October 26 PST and passes if a majority +1 PMC
>>>> votes are cast, with
>>>> a minimum of 3 +1 votes.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 2.4.0
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>> The tag to be voted on is v2.4.0-rc4 (commit
>>>> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
>>>> https://github.com/apache/spark/tree/v2.4.0-rc4
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>>>>
>>>> Signatures used for Spark RCs can be found in this file:
>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1290
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>>>>
>>>> The list of bug fixes going into 2.4.0 can be found at the following
>>>> URL:
>>>> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>>>>
>>>> FAQ
>>>>
>>>> =========================
>>>> How can I help test this release?
>>>> =========================
>>>>
>>>> If you are a Spark user, you can help us test this release by taking
>>>> an existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> If you're working in PySpark you can set up a virtual env and install
>>>> the current RC and see if anything important breaks, in the Java/Scala
>>>> you can add the staging repository to your projects resolvers and test
>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>> you don't end up building with a out of date RC going forward).
>>>>
>>>> ===========================================
>>>> What should happen to JIRA tickets still targeting 2.4.0?
>>>> ===========================================
>>>>
>>>> The current list of open tickets targeted at 2.4.0 can be found at:
>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>> Version/s" = 2.4.0
>>>>
>>>> Committers should look at those and triage. Extremely important bug
>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>> be worked on immediately. Everything else please retarget to an
>>>> appropriate release.
>>>>
>>>> ==================
>>>> But my bug isn't fixed?
>>>> ==================
>>>>
>>>> In order to make timely releases, we will typically not hold the
>>>> release unless the bug in question is a regression from the previous
>>>> release. That being said, if there is something which is a regression
>>>> that has not been correctly targeted please ping me or a committer to
>>>> help target the issue.
>>>>
>>>

Re: [VOTE] SPARK 2.4.0 (RC4)

Reply via email to