Re: [VOTE] SPARK 2.4.0 (RC4)

Dongjoon Hyun Wed, 24 Oct 2018 17:57:01 -0700

Thank you for the follow-ups.

Then, Spark 2.4.1 will return `{1:2}` differently from the followings
(including Spark/Scala) in the end?


I hoped to fix the `map_filter`, but now Spark looks inconsistent in many
ways.

scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
+---------------+
|map(1, 2, 1, 3)|
+---------------+
|    Map(1 -> 3)|
+---------------+


spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
{1:3}


hive> select map(1,2,1,3);  // Hive 1.2.2
OK
{1:3}


presto> SELECT map_concat(map(array[1],array[2]), map(array[1],array[3]));
// Presto 0.212
 _col0
-------
 {1=3}


Bests,
Dongjoon.


On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan <[email protected]> wrote:

> Hi Dongjoon,
>
> Thanks for reporting it! This is indeed a bug that needs to be fixed.
>
> The problem is not about the function `map_filter`, but about how the map
> type values are created in Spark, when there are duplicated keys.
>
> In programming languages like Java/Scala, when creating map, the later
> entry wins. e.g. in scala
> scala> Map(1 -> 2, 1 -> 3)
> res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)
>
> scala> Map(1 -> 2, 1 -> 3).get(1)
> res1: Option[Int] = Some(3)
>
> However, in Spark, the earlier entry wins
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +------------------+
> |map(1, 2, 1, 3)[1]|
> +------------------+
> |                 2|
> +------------------+
>
> So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).
>
> But there are several bugs in Spark
>
> scala> sql("SELECT map(1,2,1,3)").show
> +----------------+
> | map(1, 2, 1, 3)|
> +----------------+
> |[1 -> 2, 1 -> 3]|
> +----------------+
> The displayed string of map values has a bug and we should deduplicate the
> entries, This is tracked by SPARK-25824.
>
>
> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
> res11: org.apache.spark.sql.DataFrame = []
>
> scala> sql("select * from t").show
> +--------+
> |     map|
> +--------+
> |[1 -> 3]|
> +--------+
> The Hive map value convert has a bug, we should respect the "earlier entry
> wins" semantic. No ticket yet.
>
>
> scala> sql("select map(1,2,1,3)").collect
> res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
> Same bug happens at `collect`. No ticket yet.
>
> I'll create tickets and list all of them as known issues in 2.4.0.
>
> It's arguable if the "earlier entry wins" semantic is reasonable. Fixing
> it is a behavior change and we can only apply it to master branch.
>
> Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's
> just a symptom of the hive map value converter bug. I think it's a
> non-blocker.
>
> Thanks,
> Wenchen
>
> On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <[email protected]>
> wrote:
>
>> Hi, All.
>>
>> -0 due to the following issue. From Spark 2.4.0, users may get an
>> incorrect result when they use new `map_fitler` with `map_concat` functions.
>>
>> https://issues.apache.org/jira/browse/SPARK-25823
>>
>> SPARK-25823 is only aiming to fix the data correctness issue from
>> `map_filter`.
>>
>> PMC members are able to lower the priority. Always, I respect PMC's
>> decision.
>>
>> I'm sending this email to draw more attention to this bug and to give
>> some warning on the new feature's limitation to the community.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <[email protected]> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.4.0.
>>>
>>> The vote is open until October 26 PST and passes if a majority +1 PMC
>>> votes are cast, with
>>> a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.4.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.4.0-rc4 (commit
>>> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
>>> https://github.com/apache/spark/tree/v2.4.0-rc4
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1290
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>>>
>>> The list of bug fixes going into 2.4.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>>>
>>> FAQ
>>>
>>> =========================
>>> How can I help test this release?
>>> =========================
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===========================================
>>> What should happen to JIRA tickets still targeting 2.4.0?
>>> ===========================================
>>>
>>> The current list of open tickets targeted at 2.4.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 2.4.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==================
>>> But my bug isn't fixed?
>>> ==================
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>

Re: [VOTE] SPARK 2.4.0 (RC4)

Reply via email to