Hi Dongjoon,

Thanks for reporting it! This is indeed a bug that needs to be fixed.

The problem is not about the function `map_filter`, but about how the map
type values are created in Spark, when there are duplicated keys.

In programming languages like Java/Scala, when creating map, the later
entry wins. e.g. in scala
scala> Map(1 -> 2, 1 -> 3)
res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)

scala> Map(1 -> 2, 1 -> 3).get(1)
res1: Option[Int] = Some(3)

However, in Spark, the earlier entry wins
scala> sql("SELECT map(1,2,1,3)[1]").show
+------------------+
|map(1, 2, 1, 3)[1]|
+------------------+
|                 2|
+------------------+

So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).

But there are several bugs in Spark

scala> sql("SELECT map(1,2,1,3)").show
+----------------+
| map(1, 2, 1, 3)|
+----------------+
|[1 -> 2, 1 -> 3]|
+----------------+
The displayed string of map values has a bug and we should deduplicate the
entries, This is tracked by SPARK-25824.


scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
res11: org.apache.spark.sql.DataFrame = []

scala> sql("select * from t").show
+--------+
|     map|
+--------+
|[1 -> 3]|
+--------+
The Hive map value convert has a bug, we should respect the "earlier entry
wins" semantic. No ticket yet.


scala> sql("select map(1,2,1,3)").collect
res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
Same bug happens at `collect`. No ticket yet.

I'll create tickets and list all of them as known issues in 2.4.0.

It's arguable if the "earlier entry wins" semantic is reasonable. Fixing it
is a behavior change and we can only apply it to master branch.

Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's just
a symptom of the hive map value converter bug. I think it's a non-blocker.

Thanks,
Wenchen

On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> Hi, All.
>
> -0 due to the following issue. From Spark 2.4.0, users may get an
> incorrect result when they use new `map_fitler` with `map_concat` functions.
>
> https://issues.apache.org/jira/browse/SPARK-25823
>
> SPARK-25823 is only aiming to fix the data correctness issue from
> `map_filter`.
>
> PMC members are able to lower the priority. Always, I respect PMC's
> decision.
>
> I'm sending this email to draw more attention to this bug and to give some
> warning on the new feature's limitation to the community.
>
> Bests,
> Dongjoon.
>
>
> On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.4.0.
>>
>> The vote is open until October 26 PST and passes if a majority +1 PMC
>> votes are cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.4.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.4.0-rc4 (commit
>> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
>> https://github.com/apache/spark/tree/v2.4.0-rc4
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1290
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>>
>> The list of bug fixes going into 2.4.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>>
>> FAQ
>>
>> =========================
>> How can I help test this release?
>> =========================
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===========================================
>> What should happen to JIRA tickets still targeting 2.4.0?
>> ===========================================
>>
>> The current list of open tickets targeted at 2.4.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.4.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==================
>> But my bug isn't fixed?
>> ==================
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>

Reply via email to