Hi Dongjoon, Thanks for reporting it! This is indeed a bug that needs to be fixed.
The problem is not about the function `map_filter`, but about how the map type values are created in Spark, when there are duplicated keys. In programming languages like Java/Scala, when creating map, the later entry wins. e.g. in scala scala> Map(1 -> 2, 1 -> 3) res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3) scala> Map(1 -> 2, 1 -> 3).get(1) res1: Option[Int] = Some(3) However, in Spark, the earlier entry wins scala> sql("SELECT map(1,2,1,3)[1]").show +------------------+ |map(1, 2, 1, 3)[1]| +------------------+ | 2| +------------------+ So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2). But there are several bugs in Spark scala> sql("SELECT map(1,2,1,3)").show +----------------+ | map(1, 2, 1, 3)| +----------------+ |[1 -> 2, 1 -> 3]| +----------------+ The displayed string of map values has a bug and we should deduplicate the entries, This is tracked by SPARK-25824. scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map") res11: org.apache.spark.sql.DataFrame = [] scala> sql("select * from t").show +--------+ | map| +--------+ |[1 -> 3]| +--------+ The Hive map value convert has a bug, we should respect the "earlier entry wins" semantic. No ticket yet. scala> sql("select map(1,2,1,3)").collect res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) Same bug happens at `collect`. No ticket yet. I'll create tickets and list all of them as known issues in 2.4.0. It's arguable if the "earlier entry wins" semantic is reasonable. Fixing it is a behavior change and we can only apply it to master branch. Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's just a symptom of the hive map value converter bug. I think it's a non-blocker. Thanks, Wenchen On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > Hi, All. > > -0 due to the following issue. From Spark 2.4.0, users may get an > incorrect result when they use new `map_fitler` with `map_concat` functions. > > https://issues.apache.org/jira/browse/SPARK-25823 > > SPARK-25823 is only aiming to fix the data correctness issue from > `map_filter`. > > PMC members are able to lower the priority. Always, I respect PMC's > decision. > > I'm sending this email to draw more attention to this bug and to give some > warning on the new feature's limitation to the community. > > Bests, > Dongjoon. > > > On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <cloud0...@gmail.com> wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 2.4.0. >> >> The vote is open until October 26 PST and passes if a majority +1 PMC >> votes are cast, with >> a minimum of 3 +1 votes. >> >> [ ] +1 Release this package as Apache Spark 2.4.0 >> [ ] -1 Do not release this package because ... >> >> To learn more about Apache Spark, please see http://spark.apache.org/ >> >> The tag to be voted on is v2.4.0-rc4 (commit >> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe): >> https://github.com/apache/spark/tree/v2.4.0-rc4 >> >> The release files, including signatures, digests, etc. can be found at: >> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/ >> >> Signatures used for Spark RCs can be found in this file: >> https://dist.apache.org/repos/dist/dev/spark/KEYS >> >> The staging repository for this release can be found at: >> https://repository.apache.org/content/repositories/orgapachespark-1290 >> >> The documentation corresponding to this release can be found at: >> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/ >> >> The list of bug fixes going into 2.4.0 can be found at the following URL: >> https://issues.apache.org/jira/projects/SPARK/versions/12342385 >> >> FAQ >> >> ========================= >> How can I help test this release? >> ========================= >> >> If you are a Spark user, you can help us test this release by taking >> an existing Spark workload and running on this release candidate, then >> reporting any regressions. >> >> If you're working in PySpark you can set up a virtual env and install >> the current RC and see if anything important breaks, in the Java/Scala >> you can add the staging repository to your projects resolvers and test >> with the RC (make sure to clean up the artifact cache before/after so >> you don't end up building with a out of date RC going forward). >> >> =========================================== >> What should happen to JIRA tickets still targeting 2.4.0? >> =========================================== >> >> The current list of open tickets targeted at 2.4.0 can be found at: >> https://issues.apache.org/jira/projects/SPARK and search for "Target >> Version/s" = 2.4.0 >> >> Committers should look at those and triage. Extremely important bug >> fixes, documentation, and API tweaks that impact compatibility should >> be worked on immediately. Everything else please retarget to an >> appropriate release. >> >> ================== >> But my bug isn't fixed? >> ================== >> >> In order to make timely releases, we will typically not hold the >> release unless the bug in question is a regression from the previous >> release. That being said, if there is something which is a regression >> that has not been correctly targeted please ping me or a committer to >> help target the issue. >> >