Thank you for the follow-ups. Then, Spark 2.4.1 will return `{1:2}` differently from the followings (including Spark/Scala) in the end?
I hoped to fix the `map_filter`, but now Spark looks inconsistent in many ways. scala> sql("select map(1,2,1,3)").show // Spark 2.2.2 +---------------+ |map(1, 2, 1, 3)| +---------------+ | Map(1 -> 3)| +---------------+ spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4 {1:3} hive> select map(1,2,1,3); // Hive 1.2.2 OK {1:3} presto> SELECT map_concat(map(array[1],array[2]), map(array[1],array[3])); // Presto 0.212 _col0 ------- {1=3} Bests, Dongjoon. On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan <cloud0...@gmail.com> wrote: > Hi Dongjoon, > > Thanks for reporting it! This is indeed a bug that needs to be fixed. > > The problem is not about the function `map_filter`, but about how the map > type values are created in Spark, when there are duplicated keys. > > In programming languages like Java/Scala, when creating map, the later > entry wins. e.g. in scala > scala> Map(1 -> 2, 1 -> 3) > res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3) > > scala> Map(1 -> 2, 1 -> 3).get(1) > res1: Option[Int] = Some(3) > > However, in Spark, the earlier entry wins > scala> sql("SELECT map(1,2,1,3)[1]").show > +------------------+ > |map(1, 2, 1, 3)[1]| > +------------------+ > | 2| > +------------------+ > > So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2). > > But there are several bugs in Spark > > scala> sql("SELECT map(1,2,1,3)").show > +----------------+ > | map(1, 2, 1, 3)| > +----------------+ > |[1 -> 2, 1 -> 3]| > +----------------+ > The displayed string of map values has a bug and we should deduplicate the > entries, This is tracked by SPARK-25824. > > > scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map") > res11: org.apache.spark.sql.DataFrame = [] > > scala> sql("select * from t").show > +--------+ > | map| > +--------+ > |[1 -> 3]| > +--------+ > The Hive map value convert has a bug, we should respect the "earlier entry > wins" semantic. No ticket yet. > > > scala> sql("select map(1,2,1,3)").collect > res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) > Same bug happens at `collect`. No ticket yet. > > I'll create tickets and list all of them as known issues in 2.4.0. > > It's arguable if the "earlier entry wins" semantic is reasonable. Fixing > it is a behavior change and we can only apply it to master branch. > > Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's > just a symptom of the hive map value converter bug. I think it's a > non-blocker. > > Thanks, > Wenchen > > On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >> Hi, All. >> >> -0 due to the following issue. From Spark 2.4.0, users may get an >> incorrect result when they use new `map_fitler` with `map_concat` functions. >> >> https://issues.apache.org/jira/browse/SPARK-25823 >> >> SPARK-25823 is only aiming to fix the data correctness issue from >> `map_filter`. >> >> PMC members are able to lower the priority. Always, I respect PMC's >> decision. >> >> I'm sending this email to draw more attention to this bug and to give >> some warning on the new feature's limitation to the community. >> >> Bests, >> Dongjoon. >> >> >> On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <cloud0...@gmail.com> wrote: >> >>> Please vote on releasing the following candidate as Apache Spark version >>> 2.4.0. >>> >>> The vote is open until October 26 PST and passes if a majority +1 PMC >>> votes are cast, with >>> a minimum of 3 +1 votes. >>> >>> [ ] +1 Release this package as Apache Spark 2.4.0 >>> [ ] -1 Do not release this package because ... >>> >>> To learn more about Apache Spark, please see http://spark.apache.org/ >>> >>> The tag to be voted on is v2.4.0-rc4 (commit >>> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe): >>> https://github.com/apache/spark/tree/v2.4.0-rc4 >>> >>> The release files, including signatures, digests, etc. can be found at: >>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/ >>> >>> Signatures used for Spark RCs can be found in this file: >>> https://dist.apache.org/repos/dist/dev/spark/KEYS >>> >>> The staging repository for this release can be found at: >>> https://repository.apache.org/content/repositories/orgapachespark-1290 >>> >>> The documentation corresponding to this release can be found at: >>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/ >>> >>> The list of bug fixes going into 2.4.0 can be found at the following URL: >>> https://issues.apache.org/jira/projects/SPARK/versions/12342385 >>> >>> FAQ >>> >>> ========================= >>> How can I help test this release? >>> ========================= >>> >>> If you are a Spark user, you can help us test this release by taking >>> an existing Spark workload and running on this release candidate, then >>> reporting any regressions. >>> >>> If you're working in PySpark you can set up a virtual env and install >>> the current RC and see if anything important breaks, in the Java/Scala >>> you can add the staging repository to your projects resolvers and test >>> with the RC (make sure to clean up the artifact cache before/after so >>> you don't end up building with a out of date RC going forward). >>> >>> =========================================== >>> What should happen to JIRA tickets still targeting 2.4.0? >>> =========================================== >>> >>> The current list of open tickets targeted at 2.4.0 can be found at: >>> https://issues.apache.org/jira/projects/SPARK and search for "Target >>> Version/s" = 2.4.0 >>> >>> Committers should look at those and triage. Extremely important bug >>> fixes, documentation, and API tweaks that impact compatibility should >>> be worked on immediately. Everything else please retarget to an >>> appropriate release. >>> >>> ================== >>> But my bug isn't fixed? >>> ================== >>> >>> In order to make timely releases, we will typically not hold the >>> release unless the bug in question is a regression from the previous >>> release. That being said, if there is something which is a regression >>> that has not been correctly targeted please ping me or a committer to >>> help target the issue. >>> >>