> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4 > {1:3} Are you running in the thrift-server? Then maybe this is caused by the bug in `Dateset.collect` as I mentioned above.
I think map_filter is implemented correctly. map(1,2,1,3) is actually map(1,2) according to the "earlier entry wins" semantic. I don't think this will change in 2.4.1. On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > Thank you for the follow-ups. > > Then, Spark 2.4.1 will return `{1:2}` differently from the followings > (including Spark/Scala) in the end? > > I hoped to fix the `map_filter`, but now Spark looks inconsistent in many > ways. > > scala> sql("select map(1,2,1,3)").show // Spark 2.2.2 > +---------------+ > |map(1, 2, 1, 3)| > +---------------+ > | Map(1 -> 3)| > +---------------+ > > > spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4 > {1:3} > > > hive> select map(1,2,1,3); // Hive 1.2.2 > OK > {1:3} > > > presto> SELECT map_concat(map(array[1],array[2]), map(array[1],array[3])); > // Presto 0.212 > _col0 > ------- > {1=3} > > > Bests, > Dongjoon. > > > On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan <cloud0...@gmail.com> wrote: > >> Hi Dongjoon, >> >> Thanks for reporting it! This is indeed a bug that needs to be fixed. >> >> The problem is not about the function `map_filter`, but about how the map >> type values are created in Spark, when there are duplicated keys. >> >> In programming languages like Java/Scala, when creating map, the later >> entry wins. e.g. in scala >> scala> Map(1 -> 2, 1 -> 3) >> res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3) >> >> scala> Map(1 -> 2, 1 -> 3).get(1) >> res1: Option[Int] = Some(3) >> >> However, in Spark, the earlier entry wins >> scala> sql("SELECT map(1,2,1,3)[1]").show >> +------------------+ >> |map(1, 2, 1, 3)[1]| >> +------------------+ >> | 2| >> +------------------+ >> >> So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2). >> >> But there are several bugs in Spark >> >> scala> sql("SELECT map(1,2,1,3)").show >> +----------------+ >> | map(1, 2, 1, 3)| >> +----------------+ >> |[1 -> 2, 1 -> 3]| >> +----------------+ >> The displayed string of map values has a bug and we should deduplicate >> the entries, This is tracked by SPARK-25824. >> >> >> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map") >> res11: org.apache.spark.sql.DataFrame = [] >> >> scala> sql("select * from t").show >> +--------+ >> | map| >> +--------+ >> |[1 -> 3]| >> +--------+ >> The Hive map value convert has a bug, we should respect the "earlier >> entry wins" semantic. No ticket yet. >> >> >> scala> sql("select map(1,2,1,3)").collect >> res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) >> Same bug happens at `collect`. No ticket yet. >> >> I'll create tickets and list all of them as known issues in 2.4.0. >> >> It's arguable if the "earlier entry wins" semantic is reasonable. Fixing >> it is a behavior change and we can only apply it to master branch. >> >> Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's >> just a symptom of the hive map value converter bug. I think it's a >> non-blocker. >> >> Thanks, >> Wenchen >> >> On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <dongjoon.h...@gmail.com> >> wrote: >> >>> Hi, All. >>> >>> -0 due to the following issue. From Spark 2.4.0, users may get an >>> incorrect result when they use new `map_fitler` with `map_concat` functions. >>> >>> https://issues.apache.org/jira/browse/SPARK-25823 >>> >>> SPARK-25823 is only aiming to fix the data correctness issue from >>> `map_filter`. >>> >>> PMC members are able to lower the priority. Always, I respect PMC's >>> decision. >>> >>> I'm sending this email to draw more attention to this bug and to give >>> some warning on the new feature's limitation to the community. >>> >>> Bests, >>> Dongjoon. >>> >>> >>> On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <cloud0...@gmail.com> >>> wrote: >>> >>>> Please vote on releasing the following candidate as Apache Spark >>>> version 2.4.0. >>>> >>>> The vote is open until October 26 PST and passes if a majority +1 PMC >>>> votes are cast, with >>>> a minimum of 3 +1 votes. >>>> >>>> [ ] +1 Release this package as Apache Spark 2.4.0 >>>> [ ] -1 Do not release this package because ... >>>> >>>> To learn more about Apache Spark, please see http://spark.apache.org/ >>>> >>>> The tag to be voted on is v2.4.0-rc4 (commit >>>> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe): >>>> https://github.com/apache/spark/tree/v2.4.0-rc4 >>>> >>>> The release files, including signatures, digests, etc. can be found at: >>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/ >>>> >>>> Signatures used for Spark RCs can be found in this file: >>>> https://dist.apache.org/repos/dist/dev/spark/KEYS >>>> >>>> The staging repository for this release can be found at: >>>> https://repository.apache.org/content/repositories/orgapachespark-1290 >>>> >>>> The documentation corresponding to this release can be found at: >>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/ >>>> >>>> The list of bug fixes going into 2.4.0 can be found at the following >>>> URL: >>>> https://issues.apache.org/jira/projects/SPARK/versions/12342385 >>>> >>>> FAQ >>>> >>>> ========================= >>>> How can I help test this release? >>>> ========================= >>>> >>>> If you are a Spark user, you can help us test this release by taking >>>> an existing Spark workload and running on this release candidate, then >>>> reporting any regressions. >>>> >>>> If you're working in PySpark you can set up a virtual env and install >>>> the current RC and see if anything important breaks, in the Java/Scala >>>> you can add the staging repository to your projects resolvers and test >>>> with the RC (make sure to clean up the artifact cache before/after so >>>> you don't end up building with a out of date RC going forward). >>>> >>>> =========================================== >>>> What should happen to JIRA tickets still targeting 2.4.0? >>>> =========================================== >>>> >>>> The current list of open tickets targeted at 2.4.0 can be found at: >>>> https://issues.apache.org/jira/projects/SPARK and search for "Target >>>> Version/s" = 2.4.0 >>>> >>>> Committers should look at those and triage. Extremely important bug >>>> fixes, documentation, and API tweaks that impact compatibility should >>>> be worked on immediately. Everything else please retarget to an >>>> appropriate release. >>>> >>>> ================== >>>> But my bug isn't fixed? >>>> ================== >>>> >>>> In order to make timely releases, we will typically not hold the >>>> release unless the bug in question is a regression from the previous >>>> release. That being said, if there is something which is a regression >>>> that has not been correctly targeted please ping me or a committer to >>>> help target the issue. >>>> >>>