For the first question, it's `bin/spark-sql` result. I didn't check STS, but it will return the same with `bin/spark-sql`.
> I think map_filter is implemented correctly. map(1,2,1,3) is actually map(1,2) according to the "earlier entry wins" semantic. I don't think this will change in 2.4.1. For the second one, `map_filter` issue is not about `earlier entry wins` stuff. Please see the following example. spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT map_concat(map(1,2), map(1,3)) m); {1:3} {1:2} spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT map_concat(map(1,2), map(1,3)) m); {1:3} {1:3} spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT map_concat(map(1,2), map(1,3)) m); {1:3} {} In other words, `map_filter` works like `push-downed filter` to the map in terms of the output result while users assumed that `map_filter` works on top of the result of `m`. This is a function semantic issue. On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan <cloud0...@gmail.com> wrote: > > spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4 > > {1:3} > > Are you running in the thrift-server? Then maybe this is caused by the bug > in `Dateset.collect` as I mentioned above. > > I think map_filter is implemented correctly. map(1,2,1,3) is actually > map(1,2) according to the "earlier entry wins" semantic. I don't think > this will change in 2.4.1. > > On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >> Thank you for the follow-ups. >> >> Then, Spark 2.4.1 will return `{1:2}` differently from the followings >> (including Spark/Scala) in the end? >> >> I hoped to fix the `map_filter`, but now Spark looks inconsistent in many >> ways. >> >> scala> sql("select map(1,2,1,3)").show // Spark 2.2.2 >> +---------------+ >> |map(1, 2, 1, 3)| >> +---------------+ >> | Map(1 -> 3)| >> +---------------+ >> >> >> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4 >> {1:3} >> >> >> hive> select map(1,2,1,3); // Hive 1.2.2 >> OK >> {1:3} >> >> >> presto> SELECT map_concat(map(array[1],array[2]), >> map(array[1],array[3])); // Presto 0.212 >> _col0 >> ------- >> {1=3} >> >> >> Bests, >> Dongjoon. >> >> >> On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan <cloud0...@gmail.com> wrote: >> >>> Hi Dongjoon, >>> >>> Thanks for reporting it! This is indeed a bug that needs to be fixed. >>> >>> The problem is not about the function `map_filter`, but about how the >>> map type values are created in Spark, when there are duplicated keys. >>> >>> In programming languages like Java/Scala, when creating map, the later >>> entry wins. e.g. in scala >>> scala> Map(1 -> 2, 1 -> 3) >>> res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3) >>> >>> scala> Map(1 -> 2, 1 -> 3).get(1) >>> res1: Option[Int] = Some(3) >>> >>> However, in Spark, the earlier entry wins >>> scala> sql("SELECT map(1,2,1,3)[1]").show >>> +------------------+ >>> |map(1, 2, 1, 3)[1]| >>> +------------------+ >>> | 2| >>> +------------------+ >>> >>> So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2). >>> >>> But there are several bugs in Spark >>> >>> scala> sql("SELECT map(1,2,1,3)").show >>> +----------------+ >>> | map(1, 2, 1, 3)| >>> +----------------+ >>> |[1 -> 2, 1 -> 3]| >>> +----------------+ >>> The displayed string of map values has a bug and we should deduplicate >>> the entries, This is tracked by SPARK-25824. >>> >>> >>> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map") >>> res11: org.apache.spark.sql.DataFrame = [] >>> >>> scala> sql("select * from t").show >>> +--------+ >>> | map| >>> +--------+ >>> |[1 -> 3]| >>> +--------+ >>> The Hive map value convert has a bug, we should respect the "earlier >>> entry wins" semantic. No ticket yet. >>> >>> >>> scala> sql("select map(1,2,1,3)").collect >>> res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) >>> Same bug happens at `collect`. No ticket yet. >>> >>> I'll create tickets and list all of them as known issues in 2.4.0. >>> >>> It's arguable if the "earlier entry wins" semantic is reasonable. Fixing >>> it is a behavior change and we can only apply it to master branch. >>> >>> Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's >>> just a symptom of the hive map value converter bug. I think it's a >>> non-blocker. >>> >>> Thanks, >>> Wenchen >>> >>> On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <dongjoon.h...@gmail.com> >>> wrote: >>> >>>> Hi, All. >>>> >>>> -0 due to the following issue. From Spark 2.4.0, users may get an >>>> incorrect result when they use new `map_fitler` with `map_concat` >>>> functions. >>>> >>>> https://issues.apache.org/jira/browse/SPARK-25823 >>>> >>>> SPARK-25823 is only aiming to fix the data correctness issue from >>>> `map_filter`. >>>> >>>> PMC members are able to lower the priority. Always, I respect PMC's >>>> decision. >>>> >>>> I'm sending this email to draw more attention to this bug and to give >>>> some warning on the new feature's limitation to the community. >>>> >>>> Bests, >>>> Dongjoon. >>>> >>>> >>>> On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <cloud0...@gmail.com> >>>> wrote: >>>> >>>>> Please vote on releasing the following candidate as Apache Spark >>>>> version 2.4.0. >>>>> >>>>> The vote is open until October 26 PST and passes if a majority +1 PMC >>>>> votes are cast, with >>>>> a minimum of 3 +1 votes. >>>>> >>>>> [ ] +1 Release this package as Apache Spark 2.4.0 >>>>> [ ] -1 Do not release this package because ... >>>>> >>>>> To learn more about Apache Spark, please see http://spark.apache.org/ >>>>> >>>>> The tag to be voted on is v2.4.0-rc4 (commit >>>>> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe): >>>>> https://github.com/apache/spark/tree/v2.4.0-rc4 >>>>> >>>>> The release files, including signatures, digests, etc. can be found at: >>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/ >>>>> >>>>> Signatures used for Spark RCs can be found in this file: >>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS >>>>> >>>>> The staging repository for this release can be found at: >>>>> https://repository.apache.org/content/repositories/orgapachespark-1290 >>>>> >>>>> The documentation corresponding to this release can be found at: >>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/ >>>>> >>>>> The list of bug fixes going into 2.4.0 can be found at the following >>>>> URL: >>>>> https://issues.apache.org/jira/projects/SPARK/versions/12342385 >>>>> >>>>> FAQ >>>>> >>>>> ========================= >>>>> How can I help test this release? >>>>> ========================= >>>>> >>>>> If you are a Spark user, you can help us test this release by taking >>>>> an existing Spark workload and running on this release candidate, then >>>>> reporting any regressions. >>>>> >>>>> If you're working in PySpark you can set up a virtual env and install >>>>> the current RC and see if anything important breaks, in the Java/Scala >>>>> you can add the staging repository to your projects resolvers and test >>>>> with the RC (make sure to clean up the artifact cache before/after so >>>>> you don't end up building with a out of date RC going forward). >>>>> >>>>> =========================================== >>>>> What should happen to JIRA tickets still targeting 2.4.0? >>>>> =========================================== >>>>> >>>>> The current list of open tickets targeted at 2.4.0 can be found at: >>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target >>>>> Version/s" = 2.4.0 >>>>> >>>>> Committers should look at those and triage. Extremely important bug >>>>> fixes, documentation, and API tweaks that impact compatibility should >>>>> be worked on immediately. Everything else please retarget to an >>>>> appropriate release. >>>>> >>>>> ================== >>>>> But my bug isn't fixed? >>>>> ================== >>>>> >>>>> In order to make timely releases, we will typically not hold the >>>>> release unless the bug in question is a regression from the previous >>>>> release. That being said, if there is something which is a regression >>>>> that has not been correctly targeted please ping me or a committer to >>>>> help target the issue. >>>>> >>>>