Personally I don't think it matters. Users can build arbitrary expressions/plans themselves with internal API, and we never guarantee the result.
Removing these functions from the function registry is a small patch and easy to review, and to me it's better than a 1000+ LOC patch that removes the whole thing. Again I don't have a strong opinion here. I'm OK to remove the entire thing if a PR is ready and well reviewed. On Thu, Oct 25, 2018 at 11:00 PM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > Thank you for the decision, All. > > As of now, to unblock this, it seems that we are trying to remove them > from the function registry. > > https://github.com/apache/spark/pull/22821 > > One problem here is that users can recover those functions like this > simply. > > scala> > spark.sessionState.functionRegistry.createOrReplaceTempFunction("map_filter", > x => org.apache.spark.sql.catalyst.expressions.MapFilter(x(0),x(1))) > > > Technically, the PR looks like a compromised way to unblock the release > and to allow some users that feature completely. > > At first glance, I thought this is a workaround to ignore the discussion > context. But, that sounds like one of the practical ways for Apache Spark. > (We had Spark 2.0 Tech. Preview before.) > > I want to finalize the decision on `map_filter` (and related three > functions) issue. Are we good to go with > https://github.com/apache/spark/pull/22821? > > Bests, > Dongjoon. > > PS. Also, there is a PR to completely remove them, too. > https://github.com/cloud-fan/spark/pull/11 > > > On Wed, Oct 24, 2018 at 10:14 PM Xiao Li <lix...@databricks.com> wrote: > >> @Dongjoon Hyun <dongjoon.h...@gmail.com> Thanks! This is a blocking >> ticket. It returns a wrong result due to our undefined behavior. I agree we >> should revert the newly added map-oriented functions. In 3.0 release, we >> need to define the behavior of duplicate keys in the data type MAP and fix >> all the related issues that are confusing to our end users. >> >> Thanks, >> >> Xiao >> >> On Wed, Oct 24, 2018 at 9:54 PM Wenchen Fan <cloud0...@gmail.com> wrote: >> >>> Ah now I see the problem. `map_filter` has a very weird semantic that is >>> neither "earlier entry wins" or "latter entry wins". >>> >>> I've opened https://github.com/apache/spark/pull/22821 , to remove >>> these newly added map-related functions from FunctionRegistry(for 2.4.0), >>> so that they are invisible to end-users, and the weird behavior of Spark >>> map type with duplicated keys are not escalated. We should fix it ASAP in >>> the master branch. >>> >>> If others are OK with it, I'll start a new RC after that PR is merged. >>> >>> Thanks, >>> Wenchen >>> >>> On Thu, Oct 25, 2018 at 10:32 AM Dongjoon Hyun <dongjoon.h...@gmail.com> >>> wrote: >>> >>>> For the first question, it's `bin/spark-sql` result. I didn't check >>>> STS, but it will return the same with `bin/spark-sql`. >>>> >>>> > I think map_filter is implemented correctly. map(1,2,1,3) is >>>> actually map(1,2) according to the "earlier entry wins" semantic. I >>>> don't think this will change in 2.4.1. >>>> >>>> For the second one, `map_filter` issue is not about `earlier entry >>>> wins` stuff. Please see the following example. >>>> >>>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT >>>> map_concat(map(1,2), map(1,3)) m); >>>> {1:3} {1:2} >>>> >>>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT >>>> map_concat(map(1,2), map(1,3)) m); >>>> {1:3} {1:3} >>>> >>>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT >>>> map_concat(map(1,2), map(1,3)) m); >>>> {1:3} {} >>>> >>>> In other words, `map_filter` works like `push-downed filter` to the map >>>> in terms of the output result >>>> while users assumed that `map_filter` works on top of the result of >>>> `m`. >>>> >>>> This is a function semantic issue. >>>> >>>> >>>> On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan <cloud0...@gmail.com> >>>> wrote: >>>> >>>>> > spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4 >>>>> > {1:3} >>>>> >>>>> Are you running in the thrift-server? Then maybe this is caused by the >>>>> bug in `Dateset.collect` as I mentioned above. >>>>> >>>>> I think map_filter is implemented correctly. map(1,2,1,3) is actually >>>>> map(1,2) according to the "earlier entry wins" semantic. I don't >>>>> think this will change in 2.4.1. >>>>> >>>>> On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun <dongjoon.h...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thank you for the follow-ups. >>>>>> >>>>>> Then, Spark 2.4.1 will return `{1:2}` differently from the followings >>>>>> (including Spark/Scala) in the end? >>>>>> >>>>>> I hoped to fix the `map_filter`, but now Spark looks inconsistent in >>>>>> many ways. >>>>>> >>>>>> scala> sql("select map(1,2,1,3)").show // Spark 2.2.2 >>>>>> +---------------+ >>>>>> |map(1, 2, 1, 3)| >>>>>> +---------------+ >>>>>> | Map(1 -> 3)| >>>>>> +---------------+ >>>>>> >>>>>> >>>>>> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4 >>>>>> {1:3} >>>>>> >>>>>> >>>>>> hive> select map(1,2,1,3); // Hive 1.2.2 >>>>>> OK >>>>>> {1:3} >>>>>> >>>>>> >>>>>> presto> SELECT map_concat(map(array[1],array[2]), >>>>>> map(array[1],array[3])); // Presto 0.212 >>>>>> _col0 >>>>>> ------- >>>>>> {1=3} >>>>>> >>>>>> >>>>>> Bests, >>>>>> Dongjoon. >>>>>> >>>>>> >>>>>> On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan <cloud0...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Dongjoon, >>>>>>> >>>>>>> Thanks for reporting it! This is indeed a bug that needs to be fixed. >>>>>>> >>>>>>> The problem is not about the function `map_filter`, but about how >>>>>>> the map type values are created in Spark, when there are duplicated >>>>>>> keys. >>>>>>> >>>>>>> In programming languages like Java/Scala, when creating map, the >>>>>>> later entry wins. e.g. in scala >>>>>>> scala> Map(1 -> 2, 1 -> 3) >>>>>>> res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3) >>>>>>> >>>>>>> scala> Map(1 -> 2, 1 -> 3).get(1) >>>>>>> res1: Option[Int] = Some(3) >>>>>>> >>>>>>> However, in Spark, the earlier entry wins >>>>>>> scala> sql("SELECT map(1,2,1,3)[1]").show >>>>>>> +------------------+ >>>>>>> |map(1, 2, 1, 3)[1]| >>>>>>> +------------------+ >>>>>>> | 2| >>>>>>> +------------------+ >>>>>>> >>>>>>> So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> >>>>>>> 2). >>>>>>> >>>>>>> But there are several bugs in Spark >>>>>>> >>>>>>> scala> sql("SELECT map(1,2,1,3)").show >>>>>>> +----------------+ >>>>>>> | map(1, 2, 1, 3)| >>>>>>> +----------------+ >>>>>>> |[1 -> 2, 1 -> 3]| >>>>>>> +----------------+ >>>>>>> The displayed string of map values has a bug and we should >>>>>>> deduplicate the entries, This is tracked by SPARK-25824. >>>>>>> >>>>>>> >>>>>>> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map") >>>>>>> res11: org.apache.spark.sql.DataFrame = [] >>>>>>> >>>>>>> scala> sql("select * from t").show >>>>>>> +--------+ >>>>>>> | map| >>>>>>> +--------+ >>>>>>> |[1 -> 3]| >>>>>>> +--------+ >>>>>>> The Hive map value convert has a bug, we should respect the "earlier >>>>>>> entry wins" semantic. No ticket yet. >>>>>>> >>>>>>> >>>>>>> scala> sql("select map(1,2,1,3)").collect >>>>>>> res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) >>>>>>> Same bug happens at `collect`. No ticket yet. >>>>>>> >>>>>>> I'll create tickets and list all of them as known issues in 2.4.0. >>>>>>> >>>>>>> It's arguable if the "earlier entry wins" semantic is reasonable. >>>>>>> Fixing it is a behavior change and we can only apply it to master >>>>>>> branch. >>>>>>> >>>>>>> Going back to https://issues.apache.org/jira/browse/SPARK-25823, >>>>>>> it's just a symptom of the hive map value converter bug. I think it's a >>>>>>> non-blocker. >>>>>>> >>>>>>> Thanks, >>>>>>> Wenchen >>>>>>> >>>>>>> On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun < >>>>>>> dongjoon.h...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi, All. >>>>>>>> >>>>>>>> -0 due to the following issue. From Spark 2.4.0, users may get an >>>>>>>> incorrect result when they use new `map_fitler` with `map_concat` >>>>>>>> functions. >>>>>>>> >>>>>>>> https://issues.apache.org/jira/browse/SPARK-25823 >>>>>>>> >>>>>>>> SPARK-25823 is only aiming to fix the data correctness issue from >>>>>>>> `map_filter`. >>>>>>>> >>>>>>>> PMC members are able to lower the priority. Always, I respect PMC's >>>>>>>> decision. >>>>>>>> >>>>>>>> I'm sending this email to draw more attention to this bug and to >>>>>>>> give some warning on the new feature's limitation to the community. >>>>>>>> >>>>>>>> Bests, >>>>>>>> Dongjoon. >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <cloud0...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Please vote on releasing the following candidate as Apache Spark >>>>>>>>> version 2.4.0. >>>>>>>>> >>>>>>>>> The vote is open until October 26 PST and passes if a majority +1 >>>>>>>>> PMC votes are cast, with >>>>>>>>> a minimum of 3 +1 votes. >>>>>>>>> >>>>>>>>> [ ] +1 Release this package as Apache Spark 2.4.0 >>>>>>>>> [ ] -1 Do not release this package because ... >>>>>>>>> >>>>>>>>> To learn more about Apache Spark, please see >>>>>>>>> http://spark.apache.org/ >>>>>>>>> >>>>>>>>> The tag to be voted on is v2.4.0-rc4 (commit >>>>>>>>> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe): >>>>>>>>> https://github.com/apache/spark/tree/v2.4.0-rc4 >>>>>>>>> >>>>>>>>> The release files, including signatures, digests, etc. can be >>>>>>>>> found at: >>>>>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/ >>>>>>>>> >>>>>>>>> Signatures used for Spark RCs can be found in this file: >>>>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS >>>>>>>>> >>>>>>>>> The staging repository for this release can be found at: >>>>>>>>> >>>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1290 >>>>>>>>> >>>>>>>>> The documentation corresponding to this release can be found at: >>>>>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/ >>>>>>>>> >>>>>>>>> The list of bug fixes going into 2.4.0 can be found at the >>>>>>>>> following URL: >>>>>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12342385 >>>>>>>>> >>>>>>>>> FAQ >>>>>>>>> >>>>>>>>> ========================= >>>>>>>>> How can I help test this release? >>>>>>>>> ========================= >>>>>>>>> >>>>>>>>> If you are a Spark user, you can help us test this release by >>>>>>>>> taking >>>>>>>>> an existing Spark workload and running on this release candidate, >>>>>>>>> then >>>>>>>>> reporting any regressions. >>>>>>>>> >>>>>>>>> If you're working in PySpark you can set up a virtual env and >>>>>>>>> install >>>>>>>>> the current RC and see if anything important breaks, in the >>>>>>>>> Java/Scala >>>>>>>>> you can add the staging repository to your projects resolvers and >>>>>>>>> test >>>>>>>>> with the RC (make sure to clean up the artifact cache before/after >>>>>>>>> so >>>>>>>>> you don't end up building with a out of date RC going forward). >>>>>>>>> >>>>>>>>> =========================================== >>>>>>>>> What should happen to JIRA tickets still targeting 2.4.0? >>>>>>>>> =========================================== >>>>>>>>> >>>>>>>>> The current list of open tickets targeted at 2.4.0 can be found at: >>>>>>>>> https://issues.apache.org/jira/projects/SPARK and search for >>>>>>>>> "Target Version/s" = 2.4.0 >>>>>>>>> >>>>>>>>> Committers should look at those and triage. Extremely important bug >>>>>>>>> fixes, documentation, and API tweaks that impact compatibility >>>>>>>>> should >>>>>>>>> be worked on immediately. Everything else please retarget to an >>>>>>>>> appropriate release. >>>>>>>>> >>>>>>>>> ================== >>>>>>>>> But my bug isn't fixed? >>>>>>>>> ================== >>>>>>>>> >>>>>>>>> In order to make timely releases, we will typically not hold the >>>>>>>>> release unless the bug in question is a regression from the >>>>>>>>> previous >>>>>>>>> release. That being said, if there is something which is a >>>>>>>>> regression >>>>>>>>> that has not been correctly targeted please ping me or a committer >>>>>>>>> to >>>>>>>>> help target the issue. >>>>>>>>> >>>>>>>> >> >> -- >> [image: Spark+AI Summit North America 2019] >> <http://t.sidekickopen24.com/s1t/c/5/f18dQhb0S7lM8dDMPbW2n0x6l2B9nMJN7t5X-FfhMynN2z8MDjQsyTKW56dzQQ1-_gV6102?t=https%3A%2F%2Fdatabricks.com%2Fsparkaisummit%2Fnorth-america&si=undefined&pi=406b8c9a-b648-4923-9ed1-9a51ffe213fa> >> >