Re: [VOTE] SPARK 2.4.0 (RC4)

Dongjoon Hyun Wed, 24 Oct 2018 19:33:00 -0700

For the first question, it's `bin/spark-sql` result. I didn't check STS,
but it will return the same with `bin/spark-sql`.


> I think map_filter is implemented correctly. map(1,2,1,3) is actually
map(1,2) according to the "earlier entry wins" semantic. I don't think this
will change in 2.4.1.

For the second one, `map_filter` issue is not about `earlier entry wins`
stuff. Please see the following example.

spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT
map_concat(map(1,2), map(1,3)) m);
{1:3} {1:2}

spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT
map_concat(map(1,2), map(1,3)) m);
{1:3} {1:3}

spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT
map_concat(map(1,2), map(1,3)) m);
{1:3} {}

In other words, `map_filter` works like `push-downed filter` to the map in
terms of the output result
while users assumed that `map_filter` works on top of the result of `m`.

This is a function semantic issue.


On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan <cloud0...@gmail.com> wrote:

> > spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
> > {1:3}
>
> Are you running in the thrift-server? Then maybe this is caused by the bug
> in `Dateset.collect` as I mentioned above.
>
> I think map_filter is implemented correctly. map(1,2,1,3) is actually
> map(1,2) according to the "earlier entry wins" semantic. I don't think
> this will change in 2.4.1.
>
> On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> Thank you for the follow-ups.
>>
>> Then, Spark 2.4.1 will return `{1:2}` differently from the followings
>> (including Spark/Scala) in the end?
>>
>> I hoped to fix the `map_filter`, but now Spark looks inconsistent in many
>> ways.
>>
>> scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
>> +---------------+
>> |map(1, 2, 1, 3)|
>> +---------------+
>> |    Map(1 -> 3)|
>> +---------------+
>>
>>
>> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
>> {1:3}
>>
>>
>> hive> select map(1,2,1,3);  // Hive 1.2.2
>> OK
>> {1:3}
>>
>>
>> presto> SELECT map_concat(map(array[1],array[2]),
>> map(array[1],array[3])); // Presto 0.212
>>  _col0
>> -------
>>  {1=3}
>>
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> Hi Dongjoon,
>>>
>>> Thanks for reporting it! This is indeed a bug that needs to be fixed.
>>>
>>> The problem is not about the function `map_filter`, but about how the
>>> map type values are created in Spark, when there are duplicated keys.
>>>
>>> In programming languages like Java/Scala, when creating map, the later
>>> entry wins. e.g. in scala
>>> scala> Map(1 -> 2, 1 -> 3)
>>> res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)
>>>
>>> scala> Map(1 -> 2, 1 -> 3).get(1)
>>> res1: Option[Int] = Some(3)
>>>
>>> However, in Spark, the earlier entry wins
>>> scala> sql("SELECT map(1,2,1,3)[1]").show
>>> +------------------+
>>> |map(1, 2, 1, 3)[1]|
>>> +------------------+
>>> |                 2|
>>> +------------------+
>>>
>>> So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).
>>>
>>> But there are several bugs in Spark
>>>
>>> scala> sql("SELECT map(1,2,1,3)").show
>>> +----------------+
>>> | map(1, 2, 1, 3)|
>>> +----------------+
>>> |[1 -> 2, 1 -> 3]|
>>> +----------------+
>>> The displayed string of map values has a bug and we should deduplicate
>>> the entries, This is tracked by SPARK-25824.
>>>
>>>
>>> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
>>> res11: org.apache.spark.sql.DataFrame = []
>>>
>>> scala> sql("select * from t").show
>>> +--------+
>>> |     map|
>>> +--------+
>>> |[1 -> 3]|
>>> +--------+
>>> The Hive map value convert has a bug, we should respect the "earlier
>>> entry wins" semantic. No ticket yet.
>>>
>>>
>>> scala> sql("select map(1,2,1,3)").collect
>>> res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
>>> Same bug happens at `collect`. No ticket yet.
>>>
>>> I'll create tickets and list all of them as known issues in 2.4.0.
>>>
>>> It's arguable if the "earlier entry wins" semantic is reasonable. Fixing
>>> it is a behavior change and we can only apply it to master branch.
>>>
>>> Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's
>>> just a symptom of the hive map value converter bug. I think it's a
>>> non-blocker.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> -0 due to the following issue. From Spark 2.4.0, users may get an
>>>> incorrect result when they use new `map_fitler` with `map_concat` 
>>>> functions.
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-25823
>>>>
>>>> SPARK-25823 is only aiming to fix the data correctness issue from
>>>> `map_filter`.
>>>>
>>>> PMC members are able to lower the priority. Always, I respect PMC's
>>>> decision.
>>>>
>>>> I'm sending this email to draw more attention to this bug and to give
>>>> some warning on the new feature's limitation to the community.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <cloud0...@gmail.com>
>>>> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.4.0.
>>>>>
>>>>> The vote is open until October 26 PST and passes if a majority +1 PMC
>>>>> votes are cast, with
>>>>> a minimum of 3 +1 votes.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 2.4.0
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v2.4.0-rc4 (commit
>>>>> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
>>>>> https://github.com/apache/spark/tree/v2.4.0-rc4
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>>>>>
>>>>> Signatures used for Spark RCs can be found in this file:
>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1290
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>>>>>
>>>>> The list of bug fixes going into 2.4.0 can be found at the following
>>>>> URL:
>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>>>>>
>>>>> FAQ
>>>>>
>>>>> =========================
>>>>> How can I help test this release?
>>>>> =========================
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>> the current RC and see if anything important breaks, in the Java/Scala
>>>>> you can add the staging repository to your projects resolvers and test
>>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>>> you don't end up building with a out of date RC going forward).
>>>>>
>>>>> ===========================================
>>>>> What should happen to JIRA tickets still targeting 2.4.0?
>>>>> ===========================================
>>>>>
>>>>> The current list of open tickets targeted at 2.4.0 can be found at:
>>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>>> Version/s" = 2.4.0
>>>>>
>>>>> Committers should look at those and triage. Extremely important bug
>>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>>> be worked on immediately. Everything else please retarget to an
>>>>> appropriate release.
>>>>>
>>>>> ==================
>>>>> But my bug isn't fixed?
>>>>> ==================
>>>>>
>>>>> In order to make timely releases, we will typically not hold the
>>>>> release unless the bug in question is a regression from the previous
>>>>> release. That being said, if there is something which is a regression
>>>>> that has not been correctly targeted please ping me or a committer to
>>>>> help target the issue.
>>>>>
>>>>

Re: [VOTE] SPARK 2.4.0 (RC4)

Reply via email to