shengyao piao created SPARK-21581:
-------------------------------------
Summary: Spark 2.x distinct return incorrect result
Key: SPARK-21581
URL: https://issues.apache.org/jira/browse/SPARK-21581
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.2.0, 2.1.0, 2.0.0
Reporter: shengyao piao
Hi all
I'm using Spark2.x on cdh5.11
I have a json file as follows.
・sample.json
{"url": "http://example.hoge/staff1", "name": "staff1", "salary":600.0}
{"url": "http://example.hoge/staff2", "name": "staff2", "salary":700}
{"url": "http://example.hoge/staff3", "name": "staff3", "salary":800}
{"url": "http://example.hoge/staff4", "name": "staff4", "salary":900}
{"url": "http://example.hoge/staff5", "name": "staff5", "salary":1000.0}
{"url": "http://example.hoge/staff6", "name": "staff6", "salary":""}
{"url": "http://example.hoge/staff7", "name": "staff7", "salary":""}
{"url": "http://example.hoge/staff8", "name": "staff8", "salary":""}
{"url": "http://example.hoge/staff9", "name": "staff9", "salary":""}
{"url": "http://example.hoge/staff10", "name": "staff10", "salary":""}
And I try to read this file and distinct.
・spark code
val s=spark.read.json("sample.json")
s.count
res13: Long = 10
s.distinct.count
res14: Long = 6 < - It's should be 10
I know the cause of incorrect result is by mixed type in salary field.
But when I try the same code in Spark 1.6 the result will be 10.
So I think it's a bug in Spark 2.x.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]