Michael Armbrust created SPARK-1994:
---------------------------------------
Summary: Weird data corruption bug when running Spark SQL on data
in HDFS
Key: SPARK-1994
URL: https://issues.apache.org/jira/browse/SPARK-1994
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Priority: Blocker
[~adav] has a full reproduction but he has found a case where the first run
returns corrupted results, but the second case does not. The same does not
occur when reading from HDFS a second time...
{code}
sql("SELECT lang, COUNT(*) AS cnt FROM tweetTable GROUP BY lang ORDER BY cnt
DESC").collect.foreach(println)
[bg,16636]
[16266,16266]
[16223,16223]
[16161,16161]
[16047,16047]
[lt,11405]
[hu,11380]
[el,10845]
[da,10289]
[fi,10261]
[9897,9897]
[9765,9765]
[9751,9751]
{code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)