Michael Armbrust created SPARK-1994:
---------------------------------------

             Summary: Weird data corruption bug when running Spark SQL on data 
in HDFS
                 Key: SPARK-1994
                 URL: https://issues.apache.org/jira/browse/SPARK-1994
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.0.0
            Reporter: Michael Armbrust
            Priority: Blocker


[~adav] has a full reproduction but he has found a case where the first run 
returns corrupted results, but the second case does not.  The same does not 
occur when reading from HDFS a second time...

{code}
sql("SELECT lang, COUNT(*) AS cnt FROM tweetTable GROUP BY lang ORDER BY cnt 
DESC").collect.foreach(println)
[bg,16636]
[16266,16266]
[16223,16223]
[16161,16161]
[16047,16047]
[lt,11405]
[hu,11380]
[el,10845]
[da,10289]
[fi,10261]
[9897,9897]
[9765,9765]
[9751,9751]
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to