liangguoning created SPARK-11722:
------------------------------------
Summary: Rdds could be different between orginal one and
save-out-then-read-in one
Key: SPARK-11722
URL: https://issues.apache.org/jira/browse/SPARK-11722
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 1.5.1
Environment: redhat6.4 64bit; standalone-cluster ; 3 machines
Reporter: liangguoning
I found a bug on pyspark;
I did some operations to create a rdd A, but I found the data are different
between that orginal A and the saved_to_hdfs's one, called B,
I also printed all detail data inside my function and discovered that A indeed
contains a different one record from B.
That record causes a different result under the same functions.
I got B through 2 methods : A.saveAsTextFile and sc.textFile
I also check the raw data, and found that B is the right rdd.
---
I tried another A2 through sc.parallelize(A.collect()) and got the same result
as A.
Thanks
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]