[
https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030633#comment-15030633
]
Maciej Bryński edited comment on SPARK-12030 at 11/28/15 7:26 PM:
------------------------------------------------------------------
Spark version 1.6.0-preview2 started in standalone mode on 3 hosts (128GB RAM,
20/40 cores each)
And spark-defaults.conf:
{code}
spark.master spark://somehost:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 20g
spark.executor.extraJavaOptions -XX:-UseGCOverheadLimit -XX:+UseG1GC
-Dhdp.version=current -XX:-UseCompressedOops
spark.driver.extraJavaOptions -XX:-UseGCOverheadLimit -XX:+UseG1GC
-Dhdp.version=current -XX:-UseCompressedOops
spark.executor.memory 30g
spark.storage.memoryFraction 0.2
spark.shuffle.memoryFraction 0.4
spark.driver.maxResultSize 4g
spark.cores.max 30
spark.executor.extraClassPath
spark/mysql-connector-java-5.1.35-bin.jar:spark/spark-csv_2.10-1.3.0-SNAPSHOT.jar:spark/commons-csv-1.2.jar:spark/aerospike-spark-0.3-SNAPSHOT-jar-with-dependencies.jar
spark.driver.extraClassPath
spark/mysql-connector-java-5.1.35-bin.jar:spark/spark-csv_2.10-1.3.0-SNAPSHOT.jar:spark/commons-csv-1.2.jar:spark/aerospike-spark-0.3-SNAPSHOT-jar-with-dependencies.jar
spark.kryoserializer.buffer.max 1g
spark.default.parallelism 400
spark.sql.autoBroadcastJoinThreshold 0
spark.yarn.am.extraJavaOptions -Dhdp.version=current
spark.executor.instances 10
spark.yarn.queue dwh
spark.python.worker.reuse false
spark.sql.shuffle.partitions 400
spark.io.compression.codec lz4
spark.eventLog.enabled true
spark.eventLog.dir history/
{code}
was (Author: maver1ck):
And spark-defaults.conf:
{code}
spark.master spark://somehost:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 20g
spark.executor.extraJavaOptions -XX:-UseGCOverheadLimit -XX:+UseG1GC
-Dhdp.version=current -XX:-UseCompressedOops
spark.driver.extraJavaOptions -XX:-UseGCOverheadLimit -XX:+UseG1GC
-Dhdp.version=current -XX:-UseCompressedOops
spark.executor.memory 30g
spark.storage.memoryFraction 0.2
spark.shuffle.memoryFraction 0.4
spark.driver.maxResultSize 4g
spark.cores.max 30
spark.executor.extraClassPath
spark/mysql-connector-java-5.1.35-bin.jar:spark/spark-csv_2.10-1.3.0-SNAPSHOT.jar:spark/commons-csv-1.2.jar:spark/aerospike-spark-0.3-SNAPSHOT-jar-with-dependencies.jar
spark.driver.extraClassPath
spark/mysql-connector-java-5.1.35-bin.jar:spark/spark-csv_2.10-1.3.0-SNAPSHOT.jar:spark/commons-csv-1.2.jar:spark/aerospike-spark-0.3-SNAPSHOT-jar-with-dependencies.jar
spark.kryoserializer.buffer.max 1g
spark.default.parallelism 400
spark.sql.autoBroadcastJoinThreshold 0
spark.yarn.am.extraJavaOptions -Dhdp.version=current
spark.executor.instances 10
spark.yarn.queue dwh
spark.python.worker.reuse false
spark.sql.shuffle.partitions 400
spark.io.compression.codec lz4
spark.eventLog.enabled true
spark.eventLog.dir history/
{code}
> Incorrect results when aggregate joined data
> --------------------------------------------
>
> Key: SPARK-12030
> URL: https://issues.apache.org/jira/browse/SPARK-12030
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.6.0
> Reporter: Maciej Bryński
> Priority: Critical
> Attachments: t1.tar.gz, t2.tar.gz
>
>
> I have following issue.
> I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2)
> {code}
> t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache()
> t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2, id2, 0, size1, 200).cache()
> joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer")
> {code}
> Important: both table are cached, so results should be the same on every
> query.
> Then I did come counts:
> {code}
> t1.count() -> 5900729
> t1.registerTempTable("t1")
> sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729
> t2.count() -> 54298
> joined.count() -> 5900729
> {code}
> And here magic begins - I counted distinct id1 from joined table
> {code}
> joined.registerTempTable("joined")
> sqlCtx.sql("select distinct(id1) from joined").count()
> {code}
> Results varies *(are different on every run)* between 5899000 and
> 5900000 but never are equal to 5900729.
> In addition. I did more queries:
> {code}
> sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) >
> 1").collect()
> {code}
> This gives some results but this query return *1*
> {code}
> len(sqlCtx.sql("select * from joined where id1 = result").collect())
> {code}
> What's wrong ?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]