[jira] [Comment Edited] (SPARK-20489) Different results in local mode and yarn mode when working with dates (race condition with SimpleDateFormat?)

Rick Moritz (JIRA) Fri, 28 Apr 2017 01:18:39 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-20489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15988400#comment-15988400
 ]


Rick Moritz edited comment on SPARK-20489 at 4/28/17 8:17 AM:
--------------------------------------------------------------

The timezone thing would make sense, if there were consistently no results - 
but with more data, there are *some* results - more or less determined by the 
number of different timestamps in the data
. This leads me to believe that [~frosner] may be on to something.

 Also note, that the timestamp/date matching is made in the same cluster, and 
at most the master machine had it's timezone adapted, the executors are all in 
the same TZ.

Anyway, here is the output:
{{
+---+-----------------------+----------+
|id |loadDTS                |timestamp |
+---+-----------------------+----------+
|1  |2017-04-27T00:00:00.000|1493269200|
|1  |2017-04-26T00:00:00.000|1493269200|
|3  |2017-04-26T00:00:00.000|1493269200|
|3  |2017-04-27T00:00:00.000|1493269200|
|2  |2017-04-27T00:00:00.000|1493269200|
|2  |2017-04-26T00:00:00.000|1493269200|
+---+-----------------------+----------+
}}


was (Author: rpcmoritz):
The timezone thing would make sense, if there were consistently no results - 
but with more data, there are *some* results - more or less determined by the 
number of different timestamps in the data
. This leads me to believe that [~frosner] may be on to something.

 Also note, that the timestamp/date matching is made in the same cluster, and 
at most the master machine had it's timezone adapted, the executors are all in 
the same TZ.

Anyway, here is the output:
+---+-----------------------+----------+
|id |loadDTS                |timestamp |
+---+-----------------------+----------+
|1  |2017-04-27T00:00:00.000|1493269200|
|1  |2017-04-26T00:00:00.000|1493269200|
|3  |2017-04-26T00:00:00.000|1493269200|
|3  |2017-04-27T00:00:00.000|1493269200|
|2  |2017-04-27T00:00:00.000|1493269200|
|2  |2017-04-26T00:00:00.000|1493269200|
+---+-----------------------+----------+

> Different results in local mode and yarn mode when working with dates (race 
> condition with SimpleDateFormat?)
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-20489
>                 URL: https://issues.apache.org/jira/browse/SPARK-20489
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0, 2.0.1, 2.0.2
>         Environment: yarn-client mode in Zeppelin, Cloudera 
> Spark2-distribution
>            Reporter: Rick Moritz
>            Priority: Critical
>
> Running the following code (in Zeppelin, or spark-shell), I get different 
> results, depending on whether I am using local[*] -mode or yarn-client mode:
> {code:title=test case|borderStyle=solid}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> import spark.implicits._
> val counter = 1 to 2
> val size = 1 to 3
> val sampleText = spark.createDataFrame(
>     sc.parallelize(size)
>     .map(Row(_)),
>     StructType(Array(StructField("id", IntegerType, nullable=false))
>         )
>     )
>     .withColumn("loadDTS",lit("2017-04-25T10:45:02.2"))
>     
> val rddList = counter.map(
>             count => sampleText
>             .withColumn("loadDTS2", 
> date_format(date_add(col("loadDTS"),count),"yyyy-MM-dd'T'HH:mm:ss.SSS"))
>             .drop(col("loadDTS"))
>             .withColumnRenamed("loadDTS2","loadDTS")
>             .coalesce(4)
>             .rdd
>         )
> val resultText = spark.createDataFrame(
>     spark.sparkContext.union(rddList),
>     sampleText.schema
> )
> val testGrouped = resultText.groupBy("id")
> val timestamps = testGrouped.agg(
>     max(unix_timestamp($"loadDTS", "yyyy-MM-dd'T'HH:mm:ss.SSS")) as 
> "timestamp"
> )
> val loadDateResult = resultText.join(timestamps, "id")
> val filteredresult = loadDateResult.filter($"timestamp" === 
> unix_timestamp($"loadDTS", "yyyy-MM-dd'T'HH:mm:ss.SSS"))
> filteredresult.count
> {code}
> The expected result, *3* is what I obtain in local mode, but as soon as I run 
> fully distributed, I get *0*. If Increase size to {{1 to 32000}}, I do get 
> some results (depending on the size of counter) - none of which makes any 
> sense.
> Up to the application of the last filter, at first glance everything looks 
> okay, but then something goes wrong. Potentially this is due to lingering 
> re-use of SimpleDateFormats, but I can't get it to happen in a 
> non-distributed mode. The generated execution plan is the same in each case, 
> as expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-20489) Different results in local mode and yarn mode when working with dates (race condition with SimpleDateFormat?)

Reply via email to