[GitHub] spark pull request #18411: [SPARK-18004][SQL] Make sure the date or timestam...

SharpRay Tue, 27 Jun 2017 19:22:12 -0700

Github user SharpRay commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18411#discussion_r124439540
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala ---
    @@ -68,5 +69,13 @@ private case object OracleDialect extends JdbcDialect {
         case _ => None
       }
     
    +  override def beforeFetch(connection: Connection, properties: Map[String, 
String]): Unit = {
    +    // Set general date and timestamp format before query.
    +    val stmt = connection.createStatement()
    +    stmt.execute("alter session set NLS_DATE_FORMAT = 'YYYY-MM-DD'")
    +    stmt.execute("alter session set NLS_TIMESTAMP_FORMAT = 'YYYY-MM-DD 
HH24:MI:SS.FF'")
    --- End diff --
    
    I am not very sure if making `compileValue` extensible to the different 
dialects can solve this problem. 
    
    In my opinion, current changes in OracleDialect's `beforeFetch` function 
should not break the existing applications. We can consider the follwing sql:
    
    `select * from test_tm where ts < cast('2017-01-01' as timesamp)`
    
    and the physical plan is:
    
    `== Physical Plan ==
    *Scan JDBCRelation(test_tm) [numPartitions=1] [TS#0,TSTZ#1,DT#2] 
PushedFilters: [*IsNotNull(TS), *LessThan(TS,2017-06-27 21:22:35.0)], 
ReadSchema: struct<TS:timestamp,TSTZ:timestamp,DT:date>`
    
    So we can see that the `LessThan` filter is pushed down to the underlying 
data source i.e. Oracle. But this time when you run `collect` on this DataFrame 
the output is:
    
    `17/06/27 21:54:53 ERROR executor.Executor: Exception in task 0.0 in stage 
0.0 (TID 0)
    java.sql.SQLDataException: ORA-01843: not a valid month
        at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:447)
        at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:396)
        at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:951)
        at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:513)
        at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:227)
        at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531)
        at 
oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:208)
        at 
oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:886)
        at 
oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1175)
        at 
oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1296)
        at 
oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3613)
        at 
oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3657)
        at 
oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:1495)
        at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:301)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)`
    
    This because the default timestamp format in Oracle is `DD-MON-RR 
HH.MI.SSXFF AM` which do not compatible with the java.sql.Timestamp format 
`yyyy-MM-dd HH:mm:ss.SS`. So I added the NLS* settings to fix this problem. 
These changes won't break the timestamp/date-related predicates pushing down 
but make them execute in Oracle correctly.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18411: [SPARK-18004][SQL] Make sure the date or timestam...

Reply via email to