Jason Moore created SPARK-17195:
-----------------------------------
Summary: Dealing with JDBC column nullability when it is not
reliable
Key: SPARK-17195
URL: https://issues.apache.org/jira/browse/SPARK-17195
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.0.0
Reporter: Jason Moore
Starting with Spark 2.0.0, the column "nullable" property is important to have
correct for the code generation to work properly. Marking the column as
nullable = false used to (<2.0.0) allow null values to be operated on, but now
this will result in:
{noformat}
Caused by: java.lang.NullPointerException
at
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
{noformat}
I'm all for the change towards a more ridged behavior (enforcing correct
input). But the problem I'm facing now is that when I used JDBC to read from a
Teradata server, the column nullability is often not correct (particularly when
sub-queries are involved).
This is the line in question:
https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L140
I'm trying to work out what would be the way forward for me on this. I know
that it's really the fault of the Teradata database server not returning the
correct schema, but I'll need to make Spark itself or my application resilient
to this behavior.
One of the Teradata JDBC Driver tech leads has told me that "when the
rsmd.getSchemaName and rsmd.getTableName methods return an empty zero-length
string, then the other metadata values may not be completely accurate" - so one
option could be to treat the nullability (at least) the same way as the
"unknown" case (as nullable = true). For reference, see the rest of our
discussion here:
http://forums.teradata.com/forum/connectivity/teradata-jdbc-driver-returns-the-wrong-schema-column-nullability
Any other thoughts?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]