Nannan Yu created SPARK-22706:
---------------------------------
Summary: Cannot read Teradata CLOB column type correctly in Spark
2.2.0
Key: SPARK-22706
URL: https://issues.apache.org/jira/browse/SPARK-22706
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.2.0
Environment: Spark 2.2.0
Teradata driver version: 16.20.00.02 (terajdbc4.jar)
I am testing with Mac
Reporter: Nannan Yu
When I try to read the table from Teradata dababase with URL like
"jdbc:teradata://IP/Database".
It works well for most of the column datatype.
But for the CLOB datatype, it cannot read the correct contest:
If I am using Spark: it will get some result like this:
-----------------------------------------------------------------------
DT_INT | DT_CLOB
------------------------------------------------------------------------------
1 | com.teradata.jdbc.jdk6_SQL_Clob@3cc9ee67
2 | com.teradata.jdbc.jdk6_SQL_Clob@7cf9ecad
--------------------------------------------------------------------------------
Actually, it should show the real Character Large Object as this:
-----------------------------------------------------------------------
DT_INT | DT_CLOB
------------------------------------------------------------------------------
1 | JESSIE
2 | CATHY
--------------------------------------------------------------------------------
Note: This issue is not for all the 'CLOB' column datatype, for example,
DB2-CLOB datatype works fine with Spark.
For teradata, we have this issue maybe related to this reason:
https://www.dwhpro.com/teradata-clob/
" The data of a CLOB column is stored {color:#f6c342}as a separate
sub-table{color}. " for TeraData CLOB?
>From the Spark SourceCode in the file:
>org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
* case StringType* =>
(rs: ResultSet, row: InternalRow, pos: Int) =>
// TODO(davies): use getBytes for better performance, if the encoding
is UTF-8
row.update(pos, UTF8String.fromString(rs.getString(pos + 1)))
It seems we cannot use this way for the CLOB column from Teradata.
Actually, I tried to use some other way to read the CLOB data from Teradata, it
works well. Also it also works for the DB2-CLOB data.
case StringType =>
(rs: ResultSet, row: InternalRow, pos: Int) => {
val sqlType = rs.getMetaData.getColumnType(pos + 1)
if (java.sql.Types.CLOB == sqlType) {
val clobData = rs.getClob(pos + 1);
if (clobData != null)
row.update(pos, UTF8String.fromString(clobData.getSubString(1,
clobData.length().toInt)))
else
row.update(pos, UTF8String.fromString(rs.getString(pos + 1)))
}
else {
// TODO(davies): use getBytes for better performance, if the encoding
is UTF-8
row.update(pos, UTF8String.fromString(rs.getString(pos + 1)))
}
}
Should we handle the CLOB datatype differently with the normal 'StringType'?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]