[jira] [Created] (DRILL-6994) TIMESTAMP type DOB column in Spark parquet is treated as VARBINARY in Drill

Khurram Faraaz (JIRA) Tue, 22 Jan 2019 15:54:33 -0800

Khurram Faraaz created DRILL-6994:
-------------------------------------

             Summary: TIMESTAMP type DOB column in Spark parquet is treated as 
VARBINARY in Drill
                 Key: DRILL-6994
                 URL: https://issues.apache.org/jira/browse/DRILL-6994
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.14.0
            Reporter: Khurram Faraaz



A timestamp type column in a parquet file created from Spark is treated as 
VARBINARY by Drill 1.14.0., Trying to cast DOB column to DATE results in an 
Exception, although the monthOfYear field is in the allowed range.

Data used in the test
{noformat}
[test@md123 spark_data]# cat inferSchema_example.csv
Name,Department,years_of_experience,DOB
Sam,Software,5,1990-10-10
Alex,Data Analytics,3,1992-10-10
{noformat}

Create the parquet file using the above CSV file
{noformat}
[test@md123 bin]# ./spark-shell
19/01/22 21:21:34 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://md123.qa.lab:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1548192099796).
Spark session available as 'spark'.
Welcome to
 ____ __
 / __/__ ___ _____/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /___/ .__/\_,_/_/ /_/\_\ version 2.3.1-mapr-SNAPSHOT
 /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.spark.sql.\{DataFrame, SQLContext}
import org.apache.spark.sql.\{DataFrame, SQLContext}

scala> import org.apache.spark.\{SparkConf, SparkContext}
import org.apache.spark.\{SparkConf, SparkContext}

scala> val sqlContext: SQLContext = new SQLContext(sc)
warning: there was one deprecation warning; re-run with -deprecation for details
sqlContext: org.apache.spark.sql.SQLContext = 
org.apache.spark.sql.SQLContext@2e0163cb

scala> val df = 
sqlContext.read.format("com.databricks.spark.csv").option("header", 
"true").option("inferSchema", "true").load("/apps/inferSchema_example.csv")
df: org.apache.spark.sql.DataFrame = [Name: string, Department: string ... 2 
more fields]

scala> df.printSchema
test
 |-- Name: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- years_of_experience: integer (nullable = true)
 |-- DOB: timestamp (nullable = true)

scala> df.write.parquet("/apps/infer_schema_example.parquet")

// Read the parquet file
scala> val data = sqlContext.read.parquet("/apps/infer_schema_example.parquet")
data: org.apache.spark.sql.DataFrame = [Name: string, Department: string ... 2 
more fields]

// Print the schema of the parquet file from Spark
scala> data.printSchema
test
 |-- Name: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- years_of_experience: integer (nullable = true)
 |-- DOB: timestamp (nullable = true)

// Display the contents of parquet file on spark-shell
// register temp table and do a show on all records,to display.

scala> data.registerTempTable("employee")
warning: there was one deprecation warning; re-run with -deprecation for details

scala> val allrecords = sqlContext.sql("SELeCT * FROM employee")
allrecords: org.apache.spark.sql.DataFrame = [Name: string, Department: string 
... 2 more fields]

scala> allrecords.show()
+----+--------------+-------------------+-------------------+
|Name| Department|years_of_experience| DOB|
+----+--------------+-------------------+-------------------+
| Sam| Software| 5|1990-10-10 00:00:00|
|Alex|Data Analytics| 3|1992-10-10 00:00:00|
+----+--------------+-------------------+-------------------+
{noformat}

Querying the parquet file from Drill 1.14.0-mapr, results in the DOB column 
(timestamp type in Spark) being treated as VARBINARY.

{noformat}
apache drill 1.14.0-mapr
"a little sql for your nosql"
0: jdbc:drill:schema=dfs.tmp> select * from 
dfs.`/apps/infer_schema_example.parquet`;
+-------+-----------------+----------------------+--------------+
| Name | Department | years_of_experience | DOB |
+-------+-----------------+----------------------+--------------+
| Sam | Software | 5 | [B@2bef51f2 |
| Alex | Data Analytics | 3 | [B@650eab8 |
+-------+-----------------+----------------------+--------------+
2 rows selected (0.229 seconds)

// typeof(DOB) column returns a VARBINARY type, whereas the parquet schema in 
Spark for DOB: timestamp (nullable = true)

0: jdbc:drill:schema=dfs.tmp> select typeof(DOB) from 
dfs.`/apps/infer_schema_example.parquet`;
+------------+
| EXPR$0 |
+------------+
| VARBINARY |
| VARBINARY |
+------------+
2 rows selected (0.199 seconds)
{noformat}

// CAST to DATE type results in Exception, though the monthOfYear is in the 
range [1,12]

{noformat}
0: jdbc:drill:schema=dfs.tmp> select cast(DOB as DATE) from 
dfs.`/apps/infer_schema_example.parquet`;
Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for monthOfYear must 
be in the range [1,12]

Fragment 0:0

[Error Id: 536c67d8-77c4-4b36-8aec-743344141d31 on md123.qa.lab:31010] 
(state=,code=0)
{noformat}

Stack trace from drillbit.log

{noformat}
2019-01-22 22:13:27,334 [23b86a78-64fc-5873-87b5-7e95d9740e51:frag:0:0] ERROR 
o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: IllegalFieldValueException: 
Value 0 for monthOfYear must be in the range [1,12]

Fragment 0:0

[Error Id: 536c67d8-77c4-4b36-8aec-743344141d31 on md123.qa.lab:31010]
org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
IllegalFieldValueException: Value 0 for monthOfYear must be in the range [1,12]

Fragment 0:0

[Error Id: 536c67d8-77c4-4b36-8aec-743344141d31 on md123.qa.lab:31010]
 at 
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:633)
 ~[drill-common-1.14.0-mapr.jar:1.14.0-mapr]
 at 
org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:361)
 [drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
 at 
org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:216)
 [drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
 at 
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:327)
 [drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
 at 
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) 
[drill-common-1.14.0-mapr.jar:1.14.0-mapr]
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[na:1.8.0_181]
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[na:1.8.0_181]
 at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181]
Caused by: org.joda.time.IllegalFieldValueException: Value 0 for monthOfYear 
must be in the range [1,12]
 at org.joda.time.field.FieldUtils.verifyValueBounds(FieldUtils.java:252) 
~[drill-hive-exec-shaded-1.14.0-mapr.jar:1.14.0-mapr]
 at 
org.joda.time.chrono.BasicChronology.getDateMidnightMillis(BasicChronology.java:612)
 ~[drill-hive-exec-shaded-1.14.0-mapr.jar:1.14.0-mapr]
 at 
org.joda.time.chrono.BasicChronology.getDateTimeMillis(BasicChronology.java:159)
 ~[drill-hive-exec-shaded-1.14.0-mapr.jar:1.14.0-mapr]
 at 
org.joda.time.chrono.AssembledChronology.getDateTimeMillis(AssembledChronology.java:120)
 ~[drill-hive-exec-shaded-1.14.0-mapr.jar:1.14.0-mapr]
 at 
org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.getDate(StringFunctionHelpers.java:210)
 ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
 at 
org.apache.drill.exec.test.generated.ProjectorGen977.doEval(ProjectorTemplate.java:41)
 ~[na:na]
 at 
org.apache.drill.exec.test.generated.ProjectorGen977.projectRecords(ProjectorTemplate.java:67)
 ~[na:na]
 at 
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork(ProjectRecordBatch.java:231)
 ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
 at 
org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext(AbstractUnaryRecordBatch.java:117)
 ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
 at 
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:142)
 ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
 at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:172)
 ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
 at 
org.apache.drill.exec.physical.impl.BasetestExec.next(BasetestExec.java:103) 
~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
 at 
org.apache.drill.exec.physical.impl.ScreenCreator$Screentest.innerNext(ScreenCreator.java:83)
 ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
 at org.apache.drill.exec.physical.impl.BasetestExec.next(BasetestExec.java:93) 
~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
 at 
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:294)
 ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
 at 
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:281)
 ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
 at java.security.AccessController.doPrivileged(Native Method) ~[na:1.8.0_181]
 at javax.security.auth.Subject.doAs(Subject.java:422) ~[na:1.8.0_181]
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
 ~[hadoop-common-2.7.0-mapr-1808.jar:na]
 at 
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:281)
 [drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
 ... 4 common frames omitted
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (DRILL-6994) TIMESTAMP type DOB column in Spark parquet is treated as VARBINARY in Drill

Reply via email to