HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

chutium Tue, 26 Aug 2014 07:55:28 -0700

is there any dataType auto convert or detect or something in HiveContext ?

all columns of a table is defined as string in hive metastore


one column is total_price with values like 123.45, then this column will be
recognized as dataType Float in HiveContext...

this is a feature or a bug? it really surprised me... how is it implemented?
if it is a feature, can i turn it off? i want to get a schemaRDD with
exactly the same datatype defined in hive metadata, i know the column
total_price should be float values, but they must not be, and what happens
if there is some broken line in my huge CSV file? or maybe some total_price
is 9,123.45 or $123.45 or something

==============================================================

some example for this in our env.

MapR v3 cluster, newest spark github master clone from yesterday

built with
sbt/sbt -Dhadoop.version=1.0.3-mapr-3.0.3 -Phive assembly

hive-site.xml configured

==============================================================

spark-shell scripts:

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.sql("use our_live_db")
hiveContext.sql("desc formatted et_fullorders").collect.foreach(println)
...
...
14/08/26 15:47:09 INFO SparkContext: Job finished: collect at
SparkPlan.scala:85, took 0.0305408 s
[# col_name             data_type               comment             ]
[                ]
[sid                    string                  from deserializer   ]
[request_id             string                  from deserializer   ]
[*times_dq               string*                  from deserializer   ]
[*total_price            string*                  from deserializer   ]
[order_id               string                  from deserializer   ]
[                ]
[# Partition Information                 ]
[# col_name             data_type               comment             ]
[                ]
[wt_date                string                  None                ]
[country                string                  None                ]
[                ]
[# Detailed Table Information            ]
[Database:              our_live_db            ]
[Owner:                 client02              ]
[CreateTime:            Fri Jan 31 12:23:40 CET 2014     ]
[LastAccessTime:        UNKNOWN                  ]
[Protect Mode:          None                     ]
[Retention:             0                        ]
[Location:             
maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders     ]
[Table Type:            EXTERNAL_TABLE           ]
[Table Parameters:               ]
[       EXTERNAL                TRUE                ]
[       transient_lastDdlTime   1391167420          ]
[                ]
[# Storage Information           ]
[SerDe Library:         com.bizo.hive.serde.csv.CSVSerde         ]
[InputFormat:           org.apache.hadoop.mapred.TextInputFormat         ]
[OutputFormat:         
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       ]
[Compressed:            No                       ]
[Num Buckets:           -1                       ]
[Bucket Columns:        []                       ]
[Sort Columns:          []                       ]
[Storage Desc Params:            ]
[       separatorChar           ;                   ]
[       serialization.format    1                   ]

then, create a schemaRDD from this table

val result = hiveContext.sql("select sid, order_id, total_price, times_dq
from et_fullorders where wt_date='2014-04-14' and country='uk' limit 5")

ok now, printSchema...

scala> result.printSchema
root
 |-- sid: string (nullable = true)
 |-- order_id: string (nullable = true)
 |-- *total_price: float* (nullable = true)
 |-- *times_dq: timestamp* (nullable = true)


total_price was STRING but now in schemaRDD is FLOAT
and
times_dq, now is TIMESTAMP

really strange and surprised...

and more strange is:

scala> result.map(row => row.getString(2)).collect.foreach(println)

i got
240.00
45.83
21.67
95.83
120.83

but

scala> result.map(row => row.getFloat(2)).collect.foreach(println)

14/08/26 16:01:24 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 8)
java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Float
        at scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:114)

==============================================================

btw, files in this external table are gzipped csv files:
14/08/26 15:49:56 INFO HadoopRDD: Input split:
maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders/wt_date=2014-04-14/country=uk/getFullOrders_2014-04-14.csv.gz:0+16990

and the data in it:

scala> result.collect.foreach(println)
[5000000001402123123,12344000123454,240.00,2014-04-14 00:03:49.082000]
[5000000001402110123,12344000123455,45.83,2014-04-14 00:04:13.639000]
[5000000001402129123,12344000123458,21.67,2014-04-14 00:09:12.276000]
[5000000001402092123,12344000132457,95.83,2014-04-14 00:09:42.228000]
[5000000001402135123,12344000123460,120.83,2014-04-14 00:12:44.742000]

we use CSVSerDe
https://drone.io/github.com/ogrodnek/csv-serde/files/target/csv-serde-1.1.2-0.11.0-all.jar

maybe this is a reason?

but why the 1st and 2nd column, will not be recognized as bigint or double
or something...?

Thanks for any idea




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

Reply via email to