is there any dataType auto convert or detect or something in HiveContext ?
all columns of a table is defined as string in hive metastore
one column is total_price with values like 123.45, then this column will be
recognized as dataType Float in HiveContext...
this is a feature or a bug? it really surprised me... how is it implemented?
if it is a feature, can i turn it off? i want to get a schemaRDD with
exactly the same datatype defined in hive metadata, i know the column
total_price should be float values, but they must not be, and what happens
if there is some broken line in my huge CSV file? or maybe some total_price
is 9,123.45 or $123.45 or something
==============================================================
some example for this in our env.
MapR v3 cluster, newest spark github master clone from yesterday
built with
sbt/sbt -Dhadoop.version=1.0.3-mapr-3.0.3 -Phive assembly
hive-site.xml configured
==============================================================
spark-shell scripts:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.sql("use our_live_db")
hiveContext.sql("desc formatted et_fullorders").collect.foreach(println)
...
...
14/08/26 15:47:09 INFO SparkContext: Job finished: collect at
SparkPlan.scala:85, took 0.0305408 s
[# col_name data_type comment ]
[ ]
[sid string from deserializer ]
[request_id string from deserializer ]
[*times_dq string* from deserializer ]
[*total_price string* from deserializer ]
[order_id string from deserializer ]
[ ]
[# Partition Information ]
[# col_name data_type comment ]
[ ]
[wt_date string None ]
[country string None ]
[ ]
[# Detailed Table Information ]
[Database: our_live_db ]
[Owner: client02 ]
[CreateTime: Fri Jan 31 12:23:40 CET 2014 ]
[LastAccessTime: UNKNOWN ]
[Protect Mode: None ]
[Retention: 0 ]
[Location:
maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders ]
[Table Type: EXTERNAL_TABLE ]
[Table Parameters: ]
[ EXTERNAL TRUE ]
[ transient_lastDdlTime 1391167420 ]
[ ]
[# Storage Information ]
[SerDe Library: com.bizo.hive.serde.csv.CSVSerde ]
[InputFormat: org.apache.hadoop.mapred.TextInputFormat ]
[OutputFormat:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat ]
[Compressed: No ]
[Num Buckets: -1 ]
[Bucket Columns: [] ]
[Sort Columns: [] ]
[Storage Desc Params: ]
[ separatorChar ; ]
[ serialization.format 1 ]
then, create a schemaRDD from this table
val result = hiveContext.sql("select sid, order_id, total_price, times_dq
from et_fullorders where wt_date='2014-04-14' and country='uk' limit 5")
ok now, printSchema...
scala> result.printSchema
root
|-- sid: string (nullable = true)
|-- order_id: string (nullable = true)
|-- *total_price: float* (nullable = true)
|-- *times_dq: timestamp* (nullable = true)
total_price was STRING but now in schemaRDD is FLOAT
and
times_dq, now is TIMESTAMP
really strange and surprised...
and more strange is:
scala> result.map(row => row.getString(2)).collect.foreach(println)
i got
240.00
45.83
21.67
95.83
120.83
but
scala> result.map(row => row.getFloat(2)).collect.foreach(println)
14/08/26 16:01:24 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 8)
java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Float
at scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:114)
==============================================================
btw, files in this external table are gzipped csv files:
14/08/26 15:49:56 INFO HadoopRDD: Input split:
maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders/wt_date=2014-04-14/country=uk/getFullOrders_2014-04-14.csv.gz:0+16990
and the data in it:
scala> result.collect.foreach(println)
[5000000001402123123,12344000123454,240.00,2014-04-14 00:03:49.082000]
[5000000001402110123,12344000123455,45.83,2014-04-14 00:04:13.639000]
[5000000001402129123,12344000123458,21.67,2014-04-14 00:09:12.276000]
[5000000001402092123,12344000132457,95.83,2014-04-14 00:09:42.228000]
[5000000001402135123,12344000123460,120.83,2014-04-14 00:12:44.742000]
we use CSVSerDe
https://drone.io/github.com/ogrodnek/csv-serde/files/target/csv-serde-1.1.2-0.11.0-all.jar
maybe this is a reason?
but why the 1st and 2nd column, will not be recognized as bigint or double
or something...?
Thanks for any idea
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]