Hi folks, I'm trying to use Automatic partition discovery as descibed here:

https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html

/data/year=2014/file.parquet/data/year=2015/file.parquet
…
SELECT * FROM table WHERE year = 2015



I have an official 1.3.1 CDH4 build and did the following:

scala> val hc= new org.apache.spark.sql.hive.HiveContext(sc)
hc: org.apache.spark.sql.hive.HiveContext =
org.apache.spark.sql.hive.HiveContext@2564dce9

scala> val df 
=hc.parquetFile("/r/warehouse/hive/pkey=0000-2013-12/another_path_part/part-00000-r-00000.snappy.parquet")

scala> df.columns
res0: Array[String] = Array(but does not contain a pkey column)

df.registerTempTable("table")
scala> hc.sql("SELECT count(*) FROM table WHERE pkey='0000-2013-12'")
15/05/12 16:27:32 INFO ParseDriver: Parsing command: SELECT count(*)
FROM table WHERE pkey='0000-2013-12'
15/05/12 16:27:33 INFO ParseDriver: Parse Completed
org.apache.spark.sql.AnalysisException: cannot resolve 'pkey' given
input columns ....

​
So in my case the dataframe that resulted from the parquet file did not
result in an added, filterable column. Am I using this wrong? Or is my
on-disk structure not correct?

Any insight appreciated.

Reply via email to