Hi folks, I'm trying to use Automatic partition discovery as descibed here:
https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html /data/year=2014/file.parquet/data/year=2015/file.parquet … SELECT * FROM table WHERE year = 2015 I have an official 1.3.1 CDH4 build and did the following: scala> val hc= new org.apache.spark.sql.hive.HiveContext(sc) hc: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@2564dce9 scala> val df =hc.parquetFile("/r/warehouse/hive/pkey=0000-2013-12/another_path_part/part-00000-r-00000.snappy.parquet") scala> df.columns res0: Array[String] = Array(but does not contain a pkey column) df.registerTempTable("table") scala> hc.sql("SELECT count(*) FROM table WHERE pkey='0000-2013-12'") 15/05/12 16:27:32 INFO ParseDriver: Parsing command: SELECT count(*) FROM table WHERE pkey='0000-2013-12' 15/05/12 16:27:33 INFO ParseDriver: Parse Completed org.apache.spark.sql.AnalysisException: cannot resolve 'pkey' given input columns .... So in my case the dataframe that resulted from the parquet file did not result in an added, filterable column. Am I using this wrong? Or is my on-disk structure not correct? Any insight appreciated.