Stuart Reynolds created SPARK-21392: ---------------------------------------
Summary: Unable to infer schema when loading Parquet file Key: SPARK-21392 URL: https://issues.apache.org/jira/browse/SPARK-21392 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.1.1 Environment: Spark 2.1.1. python 2.7.6 Reporter: Stuart Reynolds The following boring code works {{ response = "mi_or_chd_5" colname = "f_1000" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") col = sqlc.sql("""select eid,{colname} as {colname} from baseline_denull where {colname} IS NOT NULL""".format(colname=colname)) col.write.parquet(colname, mode="overwrite") >>> print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) >>> print col.schema StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true))) }}. But then, {{ outcome2 = sqlc.read.parquet(response) # fail col2 = sqlc.read.parquet(colname) # fail }}. fails with: {{ AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' }}. in {{ /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) }}. The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org