[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stuart Reynolds updated SPARK-21392: ------------------------------------ Description: The following boring code works {code:none} response = "mi_or_chd_5" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") >>> print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) {code} But then, {code:none} outcome2 = sqlc.read.parquet(response) # fail {code} fails with: {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:none} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) was: The following boring code works {code:python} response = "mi_or_chd_5" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") >>> print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) {code} But then, {code:python} outcome2 = sqlc.read.parquet(response) # fail {code} fails with: {code:python}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} in {code:python} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) {code} The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives? Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) > Unable to infer schema when loading Parquet file > ------------------------------------------------ > > Key: SPARK-21392 > URL: https://issues.apache.org/jira/browse/SPARK-21392 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.1.1 > Environment: Spark 2.1.1. python 2.7.6 > Reporter: Stuart Reynolds > Labels: parquet, pyspark > > The following boring code works > {code:none} > response = "mi_or_chd_5" > outcome = sqlc.sql("""select eid,{response} as response > from outcomes > where {response} IS NOT NULL""".format(response=response)) > outcome.write.parquet(response, mode="overwrite") > > >>> print outcome.schema > > StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) > {code} > > But then, > {code:none} > outcome2 = sqlc.read.parquet(response) # fail > {code} > fails with: > {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must > be specified manually.;' > {code} > in > {code:none} > /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc > in deco(*a, **kw) > {code} > The documentation for parquet says the format is self describing, and the > full schema was available when the parquet file was saved. What gives? > Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but > which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1) -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org