Stuart Reynolds created SPARK-21392:
---------------------------------------

             Summary: Unable to infer schema when loading Parquet file
                 Key: SPARK-21392
                 URL: https://issues.apache.org/jira/browse/SPARK-21392
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.1.1
         Environment: Spark 2.1.1. python 2.7.6

            Reporter: Stuart Reynolds


The following boring code works

{{    response = "mi_or_chd_5"
    colname = "f_1000"

    outcome = sqlc.sql("""select eid,{response} as response
    from outcomes
    where {response} IS NOT NULL""".format(response=response))
    outcome.write.parquet(response, mode="overwrite")
    
    col = sqlc.sql("""select eid,{colname} as {colname}
    from baseline_denull
    where {colname} IS NOT NULL""".format(colname=colname))
    col.write.parquet(colname, mode="overwrite")

    >>> print outcome.schema
    
StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

    >>> print col.schema
    
StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true)))
}}.
    
But then,
{{
    outcome2 = sqlc.read.parquet(response)  # fail
    col2 = sqlc.read.parquet(colname) # fail
}}.

fails with:

{{    AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
}}.

in 

{{    
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
}}.

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to