[
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stuart Reynolds updated SPARK-21392:
------------------------------------
Description:
The following boring code works
{code:none}
response = "mi_or_chd_5"
sc = get_spark_context() # custom
sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom
rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
print rdd.schema
#>>
StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
rdd.show()
#+-------+-----------+
#|eid|mi_or_chd_5|
#+-------+-----------+
#|226| null|
#|442| null|
#|978| 0|
#|851| 0|
#|428| 0|
rdd.write.parquet(response, mode="overwrite") # success!
rdd2 = sqlc.read.parquet(response) # fail
{code}
fails with:
{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be
specified manually.;'
{code}
in
{code:none}
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
in deco(*a, **kw)
{code}
The documentation for parquet says the format is self describing, and the full
schema was available when the parquet file was saved. What gives?
The error doesn't happen if I add "limit 10" to the sql query. The whole
selected table is 500k rows with an int and short column.
Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)
was:
The following boring code works
{code:none}
response = "mi_or_chd_5"
sc = get_spark_context() # custom
sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom
rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
print rdd.schema
#>>
StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
rdd.show()
#+-------+-----------+
#| eid|mi_or_chd_5|
#+-------+-----------+
#|216| null|
#|431| null|
#|978| 0|
#|852| 0|
#|418| 0|
rdd.write.parquet(response, mode="overwrite") # success!
rdd2 = sqlc.read.parquet(response) # fail
{code}
fails with:
{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be
specified manually.;'
{code}
in
{code:none}
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
in deco(*a, **kw)
{code}
The documentation for parquet says the format is self describing, and the full
schema was available when the parquet file was saved. What gives?
The error doesn't happen if I add "limit 10" to the sql query. The whole
selected table is 500k rows with an int and short column.
Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)
> Unable to infer schema when loading large Parquet file
> ------------------------------------------------------
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
> Reporter: Stuart Reynolds
> Labels: parquet, pyspark
>
> The following boring code works
> {code:none}
> response = "mi_or_chd_5"
> sc = get_spark_context() # custom
> sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom
> rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
> print rdd.schema
> #>>
> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
> rdd.show()
> #+-------+-----------+
> #|eid|mi_or_chd_5|
> #+-------+-----------+
> #|226| null|
> #|442| null|
> #|978| 0|
> #|851| 0|
> #|428| 0|
> rdd.write.parquet(response, mode="overwrite") # success!
> rdd2 = sqlc.read.parquet(response) # fail
> {code}
>
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must
> be specified manually.;'
> {code}
> in
> {code:none}
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
> in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the
> full schema was available when the parquet file was saved. What gives?
> The error doesn't happen if I add "limit 10" to the sql query. The whole
> selected table is 500k rows with an int and short column.
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]