[
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stuart Reynolds updated SPARK-21392:
------------------------------------
Description:
The following boring code works up until when I read in the parquet file.
{code:none}
import numpy as np
import pandas as pd
import pyspark
from pyspark import SQLContext, SparkContext, SparkConf
print pyspark.__version__
sc = SparkContext(conf=SparkConf().setMaster('local'))
df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)})
print df
sqlc = SQLContext(sc)
df = sqlc.createDataFrame(df)
df = df.createOrReplaceTempView("outcomes")
rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5")
print rdd.schema
rdd.show()
rdd.write.parquet("mi", mode="overwrite")
rdd2 = sqlc.read.parquet("mi") # FAIL!
{code}
{code:none}
# print pyspark.__version__
2.2.0
# print df
eid mi
0 0 0
1 1 1
2 2 2
3 3 3
...
[100 rows x 2 columns]
# print rdd.schema
StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true)))
# rdd.show()
+---+---+
|eid| mi|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
{code}
fails with:
{code:none}
rdd2 = sqlc.read.parquet("mixx")
File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py", line
291, in parquet
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line
1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line 69,
in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It
must be specified manually.;'
{code}
The documentation for parquet says the format is self describing, and the full
schema was available when the parquet file was saved. What gives?
Works with master='local', but fails with my cluster is specified.
was:
The following boring code works up until when I read in the parquet file.
{code:none}
import numpy as np
import pandas as pd
import pyspark
from pyspark import SQLContext, SparkContext, SparkConf
print pyspark.__version__
sc = SparkContext(conf=SparkConf().setMaster('local'))
df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)})
print df
sqlc = SQLContext(sc)
df = sqlc.createDataFrame(df)
df = df.createOrReplaceTempView("outcomes")
rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5")
print rdd.schema
rdd.show()
rdd.write.parquet("mi", mode="overwrite")
rdd2 = sqlc.read.parquet("mi")
{code}
{code:none}
# print pyspark.__version__
2.2.0
# print df
eid mi
0 0 0
1 1 1
2 2 2
3 3 3
---
[100 rows x 2 columns]
# print rdd.schema
StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true)))
# rdd.show()
+---+---+
|eid| mi|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
{code}
fails with:
{code:none}
rdd2 = sqlc.read.parquet("mixx")
File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py", line
291, in parquet
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line
1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line 69,
in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It
must be specified manually.;'
{code}
in
{code:none}
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
in deco(*a, **kw)
{code}
The documentation for parquet says the format is self describing, and the full
schema was available when the parquet file was saved. What gives?
Works with master='local', but fails with my cluster is specified.
> Unable to infer schema when loading large Parquet file
> ------------------------------------------------------
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.1.1, 2.2.0
> Environment: Spark 2.1.1. python 2.7.6
> Reporter: Stuart Reynolds
> Labels: parquet, pyspark
>
> The following boring code works up until when I read in the parquet file.
> {code:none}
> import numpy as np
> import pandas as pd
> import pyspark
> from pyspark import SQLContext, SparkContext, SparkConf
> print pyspark.__version__
> sc = SparkContext(conf=SparkConf().setMaster('local'))
> df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)})
> print df
> sqlc = SQLContext(sc)
> df = sqlc.createDataFrame(df)
> df = df.createOrReplaceTempView("outcomes")
> rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5")
> print rdd.schema
> rdd.show()
> rdd.write.parquet("mi", mode="overwrite")
> rdd2 = sqlc.read.parquet("mi") # FAIL!
> {code}
> {code:none}
> # print pyspark.__version__
> 2.2.0
> # print df
> eid mi
> 0 0 0
> 1 1 1
> 2 2 2
> 3 3 3
> ...
> [100 rows x 2 columns]
> # print rdd.schema
> StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true)))
> # rdd.show()
> +---+---+
> |eid| mi|
> +---+---+
> | 0| 0|
> | 1| 1|
> | 2| 2|
> | 3| 3|
> | 4| 4|
> +---+---+
> {code}
>
> fails with:
> {code:none}
> rdd2 = sqlc.read.parquet("mixx")
> File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py",
> line 291, in parquet
> return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
> File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line
> 1133, in __call__
> answer, self.gateway_client, self.target_id, self.name)
> File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line
> 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It
> must be specified manually.;'
> {code}
> The documentation for parquet says the format is self describing, and the
> full schema was available when the parquet file was saved. What gives?
> Works with master='local', but fails with my cluster is specified.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]