Paul Pearce created SPARK-19381:
-----------------------------------
Summary: spark 2.1.0 raises unrelated (unhelpful) error for
parquet files beginning with '_'
Key: SPARK-19381
URL: https://issues.apache.org/jira/browse/SPARK-19381
Project: Spark
Issue Type: Bug
Affects Versions: 2.1.0
Reporter: Paul Pearce
Priority: Minor
Under spark 2.1.0 if you attempt to read a parquet file with filename beginning
with '_' the error returned is
"Unable to infer schema for Parquet. It must be specified manually."
The bug is not the inability to read the file, rather that the error is
unrelated to the actual problem. Below shows the generation of parquet files
under spark 2.0.0 and the attempted reading of them under spark 2.1.0.
Generation:
{code}
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.0.cloudera1
/_/
Using Python version 2.7.6 (default, Oct 26 2016 20:30:19)
SparkSession available as 'spark'.
>>> from pyspark.sql import Row
>>> df = spark.createDataFrame(sc.parallelize(range(1, 6)).map(lambda i:
>>> Row(single=i, double=i ** 2)))
>>> df.write.parquet("debug.parquet")
>>> df.write.parquet("_debug.parquet")
{code}
Reading
{code}
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Python version 2.7.6 (default, Oct 26 2016 20:30:19)
SparkSession available as 'spark'.
>>> df = spark.read.parquet("debug.parquet")
>>> df = spark.read.parquet("_debug.parquet")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/opt/apache/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql/readwriter.py", line
274, in parquet
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
File
"/opt/apache/spark-2.1.0-bin-hadoop2.6/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1133, in __call__
File "/opt/apache/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql/utils.py",
line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It
must be specified manually.;'
{code}
I only realized the source of the problem when reading issue:
https://issues.apache.org/jira/browse/SPARK-16975 which describes a similar
problem but with column names.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]