immerrr again created SPARK-16975:
-------------------------------------
Summary: Spark-2.0.0 unable to infer schema for parquet data
written by Spark-1.6.2
Key: SPARK-16975
URL: https://issues.apache.org/jira/browse/SPARK-16975
Project: Spark
Issue Type: Bug
Components: Input/Output
Affects Versions: 2.0.0
Environment: Ubuntu Linux 14.04
Reporter: immerrr again
Spark-2.0.0 seems to have some problems reading a parquet dataset generated by
1.6.2.
{code}
In [80]: spark.read.parquet('/path/to/data')
...
AnalysisException: u'Unable to infer schema for ParquetFormat at /path/to/data.
It must be specified manually;'
{code}
The dataset is ~150G and partitioned by _locality_code column. None of the
partitions are empty. I have narrowed the failing dataset to the first 32
partitions of the data:
{code}
In [82]: spark.read.parquet(*subdirs[:32])
...
AnalysisException: u'Unable to infer schema for ParquetFormat at
/path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be
specified manually;'
{code}
Interestingly, it works OK if you remove any of the partitions from the list:
{code}
In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] +
subdirs[i+1:32]))
{code}
Another strange thing is that the schemas for the first and the last 31
partitions of the subset are identical:
{code}
In [84]: spark.read.parquet(*subdirs[:31]).schema.fields ==
spark.read.parquet(*subdirs[1:32]).schema.fields
Out[84]: True
{code}
Which got me interested and I tried this:
{code}
In [87]: spark.read.parquet(*([subdirs[0]] * 32))
...
AnalysisException: u'Unable to infer schema for ParquetFormat at
/path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be
specified manually;'
In [88]: spark.read.parquet(*([subdirs[15]] * 32))
...
AnalysisException: u'Unable to infer schema for ParquetFormat at
/path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be
specified manually;'
In [89]: spark.read.parquet(*([subdirs[31]] * 32))
...
AnalysisException: u'Unable to infer schema for ParquetFormat at
/path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be
specified manually;'
{code}
If I read the first partition, save it in 2.0 and try to read in the same
manner, everything is fine:
{code}
In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to
context is not a instance of TaskInputOutputContext, but is
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
{code}
I have originally posted it to user mailing list, but with the last discoveries
this clearly seems like a bug.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]