Paul Nepywoda created PARQUET-294:
-------------------------------------
Summary: NPE in ParquetInputFormat.getSplits when no .parquet
files exist
Key: PARQUET-294
URL: https://issues.apache.org/jira/browse/PARQUET-294
Project: Parquet
Issue Type: Bug
Affects Versions: 1.6.0
Reporter: Paul Nepywoda
{code}
JavaSparkContext context = ...
JavaRDD<Row> rdd1 = context.parallelize(ImmutableList.<Row> of());
SQLContext sqlContext = new SQLContext(context);
StructType schema =
DataTypes.createStructType(ImmutableList.of(DataTypes.createStructField("col1",
DataTypes.StringType, true)));
DataFrame df = sqlContext.createDataFrame(rdd1, schema);
String url = "file:///tmp/emptyRDD";
df.saveAsParquetFile(url);
Configuration configuration =
SparkHadoopUtil.get().newConfiguration(context.getConf());
JobConf jobConf = new JobConf(configuration);
ParquetInputFormat.setReadSupportClass(jobConf, RowReadSupport.class);
FileInputFormat.setInputPaths(jobConf, url);
JavaRDD<Row> rdd2 = context.newAPIHadoopRDD(
jobConf, ParquetInputFormat.class, Void.class, Row.class).values();
rdd2.count();
df = sqlContext.createDataFrame(rdd2, schema);
url = "file:///tmp/emptyRDD2";
df.saveAsParquetFile(url);
FileInputFormat.setInputPaths(jobConf, url);
JavaRDD<Row> rdd3 = context.newAPIHadoopRDD(
jobConf, ParquetInputFormat.class, Void.class, Row.class).values();
rdd3.count();
{code}
The NPE happens here:
{code}
java.lang.NullPointerException
at
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:263)
at
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245)
at
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
{code}
This stems from ParquetFileWriter.getGlobalMetaData returning null when there
are no footers to read.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)