Marshall created SPARK-6161:
-------------------------------

             Summary: sqlCtx.parquetFile(dataFilePath) throws NPE when using 
s3, but OK when using local filesystem
                 Key: SPARK-6161
                 URL: https://issues.apache.org/jira/browse/SPARK-6161
             Project: Spark
          Issue Type: Question
          Components: Spark Submit
    Affects Versions: 1.2.1
         Environment: MacOSX 10.10, S3
            Reporter: Marshall


Using some examples from Spark summit 2014 and spark1.2.1, we converted 15 
pipe-separated raw text files (with on avg 100k lines) individually ​
to parquet file ​format using the following code​:

  JavaSchemaRDD schemaXXXXData = sqlCtx.applySchema(xxxxData, XXXXRecord.class);
  schemaXXXXData.registerTempTable("xxxxdata");
  schemaXXXXData.saveAsParquetFile(output);

We took the results of each folder and renamed the part file to match the 
original filename plus .parquet and dropped them all into one directory.

We created a java class that we then invoke using a 
spark-1.2.1/bin/spark-submit command...

      SparkConf sparkConf = new SparkConf().setAppName("XXXXX");
      JavaSparkContext ctx = new JavaSparkContext(sparkConf);
      JavaSQLContext sqlCtx = new JavaSQLContext(ctx);
       
      final String dataFilePath = "/tmp/xxxxprocessor/xxxxsamplefiles_parquet";
      //final String dataFilePath = inputPath;

      // Create a JavaSchemaRDD from the file(s) pointed to by path
      JavaSchemaRDD xxxxData = sqlCtx.parquetFile(dataFilePath);

GOOD: when we run our spark app locally (specifying dataFilePath as a full 
filename of ONE ​specific parquet on local filesystem), all is well... the 
'sqlCtx.parquetFile(dataFilePath);' command finds the file and proceeds.

GOOD: when we run our spark app locally (specifying dataFilePath as a the 
directory that contains all the parquet files), all is well... the 
'sqlCtx.parquetFile(dataFilePath);' command rips thru each file in the 
dataFilePath directory and proceeds.

GOOD: if we do the same thing by uploading ONE of the parquet files to s3, and 
change our app to use the s3 path (giving it the full filename to ONE parquet 
file​), all is good - code finds the file and proceeds...

BAD: if we then upload all the parquet files to s3 and specify the s3 directory 
where all the parquet files are, we get an NPE:

 Exception in thread "main" java.lang.NullPointerException
    at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.close(NativeS3FileSystem.java:106)
    at java.io.BufferedInputStream.close(BufferedInputStream.java:472)
    at java.io.FilterInputStream.close(FilterInputStream.java:181)
    at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:428)
    at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:389)
    at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)
    at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)
    at scala.Option.map(Option.scala:145)
    at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:457)
    at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
    at 
org.apache.spark.sql.parquet.ParquetRelation.<init>(ParquetRelation.scala:65)
    at 
org.apache.spark.sql.api.java.JavaSQLContext.parquetFile(JavaSQLContext.scala:141)
    at 
com.aol.ido.spark.sql.XXXXFileIndexParquet.doWork(XXXFileIndexParquet.java:101)

Wondering why specifying a 'dir' works locally but not in S3...

BTW, we have done above steps using json formatted files and all four scenarios 
work well.

      // Create a JavaSchemaRDD from the file(s) pointed to by path
      JavaSchemaRDD xxxxData = sqlCtx.jsonFile(dataFilePath);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to