[ 
https://issues.apache.org/jira/browse/SPARK-6161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6161.
------------------------------
    Resolution: Not A Problem

I think this is maybe a question for user@ first, but also, appears to be an S3 
problem.

> sqlCtx.parquetFile(dataFilePath) throws NPE when using s3, but OK when using 
> local filesystem
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-6161
>                 URL: https://issues.apache.org/jira/browse/SPARK-6161
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Submit
>    Affects Versions: 1.2.1
>         Environment: MacOSX 10.10, S3
>            Reporter: Marshall
>
> Using some examples from Spark summit 2014 and spark1.2.1, we converted 15 
> pipe-separated raw text files (with on avg 100k lines) individually ​
> to parquet file ​format using the following code​:
>   JavaSchemaRDD schemaXXXXData = sqlCtx.applySchema(xxxxData, 
> XXXXRecord.class);
>   schemaXXXXData.registerTempTable("xxxxdata");
>   schemaXXXXData.saveAsParquetFile(output);
> We took the results of each folder and renamed the part file to match the 
> original filename plus .parquet and dropped them all into one directory.
> We created a java class that we then invoke using a 
> spark-1.2.1/bin/spark-submit command...
>       SparkConf sparkConf = new SparkConf().setAppName("XXXXX");
>       JavaSparkContext ctx = new JavaSparkContext(sparkConf);
>       JavaSQLContext sqlCtx = new JavaSQLContext(ctx);
>        
>       final String dataFilePath = 
> "/tmp/xxxxprocessor/xxxxsamplefiles_parquet";
>       //final String dataFilePath = inputPath;
>       // Create a JavaSchemaRDD from the file(s) pointed to by path
>       JavaSchemaRDD xxxxData = sqlCtx.parquetFile(dataFilePath);
> GOOD: when we run our spark app locally (specifying dataFilePath as a full 
> filename of ONE ​specific parquet on local filesystem), all is well... the 
> 'sqlCtx.parquetFile(dataFilePath);' command finds the file and proceeds.
> GOOD: when we run our spark app locally (specifying dataFilePath as a the 
> directory that contains all the parquet files), all is well... the 
> 'sqlCtx.parquetFile(dataFilePath);' command rips thru each file in the 
> dataFilePath directory and proceeds.
> GOOD: if we do the same thing by uploading ONE of the parquet files to s3, 
> and change our app to use the s3 path (giving it the full filename to ONE 
> parquet file​), all is good - code finds the file and proceeds...
> BAD: if we then upload all the parquet files to s3 and specify the s3 
> directory where all the parquet files are, we get an NPE:
>  Exception in thread "main" java.lang.NullPointerException
>     at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.close(NativeS3FileSystem.java:106)
>     at java.io.BufferedInputStream.close(BufferedInputStream.java:472)
>     at java.io.FilterInputStream.close(FilterInputStream.java:181)
>     at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:428)
>     at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:389)
>     at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)
>     at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)
>     at scala.Option.map(Option.scala:145)
>     at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:457)
>     at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
>     at 
> org.apache.spark.sql.parquet.ParquetRelation.<init>(ParquetRelation.scala:65)
>     at 
> org.apache.spark.sql.api.java.JavaSQLContext.parquetFile(JavaSQLContext.scala:141)
>     at 
> com.aol.ido.spark.sql.XXXXFileIndexParquet.doWork(XXXFileIndexParquet.java:101)
> Wondering why specifying a 'dir' works locally but not in S3...
> BTW, we have done above steps using json formatted files and all four 
> scenarios work well.
>       // Create a JavaSchemaRDD from the file(s) pointed to by path
>       JavaSchemaRDD xxxxData = sqlCtx.jsonFile(dataFilePath);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to