[
https://issues.apache.org/jira/browse/SPARK-6161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-6161.
------------------------------
Resolution: Not A Problem
I think this is maybe a question for user@ first, but also, appears to be an S3
problem.
> sqlCtx.parquetFile(dataFilePath) throws NPE when using s3, but OK when using
> local filesystem
> ---------------------------------------------------------------------------------------------
>
> Key: SPARK-6161
> URL: https://issues.apache.org/jira/browse/SPARK-6161
> Project: Spark
> Issue Type: Question
> Components: Spark Submit
> Affects Versions: 1.2.1
> Environment: MacOSX 10.10, S3
> Reporter: Marshall
>
> Using some examples from Spark summit 2014 and spark1.2.1, we converted 15
> pipe-separated raw text files (with on avg 100k lines) individually
> to parquet file format using the following code:
> JavaSchemaRDD schemaXXXXData = sqlCtx.applySchema(xxxxData,
> XXXXRecord.class);
> schemaXXXXData.registerTempTable("xxxxdata");
> schemaXXXXData.saveAsParquetFile(output);
> We took the results of each folder and renamed the part file to match the
> original filename plus .parquet and dropped them all into one directory.
> We created a java class that we then invoke using a
> spark-1.2.1/bin/spark-submit command...
> SparkConf sparkConf = new SparkConf().setAppName("XXXXX");
> JavaSparkContext ctx = new JavaSparkContext(sparkConf);
> JavaSQLContext sqlCtx = new JavaSQLContext(ctx);
>
> final String dataFilePath =
> "/tmp/xxxxprocessor/xxxxsamplefiles_parquet";
> //final String dataFilePath = inputPath;
> // Create a JavaSchemaRDD from the file(s) pointed to by path
> JavaSchemaRDD xxxxData = sqlCtx.parquetFile(dataFilePath);
> GOOD: when we run our spark app locally (specifying dataFilePath as a full
> filename of ONE specific parquet on local filesystem), all is well... the
> 'sqlCtx.parquetFile(dataFilePath);' command finds the file and proceeds.
> GOOD: when we run our spark app locally (specifying dataFilePath as a the
> directory that contains all the parquet files), all is well... the
> 'sqlCtx.parquetFile(dataFilePath);' command rips thru each file in the
> dataFilePath directory and proceeds.
> GOOD: if we do the same thing by uploading ONE of the parquet files to s3,
> and change our app to use the s3 path (giving it the full filename to ONE
> parquet file), all is good - code finds the file and proceeds...
> BAD: if we then upload all the parquet files to s3 and specify the s3
> directory where all the parquet files are, we get an NPE:
> Exception in thread "main" java.lang.NullPointerException
> at
> org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.close(NativeS3FileSystem.java:106)
> at java.io.BufferedInputStream.close(BufferedInputStream.java:472)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:428)
> at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:389)
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)
> at scala.Option.map(Option.scala:145)
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:457)
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
> at
> org.apache.spark.sql.parquet.ParquetRelation.<init>(ParquetRelation.scala:65)
> at
> org.apache.spark.sql.api.java.JavaSQLContext.parquetFile(JavaSQLContext.scala:141)
> at
> com.aol.ido.spark.sql.XXXXFileIndexParquet.doWork(XXXFileIndexParquet.java:101)
> Wondering why specifying a 'dir' works locally but not in S3...
> BTW, we have done above steps using json formatted files and all four
> scenarios work well.
> // Create a JavaSchemaRDD from the file(s) pointed to by path
> JavaSchemaRDD xxxxData = sqlCtx.jsonFile(dataFilePath);
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]