Re: PathFilter for newAPIHadoopFile?

2014-09-15 Thread Nat Padmanabhan
Hi Eric, Something along the lines of the following should work val fs = getFileSystem(...) // standard hadoop API call val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath, pathFilter).map(_.getPath.toString).mkString(,) // pathFilter is an instance of org.apache.hadoop.fs.PathFilter

Re: PathFilter for newAPIHadoopFile?

2014-09-15 Thread Davies Liu
In PySpark, I think you could enumerate all the valid files, and create RDD by newAPIHadoopFile(), then union them together. On Mon, Sep 15, 2014 at 5:49 AM, Eric Friedman eric.d.fried...@gmail.com wrote: I neglected to specify that I'm using pyspark. Doesn't look like these APIs have been

Re: PathFilter for newAPIHadoopFile?

2014-09-15 Thread Eric Friedman
That's a good idea and one I had considered too. Unfortunately I'm not aware of an API in PySpark for enumerating paths on HDFS. Have I overlooked one? On Mon, Sep 15, 2014 at 10:01 AM, Davies Liu dav...@databricks.com wrote: In PySpark, I think you could enumerate all the valid files, and

Re: PathFilter for newAPIHadoopFile?

2014-09-15 Thread Davies Liu
There is one way by do it in bash: hadoop fs -ls , maybe you could end up with a bash scripts to do the things. On Mon, Sep 15, 2014 at 1:01 PM, Eric Friedman eric.d.fried...@gmail.com wrote: That's a good idea and one I had considered too. Unfortunately I'm not aware of an API in PySpark

Re: PathFilter for newAPIHadoopFile?

2014-09-15 Thread Davies Liu
Or maybe you could give this one a try: https://labs.spotify.com/2013/05/07/snakebite/ On Mon, Sep 15, 2014 at 2:51 PM, Davies Liu dav...@databricks.com wrote: There is one way by do it in bash: hadoop fs -ls , maybe you could end up with a bash scripts to do the things. On Mon, Sep 15,

PathFilter for newAPIHadoopFile?

2014-09-14 Thread Eric Friedman
Hi, I have a directory structure with parquet+avro data in it. There are a couple of administrative files (.foo and/or _foo) that I need to ignore when processing this data or Spark tries to read them as containing parquet content, which they do not. How can I set a PathFilter on the