Hi Eric,
Something along the lines of the following should work
val fs = getFileSystem(...) // standard hadoop API call
val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath,
pathFilter).map(_.getPath.toString).mkString(,) // pathFilter is an
instance of org.apache.hadoop.fs.PathFilter
In PySpark, I think you could enumerate all the valid files, and create RDD by
newAPIHadoopFile(), then union them together.
On Mon, Sep 15, 2014 at 5:49 AM, Eric Friedman
eric.d.fried...@gmail.com wrote:
I neglected to specify that I'm using pyspark. Doesn't look like these APIs
have been
That's a good idea and one I had considered too. Unfortunately I'm not
aware of an API in PySpark for enumerating paths on HDFS. Have I
overlooked one?
On Mon, Sep 15, 2014 at 10:01 AM, Davies Liu dav...@databricks.com wrote:
In PySpark, I think you could enumerate all the valid files, and
There is one way by do it in bash: hadoop fs -ls , maybe you could
end up with a bash scripts to do the things.
On Mon, Sep 15, 2014 at 1:01 PM, Eric Friedman
eric.d.fried...@gmail.com wrote:
That's a good idea and one I had considered too. Unfortunately I'm not
aware of an API in PySpark
Or maybe you could give this one a try:
https://labs.spotify.com/2013/05/07/snakebite/
On Mon, Sep 15, 2014 at 2:51 PM, Davies Liu dav...@databricks.com wrote:
There is one way by do it in bash: hadoop fs -ls , maybe you could
end up with a bash scripts to do the things.
On Mon, Sep 15,
Hi,
I have a directory structure with parquet+avro data in it. There are a
couple of administrative files (.foo and/or _foo) that I need to ignore
when processing this data or Spark tries to read them as containing parquet
content, which they do not.
How can I set a PathFilter on the