Hi

I put million files into a har archive on hdfs. I d'like to iterate over
their file paths, and read them. (Basically they are pdf, and I want to
transform them into text with apache pdfbox)

My first attempts has been to list them with hadoop command 
`hdfs dfs -ls har:///user/<my_user>/har/pdf.har` and this works fine.
However, when I try to replicate this in spark, I get an error:

```  
val hconf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
val hdfs = FileSystem.get(hconf)
val test = hdfs.listFiles(new Path("har:///user/<my_user>/har/pdf.har"), false)
java.lang.IllegalArgumentException: Wrong FS:
har:/user/<my_user>/har/pdf.har, expected: hdfs://<my_cluster>:<my_port>
        at
        org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:661)
```  

However, I had been able to use the `sc.textFile` without problem:

```
val test = sc.textFile("har:///user/<my_ser>/har/pdf.har").count
80000000
```  

--------------------------------------------------------------
1) Is it easily solvable ?
2) Do I need to implement my own pdfFile reader, inspired from textFile ?
2) If not, does har the best way ? I have been looking at AVRO too

Thanks for any advice,

-- 
Nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to