Hi Alexander,
You may want to try the wholeTextFiles() method of SparkContext. Using that
you could just do something like this:
sc.wholeTextFiles("hdfs://input_dir")
> .saveAsSequenceFile("hdfs://output_dir")
The wholeTextFiles returns a RDD of ((filename, content)).
Hi colleagues,
In Hadoop I have a lot of folders containing small files. Therefore I am
reading the content of all folders, union the small files and write the unioned
data into a single folder
containing one file. Afterwards I delete the small files and the according
folders.
I see two