Re: Prevent possible out of memory when using read/union

2015-11-04 Thread Sujit Pal
Hi Alexander, You may want to try the wholeTextFiles() method of SparkContext. Using that you could just do something like this: sc.wholeTextFiles("hdfs://input_dir") > .saveAsSequenceFile("hdfs://output_dir") The wholeTextFiles returns a RDD of ((filename, content)).

Prevent possible out of memory when using read/union

2015-11-04 Thread Alexander Lenz
Hi colleagues, In Hadoop I have a lot of folders containing small files. Therefore I am reading the content of all folders, union the small files and write the unioned data into a single folder containing one file. Afterwards I delete the small files and the according folders. I see two