subject:"Prevent possible out of memory when using read\/union"

Re: Prevent possible out of memory when using read/union

2015-11-04 Thread Sujit Pal

Hi Alexander, You may want to try the wholeTextFiles() method of SparkContext. Using that you could just do something like this: sc.wholeTextFiles("hdfs://input_dir") > .saveAsSequenceFile("hdfs://output_dir") The wholeTextFiles returns a RDD of ((filename, content)).

Prevent possible out of memory when using read/union

2015-11-04 Thread Alexander Lenz

Hi colleagues, In Hadoop I have a lot of folders containing small files. Therefore I am reading the content of all folders, union the small files and write the unioned data into a single folder containing one file. Afterwards I delete the small files and the according folders. I see two