Hello Spark Dev Community,

Friend of mine is facing issue while reading 20 GB of log files from
Directory on Cluster.
Approach are as below:

*1. This gives out of memory error.*
val logRDD =
sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*")
val mappedRDD = logRDD.flatMap { x => x._2.split("[^A-Za-z']+") }.map { x
=> x.replaceAll("""\n""", " ")}

*2. Individual files can be processed with below approach*
val textlogRDD =
sc.textFile("file:///usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-hduser-org.apache.spark.deploy.master.Master-1-chetan-ThinkPad-E460.out")
val textMappedRDD = textlogRDD.flatMap { x => x.split("[^A-Za-z']+")}.map {
y => y.replaceAll("""\n""", " ")}

*3. Could be try.*
val tempRDD =
sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*").flatMap(files
=> files._2.split("[^A-Za-z']+".replaceAll("""\n"""," ")))

*Thoughts:*
1. What I am thinking is if OutOfMemory is the issue then increasing
driver-memory at spark-submit can help because *collect() *would be causing
issue due to taking everything on driver node.
2. or persisting an RDD on Disk StoreageLevel.MEMORY_AND_DISK_SER2 and then
proceed further.

Any suggestions please ?

Thanks
-Chetan

Reply via email to