I would recommend using the Akka-streams API for this. Here is sample. I was able to process a 1G file with around 1.5 million records in *20MB* of memory. The file read and the writing on the console rates are different but the streams API handles that. This is not the fastest but you at least won't run out of memory.
<https://lh6.googleusercontent.com/-zdX0n1pvueE/VLATDja3K4I/AAAAAAAAv18/BH7V1RAuxT8/s1600/1gb_file_processing.png> import java.io.FileInputStream import java.util.Scanner import akka.actor.ActorSystem import akka.stream.{FlowMaterializer, MaterializerSettings} import akka.stream.scaladsl.Source import scala.util.Try object StreamingFileReader extends App { val inputStream = new FileInputStream("/path/to/file") val sc = new Scanner(inputStream, "UTF-8") implicit val system = ActorSystem("Sys") val settings = MaterializerSettings(system) implicit val materializer = FlowMaterializer(settings.copy(maxInputBufferSize = 256, initialInputBufferSize = 256)) val fileSource = Source(() => Iterator.continually(sc.nextLine())) import system.dispatcher fileSource.map { line => line //do nothing //in the for each print the line. }.foreach(println).onComplete { _ => Try { sc.close() inputStream.close() } system.shutdown() } } On Friday, January 9, 2015 at 10:53:33 AM UTC-5, Allen Nie wrote: > > Hi, > > I am trying to process a csv file with 40 million lines of data in > there. It's a 5GB size file. I'm trying to use Akka to parallelize the > task. However, it seems like I can't stop the quick memory growth. It > expanded from 1GB to almost 15GB (the limit I set) under 5 minutes. This is > the code in my main() method: > > val inputStream = new > FileInputStream("E:\\Allen\\DataScience\\train\\train.csv")val sc = new > Scanner(inputStream, "UTF-8") > var counter = 0 > while (sc.hasNextLine) { > > rowActors(counter % 20) ! Row(sc.nextLine()) > > counter += 1} > > sc.close() > inputStream.close() > > Someone pointed out that I was essentially creating 40 million Row > objects, which naturally will take up a lot of space. My row actor is not > doing much. Just simply transforming each line into an array of integers > (if you are familiar with the concept of vectorizing, that's what I'm > doing). Then the transformed array gets printed out. Done. I originally > thought there was a memory leak but maybe I'm not managing memory right. > Can I get any wise suggestions from the Akka experts here?? > > > > <http://i.stack.imgur.com/yQ4xx.png> > > -- >>>>>>>>>> Read the docs: http://akka.io/docs/ >>>>>>>>>> Check the FAQ: >>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user --- You received this message because you are subscribed to the Google Groups "Akka User List" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/akka-user. For more options, visit https://groups.google.com/d/optout.
