Hi,
I am trying to process a csv file with 40 million lines of data in
there. It's a 5GB size file. I'm trying to use Akka to parallelize the
task. However, it seems like I can't stop the quick memory growth. It
expanded from 1GB to almost 15GB (the limit I set) under 5 minutes. This is
the code in my main() method:
val inputStream = new
FileInputStream("E:\\Allen\\DataScience\\train\\train.csv")val sc = new
Scanner(inputStream, "UTF-8")
var counter = 0
while (sc.hasNextLine) {
rowActors(counter % 20) ! Row(sc.nextLine())
counter += 1}
sc.close()
inputStream.close()
Someone pointed out that I was essentially creating 40 million Row
objects, which naturally will take up a lot of space. My row actor is not
doing much. Just simply transforming each line into an array of integers
(if you are familiar with the concept of vectorizing, that's what I'm
doing). Then the transformed array gets printed out. Done. I originally
thought there was a memory leak but maybe I'm not managing memory right.
Can I get any wise suggestions from the Akka experts here??
<http://i.stack.imgur.com/yQ4xx.png>
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ:
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.