Garbage collection in node can/should be automatic. When you run into 
memory pressure, the garbage collector will run. You can expose it and call 
it yourself if you absolutely must control when the garbage collector is 
run, but in general you shouldn't need to do that. The garbage collector 
will clean up memory when there are no more references to it in your 
program -- I am not sure whether it uses a generational garbage 
collector/mark and sweep/whatever GC, but it really you don't have to 
understand the internals of how it works.

I see that you are adding all pages to the array: 'html', but not popping 
them off when you go to save it. This means that the contents of this array 
will continue to grow without bounds as you crawl more pages. As you crawl 
more pages, you're also storing the entire contents of the array every time 
you crawl a page, so it just stores more and more and more data.

There really should be no reason to store more than one page in memory at a 
time (or at least === the number of concurrency you want to allow in your 
queue).

If I were to do it, I'd write it like this (in psuedocode, ignoring that 
node is async):

page := crawl_page('http://reddit.com');
links := get_links(page);
foreach link in links
  queue(link, crawl_page_and_save);

def crawl_page_and_save (link) ->
   page = crawl_page(link);
   save_page(page) // store it to your database or file system -- add a new 
entry for each link
   // potentially get more links from the crawled page


You'll notice that I don't add it to any intermediate array. After 
crawl_page_and_save is done it will free up the memory that is used there 
during the next garbage collection.
  

On Wednesday, April 3, 2013 12:00:57 AM UTC-4, Jonathan Crowe wrote:
>
> I'm learning node and decided to build a web crawler. It works well 
> enough, but when i try to crawl a site like reddit I start running into 
> severe memory issues.
>
> The goal of the crawler would be to take a provided url, crawl the page, 
> gather all internal links and crawl them, then store all the html from all 
> pages into a mongo database or on the file system.
>
> Since i'm working with large amounts of data it is important that I 
> understand garbage collection in node but no matter what I do I can't seem 
> to help the performance. Any chance one of you with more expertise could 
> take a look and help me figure out where my holes are?
>
> git repo here: https://github.com/jcrowe206/crawler[1]
>

-- 
-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to