Garbage collection in node can/should be automatic. When you run into
memory pressure, the garbage collector will run. You can expose it and call
it yourself if you absolutely must control when the garbage collector is
run, but in general you shouldn't need to do that. The garbage collector
will clean up memory when there are no more references to it in your
program -- I am not sure whether it uses a generational garbage
collector/mark and sweep/whatever GC, but it really you don't have to
understand the internals of how it works.
I see that you are adding all pages to the array: 'html', but not popping
them off when you go to save it. This means that the contents of this array
will continue to grow without bounds as you crawl more pages. As you crawl
more pages, you're also storing the entire contents of the array every time
you crawl a page, so it just stores more and more and more data.
There really should be no reason to store more than one page in memory at a
time (or at least === the number of concurrency you want to allow in your
queue).
If I were to do it, I'd write it like this (in psuedocode, ignoring that
node is async):
page := crawl_page('http://reddit.com');
links := get_links(page);
foreach link in links
queue(link, crawl_page_and_save);
def crawl_page_and_save (link) ->
page = crawl_page(link);
save_page(page) // store it to your database or file system -- add a new
entry for each link
// potentially get more links from the crawled page
You'll notice that I don't add it to any intermediate array. After
crawl_page_and_save is done it will free up the memory that is used there
during the next garbage collection.
On Wednesday, April 3, 2013 12:00:57 AM UTC-4, Jonathan Crowe wrote:
>
> I'm learning node and decided to build a web crawler. It works well
> enough, but when i try to crawl a site like reddit I start running into
> severe memory issues.
>
> The goal of the crawler would be to take a provided url, crawl the page,
> gather all internal links and crawl them, then store all the html from all
> pages into a mongo database or on the file system.
>
> Since i'm working with large amounts of data it is important that I
> understand garbage collection in node but no matter what I do I can't seem
> to help the performance. Any chance one of you with more expertise could
> take a look and help me figure out where my holes are?
>
> git repo here: https://github.com/jcrowe206/crawler[1]
>
--
--
Job Board: http://jobs.nodejs.org/
Posting guidelines:
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups
"nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.