So i took your advice and things got a little better. Then I took out the
jquery module i was using to parse the html out of the request response and
everything started working! Seems like jquery was retaining references to
all the html it had parsed!
Thanks for the help!
On Wednesday, April 3, 2013 7:14:03 AM UTC-7, Paul wrote:
>
> Garbage collection in node can/should be automatic. When you run into
> memory pressure, the garbage collector will run. You can expose it and call
> it yourself if you absolutely must control when the garbage collector is
> run, but in general you shouldn't need to do that. The garbage collector
> will clean up memory when there are no more references to it in your
> program -- I am not sure whether it uses a generational garbage
> collector/mark and sweep/whatever GC, but it really you don't have to
> understand the internals of how it works.
>
> I see that you are adding all pages to the array: 'html', but not popping
> them off when you go to save it. This means that the contents of this array
> will continue to grow without bounds as you crawl more pages. As you crawl
> more pages, you're also storing the entire contents of the array every time
> you crawl a page, so it just stores more and more and more data.
>
> There really should be no reason to store more than one page in memory at
> a time (or at least === the number of concurrency you want to allow in your
> queue).
>
> If I were to do it, I'd write it like this (in psuedocode, ignoring that
> node is async):
>
> page := crawl_page('http://reddit.com');
> links := get_links(page);
> foreach link in links
> queue(link, crawl_page_and_save);
>
> def crawl_page_and_save (link) ->
> page = crawl_page(link);
> save_page(page) // store it to your database or file system -- add a
> new entry for each link
> // potentially get more links from the crawled page
>
>
> You'll notice that I don't add it to any intermediate array. After
> crawl_page_and_save is done it will free up the memory that is used there
> during the next garbage collection.
>
>
> On Wednesday, April 3, 2013 12:00:57 AM UTC-4, Jonathan Crowe wrote:
>>
>> I'm learning node and decided to build a web crawler. It works well
>> enough, but when i try to crawl a site like reddit I start running into
>> severe memory issues.
>>
>> The goal of the crawler would be to take a provided url, crawl the page,
>> gather all internal links and crawl them, then store all the html from all
>> pages into a mongo database or on the file system.
>>
>> Since i'm working with large amounts of data it is important that I
>> understand garbage collection in node but no matter what I do I can't seem
>> to help the performance. Any chance one of you with more expertise could
>> take a look and help me figure out where my holes are?
>>
>> git repo here: https://github.com/jcrowe206/crawler[1]
>>
>
--
--
Job Board: http://jobs.nodejs.org/
Posting guidelines:
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups
"nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.