Hi,

I had a play with your code and found a couple of things. It's
probably worth trying to avoid the global variable links_grabbed as
it's just getting larger and larger as you crawl. I know you need it
to avoid parsing the same site twice, but perhaps you could find a
more lightweight data structure? I'd probably be tempted to keep this
state in a Redis set (or something similar).

Also, I'm not an expert on scraper, but I *seemed* to get a
performance improvement when I modified the code that pushed new urls
onto your links stack.

I added a String() conversion: e.g.

...
links.push(String(link));
...

which meant I wasn't keeping the jquery link around in a stack. Hope that helps,

Charles



On 2 July 2012 14:08, ec.developer <[email protected]> wrote:
> Hi all,
> I've created a small app, which searches for Not Found [404] exceptions on a
> specified website. I use the node-scraper module
> (https://github.com/mape/node-scraper/), which uses native node's request
> module and jsdom for parsing the html).
> My app recursively searches for links on the each webpage, and then calls
> the Scraping stuff for each found link. The problem is that after scanning
> 100 pages (and collecting over 200 links to be scanned) the RSS memory usage
> is >200MB (and it still increases on each iteration). So after scanning over
> 300-400 pages, I got memory allocation error.
> The code is provided below.
> Any hints?
>
> var scraper = require('scraper'),
> util = require('util');
>
> var checkDomain = process.argv[2].replace("https://";, "").replace("http://";,
> ""),
> links = [process.argv[2]],
> links_grabbed = [];
>
> var link_check = links.pop();
> links_grabbed.push(link_check);
> scraper(link_check, parseData);
>
> function parseData(err, jQuery, url)
> {
> var ramUsage = bytesToSize(process.memoryUsage().rss);
> process.stdout.write("\rLinks checked: " +
> (Object.keys(links_grabbed).length) + "/" + links.length + " ["+ ramUsage
> +"] ");
>
> if( err ) {
> console.log("%s [%s], source - %s", err.uri, err.http_status,
> links_grabbed[err.uri].src);
> }
> else {
> jQuery('a').each(function() {
> var link = jQuery(this).attr("href").trim();
>
> if( link.indexOf("/")==0 )
> link = "http://"; + checkDomain + link;
>
> if( links.indexOf(link)==-1 && links_grabbed.indexOf(link)==-1 && ["#",
> ""].indexOf(link)==-1 && (link.indexOf("http://"; + checkDomain)==0 ||
> link.indexOf("https://"+checkDomain)==0) )
> links.push(link);
> });
> }
>
> if( links.length>0 ) {
> var link_check = links.pop();
> links_grabbed.push(link_check);
> scraper(link_check, parseData);
> }
> else {
> util.log("Scraping is done. Bye bye =)");
> process.exit(0);
> }
> }
>
> --
> Job Board: http://jobs.nodejs.org/
> Posting guidelines:
> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
> You received this message because you are subscribed to the Google
> Groups "nodejs" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/nodejs?hl=en?hl=en

-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

Reply via email to