Can you try switching to cheerio instead of jsdom? I found that jsdom consumed way too much ram and was slow.
On Wed, Jul 25, 2012 at 8:57 AM, ec.developer <[email protected]>wrote: > Aflter a longer test, got these results: > Mem: 795 MB > Requests running: 11 > Grabbed: 98050 > Links queue: 1160553 > > Links grabbed and links queue are still stored in mongodb. > > On Monday, July 2, 2012 4:08:13 PM UTC+3, ec.developer wrote: > >> Hi all, >> I've created a small app, which searches for Not Found [404] exceptions >> on a specified website. I use the node-scraper module ( >> https://github.com/mape/node-**scraper/<https://github.com/mape/node-scraper/>), >> which uses native node's request module and jsdom for parsing the html). >> My app recursively searches for links on the each webpage, and then calls >> the Scraping stuff for each found link. The problem is that after scanning >> 100 pages (and collecting over 200 links to be scanned) the RSS memory >> usage is >200MB (and it still increases on each iteration). So after >> scanning over 300-400 pages, I got memory allocation error. >> The code is provided below. >> Any hints? >> >> var scraper = require('scraper'), >> util = require('util'); >> >> var checkDomain = process.argv[2].replace("**https://", >> "").replace("http://", ""), >> links = [process.argv[2]], >> links_grabbed = []; >> >> var link_check = links.pop(); >> links_grabbed.push(link_check)**; >> scraper(link_check, parseData); >> >> function parseData(err, jQuery, url) >> { >> var ramUsage = bytesToSize(process.**memoryUsage().rss); >> process.stdout.write("\rLinks checked: " + >> (Object.keys(links_grabbed).**length) >> + "/" + links.length + " ["+ ramUsage +"] "); >> >> if( err ) { >> console.log("%s [%s], source - %s", err.uri, err.http_status, >> links_grabbed[err.uri].src); >> } >> else { >> jQuery('a').each(function() { >> var link = jQuery(this).attr("href").**trim(); >> >> if( link.indexOf("/")==0 ) >> link = "http://" + checkDomain + link; >> >> if( links.indexOf(link)==-1 && links_grabbed.indexOf(link)==-**1 && >> ["#", ""].indexOf(link)==-1 && (link.indexOf("http://" + checkDomain)==0 >> || link.indexOf("https://"+**checkDomain)==0) ) >> links.push(link); >> }); >> } >> >> if( links.length>0 ) { >> var link_check = links.pop(); >> links_grabbed.push(link_check)**; >> scraper(link_check, parseData); >> } >> else { >> util.log("Scraping is done. Bye bye =)"); >> process.exit(0); >> } >> } >> > -- > Job Board: http://jobs.nodejs.org/ > Posting guidelines: > https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines > You received this message because you are subscribed to the Google > Groups "nodejs" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/nodejs?hl=en?hl=en > -- Job Board: http://jobs.nodejs.org/ Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines You received this message because you are subscribed to the Google Groups "nodejs" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/nodejs?hl=en?hl=en
