Ahhh, brilliant! Thank you. window.close() - minimized significantly the memory usage. But it still leaks. Before closing the window I was able to check ~1000 pages. Now I can check over 10000 pages, but after a while I got again the memory allocation error.
On Monday, July 2, 2012 4:32:14 PM UTC+3, tim sebastian wrote: > > https://github.com/tmpvar/jsdom#how-it-works > > jsdom.env(html, function(`errors`, `window`) { > // free memory associated with the window > window.close(); > }); > > > On Mon, Jul 2, 2012 at 3:30 PM, tim sebastian < > [email protected]> wrote: > >> node-scraper doesnt seem to be closing the jsdom window it creates. >> And honestly dont see a way to do so expect you play around with the >> node-scraper module yourself to fix this issue. >> >> Not even sure if that is the problem, but i had a similar issue working >> with plain jsdom, and not closing the "window" that contains the whole >> DOM-Tree was the reason. >> >> On Mon, Jul 2, 2012 at 3:08 PM, ec.developer <[email protected]>wrote: >> >>> Hi all, >>> I've created a small app, which searches for Not Found [404] exceptions >>> on a specified website. I use the node-scraper module ( >>> https://github.com/mape/node-scraper/), which uses native node's >>> request module and jsdom for parsing the html). >>> My app recursively searches for links on the each webpage, and then >>> calls the Scraping stuff for each found link. The problem is that after >>> scanning 100 pages (and collecting over 200 links to be scanned) the RSS >>> memory usage is >200MB (and it still increases on each iteration). So after >>> scanning over 300-400 pages, I got memory allocation error. >>> The code is provided below. >>> Any hints? >>> >>> var scraper = require('scraper'), >>> util = require('util'); >>> >>> var checkDomain = process.argv[2].replace("https://", >>> "").replace("http://", ""), >>> links = [process.argv[2]], >>> links_grabbed = []; >>> >>> var link_check = links.pop(); >>> links_grabbed.push(link_check); >>> scraper(link_check, parseData); >>> >>> function parseData(err, jQuery, url) >>> { >>> var ramUsage = bytesToSize(process.memoryUsage().rss); >>> process.stdout.write("\rLinks checked: " + >>> (Object.keys(links_grabbed).length) + "/" + links.length + " ["+ ramUsage >>> +"] "); >>> >>> if( err ) { >>> console.log("%s [%s], source - %s", err.uri, err.http_status, >>> links_grabbed[err.uri].src); >>> } >>> else { >>> jQuery('a').each(function() { >>> var link = jQuery(this).attr("href").trim(); >>> >>> if( link.indexOf("/")==0 ) >>> link = "http://" + checkDomain + link; >>> >>> if( links.indexOf(link)==-1 && links_grabbed.indexOf(link)==-1 && >>> ["#", ""].indexOf(link)==-1 && (link.indexOf("http://" + checkDomain)==0 || >>> link.indexOf("https://"+checkDomain)==0) ) >>> links.push(link); >>> }); >>> } >>> >>> if( links.length>0 ) { >>> var link_check = links.pop(); >>> links_grabbed.push(link_check); >>> scraper(link_check, parseData); >>> } >>> else { >>> util.log("Scraping is done. Bye bye =)"); >>> process.exit(0); >>> } >>> } >>> >>> -- >>> Job Board: http://jobs.nodejs.org/ >>> Posting guidelines: >>> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines >>> You received this message because you are subscribed to the Google >>> Groups "nodejs" group. >>> To post to this group, send email to [email protected] >>> To unsubscribe from this group, send email to >>> [email protected] >>> For more options, visit this group at >>> http://groups.google.com/group/nodejs?hl=en?hl=en >>> >> >> > -- Job Board: http://jobs.nodejs.org/ Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines You received this message because you are subscribed to the Google Groups "nodejs" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/nodejs?hl=en?hl=en
