Thanks for cheerio =)) Have removed the jsdom with cheerio. Now after 6000 pages are checked - only ~200MB of RSS memory is used. It continue growing, but not so fast as it was earlier.
On Tuesday, July 3, 2012 7:13:54 AM UTC+3, node-code wrote: > > +1 for Cheerio. > > On Tue, Jul 3, 2012 at 9:42 AM, rhasson <[email protected]> wrote: > >> Have you looked at Cheerio (https://github.com/MatthewMueller/cheerio) ? >> I've been using it over JSDom and it's faster and lightweight. If you're >> doing heaving scraping I would recommend checking it out. >> >> >> On Monday, July 2, 2012 10:30:25 AM UTC-4, ec.developer wrote: >>> >>> Charles, Thanks for your suggestion. About global links_grabbed - I am >>> sure there could be a better solution, but in my case it is not so >>> significant. >>> I have tried, just for testing, to store 200 thousands of large links in an >>> array, and then I've outputed the used memory, and it's amount is very >>> small. So I didn't focused on this :)* >>> * >>> * >>> * >>> >>> On Monday, July 2, 2012 4:44:24 PM UTC+3, Charles Care wrote: >>>> >>>> Hi, >>>> >>>> I had a play with your code and found a couple of things. It's >>>> probably worth trying to avoid the global variable links_grabbed as >>>> it's just getting larger and larger as you crawl. I know you need it >>>> to avoid parsing the same site twice, but perhaps you could find a >>>> more lightweight data structure? I'd probably be tempted to keep this >>>> state in a Redis set (or something similar). >>>> >>>> Also, I'm not an expert on scraper, but I *seemed* to get a >>>> performance improvement when I modified the code that pushed new urls >>>> onto your links stack. >>>> >>>> I added a String() conversion: e.g. >>>> >>>> ... >>>> links.push(String(link)); >>>> ... >>>> >>>> which meant I wasn't keeping the jquery link around in a stack. Hope >>>> that helps, >>>> >>>> Charles >>>> >>>> >>>> >>>> On 2 July 2012 14:08, ec.developer <[email protected]> wrote: >>>> > Hi all, >>>> > I've created a small app, which searches for Not Found [404] >>>> exceptions on a >>>> > specified website. I use the node-scraper module >>>> > (https://github.com/mape/node-**scraper/<https://github.com/mape/node-scraper/>), >>>> > >>>> which uses native node's request >>>> > module and jsdom for parsing the html). >>>> > My app recursively searches for links on the each webpage, and then >>>> calls >>>> > the Scraping stuff for each found link. The problem is that after >>>> scanning >>>> > 100 pages (and collecting over 200 links to be scanned) the RSS >>>> memory usage >>>> > is >200MB (and it still increases on each iteration). So after >>>> scanning over >>>> > 300-400 pages, I got memory allocation error. >>>> > The code is provided below. >>>> > Any hints? >>>> > >>>> > var scraper = require('scraper'), >>>> > util = require('util'); >>>> > >>>> > var checkDomain = process.argv[2].replace("**https://", >>>> "").replace("http://", >>>> > ""), >>>> > links = [process.argv[2]], >>>> > links_grabbed = []; >>>> > >>>> > var link_check = links.pop(); >>>> > links_grabbed.push(link_check)**; >>>> > scraper(link_check, parseData); >>>> > >>>> > function parseData(err, jQuery, url) >>>> > { >>>> > var ramUsage = bytesToSize(process.**memoryUsage().rss); >>>> > process.stdout.write("\rLinks checked: " + >>>> > (Object.keys(links_grabbed).**length) + "/" + links.length + " ["+ >>>> ramUsage >>>> > +"] "); >>>> > >>>> > if( err ) { >>>> > console.log("%s [%s], source - %s", err.uri, err.http_status, >>>> > links_grabbed[err.uri].src); >>>> > } >>>> > else { >>>> > jQuery('a').each(function() { >>>> > var link = jQuery(this).attr("href").**trim(); >>>> > >>>> > if( link.indexOf("/")==0 ) >>>> > link = "http://" + checkDomain + link; >>>> > >>>> > if( links.indexOf(link)==-1 && links_grabbed.indexOf(link)==-**1 && >>>> ["#", >>>> > ""].indexOf(link)==-1 && (link.indexOf("http://" + checkDomain)==0 || >>>> > link.indexOf("https://"+**checkDomain)==0) ) >>>> > links.push(link); >>>> > }); >>>> > } >>>> > >>>> > if( links.length>0 ) { >>>> > var link_check = links.pop(); >>>> > links_grabbed.push(link_check)**; >>>> > scraper(link_check, parseData); >>>> > } >>>> > else { >>>> > util.log("Scraping is done. Bye bye =)"); >>>> > process.exit(0); >>>> > } >>>> > } >>>> > >>>> > -- >>>> > Job Board: http://jobs.nodejs.org/ >>>> > Posting guidelines: >>>> > https://github.com/joyent/**node/wiki/Mailing-List-** >>>> Posting-Guidelines<https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines> >>>> >>>> > You received this message because you are subscribed to the Google >>>> > Groups "nodejs" group. >>>> > To post to this group, send email to [email protected] >>>> > To unsubscribe from this group, send email to >>>> > nodejs+unsubscribe@**googlegroups.com<nodejs%[email protected]> >>>> > >>>> > For more options, visit this group at >>>> > http://groups.google.com/**group/nodejs?hl=en?hl=en<http://groups.google.com/group/nodejs?hl=en?hl=en> >>>> > >>>> >>> -- >> Job Board: http://jobs.nodejs.org/ >> Posting guidelines: >> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines >> You received this message because you are subscribed to the Google >> Groups "nodejs" group. >> To post to this group, send email to [email protected] >> To unsubscribe from this group, send email to >> [email protected] >> For more options, visit this group at >> http://groups.google.com/group/nodejs?hl=en?hl=en >> > > -- Job Board: http://jobs.nodejs.org/ Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines You received this message because you are subscribed to the Google Groups "nodejs" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/nodejs?hl=en?hl=en
