+1 for Cheerio. On Tue, Jul 3, 2012 at 9:42 AM, rhasson <[email protected]> wrote:
> Have you looked at Cheerio (https://github.com/MatthewMueller/cheerio) ? > I've been using it over JSDom and it's faster and lightweight. If you're > doing heaving scraping I would recommend checking it out. > > > On Monday, July 2, 2012 10:30:25 AM UTC-4, ec.developer wrote: >> >> Charles, Thanks for your suggestion. About global links_grabbed - I am >> sure there could be a better solution, but in my case it is not so >> significant. >> I have tried, just for testing, to store 200 thousands of large links in an >> array, and then I've outputed the used memory, and it's amount is very >> small. So I didn't focused on this :)* >> * >> * >> * >> >> On Monday, July 2, 2012 4:44:24 PM UTC+3, Charles Care wrote: >>> >>> Hi, >>> >>> I had a play with your code and found a couple of things. It's >>> probably worth trying to avoid the global variable links_grabbed as >>> it's just getting larger and larger as you crawl. I know you need it >>> to avoid parsing the same site twice, but perhaps you could find a >>> more lightweight data structure? I'd probably be tempted to keep this >>> state in a Redis set (or something similar). >>> >>> Also, I'm not an expert on scraper, but I *seemed* to get a >>> performance improvement when I modified the code that pushed new urls >>> onto your links stack. >>> >>> I added a String() conversion: e.g. >>> >>> ... >>> links.push(String(link)); >>> ... >>> >>> which meant I wasn't keeping the jquery link around in a stack. Hope >>> that helps, >>> >>> Charles >>> >>> >>> >>> On 2 July 2012 14:08, ec.developer <[email protected]> wrote: >>> > Hi all, >>> > I've created a small app, which searches for Not Found [404] >>> exceptions on a >>> > specified website. I use the node-scraper module >>> > (https://github.com/mape/node-**scraper/<https://github.com/mape/node-scraper/>), >>> which uses native node's request >>> > module and jsdom for parsing the html). >>> > My app recursively searches for links on the each webpage, and then >>> calls >>> > the Scraping stuff for each found link. The problem is that after >>> scanning >>> > 100 pages (and collecting over 200 links to be scanned) the RSS memory >>> usage >>> > is >200MB (and it still increases on each iteration). So after >>> scanning over >>> > 300-400 pages, I got memory allocation error. >>> > The code is provided below. >>> > Any hints? >>> > >>> > var scraper = require('scraper'), >>> > util = require('util'); >>> > >>> > var checkDomain = process.argv[2].replace("**https://", >>> "").replace("http://", >>> > ""), >>> > links = [process.argv[2]], >>> > links_grabbed = []; >>> > >>> > var link_check = links.pop(); >>> > links_grabbed.push(link_check)**; >>> > scraper(link_check, parseData); >>> > >>> > function parseData(err, jQuery, url) >>> > { >>> > var ramUsage = bytesToSize(process.**memoryUsage().rss); >>> > process.stdout.write("\rLinks checked: " + >>> > (Object.keys(links_grabbed).**length) + "/" + links.length + " ["+ >>> ramUsage >>> > +"] "); >>> > >>> > if( err ) { >>> > console.log("%s [%s], source - %s", err.uri, err.http_status, >>> > links_grabbed[err.uri].src); >>> > } >>> > else { >>> > jQuery('a').each(function() { >>> > var link = jQuery(this).attr("href").**trim(); >>> > >>> > if( link.indexOf("/")==0 ) >>> > link = "http://" + checkDomain + link; >>> > >>> > if( links.indexOf(link)==-1 && links_grabbed.indexOf(link)==-**1 && >>> ["#", >>> > ""].indexOf(link)==-1 && (link.indexOf("http://" + checkDomain)==0 || >>> > link.indexOf("https://"+**checkDomain)==0) ) >>> > links.push(link); >>> > }); >>> > } >>> > >>> > if( links.length>0 ) { >>> > var link_check = links.pop(); >>> > links_grabbed.push(link_check)**; >>> > scraper(link_check, parseData); >>> > } >>> > else { >>> > util.log("Scraping is done. Bye bye =)"); >>> > process.exit(0); >>> > } >>> > } >>> > >>> > -- >>> > Job Board: http://jobs.nodejs.org/ >>> > Posting guidelines: >>> > https://github.com/joyent/**node/wiki/Mailing-List-** >>> Posting-Guidelines<https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines> >>> > You received this message because you are subscribed to the Google >>> > Groups "nodejs" group. >>> > To post to this group, send email to [email protected] >>> > To unsubscribe from this group, send email to >>> > nodejs+unsubscribe@**googlegroups.com<nodejs%[email protected]> >>> > For more options, visit this group at >>> > http://groups.google.com/**group/nodejs?hl=en?hl=en<http://groups.google.com/group/nodejs?hl=en?hl=en> >>> >> -- > Job Board: http://jobs.nodejs.org/ > Posting guidelines: > https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines > You received this message because you are subscribed to the Google > Groups "nodejs" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/nodejs?hl=en?hl=en > -- Job Board: http://jobs.nodejs.org/ Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines You received this message because you are subscribed to the Google Groups "nodejs" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/nodejs?hl=en?hl=en
