Casper/Phantom won't scale for concurrency as well as a node.js implementation does (although since we're limiting to 10 parallel sessions here that's not really an issue).
I wrote about my hybrid approach here: http://baudehlo.wordpress.com/2012/06/05/web-scraping-with-node-js/ On Wed, Jul 25, 2012 at 3:26 PM, ec.developer <[email protected]>wrote: > Thanks, will try phantomjs. > But I'd like though to got the answer - where is the memory leak. I'd > appreciate if smbd could point me to the right direction. > Thanks again. > > > On Wednesday, July 25, 2012 7:14:55 PM UTC+3, Davis Ford wrote: >> >> I know this is a node.js list, but I've been doing a lot of scraping >> myself lately -- filling out forms, etc., and grabbing the results, and >> I've had fantastic success with http://phantomjs.org and >> http://casperjs.org (casper built on top of phantom) -- might be worth >> trying if you aren't in love with your current approach. >> >> On Wed, Jul 25, 2012 at 11:28 AM, Matt <[email protected]> wrote: >> >>> Ah no I didn't. Sorry I couldn't help more. >>> >>> >>> On Wed, Jul 25, 2012 at 10:48 AM, ec.developer >>> <[email protected]>wrote: >>> >>>> Have you checked the latest code? - http://pastebin.com/GyefDREM >>>> I'm using cheerio instead of jsdom. >>>> >>>> >>>> On Wednesday, July 25, 2012 5:45:52 PM UTC+3, Matt Sergeant wrote: >>>>> >>>>> Can you try switching to cheerio instead of jsdom? I found that jsdom >>>>> consumed way too much ram and was slow. >>>>> >>>>> On Wed, Jul 25, 2012 at 8:57 AM, ec.developer >>>>> <[email protected]>wrote: >>>>> >>>>>> Aflter a longer test, got these results: >>>>>> Mem: 795 MB >>>>>> Requests running: 11 >>>>>> Grabbed: 98050 >>>>>> Links queue: 1160553 >>>>>> >>>>>> Links grabbed and links queue are still stored in mongodb. >>>>>> >>>>>> On Monday, July 2, 2012 4:08:13 PM UTC+3, ec.developer wrote: >>>>>> >>>>>>> Hi all, >>>>>>> I've created a small app, which searches for Not Found [404] >>>>>>> exceptions on a specified website. I use the node-scraper module ( >>>>>>> https://github.com/mape/node-******scraper/<https://github.com/mape/node-scraper/>), >>>>>>> which uses native node's request module and jsdom for parsing the html). >>>>>>> My app recursively searches for links on the each webpage, and then >>>>>>> calls the Scraping stuff for each found link. The problem is that after >>>>>>> scanning 100 pages (and collecting over 200 links to be scanned) the RSS >>>>>>> memory usage is >200MB (and it still increases on each iteration). So >>>>>>> after >>>>>>> scanning over 300-400 pages, I got memory allocation error. >>>>>>> The code is provided below. >>>>>>> Any hints? >>>>>>> >>>>>>> var scraper = require('scraper'), >>>>>>> util = require('util'); >>>>>>> >>>>>>> var checkDomain = process.argv[2].replace("**https****://", >>>>>>> "").replace("http://", ""), >>>>>>> links = [process.argv[2]], >>>>>>> links_grabbed = []; >>>>>>> >>>>>>> var link_check = links.pop(); >>>>>>> links_grabbed.push(link_check)******; >>>>>>> scraper(link_check, parseData); >>>>>>> >>>>>>> function parseData(err, jQuery, url) >>>>>>> { >>>>>>> var ramUsage = bytesToSize(process.**memoryUsag****e().rss); >>>>>>> process.stdout.write("\rLinks checked: " + >>>>>>> (Object.keys(links_grabbed).**le****ngth) + "/" + links.length + " >>>>>>> ["+ ramUsage +"] "); >>>>>>> >>>>>>> if( err ) { >>>>>>> console.log("%s [%s], source - %s", err.uri, err.http_status, >>>>>>> links_grabbed[err.uri].src); >>>>>>> } >>>>>>> else { >>>>>>> jQuery('a').each(function() { >>>>>>> var link = jQuery(this).attr("href").**trim****(); >>>>>>> >>>>>>> if( link.indexOf("/")==0 ) >>>>>>> link = "http://" + checkDomain + link; >>>>>>> >>>>>>> if( links.indexOf(link)==-1 && links_grabbed.indexOf(link)==-******1 >>>>>>> && ["#", ""].indexOf(link)==-1 && (link.indexOf("http://" + >>>>>>> checkDomain)==0 || link.indexOf("https://"+**checkD****omain)==0) ) >>>>>>> links.push(link); >>>>>>> }); >>>>>>> } >>>>>>> >>>>>>> if( links.length>0 ) { >>>>>>> var link_check = links.pop(); >>>>>>> links_grabbed.push(link_check)******; >>>>>>> scraper(link_check, parseData); >>>>>>> } >>>>>>> else { >>>>>>> util.log("Scraping is done. Bye bye =)"); >>>>>>> process.exit(0); >>>>>>> } >>>>>>> } >>>>>>> >>>>>> -- >>>>>> Job Board: http://jobs.nodejs.org/ >>>>>> Posting guidelines: https://github.com/joyent/**node** >>>>>> /wiki/Mailing-List-**Posting-**Guidelines<https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines> >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "nodejs" group. >>>>>> To post to this group, send email to [email protected] >>>>>> To unsubscribe from this group, send email to >>>>>> nodejs+unsubscribe@**googlegroup**s.com<nodejs%[email protected]> >>>>>> For more options, visit this group at >>>>>> http://groups.google.com/**group**/nodejs?hl=en?hl=en<http://groups.google.com/group/nodejs?hl=en?hl=en> >>>>>> >>>>> >>>>> -- >>>> Job Board: http://jobs.nodejs.org/ >>>> Posting guidelines: https://github.com/joyent/**node/wiki/Mailing-List- >>>> **Posting-Guidelines<https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines> >>>> You received this message because you are subscribed to the Google >>>> Groups "nodejs" group. >>>> To post to this group, send email to [email protected] >>>> To unsubscribe from this group, send email to >>>> nodejs+unsubscribe@**googlegroups.com<nodejs%[email protected]> >>>> For more options, visit this group at >>>> http://groups.google.com/**group/nodejs?hl=en?hl=en<http://groups.google.com/group/nodejs?hl=en?hl=en> >>>> >>> >>> -- >>> Job Board: http://jobs.nodejs.org/ >>> Posting guidelines: https://github.com/joyent/**node/wiki/Mailing-List-* >>> *Posting-Guidelines<https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines> >>> You received this message because you are subscribed to the Google >>> Groups "nodejs" group. >>> To post to this group, send email to [email protected] >>> To unsubscribe from this group, send email to >>> nodejs+unsubscribe@**googlegroups.com<nodejs%[email protected]> >>> For more options, visit this group at >>> http://groups.google.com/**group/nodejs?hl=en?hl=en<http://groups.google.com/group/nodejs?hl=en?hl=en> >>> >> >> -- > Job Board: http://jobs.nodejs.org/ > Posting guidelines: > https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines > You received this message because you are subscribed to the Google > Groups "nodejs" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/nodejs?hl=en?hl=en > -- Job Board: http://jobs.nodejs.org/ Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines You received this message because you are subscribed to the Google Groups "nodejs" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/nodejs?hl=en?hl=en
