Re: [nodejs] Web scraping and Memory leaking issue

Anand George Mon, 02 Jul 2012 21:14:01 -0700

+1 for Cheerio.

On Tue, Jul 3, 2012 at 9:42 AM, rhasson <[email protected]> wrote:


> Have you looked at Cheerio (https://github.com/MatthewMueller/cheerio) ?
>  I've been using it over JSDom and it's faster and lightweight.  If you're
> doing heaving scraping I would recommend checking it out.
>
>
> On Monday, July 2, 2012 10:30:25 AM UTC-4, ec.developer wrote:
>>
>> Charles, Thanks for your suggestion. About global links_grabbed - I am
>> sure there could be a better solution, but in my case it is not so 
>> significant.
>> I have tried, just for testing, to store 200 thousands of large links in an
>> array, and then I've outputed the used memory, and it's amount is very
>> small. So I didn't focused on this :)*
>> *
>> *
>> *
>>
>> On Monday, July 2, 2012 4:44:24 PM UTC+3, Charles Care wrote:
>>>
>>> Hi,
>>>
>>> I had a play with your code and found a couple of things. It's
>>> probably worth trying to avoid the global variable links_grabbed as
>>> it's just getting larger and larger as you crawl. I know you need it
>>> to avoid parsing the same site twice, but perhaps you could find a
>>> more lightweight data structure? I'd probably be tempted to keep this
>>> state in a Redis set (or something similar).
>>>
>>> Also, I'm not an expert on scraper, but I *seemed* to get a
>>> performance improvement when I modified the code that pushed new urls
>>> onto your links stack.
>>>
>>> I added a String() conversion: e.g.
>>>
>>> ...
>>> links.push(String(link));
>>> ...
>>>
>>> which meant I wasn't keeping the jquery link around in a stack. Hope
>>> that helps,
>>>
>>> Charles
>>>
>>>
>>>
>>> On 2 July 2012 14:08, ec.developer <[email protected]> wrote:
>>> > Hi all,
>>> > I've created a small app, which searches for Not Found [404]
>>> exceptions on a
>>> > specified website. I use the node-scraper module
>>> > (https://github.com/mape/node-**scraper/<https://github.com/mape/node-scraper/>),
>>> which uses native node's request
>>> > module and jsdom for parsing the html).
>>> > My app recursively searches for links on the each webpage, and then
>>> calls
>>> > the Scraping stuff for each found link. The problem is that after
>>> scanning
>>> > 100 pages (and collecting over 200 links to be scanned) the RSS memory
>>> usage
>>> > is >200MB (and it still increases on each iteration). So after
>>> scanning over
>>> > 300-400 pages, I got memory allocation error.
>>> > The code is provided below.
>>> > Any hints?
>>> >
>>> > var scraper = require('scraper'),
>>> > util = require('util');
>>> >
>>> > var checkDomain = process.argv[2].replace("**https://";,
>>> "").replace("http://";,
>>> > ""),
>>> > links = [process.argv[2]],
>>> > links_grabbed = [];
>>> >
>>> > var link_check = links.pop();
>>> > links_grabbed.push(link_check)**;
>>> > scraper(link_check, parseData);
>>> >
>>> > function parseData(err, jQuery, url)
>>> > {
>>> > var ramUsage = bytesToSize(process.**memoryUsage().rss);
>>> > process.stdout.write("\rLinks checked: " +
>>> > (Object.keys(links_grabbed).**length) + "/" + links.length + " ["+
>>> ramUsage
>>> > +"] ");
>>> >
>>> > if( err ) {
>>> > console.log("%s [%s], source - %s", err.uri, err.http_status,
>>> > links_grabbed[err.uri].src);
>>> > }
>>> > else {
>>> > jQuery('a').each(function() {
>>> > var link = jQuery(this).attr("href").**trim();
>>> >
>>> > if( link.indexOf("/")==0 )
>>> > link = "http://"; + checkDomain + link;
>>> >
>>> > if( links.indexOf(link)==-1 && links_grabbed.indexOf(link)==-**1 &&
>>> ["#",
>>> > ""].indexOf(link)==-1 && (link.indexOf("http://"; + checkDomain)==0 ||
>>> > link.indexOf("https://"+**checkDomain)==0) )
>>> > links.push(link);
>>> > });
>>> > }
>>> >
>>> > if( links.length>0 ) {
>>> > var link_check = links.pop();
>>> > links_grabbed.push(link_check)**;
>>> > scraper(link_check, parseData);
>>> > }
>>> > else {
>>> > util.log("Scraping is done. Bye bye =)");
>>> > process.exit(0);
>>> > }
>>> > }
>>> >
>>> > --
>>> > Job Board: http://jobs.nodejs.org/
>>> > Posting guidelines:
>>> > https://github.com/joyent/**node/wiki/Mailing-List-**
>>> Posting-Guidelines<https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines>
>>> > You received this message because you are subscribed to the Google
>>> > Groups "nodejs" group.
>>> > To post to this group, send email to [email protected]
>>> > To unsubscribe from this group, send email to
>>> > nodejs+unsubscribe@**googlegroups.com<nodejs%[email protected]>
>>> > For more options, visit this group at
>>> > http://groups.google.com/**group/nodejs?hl=en?hl=en<http://groups.google.com/group/nodejs?hl=en?hl=en>
>>>
>>  --
> Job Board: http://jobs.nodejs.org/
> Posting guidelines:
> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
> You received this message because you are subscribed to the Google
> Groups "nodejs" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/nodejs?hl=en?hl=en
>

-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

Re: [nodejs] Web scraping and Memory leaking issue

Reply via email to