Thank you Mo. I sincerely appreciate your guidance and contribution.

I will work to get your nutch selenium grid plugin contributed
to work with Nutch 1.x.

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Mo Omer <[email protected]>
Date: Friday, February 13, 2015 at 11:10 AM
To: Chris Mattmann <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: Vagrant Crushed When using Nutch-Selenium

>Hey all,
>
>When I had run nutch-selenium, it was in a config such that zombies were
>created from closing Firefox windows and they couldn't be reaped (again,
>due to the docker configuration I had).
>
>In a normal setup, it should not be an issue - if you're running 20
>threads in nutch that's potentially 20 open FF windows which isn't good
>for 512mb.
>
>Selenium grid is much more efficient, in that browsers are opened, but
>tabs are used to fetch sites - and only those are closed.
>
>Additionally, ensure you're using Nutch 2.2.1.
>
>Feel free to fork patch and tinker and PR as needed.
>
>Chris, if you want to be added to contribs on the GitHub project, that's
>cool with me! Wish I could dedicate more time to this, but I don't
>foresee using Nutch again in the near future, and am now working on
>projects that require lots of reading and possibly patches to Caffe and
>opencl r-CNN projects.
>
>Tl;dr: 
>- no, this shouldn't be typical unless you're creating zombies like crazy
>and they're not being reaped (too many open file descriptors), running
>out of memory, or similar resource constraint.
>- selenium grid is TONs more efficient, but a bit more difficult to set
>up. I used it to crawl 100ks of sites.
>- unfortunately I can't commit more time to this, but if I can assist in
>any admin way, let me know.
>
>Thank you,
>
>Mo
>
>This message was drafted on a tiny touch screen; please forgive brevity &
>tpyos
>
>> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)"
>><[email protected]> wrote:
>> 
>> Oh yes, please up your memory to like at least 2Gb..
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: [email protected]
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Shuo Li <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Friday, February 13, 2015 at 10:38 AM
>> To: "[email protected]" <[email protected]>
>> Cc: Mo Omer <[email protected]>
>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>> 
>>> Hey Mo and Prof Mattmann,
>>> 
>>> 
>>> I will try to crawl the 3 websites in the homework tonight (NASA AMD,
>>>NSF
>>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's
>>>going
>>> on. 
>>> 
>>> 
>>> Is memory an issue? My vagrant only has 512MB of memory.
>>> 
>>> 
>>> Regards,
>>> Shuo Li
>>> 
>>> 
>>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980)
>>> <[email protected]> wrote:
>>> 
>>> Hi Shuo,
>>> 
>>> Thanks for your email. I wonder if using selenium grid would
>>> help?
>>> 
>>> Please see this plugin:
>>> 
>>> https://github.com/momer/nutch-selenium-grid-plugin
>>> 
>>> 
>>> I’m CC’ing Mo the author of the plugin to see if he experienced
>>> this while running the original selenium plugin - Mo did using
>>> selenium grid help the issue that Shuo is experiencing below?
>>> 
>>> Mo: are you cool with portion the grid plugin, or if Lewis or
>>> I do it to trunk (with full credit to you of course?)
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: [email protected]
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Shuo Li <[email protected]>
>>> Reply-To: "[email protected]" <[email protected]>
>>> Date: Friday, February 13, 2015 at 10:12 AM
>>> To: "[email protected]" <[email protected]>
>>> Subject: Vagrant Crushed When using Nutch-Selenium
>>> 
>>>> Hey guys,
>>>> 
>>>> 
>>>> I'm trying to use Nutch-Selenium to crawl
>>>> nutch.apache.org <http://nutch.apache.org> <http://nutch.apache.org>.
>>>> However, my vagrant seems
>>>> crushed after a few minutes. I forced it to shut down and it turns
>>>>out it
>>>> only crawled 59 websites. My nutch version is 1.10 and my OS is Ubuntu
>>>> Trusty, 14.04.
>>>> 
>>>> 
>>>> Is there anything I can provide to you guys? Or is there anybody have
>>>>the
>>>> same issue? Or 59 websites is the complete crawling?
>>>> 
>>>> 
>>>> Any suggestion would be appreciated.
>>>> 
>>>> 
>>>> Regards,
>>>> Shuo Li
>> 

Reply via email to