Thank you Mo. I sincerely appreciate your guidance and contribution. I will work to get your nutch selenium grid plugin contributed to work with Nutch 1.x.
Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Mo Omer <[email protected]> Date: Friday, February 13, 2015 at 11:10 AM To: Chris Mattmann <[email protected]> Cc: "[email protected]" <[email protected]> Subject: Re: Vagrant Crushed When using Nutch-Selenium >Hey all, > >When I had run nutch-selenium, it was in a config such that zombies were >created from closing Firefox windows and they couldn't be reaped (again, >due to the docker configuration I had). > >In a normal setup, it should not be an issue - if you're running 20 >threads in nutch that's potentially 20 open FF windows which isn't good >for 512mb. > >Selenium grid is much more efficient, in that browsers are opened, but >tabs are used to fetch sites - and only those are closed. > >Additionally, ensure you're using Nutch 2.2.1. > >Feel free to fork patch and tinker and PR as needed. > >Chris, if you want to be added to contribs on the GitHub project, that's >cool with me! Wish I could dedicate more time to this, but I don't >foresee using Nutch again in the near future, and am now working on >projects that require lots of reading and possibly patches to Caffe and >opencl r-CNN projects. > >Tl;dr: >- no, this shouldn't be typical unless you're creating zombies like crazy >and they're not being reaped (too many open file descriptors), running >out of memory, or similar resource constraint. >- selenium grid is TONs more efficient, but a bit more difficult to set >up. I used it to crawl 100ks of sites. >- unfortunately I can't commit more time to this, but if I can assist in >any admin way, let me know. > >Thank you, > >Mo > >This message was drafted on a tiny touch screen; please forgive brevity & >tpyos > >> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)" >><[email protected]> wrote: >> >> Oh yes, please up your memory to like at least 2Gb.. >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> >> -----Original Message----- >> From: Shuo Li <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Friday, February 13, 2015 at 10:38 AM >> To: "[email protected]" <[email protected]> >> Cc: Mo Omer <[email protected]> >> Subject: Re: Vagrant Crushed When using Nutch-Selenium >> >>> Hey Mo and Prof Mattmann, >>> >>> >>> I will try to crawl the 3 websites in the homework tonight (NASA AMD, >>>NSF >>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's >>>going >>> on. >>> >>> >>> Is memory an issue? My vagrant only has 512MB of memory. >>> >>> >>> Regards, >>> Shuo Li >>> >>> >>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980) >>> <[email protected]> wrote: >>> >>> Hi Shuo, >>> >>> Thanks for your email. I wonder if using selenium grid would >>> help? >>> >>> Please see this plugin: >>> >>> https://github.com/momer/nutch-selenium-grid-plugin >>> >>> >>> I’m CC’ing Mo the author of the plugin to see if he experienced >>> this while running the original selenium plugin - Mo did using >>> selenium grid help the issue that Shuo is experiencing below? >>> >>> Mo: are you cool with portion the grid plugin, or if Lewis or >>> I do it to trunk (with full credit to you of course?) >>> >>> Cheers, >>> Chris >>> >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Chief Architect >>> Instrument Software and Science Data Systems Section (398) >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 168-519, Mailstop: 168-527 >>> Email: [email protected] >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Associate Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: Shuo Li <[email protected]> >>> Reply-To: "[email protected]" <[email protected]> >>> Date: Friday, February 13, 2015 at 10:12 AM >>> To: "[email protected]" <[email protected]> >>> Subject: Vagrant Crushed When using Nutch-Selenium >>> >>>> Hey guys, >>>> >>>> >>>> I'm trying to use Nutch-Selenium to crawl >>>> nutch.apache.org <http://nutch.apache.org> <http://nutch.apache.org>. >>>> However, my vagrant seems >>>> crushed after a few minutes. I forced it to shut down and it turns >>>>out it >>>> only crawled 59 websites. My nutch version is 1.10 and my OS is Ubuntu >>>> Trusty, 14.04. >>>> >>>> >>>> Is there anything I can provide to you guys? Or is there anybody have >>>>the >>>> same issue? Or 59 websites is the complete crawling? >>>> >>>> >>>> Any suggestion would be appreciated. >>>> >>>> >>>> Regards, >>>> Shuo Li >>

