I am using fetcher.threads.per.queue = 30 by the way. On Mon, Feb 16, 2015 at 12:08 AM, Jiaxin Ye <[email protected]> wrote:
> Hi Mo, > > I have a problem about the selenium plugin on mac. I think I successfully > set it up on mac but I have a question about the performance. > I am using a Mac with Intel Core i5 processor and 8GB ram, but I found > that each url fetched takes about 1 seconds to open and close > the firefox window. Is it a normal speed? or anything is wrong? And is it > possible to install selenium grid plugin on Mac? I will cry if you > ask me to change machine now...... > > Best, > Jiaxin > > On Fri, Feb 13, 2015 at 2:09 PM, Mohammed Omer <[email protected]> > wrote: > >> No worries man, glad everything works! Glad, since I was having hostname >> issues with nutch/hbase just now as I quickly tried to get it working/fixed >> for ya, ha. >> >> Mo >> >> On Fri, Feb 13, 2015 at 2:57 PM, Shuo Li <[email protected]> wrote: >> >>> Hey guys, >>> >>> After change my RAM to 2GB, everything works fine. My bad. Thanks for >>> your help. >>> >>> Regards, >>> Shuo Li >>> >>> On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980) < >>> [email protected]> wrote: >>> >>>> Thank you Mo. I sincerely appreciate your guidance and contribution. >>>> >>>> I will work to get your nutch selenium grid plugin contributed >>>> to work with Nutch 1.x. >>>> >>>> Cheers, >>>> Chris >>>> >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Chief Architect >>>> Instrument Software and Science Data Systems Section (398) >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 168-519, Mailstop: 168-527 >>>> Email: [email protected] >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Associate Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Mo Omer <[email protected]> >>>> Date: Friday, February 13, 2015 at 11:10 AM >>>> To: Chris Mattmann <[email protected]> >>>> Cc: "[email protected]" <[email protected]> >>>> Subject: Re: Vagrant Crushed When using Nutch-Selenium >>>> >>>> >Hey all, >>>> > >>>> >When I had run nutch-selenium, it was in a config such that zombies >>>> were >>>> >created from closing Firefox windows and they couldn't be reaped >>>> (again, >>>> >due to the docker configuration I had). >>>> > >>>> >In a normal setup, it should not be an issue - if you're running 20 >>>> >threads in nutch that's potentially 20 open FF windows which isn't good >>>> >for 512mb. >>>> > >>>> >Selenium grid is much more efficient, in that browsers are opened, but >>>> >tabs are used to fetch sites - and only those are closed. >>>> > >>>> >Additionally, ensure you're using Nutch 2.2.1. >>>> > >>>> >Feel free to fork patch and tinker and PR as needed. >>>> > >>>> >Chris, if you want to be added to contribs on the GitHub project, >>>> that's >>>> >cool with me! Wish I could dedicate more time to this, but I don't >>>> >foresee using Nutch again in the near future, and am now working on >>>> >projects that require lots of reading and possibly patches to Caffe and >>>> >opencl r-CNN projects. >>>> > >>>> >Tl;dr: >>>> >- no, this shouldn't be typical unless you're creating zombies like >>>> crazy >>>> >and they're not being reaped (too many open file descriptors), running >>>> >out of memory, or similar resource constraint. >>>> >- selenium grid is TONs more efficient, but a bit more difficult to set >>>> >up. I used it to crawl 100ks of sites. >>>> >- unfortunately I can't commit more time to this, but if I can assist >>>> in >>>> >any admin way, let me know. >>>> > >>>> >Thank you, >>>> > >>>> >Mo >>>> > >>>> >This message was drafted on a tiny touch screen; please forgive >>>> brevity & >>>> >tpyos >>>> > >>>> >> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)" >>>> >><[email protected]> wrote: >>>> >> >>>> >> Oh yes, please up your memory to like at least 2Gb.. >>>> >> >>>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >> Chris Mattmann, Ph.D. >>>> >> Chief Architect >>>> >> Instrument Software and Science Data Systems Section (398) >>>> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> >> Office: 168-519, Mailstop: 168-527 >>>> >> Email: [email protected] >>>> >> WWW: http://sunset.usc.edu/~mattmann/ >>>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >> Adjunct Associate Professor, Computer Science Department >>>> >> University of Southern California, Los Angeles, CA 90089 USA >>>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> -----Original Message----- >>>> >> From: Shuo Li <[email protected]> >>>> >> Reply-To: "[email protected]" <[email protected]> >>>> >> Date: Friday, February 13, 2015 at 10:38 AM >>>> >> To: "[email protected]" <[email protected]> >>>> >> Cc: Mo Omer <[email protected]> >>>> >> Subject: Re: Vagrant Crushed When using Nutch-Selenium >>>> >> >>>> >>> Hey Mo and Prof Mattmann, >>>> >>> >>>> >>> >>>> >>> I will try to crawl the 3 websites in the homework tonight (NASA >>>> AMD, >>>> >>>NSF >>>> >>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's >>>> >>>going >>>> >>> on. >>>> >>> >>>> >>> >>>> >>> Is memory an issue? My vagrant only has 512MB of memory. >>>> >>> >>>> >>> >>>> >>> Regards, >>>> >>> Shuo Li >>>> >>> >>>> >>> >>>> >>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980) >>>> >>> <[email protected]> wrote: >>>> >>> >>>> >>> Hi Shuo, >>>> >>> >>>> >>> Thanks for your email. I wonder if using selenium grid would >>>> >>> help? >>>> >>> >>>> >>> Please see this plugin: >>>> >>> >>>> >>> https://github.com/momer/nutch-selenium-grid-plugin >>>> >>> >>>> >>> >>>> >>> I’m CC’ing Mo the author of the plugin to see if he experienced >>>> >>> this while running the original selenium plugin - Mo did using >>>> >>> selenium grid help the issue that Shuo is experiencing below? >>>> >>> >>>> >>> Mo: are you cool with portion the grid plugin, or if Lewis or >>>> >>> I do it to trunk (with full credit to you of course?) >>>> >>> >>>> >>> Cheers, >>>> >>> Chris >>>> >>> >>>> >>> >>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>> Chris Mattmann, Ph.D. >>>> >>> Chief Architect >>>> >>> Instrument Software and Science Data Systems Section (398) >>>> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> >>> Office: 168-519, Mailstop: 168-527 >>>> >>> Email: [email protected] >>>> >>> WWW: http://sunset.usc.edu/~mattmann/ >>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>> Adjunct Associate Professor, Computer Science Department >>>> >>> University of Southern California, Los Angeles, CA 90089 USA >>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> -----Original Message----- >>>> >>> From: Shuo Li <[email protected]> >>>> >>> Reply-To: "[email protected]" <[email protected]> >>>> >>> Date: Friday, February 13, 2015 at 10:12 AM >>>> >>> To: "[email protected]" <[email protected]> >>>> >>> Subject: Vagrant Crushed When using Nutch-Selenium >>>> >>> >>>> >>>> Hey guys, >>>> >>>> >>>> >>>> >>>> >>>> I'm trying to use Nutch-Selenium to crawl >>>> >>>> nutch.apache.org <http://nutch.apache.org> < >>>> http://nutch.apache.org>. >>>> >>>> However, my vagrant seems >>>> >>>> crushed after a few minutes. I forced it to shut down and it turns >>>> >>>>out it >>>> >>>> only crawled 59 websites. My nutch version is 1.10 and my OS is >>>> Ubuntu >>>> >>>> Trusty, 14.04. >>>> >>>> >>>> >>>> >>>> >>>> Is there anything I can provide to you guys? Or is there anybody >>>> have >>>> >>>>the >>>> >>>> same issue? Or 59 websites is the complete crawling? >>>> >>>> >>>> >>>> >>>> >>>> Any suggestion would be appreciated. >>>> >>>> >>>> >>>> >>>> >>>> Regards, >>>> >>>> Shuo Li >>>> >> >>>> >>>> >>> >> >

