Thanks Mo, great advice. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: Jiaxin Ye <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Tuesday, February 17, 2015 at 2:49 PM To: Mohammed Omer <[email protected]> Cc: "[email protected]" <[email protected]> Subject: Re: Vagrant Crushed When using Nutch-Selenium > > > >Thank you so much!! I am going to try it out tonight. > >On Tuesday, February 17, 2015, Mohammed Omer <[email protected]> >wrote: > >Jiaxin, > > >Each page takes about 3 seconds to crawl due to this piece of code - we >allow selenium 3 seconds to grab the page [0]. Due to what I was >crawling, I didn't want to wait for a specific element/class/id to show >up. However, you can change it up if you want. > Selenium documentation [1] has more info on Ex/Implicit waiting. > > >Again, it's not the most efficient way to crawl; but, if you need JS to >render, it's a backwards way that ensures it happens. Selenium Grid has >the benefit of being able to handle more throughput, but at the end of >the day we're waiting for a browser to > go out and fetch the url. > > >I've suggested that most items be configurable when merged into trunk >[2], but I'll make a specific call-out to the wait time. > > >Due to the way Selenium standalone works, it's wayyyyyy less efficient >than a 'Grid' set-up (hub + nodes) [3], which is why I switched to that >set-up. > > >Wish I could help out more, but 30 threads might be too much. 5 threads, >at a total fetch/parse time of 4 seconds per url, would still >theoretically churn out > 100k urls per day. There are multiple tweaks >that could be made to optimize for your system, > I'd start with reducing thread count, as you might be saturating your >system [4]. > > >Sorry I can't be of more help! > > >Thank you, > > >Mo > > >[0]: >https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/java/ >org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49 ><https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/java >/org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49> >[1]: >http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp ><http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp> >[2]: https://issues.apache.org/jira/browse/NUTCH-1933 >[3]: https://code.google.com/p/selenium/wiki/Grid2 >[4]: http://stackoverflow.com/a/4895271 > > >On Mon, Feb 16, 2015 at 2:13 AM, Jiaxin Ye ><[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>> >wrote: > >I am using fetcher.threads.per.queue = 30 by the way. > >On Mon, Feb 16, 2015 at 12:08 AM, Jiaxin Ye ><[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>> >wrote: > >Hi Mo, > > >I have a problem about the selenium plugin on mac. I think I successfully >set it up on mac but I have a question about the performance. >I am using a Mac with Intel Core i5 processor and 8GB ram, but I found >that each url fetched takes about 1 seconds to open and close >the firefox window. Is it a normal speed? or anything is wrong? And is it >possible to install selenium grid plugin on Mac? I will cry if you >ask me to change machine now...... > > >Best, >Jiaxin > > >On Fri, Feb 13, 2015 at 2:09 PM, Mohammed Omer ><[email protected] ><javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >No worries man, glad everything works! Glad, since I was having hostname >issues with nutch/hbase just now as I quickly tried to get it >working/fixed for ya, ha. > >Mo > > >On Fri, Feb 13, 2015 at 2:57 PM, Shuo Li ><[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >Hey guys, > > >After change my RAM to 2GB, everything works fine. My bad. Thanks for >your help. > > >Regards, >Shuo Li > > >On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980) ><[email protected] ><javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >Thank you Mo. I sincerely appreciate your guidance and contribution. > >I will work to get your nutch selenium grid plugin contributed >to work with Nutch 1.x. > >Cheers, >Chris > > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: >[email protected] ><javascript:_e(%7B%7D,'cvml','[email protected]');> >WWW: http://sunset.usc.edu/~mattmann/ >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Adjunct Associate Professor, Computer Science Department >University of Southern California, Los Angeles, CA 90089 USA >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > >-----Original Message----- >From: Mo Omer <[email protected] ><javascript:_e(%7B%7D,'cvml','[email protected]');>> >Date: Friday, February 13, 2015 at 11:10 AM >To: Chris Mattmann <[email protected] ><javascript:_e(%7B%7D,'cvml','[email protected]');>> >Cc: "[email protected] ><javascript:_e(%7B%7D,'cvml','[email protected]');>" ><[email protected] ><javascript:_e(%7B%7D,'cvml','[email protected]');>> >Subject: Re: Vagrant Crushed When using Nutch-Selenium > >>Hey all, >> >>When I had run nutch-selenium, it was in a config such that zombies were >>created from closing Firefox windows and they couldn't be reaped (again, >>due to the docker configuration I had). >> >>In a normal setup, it should not be an issue - if you're running 20 >>threads in nutch that's potentially 20 open FF windows which isn't good >>for 512mb. >> >>Selenium grid is much more efficient, in that browsers are opened, but >>tabs are used to fetch sites - and only those are closed. >> >>Additionally, ensure you're using Nutch 2.2.1. >> >>Feel free to fork patch and tinker and PR as needed. >> >>Chris, if you want to be added to contribs on the GitHub project, that's >>cool with me! Wish I could dedicate more time to this, but I don't >>foresee using Nutch again in the near future, and am now working on >>projects that require lots of reading and possibly patches to Caffe and >>opencl r-CNN projects. >> >>Tl;dr: >>- no, this shouldn't be typical unless you're creating zombies like crazy >>and they're not being reaped (too many open file descriptors), running >>out of memory, or similar resource constraint. >>- selenium grid is TONs more efficient, but a bit more difficult to set >>up. I used it to crawl 100ks of sites. >>- unfortunately I can't commit more time to this, but if I can assist in >>any admin way, let me know. >> >>Thank you, >> >>Mo >> >>This message was drafted on a tiny touch screen; please forgive brevity & >>tpyos >> >>> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)" >>><[email protected] >>><javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>> >>> Oh yes, please up your memory to like at least 2Gb.. >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Chief Architect >>> Instrument Software and Science Data Systems Section (398) >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 168-519, Mailstop: 168-527 >>> Email: >[email protected] ><javascript:_e(%7B%7D,'cvml','[email protected]');> >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Associate Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: Shuo Li <[email protected] >>><javascript:_e(%7B%7D,'cvml','[email protected]');>> >>> Reply-To: "[email protected] >>><javascript:_e(%7B%7D,'cvml','[email protected]');>" >>><[email protected] >>><javascript:_e(%7B%7D,'cvml','[email protected]');>> >>> Date: Friday, February 13, 2015 at 10:38 AM >>> To: "[email protected] >>><javascript:_e(%7B%7D,'cvml','[email protected]');>" >>><[email protected] >>><javascript:_e(%7B%7D,'cvml','[email protected]');>> >>> Cc: Mo Omer <[email protected] >>><javascript:_e(%7B%7D,'cvml','[email protected]');>> >>> Subject: Re: Vagrant Crushed When using Nutch-Selenium >>> >>>> Hey Mo and Prof Mattmann, >>>> >>>> >>>> I will try to crawl the 3 websites in the homework tonight (NASA AMD, >>>>NSF >>>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's >>>>going >>>> on. >>>> >>>> >>>> Is memory an issue? My vagrant only has 512MB of memory. >>>> >>>> >>>> Regards, >>>> Shuo Li >>>> >>>> >>>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980) >>>> <[email protected] >>>><javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>>> >>>> Hi Shuo, >>>> >>>> Thanks for your email. I wonder if using selenium grid would >>>> help? >>>> >>>> Please see this plugin: >>>> >>>> >https://github.com/momer/nutch-selenium-grid-plugin ><https://github.com/momer/nutch-selenium-grid-plugin> >>>> >>>> >>>> I’m CC’ing Mo the author of the plugin to see if he experienced >>>> this while running the original selenium plugin - Mo did using >>>> selenium grid help the issue that Shuo is experiencing below? >>>> >>>> Mo: are you cool with portion the grid plugin, or if Lewis or >>>> I do it to trunk (with full credit to you of course?) >>>> >>>> Cheers, >>>> Chris >>>> >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Chief Architect >>>> Instrument Software and Science Data Systems Section (398) >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 168-519, Mailstop: 168-527 >>>> Email: >[email protected] ><javascript:_e(%7B%7D,'cvml','[email protected]');> >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Associate Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Shuo Li <[email protected] >>>><javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>> Reply-To: "[email protected] >>>><javascript:_e(%7B%7D,'cvml','[email protected]');>" >>>><[email protected] >>>><javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>> Date: Friday, February 13, 2015 at 10:12 AM >>>> To: "[email protected] >>>><javascript:_e(%7B%7D,'cvml','[email protected]');>" >>>><[email protected] >>>><javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>> Subject: Vagrant Crushed When using Nutch-Selenium >>>> >>>>> Hey guys, >>>>> >>>>> >>>>> I'm trying to use Nutch-Selenium to crawl >>>>> nutch.apache.org <http://nutch.apache.org> <http://nutch.apache.org> >>>>><http://nutch.apache.org>. >>>>> However, my vagrant seems >>>>> crushed after a few minutes. I forced it to shut down and it turns >>>>>out it >>>>> only crawled 59 websites. My nutch version is 1.10 and my OS is >>>>>Ubuntu >>>>> Trusty, 14.04. >>>>> >>>>> >>>>> Is there anything I can provide to you guys? Or is there anybody have >>>>>the >>>>> same issue? Or 59 websites is the complete crawling? >>>>> >>>>> >>>>> Any suggestion would be appreciated. >>>>> >>>>> >>>>> Regards, >>>>> Shuo Li >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >

