No problem! How'd it work out? Mo
This message was drafted on a tiny touch screen; please forgive brevity & tpyos > On Feb 22, 2015, at 6:19 PM, "Mattmann, Chris A (3980)" > <[email protected]> wrote: > > Thanks Mo, great advice. > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > -----Original Message----- > From: Jiaxin Ye <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Tuesday, February 17, 2015 at 2:49 PM > To: Mohammed Omer <[email protected]> > Cc: "[email protected]" <[email protected]> > Subject: Re: Vagrant Crushed When using Nutch-Selenium > >> >> >> >> Thank you so much!! I am going to try it out tonight. >> >> On Tuesday, February 17, 2015, Mohammed Omer <[email protected]> >> wrote: >> >> Jiaxin, >> >> >> Each page takes about 3 seconds to crawl due to this piece of code - we >> allow selenium 3 seconds to grab the page [0]. Due to what I was >> crawling, I didn't want to wait for a specific element/class/id to show >> up. However, you can change it up if you want. >> Selenium documentation [1] has more info on Ex/Implicit waiting. >> >> >> Again, it's not the most efficient way to crawl; but, if you need JS to >> render, it's a backwards way that ensures it happens. Selenium Grid has >> the benefit of being able to handle more throughput, but at the end of >> the day we're waiting for a browser to >> go out and fetch the url. >> >> >> I've suggested that most items be configurable when merged into trunk >> [2], but I'll make a specific call-out to the wait time. >> >> >> Due to the way Selenium standalone works, it's wayyyyyy less efficient >> than a 'Grid' set-up (hub + nodes) [3], which is why I switched to that >> set-up. >> >> >> Wish I could help out more, but 30 threads might be too much. 5 threads, >> at a total fetch/parse time of 4 seconds per url, would still >> theoretically churn out > 100k urls per day. There are multiple tweaks >> that could be made to optimize for your system, >> I'd start with reducing thread count, as you might be saturating your >> system [4]. >> >> >> Sorry I can't be of more help! >> >> >> Thank you, >> >> >> Mo >> >> >> [0]: >> https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/java/ >> org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49 >> <https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/java >> /org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49> >> [1]: >> http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp >> <http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp> >> [2]: https://issues.apache.org/jira/browse/NUTCH-1933 >> [3]: https://code.google.com/p/selenium/wiki/Grid2 >> [4]: http://stackoverflow.com/a/4895271 >> >> >> On Mon, Feb 16, 2015 at 2:13 AM, Jiaxin Ye >> <[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>> >> wrote: >> >> I am using fetcher.threads.per.queue = 30 by the way. >> >> On Mon, Feb 16, 2015 at 12:08 AM, Jiaxin Ye >> <[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>> >> wrote: >> >> Hi Mo, >> >> >> I have a problem about the selenium plugin on mac. I think I successfully >> set it up on mac but I have a question about the performance. >> I am using a Mac with Intel Core i5 processor and 8GB ram, but I found >> that each url fetched takes about 1 seconds to open and close >> the firefox window. Is it a normal speed? or anything is wrong? And is it >> possible to install selenium grid plugin on Mac? I will cry if you >> ask me to change machine now...... >> >> >> Best, >> Jiaxin >> >> >> On Fri, Feb 13, 2015 at 2:09 PM, Mohammed Omer >> <[email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >> >> No worries man, glad everything works! Glad, since I was having hostname >> issues with nutch/hbase just now as I quickly tried to get it >> working/fixed for ya, ha. >> >> Mo >> >> >> On Fri, Feb 13, 2015 at 2:57 PM, Shuo Li >> <[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >> >> Hey guys, >> >> >> After change my RAM to 2GB, everything works fine. My bad. Thanks for >> your help. >> >> >> Regards, >> Shuo Li >> >> >> On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980) >> <[email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >> >> Thank you Mo. I sincerely appreciate your guidance and contribution. >> >> I will work to get your nutch selenium grid plugin contributed >> to work with Nutch 1.x. >> >> Cheers, >> Chris >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: >> [email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');> >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> >> -----Original Message----- >> From: Mo Omer <[email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >> Date: Friday, February 13, 2015 at 11:10 AM >> To: Chris Mattmann <[email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >> Cc: "[email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');>" >> <[email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >> Subject: Re: Vagrant Crushed When using Nutch-Selenium >> >>> Hey all, >>> >>> When I had run nutch-selenium, it was in a config such that zombies were >>> created from closing Firefox windows and they couldn't be reaped (again, >>> due to the docker configuration I had). >>> >>> In a normal setup, it should not be an issue - if you're running 20 >>> threads in nutch that's potentially 20 open FF windows which isn't good >>> for 512mb. >>> >>> Selenium grid is much more efficient, in that browsers are opened, but >>> tabs are used to fetch sites - and only those are closed. >>> >>> Additionally, ensure you're using Nutch 2.2.1. >>> >>> Feel free to fork patch and tinker and PR as needed. >>> >>> Chris, if you want to be added to contribs on the GitHub project, that's >>> cool with me! Wish I could dedicate more time to this, but I don't >>> foresee using Nutch again in the near future, and am now working on >>> projects that require lots of reading and possibly patches to Caffe and >>> opencl r-CNN projects. >>> >>> Tl;dr: >>> - no, this shouldn't be typical unless you're creating zombies like crazy >>> and they're not being reaped (too many open file descriptors), running >>> out of memory, or similar resource constraint. >>> - selenium grid is TONs more efficient, but a bit more difficult to set >>> up. I used it to crawl 100ks of sites. >>> - unfortunately I can't commit more time to this, but if I can assist in >>> any admin way, let me know. >>> >>> Thank you, >>> >>> Mo >>> >>> This message was drafted on a tiny touch screen; please forgive brevity & >>> tpyos >>> >>>> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)" >>>> <[email protected] >>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>>> >>>> Oh yes, please up your memory to like at least 2Gb.. >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Chief Architect >>>> Instrument Software and Science Data Systems Section (398) >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 168-519, Mailstop: 168-527 >>>> Email: >> [email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');> >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Associate Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Shuo Li <[email protected] >>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>> Reply-To: "[email protected] >>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>" >>>> <[email protected] >>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>> Date: Friday, February 13, 2015 at 10:38 AM >>>> To: "[email protected] >>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>" >>>> <[email protected] >>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>> Cc: Mo Omer <[email protected] >>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>> Subject: Re: Vagrant Crushed When using Nutch-Selenium >>>> >>>>> Hey Mo and Prof Mattmann, >>>>> >>>>> >>>>> I will try to crawl the 3 websites in the homework tonight (NASA AMD, >>>>> NSF >>>>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's >>>>> going >>>>> on. >>>>> >>>>> >>>>> Is memory an issue? My vagrant only has 512MB of memory. >>>>> >>>>> >>>>> Regards, >>>>> Shuo Li >>>>> >>>>> >>>>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980) >>>>> <[email protected] >>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>>>> >>>>> Hi Shuo, >>>>> >>>>> Thanks for your email. I wonder if using selenium grid would >>>>> help? >>>>> >>>>> Please see this plugin: >> https://github.com/momer/nutch-selenium-grid-plugin >> <https://github.com/momer/nutch-selenium-grid-plugin> >>>>> >>>>> >>>>> I’m CC’ing Mo the author of the plugin to see if he experienced >>>>> this while running the original selenium plugin - Mo did using >>>>> selenium grid help the issue that Shuo is experiencing below? >>>>> >>>>> Mo: are you cool with portion the grid plugin, or if Lewis or >>>>> I do it to trunk (with full credit to you of course?) >>>>> >>>>> Cheers, >>>>> Chris >>>>> >>>>> >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Chris Mattmann, Ph.D. >>>>> Chief Architect >>>>> Instrument Software and Science Data Systems Section (398) >>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>> Office: 168-519, Mailstop: 168-527 >>>>> Email: >> [email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');> >>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Adjunct Associate Professor, Computer Science Department >>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Shuo Li <[email protected] >>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>>> Reply-To: "[email protected] >>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>" >>>>> <[email protected] >>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>>> Date: Friday, February 13, 2015 at 10:12 AM >>>>> To: "[email protected] >>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>" >>>>> <[email protected] >>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>>> Subject: Vagrant Crushed When using Nutch-Selenium >>>>> >>>>>> Hey guys, >>>>>> >>>>>> >>>>>> I'm trying to use Nutch-Selenium to crawl >>>>>> nutch.apache.org <http://nutch.apache.org> <http://nutch.apache.org> >>>>>> <http://nutch.apache.org>. >>>>>> However, my vagrant seems >>>>>> crushed after a few minutes. I forced it to shut down and it turns >>>>>> out it >>>>>> only crawled 59 websites. My nutch version is 1.10 and my OS is >>>>>> Ubuntu >>>>>> Trusty, 14.04. >>>>>> >>>>>> >>>>>> Is there anything I can provide to you guys? Or is there anybody have >>>>>> the >>>>>> same issue? Or 59 websites is the complete crawling? >>>>>> >>>>>> >>>>>> Any suggestion would be appreciated. >>>>>> >>>>>> >>>>>> Regards, >>>>>> Shuo Li >

