Going to implement more configuration in the plugin, but based on the student emails I think your advice helped :)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Mo Omer <[email protected]> Date: Sunday, February 22, 2015 at 5:45 PM To: Chris Mattmann <[email protected]> Cc: "[email protected]" <[email protected]> Subject: Re: Vagrant Crushed When using Nutch-Selenium >No problem! How'd it work out? > >Mo > >This message was drafted on a tiny touch screen; please forgive brevity & >tpyos > >> On Feb 22, 2015, at 6:19 PM, "Mattmann, Chris A (3980)" >><[email protected]> wrote: >> >> Thanks Mo, great advice. >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> >> -----Original Message----- >> From: Jiaxin Ye <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Tuesday, February 17, 2015 at 2:49 PM >> To: Mohammed Omer <[email protected]> >> Cc: "[email protected]" <[email protected]> >> Subject: Re: Vagrant Crushed When using Nutch-Selenium >> >>> >>> >>> >>> Thank you so much!! I am going to try it out tonight. >>> >>> On Tuesday, February 17, 2015, Mohammed Omer <[email protected]> >>> wrote: >>> >>> Jiaxin, >>> >>> >>> Each page takes about 3 seconds to crawl due to this piece of code - we >>> allow selenium 3 seconds to grab the page [0]. Due to what I was >>> crawling, I didn't want to wait for a specific element/class/id to show >>> up. However, you can change it up if you want. >>> Selenium documentation [1] has more info on Ex/Implicit waiting. >>> >>> >>> Again, it's not the most efficient way to crawl; but, if you need JS to >>> render, it's a backwards way that ensures it happens. Selenium Grid has >>> the benefit of being able to handle more throughput, but at the end of >>> the day we're waiting for a browser to >>> go out and fetch the url. >>> >>> >>> I've suggested that most items be configurable when merged into trunk >>> [2], but I'll make a specific call-out to the wait time. >>> >>> >>> Due to the way Selenium standalone works, it's wayyyyyy less efficient >>> than a 'Grid' set-up (hub + nodes) [3], which is why I switched to that >>> set-up. >>> >>> >>> Wish I could help out more, but 30 threads might be too much. 5 >>>threads, >>> at a total fetch/parse time of 4 seconds per url, would still >>> theoretically churn out > 100k urls per day. There are multiple tweaks >>> that could be made to optimize for your system, >>> I'd start with reducing thread count, as you might be saturating your >>> system [4]. >>> >>> >>> Sorry I can't be of more help! >>> >>> >>> Thank you, >>> >>> >>> Mo >>> >>> >>> [0]: >>> >>>https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/jav >>>a/ >>> org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49 >>> >>><https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/ja >>>va >>> /org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49> >>> [1]: >>> http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp >>> <http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp> >>> [2]: https://issues.apache.org/jira/browse/NUTCH-1933 >>> [3]: https://code.google.com/p/selenium/wiki/Grid2 >>> [4]: http://stackoverflow.com/a/4895271 >>> >>> >>> On Mon, Feb 16, 2015 at 2:13 AM, Jiaxin Ye >>> <[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>> wrote: >>> >>> I am using fetcher.threads.per.queue = 30 by the way. >>> >>> On Mon, Feb 16, 2015 at 12:08 AM, Jiaxin Ye >>> <[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>> wrote: >>> >>> Hi Mo, >>> >>> >>> I have a problem about the selenium plugin on mac. I think I >>>successfully >>> set it up on mac but I have a question about the performance. >>> I am using a Mac with Intel Core i5 processor and 8GB ram, but I found >>> that each url fetched takes about 1 seconds to open and close >>> the firefox window. Is it a normal speed? or anything is wrong? And is >>>it >>> possible to install selenium grid plugin on Mac? I will cry if you >>> ask me to change machine now...... >>> >>> >>> Best, >>> Jiaxin >>> >>> >>> On Fri, Feb 13, 2015 at 2:09 PM, Mohammed Omer >>> <[email protected] >>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>> >>> No worries man, glad everything works! Glad, since I was having >>>hostname >>> issues with nutch/hbase just now as I quickly tried to get it >>> working/fixed for ya, ha. >>> >>> Mo >>> >>> >>> On Fri, Feb 13, 2015 at 2:57 PM, Shuo Li >>> <[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>wrote: >>> >>> Hey guys, >>> >>> >>> After change my RAM to 2GB, everything works fine. My bad. Thanks for >>> your help. >>> >>> >>> Regards, >>> Shuo Li >>> >>> >>> On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980) >>> <[email protected] >>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>> >>> Thank you Mo. I sincerely appreciate your guidance and contribution. >>> >>> I will work to get your nutch selenium grid plugin contributed >>> to work with Nutch 1.x. >>> >>> Cheers, >>> Chris >>> >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Chief Architect >>> Instrument Software and Science Data Systems Section (398) >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 168-519, Mailstop: 168-527 >>> Email: >>> [email protected] >>> <javascript:_e(%7B%7D,'cvml','[email protected]');> >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Associate Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: Mo Omer <[email protected] >>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>> Date: Friday, February 13, 2015 at 11:10 AM >>> To: Chris Mattmann <[email protected] >>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>> Cc: "[email protected] >>> <javascript:_e(%7B%7D,'cvml','[email protected]');>" >>> <[email protected] >>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>> Subject: Re: Vagrant Crushed When using Nutch-Selenium >>> >>>> Hey all, >>>> >>>> When I had run nutch-selenium, it was in a config such that zombies >>>>were >>>> created from closing Firefox windows and they couldn't be reaped >>>>(again, >>>> due to the docker configuration I had). >>>> >>>> In a normal setup, it should not be an issue - if you're running 20 >>>> threads in nutch that's potentially 20 open FF windows which isn't >>>>good >>>> for 512mb. >>>> >>>> Selenium grid is much more efficient, in that browsers are opened, but >>>> tabs are used to fetch sites - and only those are closed. >>>> >>>> Additionally, ensure you're using Nutch 2.2.1. >>>> >>>> Feel free to fork patch and tinker and PR as needed. >>>> >>>> Chris, if you want to be added to contribs on the GitHub project, >>>>that's >>>> cool with me! Wish I could dedicate more time to this, but I don't >>>> foresee using Nutch again in the near future, and am now working on >>>> projects that require lots of reading and possibly patches to Caffe >>>>and >>>> opencl r-CNN projects. >>>> >>>> Tl;dr: >>>> - no, this shouldn't be typical unless you're creating zombies like >>>>crazy >>>> and they're not being reaped (too many open file descriptors), running >>>> out of memory, or similar resource constraint. >>>> - selenium grid is TONs more efficient, but a bit more difficult to >>>>set >>>> up. I used it to crawl 100ks of sites. >>>> - unfortunately I can't commit more time to this, but if I can assist >>>>in >>>> any admin way, let me know. >>>> >>>> Thank you, >>>> >>>> Mo >>>> >>>> This message was drafted on a tiny touch screen; please forgive >>>>brevity & >>>> tpyos >>>> >>>>> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)" >>>>> <[email protected] >>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>>>wrote: >>>>> >>>>> Oh yes, please up your memory to like at least 2Gb.. >>>>> >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Chris Mattmann, Ph.D. >>>>> Chief Architect >>>>> Instrument Software and Science Data Systems Section (398) >>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>> Office: 168-519, Mailstop: 168-527 >>>>> Email: >>> [email protected] >>> <javascript:_e(%7B%7D,'cvml','[email protected]');> >>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Adjunct Associate Professor, Computer Science Department >>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Shuo Li <[email protected] >>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>>> Reply-To: "[email protected] >>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>" >>>>> <[email protected] >>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>>> Date: Friday, February 13, 2015 at 10:38 AM >>>>> To: "[email protected] >>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>" >>>>> <[email protected] >>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>>> Cc: Mo Omer <[email protected] >>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>>> Subject: Re: Vagrant Crushed When using Nutch-Selenium >>>>> >>>>>> Hey Mo and Prof Mattmann, >>>>>> >>>>>> >>>>>> I will try to crawl the 3 websites in the homework tonight (NASA >>>>>>AMD, >>>>>> NSF >>>>>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's >>>>>> going >>>>>> on. >>>>>> >>>>>> >>>>>> Is memory an issue? My vagrant only has 512MB of memory. >>>>>> >>>>>> >>>>>> Regards, >>>>>> Shuo Li >>>>>> >>>>>> >>>>>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980) >>>>>> <[email protected] >>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>>>>wrote: >>>>>> >>>>>> Hi Shuo, >>>>>> >>>>>> Thanks for your email. I wonder if using selenium grid would >>>>>> help? >>>>>> >>>>>> Please see this plugin: >>> https://github.com/momer/nutch-selenium-grid-plugin >>> <https://github.com/momer/nutch-selenium-grid-plugin> >>>>>> >>>>>> >>>>>> I’m CC’ing Mo the author of the plugin to see if he experienced >>>>>> this while running the original selenium plugin - Mo did using >>>>>> selenium grid help the issue that Shuo is experiencing below? >>>>>> >>>>>> Mo: are you cool with portion the grid plugin, or if Lewis or >>>>>> I do it to trunk (with full credit to you of course?) >>>>>> >>>>>> Cheers, >>>>>> Chris >>>>>> >>>>>> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Chris Mattmann, Ph.D. >>>>>> Chief Architect >>>>>> Instrument Software and Science Data Systems Section (398) >>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>> Office: 168-519, Mailstop: 168-527 >>>>>> Email: >>> [email protected] >>> <javascript:_e(%7B%7D,'cvml','[email protected]');> >>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Adjunct Associate Professor, Computer Science Department >>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Shuo Li <[email protected] >>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>>>> Reply-To: "[email protected] >>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>" >>>>>> <[email protected] >>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>>>> Date: Friday, February 13, 2015 at 10:12 AM >>>>>> To: "[email protected] >>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>" >>>>>> <[email protected] >>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>>>>> Subject: Vagrant Crushed When using Nutch-Selenium >>>>>> >>>>>>> Hey guys, >>>>>>> >>>>>>> >>>>>>> I'm trying to use Nutch-Selenium to crawl >>>>>>> nutch.apache.org <http://nutch.apache.org> >>>>>>><http://nutch.apache.org> >>>>>>> <http://nutch.apache.org>. >>>>>>> However, my vagrant seems >>>>>>> crushed after a few minutes. I forced it to shut down and it turns >>>>>>> out it >>>>>>> only crawled 59 websites. My nutch version is 1.10 and my OS is >>>>>>> Ubuntu >>>>>>> Trusty, 14.04. >>>>>>> >>>>>>> >>>>>>> Is there anything I can provide to you guys? Or is there anybody >>>>>>>have >>>>>>> the >>>>>>> same issue? Or 59 websites is the complete crawling? >>>>>>> >>>>>>> >>>>>>> Any suggestion would be appreciated. >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Shuo Li >>

