Thanks Mo, great advice.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Jiaxin Ye <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Tuesday, February 17, 2015 at 2:49 PM
To: Mohammed Omer <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: Vagrant Crushed When using Nutch-Selenium

>
>
>
>Thank you so much!! I am going to try it out tonight.
>
>On Tuesday, February 17, 2015, Mohammed Omer <[email protected]>
>wrote:
>
>Jiaxin, 
>
>
>Each page takes about 3 seconds to crawl due to this piece of code - we
>allow selenium 3 seconds to grab the page [0]. Due to what I was
>crawling, I didn't want to wait for a specific element/class/id to show
>up. However, you can change it up if you want.
> Selenium documentation [1] has more info on Ex/Implicit waiting.
>
>
>Again, it's not the most efficient way to crawl; but, if you need JS to
>render, it's a backwards way that ensures it happens. Selenium Grid has
>the benefit of being able to handle more throughput, but at the end of
>the day we're waiting for a browser to
> go out and fetch the url.
>
>
>I've suggested that most items be configurable when merged into trunk
>[2], but I'll make a specific call-out to the wait time.
>
>
>Due to the way Selenium standalone works, it's wayyyyyy less efficient
>than a 'Grid' set-up (hub + nodes) [3], which is why I switched to that
>set-up. 
>
>
>Wish I could help out more, but 30 threads might be too much. 5 threads,
>at a total fetch/parse time of 4 seconds per url, would still
>theoretically churn out > 100k urls per day. There are multiple tweaks
>that could be made to optimize for your system,
> I'd start with reducing thread count, as you might be saturating your
>system [4].
>
>
>Sorry I can't be of more help!
>
>
>Thank you,
>
>
>Mo
>
>
>[0]: 
>https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/java/
>org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49
><https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/java
>/org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49>
>[1]: 
>http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp
><http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp>
>[2]: https://issues.apache.org/jira/browse/NUTCH-1933
>[3]: https://code.google.com/p/selenium/wiki/Grid2
>[4]: http://stackoverflow.com/a/4895271
>
>
>On Mon, Feb 16, 2015 at 2:13 AM, Jiaxin Ye
><[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>wrote:
>
>I am using fetcher.threads.per.queue = 30 by the way.
>
>On Mon, Feb 16, 2015 at 12:08 AM, Jiaxin Ye
><[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>wrote:
>
>Hi Mo,
>
>
>I have a problem about the selenium plugin on mac. I think I successfully
>set it up on mac but I have a question about the performance.
>I am using a Mac with Intel Core i5 processor and 8GB ram, but I found
>that each url fetched takes about 1 seconds to open and close
>the firefox window. Is it a normal speed? or anything is wrong? And is it
>possible to install selenium grid plugin on Mac? I will cry if you
>ask me to change machine now......
>
>
>Best,
>Jiaxin
>
>
>On Fri, Feb 13, 2015 at 2:09 PM, Mohammed Omer
><[email protected]
><javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>
>No worries man, glad everything works! Glad, since I was having hostname
>issues with nutch/hbase just now as I quickly tried to get it
>working/fixed for ya, ha.
>
>Mo
>
>
>On Fri, Feb 13, 2015 at 2:57 PM, Shuo Li
><[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>
>Hey guys,
>
>
>After change my RAM to 2GB, everything works fine. My bad. Thanks for
>your help.
>
>
>Regards,
>Shuo Li
>
>
>On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980)
><[email protected]
><javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>
>Thank you Mo. I sincerely appreciate your guidance and contribution.
>
>I will work to get your nutch selenium grid plugin contributed
>to work with Nutch 1.x.
>
>Cheers,
>Chris
>
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: 
>[email protected]
><javascript:_e(%7B%7D,'cvml','[email protected]');>
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Mo Omer <[email protected]
><javascript:_e(%7B%7D,'cvml','[email protected]');>>
>Date: Friday, February 13, 2015 at 11:10 AM
>To: Chris Mattmann <[email protected]
><javascript:_e(%7B%7D,'cvml','[email protected]');>>
>Cc: "[email protected]
><javascript:_e(%7B%7D,'cvml','[email protected]');>"
><[email protected]
><javascript:_e(%7B%7D,'cvml','[email protected]');>>
>Subject: Re: Vagrant Crushed When using Nutch-Selenium
>
>>Hey all,
>>
>>When I had run nutch-selenium, it was in a config such that zombies were
>>created from closing Firefox windows and they couldn't be reaped (again,
>>due to the docker configuration I had).
>>
>>In a normal setup, it should not be an issue - if you're running 20
>>threads in nutch that's potentially 20 open FF windows which isn't good
>>for 512mb.
>>
>>Selenium grid is much more efficient, in that browsers are opened, but
>>tabs are used to fetch sites - and only those are closed.
>>
>>Additionally, ensure you're using Nutch 2.2.1.
>>
>>Feel free to fork patch and tinker and PR as needed.
>>
>>Chris, if you want to be added to contribs on the GitHub project, that's
>>cool with me! Wish I could dedicate more time to this, but I don't
>>foresee using Nutch again in the near future, and am now working on
>>projects that require lots of reading and possibly patches to Caffe and
>>opencl r-CNN projects.
>>
>>Tl;dr:
>>- no, this shouldn't be typical unless you're creating zombies like crazy
>>and they're not being reaped (too many open file descriptors), running
>>out of memory, or similar resource constraint.
>>- selenium grid is TONs more efficient, but a bit more difficult to set
>>up. I used it to crawl 100ks of sites.
>>- unfortunately I can't commit more time to this, but if I can assist in
>>any admin way, let me know.
>>
>>Thank you,
>>
>>Mo
>>
>>This message was drafted on a tiny touch screen; please forgive brevity &
>>tpyos
>>
>>> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)"
>>><[email protected]
>>><javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>>
>>> Oh yes, please up your memory to like at least 2Gb..
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: 
>[email protected]
><javascript:_e(%7B%7D,'cvml','[email protected]');>
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Shuo Li <[email protected]
>>><javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>> Reply-To: "[email protected]
>>><javascript:_e(%7B%7D,'cvml','[email protected]');>"
>>><[email protected]
>>><javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>> Date: Friday, February 13, 2015 at 10:38 AM
>>> To: "[email protected]
>>><javascript:_e(%7B%7D,'cvml','[email protected]');>"
>>><[email protected]
>>><javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>> Cc: Mo Omer <[email protected]
>>><javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>>>
>>>> Hey Mo and Prof Mattmann,
>>>>
>>>>
>>>> I will try to crawl the 3 websites in the homework tonight (NASA AMD,
>>>>NSF
>>>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's
>>>>going
>>>> on.
>>>>
>>>>
>>>> Is memory an issue? My vagrant only has 512MB of memory.
>>>>
>>>>
>>>> Regards,
>>>> Shuo Li
>>>>
>>>>
>>>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980)
>>>> <[email protected]
>>>><javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>>>
>>>> Hi Shuo,
>>>>
>>>> Thanks for your email. I wonder if using selenium grid would
>>>> help?
>>>>
>>>> Please see this plugin:
>>>>
>>>> 
>https://github.com/momer/nutch-selenium-grid-plugin
><https://github.com/momer/nutch-selenium-grid-plugin>
>>>>
>>>>
>>>> I’m CC’ing Mo the author of the plugin to see if he experienced
>>>> this while running the original selenium plugin - Mo did using
>>>> selenium grid help the issue that Shuo is experiencing below?
>>>>
>>>> Mo: are you cool with portion the grid plugin, or if Lewis or
>>>> I do it to trunk (with full credit to you of course?)
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: 
>[email protected]
><javascript:_e(%7B%7D,'cvml','[email protected]');>
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Shuo Li <[email protected]
>>>><javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>> Reply-To: "[email protected]
>>>><javascript:_e(%7B%7D,'cvml','[email protected]');>"
>>>><[email protected]
>>>><javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>> Date: Friday, February 13, 2015 at 10:12 AM
>>>> To: "[email protected]
>>>><javascript:_e(%7B%7D,'cvml','[email protected]');>"
>>>><[email protected]
>>>><javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>> Subject: Vagrant Crushed When using Nutch-Selenium
>>>>
>>>>> Hey guys,
>>>>>
>>>>>
>>>>> I'm trying to use Nutch-Selenium to crawl
>>>>> nutch.apache.org <http://nutch.apache.org> <http://nutch.apache.org>
>>>>><http://nutch.apache.org>.
>>>>> However, my vagrant seems
>>>>> crushed after a few minutes. I forced it to shut down and it turns
>>>>>out it
>>>>> only crawled 59 websites. My nutch version is 1.10 and my OS is
>>>>>Ubuntu
>>>>> Trusty, 14.04.
>>>>>
>>>>>
>>>>> Is there anything I can provide to you guys? Or is there anybody have
>>>>>the
>>>>> same issue? Or 59 websites is the complete crawling?
>>>>>
>>>>>
>>>>> Any suggestion would be appreciated.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Shuo Li
>>>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Reply via email to