Going to implement more configuration in the plugin, but
based on the student emails I think your advice helped :)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Mo Omer <[email protected]>
Date: Sunday, February 22, 2015 at 5:45 PM
To: Chris Mattmann <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: Vagrant Crushed When using Nutch-Selenium

>No problem! How'd it work out?
>
>Mo
>
>This message was drafted on a tiny touch screen; please forgive brevity &
>tpyos
>
>> On Feb 22, 2015, at 6:19 PM, "Mattmann, Chris A (3980)"
>><[email protected]> wrote:
>> 
>> Thanks Mo, great advice.
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: [email protected]
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Jiaxin Ye <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Tuesday, February 17, 2015 at 2:49 PM
>> To: Mohammed Omer <[email protected]>
>> Cc: "[email protected]" <[email protected]>
>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>> 
>>> 
>>> 
>>> 
>>> Thank you so much!! I am going to try it out tonight.
>>> 
>>> On Tuesday, February 17, 2015, Mohammed Omer <[email protected]>
>>> wrote:
>>> 
>>> Jiaxin, 
>>> 
>>> 
>>> Each page takes about 3 seconds to crawl due to this piece of code - we
>>> allow selenium 3 seconds to grab the page [0]. Due to what I was
>>> crawling, I didn't want to wait for a specific element/class/id to show
>>> up. However, you can change it up if you want.
>>> Selenium documentation [1] has more info on Ex/Implicit waiting.
>>> 
>>> 
>>> Again, it's not the most efficient way to crawl; but, if you need JS to
>>> render, it's a backwards way that ensures it happens. Selenium Grid has
>>> the benefit of being able to handle more throughput, but at the end of
>>> the day we're waiting for a browser to
>>> go out and fetch the url.
>>> 
>>> 
>>> I've suggested that most items be configurable when merged into trunk
>>> [2], but I'll make a specific call-out to the wait time.
>>> 
>>> 
>>> Due to the way Selenium standalone works, it's wayyyyyy less efficient
>>> than a 'Grid' set-up (hub + nodes) [3], which is why I switched to that
>>> set-up. 
>>> 
>>> 
>>> Wish I could help out more, but 30 threads might be too much. 5
>>>threads,
>>> at a total fetch/parse time of 4 seconds per url, would still
>>> theoretically churn out > 100k urls per day. There are multiple tweaks
>>> that could be made to optimize for your system,
>>> I'd start with reducing thread count, as you might be saturating your
>>> system [4].
>>> 
>>> 
>>> Sorry I can't be of more help!
>>> 
>>> 
>>> Thank you,
>>> 
>>> 
>>> Mo
>>> 
>>> 
>>> [0]: 
>>> 
>>>https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/jav
>>>a/
>>> org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49
>>> 
>>><https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/ja
>>>va
>>> /org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49>
>>> [1]: 
>>> http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp
>>> <http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp>
>>> [2]: https://issues.apache.org/jira/browse/NUTCH-1933
>>> [3]: https://code.google.com/p/selenium/wiki/Grid2
>>> [4]: http://stackoverflow.com/a/4895271
>>> 
>>> 
>>> On Mon, Feb 16, 2015 at 2:13 AM, Jiaxin Ye
>>> <[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>> wrote:
>>> 
>>> I am using fetcher.threads.per.queue = 30 by the way.
>>> 
>>> On Mon, Feb 16, 2015 at 12:08 AM, Jiaxin Ye
>>> <[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>> wrote:
>>> 
>>> Hi Mo,
>>> 
>>> 
>>> I have a problem about the selenium plugin on mac. I think I
>>>successfully
>>> set it up on mac but I have a question about the performance.
>>> I am using a Mac with Intel Core i5 processor and 8GB ram, but I found
>>> that each url fetched takes about 1 seconds to open and close
>>> the firefox window. Is it a normal speed? or anything is wrong? And is
>>>it
>>> possible to install selenium grid plugin on Mac? I will cry if you
>>> ask me to change machine now......
>>> 
>>> 
>>> Best,
>>> Jiaxin
>>> 
>>> 
>>> On Fri, Feb 13, 2015 at 2:09 PM, Mohammed Omer
>>> <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>> 
>>> No worries man, glad everything works! Glad, since I was having
>>>hostname
>>> issues with nutch/hbase just now as I quickly tried to get it
>>> working/fixed for ya, ha.
>>> 
>>> Mo
>>> 
>>> 
>>> On Fri, Feb 13, 2015 at 2:57 PM, Shuo Li
>>> <[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>wrote:
>>> 
>>> Hey guys,
>>> 
>>> 
>>> After change my RAM to 2GB, everything works fine. My bad. Thanks for
>>> your help.
>>> 
>>> 
>>> Regards,
>>> Shuo Li
>>> 
>>> 
>>> On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980)
>>> <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>> 
>>> Thank you Mo. I sincerely appreciate your guidance and contribution.
>>> 
>>> I will work to get your nutch selenium grid plugin contributed
>>> to work with Nutch 1.x.
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: 
>>> [email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Mo Omer <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>> Date: Friday, February 13, 2015 at 11:10 AM
>>> To: Chris Mattmann <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>> Cc: "[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>"
>>> <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>>> 
>>>> Hey all,
>>>> 
>>>> When I had run nutch-selenium, it was in a config such that zombies
>>>>were
>>>> created from closing Firefox windows and they couldn't be reaped
>>>>(again,
>>>> due to the docker configuration I had).
>>>> 
>>>> In a normal setup, it should not be an issue - if you're running 20
>>>> threads in nutch that's potentially 20 open FF windows which isn't
>>>>good
>>>> for 512mb.
>>>> 
>>>> Selenium grid is much more efficient, in that browsers are opened, but
>>>> tabs are used to fetch sites - and only those are closed.
>>>> 
>>>> Additionally, ensure you're using Nutch 2.2.1.
>>>> 
>>>> Feel free to fork patch and tinker and PR as needed.
>>>> 
>>>> Chris, if you want to be added to contribs on the GitHub project,
>>>>that's
>>>> cool with me! Wish I could dedicate more time to this, but I don't
>>>> foresee using Nutch again in the near future, and am now working on
>>>> projects that require lots of reading and possibly patches to Caffe
>>>>and
>>>> opencl r-CNN projects.
>>>> 
>>>> Tl;dr:
>>>> - no, this shouldn't be typical unless you're creating zombies like
>>>>crazy
>>>> and they're not being reaped (too many open file descriptors), running
>>>> out of memory, or similar resource constraint.
>>>> - selenium grid is TONs more efficient, but a bit more difficult to
>>>>set
>>>> up. I used it to crawl 100ks of sites.
>>>> - unfortunately I can't commit more time to this, but if I can assist
>>>>in
>>>> any admin way, let me know.
>>>> 
>>>> Thank you,
>>>> 
>>>> Mo
>>>> 
>>>> This message was drafted on a tiny touch screen; please forgive
>>>>brevity &
>>>> tpyos
>>>> 
>>>>> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)"
>>>>> <[email protected]
>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>>>wrote:
>>>>> 
>>>>> Oh yes, please up your memory to like at least 2Gb..
>>>>> 
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Chris Mattmann, Ph.D.
>>>>> Chief Architect
>>>>> Instrument Software and Science Data Systems Section (398)
>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>> Office: 168-519, Mailstop: 168-527
>>>>> Email:
>>> [email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>
>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Adjunct Associate Professor, Computer Science Department
>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Shuo Li <[email protected]
>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>>> Reply-To: "[email protected]
>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>"
>>>>> <[email protected]
>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>>> Date: Friday, February 13, 2015 at 10:38 AM
>>>>> To: "[email protected]
>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>"
>>>>> <[email protected]
>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>>> Cc: Mo Omer <[email protected]
>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>>>>> 
>>>>>> Hey Mo and Prof Mattmann,
>>>>>> 
>>>>>> 
>>>>>> I will try to crawl the 3 websites in the homework tonight (NASA
>>>>>>AMD,
>>>>>> NSF
>>>>>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's
>>>>>> going
>>>>>> on.
>>>>>> 
>>>>>> 
>>>>>> Is memory an issue? My vagrant only has 512MB of memory.
>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Shuo Li
>>>>>> 
>>>>>> 
>>>>>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980)
>>>>>> <[email protected]
>>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>>>>wrote:
>>>>>> 
>>>>>> Hi Shuo,
>>>>>> 
>>>>>> Thanks for your email. I wonder if using selenium grid would
>>>>>> help?
>>>>>> 
>>>>>> Please see this plugin:
>>> https://github.com/momer/nutch-selenium-grid-plugin
>>> <https://github.com/momer/nutch-selenium-grid-plugin>
>>>>>> 
>>>>>> 
>>>>>> I’m CC’ing Mo the author of the plugin to see if he experienced
>>>>>> this while running the original selenium plugin - Mo did using
>>>>>> selenium grid help the issue that Shuo is experiencing below?
>>>>>> 
>>>>>> Mo: are you cool with portion the grid plugin, or if Lewis or
>>>>>> I do it to trunk (with full credit to you of course?)
>>>>>> 
>>>>>> Cheers,
>>>>>> Chris
>>>>>> 
>>>>>> 
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Chris Mattmann, Ph.D.
>>>>>> Chief Architect
>>>>>> Instrument Software and Science Data Systems Section (398)
>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>> Email:
>>> [email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>
>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Adjunct Associate Professor, Computer Science Department
>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Shuo Li <[email protected]
>>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>>>> Reply-To: "[email protected]
>>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>"
>>>>>> <[email protected]
>>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>>>> Date: Friday, February 13, 2015 at 10:12 AM
>>>>>> To: "[email protected]
>>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>"
>>>>>> <[email protected]
>>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>>>> Subject: Vagrant Crushed When using Nutch-Selenium
>>>>>> 
>>>>>>> Hey guys,
>>>>>>> 
>>>>>>> 
>>>>>>> I'm trying to use Nutch-Selenium to crawl
>>>>>>> nutch.apache.org <http://nutch.apache.org>
>>>>>>><http://nutch.apache.org>
>>>>>>> <http://nutch.apache.org>.
>>>>>>> However, my vagrant seems
>>>>>>> crushed after a few minutes. I forced it to shut down and it turns
>>>>>>> out it
>>>>>>> only crawled 59 websites. My nutch version is 1.10 and my OS is
>>>>>>> Ubuntu
>>>>>>> Trusty, 14.04.
>>>>>>> 
>>>>>>> 
>>>>>>> Is there anything I can provide to you guys? Or is there anybody
>>>>>>>have
>>>>>>> the
>>>>>>> same issue? Or 59 websites is the complete crawling?
>>>>>>> 
>>>>>>> 
>>>>>>> Any suggestion would be appreciated.
>>>>>>> 
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Shuo Li
>> 

Reply via email to