No problem! How'd it work out?

Mo

This message was drafted on a tiny touch screen; please forgive brevity & tpyos

> On Feb 22, 2015, at 6:19 PM, "Mattmann, Chris A (3980)" 
> <[email protected]> wrote:
> 
> Thanks Mo, great advice.
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Jiaxin Ye <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Tuesday, February 17, 2015 at 2:49 PM
> To: Mohammed Omer <[email protected]>
> Cc: "[email protected]" <[email protected]>
> Subject: Re: Vagrant Crushed When using Nutch-Selenium
> 
>> 
>> 
>> 
>> Thank you so much!! I am going to try it out tonight.
>> 
>> On Tuesday, February 17, 2015, Mohammed Omer <[email protected]>
>> wrote:
>> 
>> Jiaxin, 
>> 
>> 
>> Each page takes about 3 seconds to crawl due to this piece of code - we
>> allow selenium 3 seconds to grab the page [0]. Due to what I was
>> crawling, I didn't want to wait for a specific element/class/id to show
>> up. However, you can change it up if you want.
>> Selenium documentation [1] has more info on Ex/Implicit waiting.
>> 
>> 
>> Again, it's not the most efficient way to crawl; but, if you need JS to
>> render, it's a backwards way that ensures it happens. Selenium Grid has
>> the benefit of being able to handle more throughput, but at the end of
>> the day we're waiting for a browser to
>> go out and fetch the url.
>> 
>> 
>> I've suggested that most items be configurable when merged into trunk
>> [2], but I'll make a specific call-out to the wait time.
>> 
>> 
>> Due to the way Selenium standalone works, it's wayyyyyy less efficient
>> than a 'Grid' set-up (hub + nodes) [3], which is why I switched to that
>> set-up. 
>> 
>> 
>> Wish I could help out more, but 30 threads might be too much. 5 threads,
>> at a total fetch/parse time of 4 seconds per url, would still
>> theoretically churn out > 100k urls per day. There are multiple tweaks
>> that could be made to optimize for your system,
>> I'd start with reducing thread count, as you might be saturating your
>> system [4].
>> 
>> 
>> Sorry I can't be of more help!
>> 
>> 
>> Thank you,
>> 
>> 
>> Mo
>> 
>> 
>> [0]: 
>> https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/java/
>> org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49
>> <https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/java
>> /org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49>
>> [1]: 
>> http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp
>> <http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp>
>> [2]: https://issues.apache.org/jira/browse/NUTCH-1933
>> [3]: https://code.google.com/p/selenium/wiki/Grid2
>> [4]: http://stackoverflow.com/a/4895271
>> 
>> 
>> On Mon, Feb 16, 2015 at 2:13 AM, Jiaxin Ye
>> <[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>> wrote:
>> 
>> I am using fetcher.threads.per.queue = 30 by the way.
>> 
>> On Mon, Feb 16, 2015 at 12:08 AM, Jiaxin Ye
>> <[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>> wrote:
>> 
>> Hi Mo,
>> 
>> 
>> I have a problem about the selenium plugin on mac. I think I successfully
>> set it up on mac but I have a question about the performance.
>> I am using a Mac with Intel Core i5 processor and 8GB ram, but I found
>> that each url fetched takes about 1 seconds to open and close
>> the firefox window. Is it a normal speed? or anything is wrong? And is it
>> possible to install selenium grid plugin on Mac? I will cry if you
>> ask me to change machine now......
>> 
>> 
>> Best,
>> Jiaxin
>> 
>> 
>> On Fri, Feb 13, 2015 at 2:09 PM, Mohammed Omer
>> <[email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>> 
>> No worries man, glad everything works! Glad, since I was having hostname
>> issues with nutch/hbase just now as I quickly tried to get it
>> working/fixed for ya, ha.
>> 
>> Mo
>> 
>> 
>> On Fri, Feb 13, 2015 at 2:57 PM, Shuo Li
>> <[email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>> 
>> Hey guys,
>> 
>> 
>> After change my RAM to 2GB, everything works fine. My bad. Thanks for
>> your help.
>> 
>> 
>> Regards,
>> Shuo Li
>> 
>> 
>> On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980)
>> <[email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>> 
>> Thank you Mo. I sincerely appreciate your guidance and contribution.
>> 
>> I will work to get your nutch selenium grid plugin contributed
>> to work with Nutch 1.x.
>> 
>> Cheers,
>> Chris
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: 
>> [email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Mo Omer <[email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>> Date: Friday, February 13, 2015 at 11:10 AM
>> To: Chris Mattmann <[email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>> Cc: "[email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>"
>> <[email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>> 
>>> Hey all,
>>> 
>>> When I had run nutch-selenium, it was in a config such that zombies were
>>> created from closing Firefox windows and they couldn't be reaped (again,
>>> due to the docker configuration I had).
>>> 
>>> In a normal setup, it should not be an issue - if you're running 20
>>> threads in nutch that's potentially 20 open FF windows which isn't good
>>> for 512mb.
>>> 
>>> Selenium grid is much more efficient, in that browsers are opened, but
>>> tabs are used to fetch sites - and only those are closed.
>>> 
>>> Additionally, ensure you're using Nutch 2.2.1.
>>> 
>>> Feel free to fork patch and tinker and PR as needed.
>>> 
>>> Chris, if you want to be added to contribs on the GitHub project, that's
>>> cool with me! Wish I could dedicate more time to this, but I don't
>>> foresee using Nutch again in the near future, and am now working on
>>> projects that require lots of reading and possibly patches to Caffe and
>>> opencl r-CNN projects.
>>> 
>>> Tl;dr:
>>> - no, this shouldn't be typical unless you're creating zombies like crazy
>>> and they're not being reaped (too many open file descriptors), running
>>> out of memory, or similar resource constraint.
>>> - selenium grid is TONs more efficient, but a bit more difficult to set
>>> up. I used it to crawl 100ks of sites.
>>> - unfortunately I can't commit more time to this, but if I can assist in
>>> any admin way, let me know.
>>> 
>>> Thank you,
>>> 
>>> Mo
>>> 
>>> This message was drafted on a tiny touch screen; please forgive brevity &
>>> tpyos
>>> 
>>>> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)"
>>>> <[email protected]
>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>>> 
>>>> Oh yes, please up your memory to like at least 2Gb..
>>>> 
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email:
>> [email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Shuo Li <[email protected]
>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>> Reply-To: "[email protected]
>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>"
>>>> <[email protected]
>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>> Date: Friday, February 13, 2015 at 10:38 AM
>>>> To: "[email protected]
>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>"
>>>> <[email protected]
>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>> Cc: Mo Omer <[email protected]
>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>>>> 
>>>>> Hey Mo and Prof Mattmann,
>>>>> 
>>>>> 
>>>>> I will try to crawl the 3 websites in the homework tonight (NASA AMD,
>>>>> NSF
>>>>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's
>>>>> going
>>>>> on.
>>>>> 
>>>>> 
>>>>> Is memory an issue? My vagrant only has 512MB of memory.
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> Shuo Li
>>>>> 
>>>>> 
>>>>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980)
>>>>> <[email protected]
>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>>>> 
>>>>> Hi Shuo,
>>>>> 
>>>>> Thanks for your email. I wonder if using selenium grid would
>>>>> help?
>>>>> 
>>>>> Please see this plugin:
>> https://github.com/momer/nutch-selenium-grid-plugin
>> <https://github.com/momer/nutch-selenium-grid-plugin>
>>>>> 
>>>>> 
>>>>> I’m CC’ing Mo the author of the plugin to see if he experienced
>>>>> this while running the original selenium plugin - Mo did using
>>>>> selenium grid help the issue that Shuo is experiencing below?
>>>>> 
>>>>> Mo: are you cool with portion the grid plugin, or if Lewis or
>>>>> I do it to trunk (with full credit to you of course?)
>>>>> 
>>>>> Cheers,
>>>>> Chris
>>>>> 
>>>>> 
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Chris Mattmann, Ph.D.
>>>>> Chief Architect
>>>>> Instrument Software and Science Data Systems Section (398)
>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>> Office: 168-519, Mailstop: 168-527
>>>>> Email:
>> [email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>
>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Adjunct Associate Professor, Computer Science Department
>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Shuo Li <[email protected]
>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>>> Reply-To: "[email protected]
>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>"
>>>>> <[email protected]
>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>>> Date: Friday, February 13, 2015 at 10:12 AM
>>>>> To: "[email protected]
>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>"
>>>>> <[email protected]
>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>>> Subject: Vagrant Crushed When using Nutch-Selenium
>>>>> 
>>>>>> Hey guys,
>>>>>> 
>>>>>> 
>>>>>> I'm trying to use Nutch-Selenium to crawl
>>>>>> nutch.apache.org <http://nutch.apache.org> <http://nutch.apache.org>
>>>>>> <http://nutch.apache.org>.
>>>>>> However, my vagrant seems
>>>>>> crushed after a few minutes. I forced it to shut down and it turns
>>>>>> out it
>>>>>> only crawled 59 websites. My nutch version is 1.10 and my OS is
>>>>>> Ubuntu
>>>>>> Trusty, 14.04.
>>>>>> 
>>>>>> 
>>>>>> Is there anything I can provide to you guys? Or is there anybody have
>>>>>> the
>>>>>> same issue? Or 59 websites is the complete crawling?
>>>>>> 
>>>>>> 
>>>>>> Any suggestion would be appreciated.
>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Shuo Li
> 

Reply via email to