Re: How to run nutch server on distributed environment

2016-10-18 Thread lewis john mcgibbney
Hi Sachin,
Very late response I know but hopefully better later than never. Response
below

On Fri, Sep 30, 2016 at 5:04 AM, <user-digest-h...@nutch.apache.org> wrote:

>
> From: Sachin Shaju <sachi...@mstack.com>
> To: user@nutch.apache.org
> Cc:
> Date: Thu, 29 Sep 2016 14:01:13 +0530
> Subject: How to run nutch server on distributed environment
> Hi,
>
> I have tested running of nutch in server mode by starting it using
> bin/nutch startserver command*locally*. Now I wonder whether I can start
> nutch in *server mode* on top of a hadoop cluster(in distributed
> environment) and submit crawl requests to server using nutch REST api ?
> Please help.
>
>
I am assuming you are running Nutch master branch (as the command is
'startserver').
The answer is yes, as long as your Yarn cluster is running well and that
your memory considerations are well suited to your crawl datasets then you
will be good. If I were you I would spend a bit of time running test crawls
with various fetch lists and batch sizes ensuring that you have no memory
issues and that your containers are not killed by ApplicationMaster.

On the Nutch side, please note that right now, when you POST a list(s) or
seed(s) they are cached in /var/something/something on the server running
Nutchserver NOT on HDFS meaning that you somehow need to get them onto HDFS
before you can use your seed list within the INJECT url_dir parameter.

If you need any help with this then simply consult the very helpful
documentation put together by Sujen at
https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI
Let us know how you get on as the REST is very handy indeed. It would be
nice to build it into deployment managers such as Ambari in the future.

Lewis


How to run nutch server on distributed environment

2016-09-29 Thread Sachin Shaju
Hi,

I have tested running of nutch in server mode by starting it using
bin/nutch startserver command*locally*. Now I wonder whether I can start
nutch in *server mode* on top of a hadoop cluster(in distributed
environment) and submit crawl requests to server using nutch REST api ?
Please help.

Regards,
Sachin Shaju

sachi...@mstack.com
+919539887554

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com