Custom options in nutch crawl script

2016-09-29 Thread Sachin Shaju
I was trying to give custom options in *bin/crawl* script and encountered an issue. I gave a custom config in nutch to ignore external outlinks in my crawl command like :- *bin/crawl -i -D elastic.index=test -D db.ignore.external.links=true urls/ CrawlTest/ 3* But this is not working. Then I set

Re: Arch 1.9.2 is available

2016-09-29 Thread lewis john mcgibbney
Cool... thanks for posting. On Wed, Sep 28, 2016 at 1:36 AM, wrote: > > user Digest 28 Sep 2016 08:36:56 - Issue 2648 > > Topics (messages 32792 through 32792) > > Arch 1.9.2 is available > 32792 by: Arkadi.Kosmynin.csiro.au > > Administrivia: > >

Re: Open Graph metadata?

2016-09-29 Thread lewis john mcgibbney
Hi Ralf, Do mean here the Open Graph Protocol [0] markup? If so, then if it is resent within then it is already parsed out and stored within Parse [1] and can be accessed Parse.getData(). Please use the ParserChecker to double check this and if necessary post an example here so that I can be

Re: Nutch in production

2016-09-29 Thread Sachin Shaju
Can I have a link to this ? Regards, Sachin Shaju sachi...@mstack.com +919539887554 On Thu, Sep 29, 2016 at 11:13 PM, Mattmann, Chris A (3980) < chris.a.mattm...@jpl.nasa.gov> wrote: > Yep also check out the work that Sujen Shah just merged (also on my team > at JPL and > USC) where you can

Re: Nutch in production

2016-09-29 Thread Mattmann, Chris A (3980)
Yep also check out the work that Sujen Shah just merged (also on my team at JPL and USC) where you can publish events to an ActiveMQ queue from Nutch crawling. That should allow all sorts of production dashboards and analytics. ++

Re: Nutch in production

2016-09-29 Thread Karanjeet Singh
Hi Sachin, Just a suggestion here - you can use Apache Kafka to generate and catch events which are mapped to incoming crawl requests, crawl status and much more. I have created a prototype for production queue [0] which runs on top of a supercomputer (TACC Wrangler) and integrated it with

How to run nutch server on distributed environment

2016-09-29 Thread Sachin Shaju
Hi, I have tested running of nutch in server mode by starting it using bin/nutch startserver command*locally*. Now I wonder whether I can start nutch in *server mode* on top of a hadoop cluster(in distributed environment) and submit crawl requests to server using nutch REST api ? Please help.

Nutch in production

2016-09-29 Thread Sachin Shaju
Hi, I was experimenting some crawl cycles with nutch and would like to setup a distributed crawl environment. But I wonder how can I trigger nutch for incoming crawl requests in a production system. I read about nutch REST api. Is that the real option that I have ? Or can I run nutch as a

RE: Arch 1.9.2 is available

2016-09-29 Thread Arkadi.Kosmynin
You are welcome. > -Original Message- > From: lewis john mcgibbney [mailto:lewi...@apache.org] > Sent: Friday, 30 September 2016 2:22 AM > To: user@nutch.apache.org > Subject: Re: Arch 1.9.2 is available > > Cool... thanks for posting. > > On Wed, Sep 28, 2016 at 1:36 AM,