Re: Nutch/Hadoop Cluster

2023-01-17 Thread Sebastian Nagel

Hi Mike,

> It can be tedious to set up for the first time, and there are many components.

In case you prefer Linux packages, I can recommend Apache Bigtop, see
   https://bigtop.apache.org/
and for the list of package repositories
   https://downloads.apache.org/bigtop/stable/repos/

~Sebastian

On 1/15/23 01:06, Markus Jelsma wrote:

Hello Mike,


would it pay off for me to put a hadoop cluster on top of the 3 servers.


Yes, for as many reasons as Hadoop exists for. It can be tedious to set up
for the first time, and there are many components. But at least you have
three servers, which is kind of required by Zookeeper, that you will also
need.

Ideally you would have some additional VMs to run the controlling Hadoop
programs and perhaps the Hadoop client nodes on. The workers can run on
bare metal.


1.) a server would not be integrated directly into the crawl process as a

master.

What do you mean? Can you elaborate?


2.) can I run multiple crawl jobs on one server?


Yes! Just have separate instances of Nutch home dirs on your Hadoop client
nodes, each having their own configuration.

Regards,
Markus

Op za 14 jan. 2023 om 18:42 schreef Mike :


Hi!

I am now crawling the internet in local mode in parallel with up to 10
instances on 3 computers. would it pay off for me to put a hadoop cluster
on top of the 3 servers.

1.) a server would not be integrated directly into the crawl process as a
master.
2.) can I run multiple crawl jobs on one server?

Thanks





Re: Nutch/Hadoop Cluster

2023-01-14 Thread Markus Jelsma
Hello Mike,

> would it pay off for me to put a hadoop cluster on top of the 3 servers.

Yes, for as many reasons as Hadoop exists for. It can be tedious to set up
for the first time, and there are many components. But at least you have
three servers, which is kind of required by Zookeeper, that you will also
need.

Ideally you would have some additional VMs to run the controlling Hadoop
programs and perhaps the Hadoop client nodes on. The workers can run on
bare metal.

> 1.) a server would not be integrated directly into the crawl process as a
master.

What do you mean? Can you elaborate?

> 2.) can I run multiple crawl jobs on one server?

Yes! Just have separate instances of Nutch home dirs on your Hadoop client
nodes, each having their own configuration.

Regards,
Markus

Op za 14 jan. 2023 om 18:42 schreef Mike :

> Hi!
>
> I am now crawling the internet in local mode in parallel with up to 10
> instances on 3 computers. would it pay off for me to put a hadoop cluster
> on top of the 3 servers.
>
> 1.) a server would not be integrated directly into the crawl process as a
> master.
> 2.) can I run multiple crawl jobs on one server?
>
> Thanks
>


Nutch/Hadoop Cluster

2023-01-14 Thread Mike
Hi!

I am now crawling the internet in local mode in parallel with up to 10
instances on 3 computers. would it pay off for me to put a hadoop cluster
on top of the 3 servers.

1.) a server would not be integrated directly into the crawl process as a
master.
2.) can I run multiple crawl jobs on one server?

Thanks


Re: Regarding Nutch Hadoop Cluster Setup in Deploy Mode

2020-09-08 Thread Sebastian Nagel
Hi Dimanshu,

Nutch is a community project. If you can, please take the time, be part of the 
community
and improve the documentation. Unlike for the source code, the barrier for the 
wiki is low:
anybody can and *is welcome* to register and update the Nutch Wiki. As a 100% 
volunteer project
we rely on contributions from the community including our users.

Thanks,
Sebastian

On 9/4/20 9:17 PM, Dimanshu Parihar wrote:
> Thanks Sebastian,
> This helps a lot. I got the point. They should change the documentation. A 
> lot of people gets confused because of that.
> 
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
> 
> From: Sebastian Nagel<mailto:wastl.na...@googlemail.com.INVALID>
> Sent: Tuesday, August 11, 2020 4:56 PM
> To: user@nutch.apache.org<mailto:user@nutch.apache.org>
> Subject: Re: Regarding Nutch Hadoop Cluster Setup in Deploy Mode
> 
> Hi,
> 
> Nutch does not include a search component anymore. These steps are obsolete.
> 
> All you need is to setup your Hadoop cluster, then run
>$NUTCH_HOME/runtime/deploy/bin/nutch ...
> (instead of .../runtime/local/bin/nutch ...)
> 
> Alternatively, you could launch a Nutch tool, eg. Injector
> the following way:
> 
> hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.15-SNAPSHOT.job \
>org.apache.nutch.crawl.Injector ...
> 
> Best,
> Sebastian
> 
> 
> On 8/10/20 11:31 AM, Dimanshu Parihar wrote:
>>
>>
>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
>> Hello Sir,
>> I have been using Nutch 1.17 in local mode and now I wanted to shift from 
>> local mode to deploy mode. For this, I tried the Apache Nutch Hadoop cluster 
>> setup link but I am stuck at the below given point :
>>
>> Problem :
>>
>> First copy the files from the nutch build to the deploy directory using 
>> something like the following command:
>>
>> cp -R /path/to/build/* /nutch/search
>>
>> Then make sure that all of the shell scripts are in unix format and are 
>> executable.
>>
>> dos2unix /nutch/search/bin/*.sh /nutch/search/bin/hadoop 
>> /nutch/search/bin/nutch
>>
>> chmod 700 /nutch/search/bin/*.sh /nutch/search/bin/hadoop 
>> /nutch/search/bin/nutch
>>
>> dos2unix /nutch/search/config/*.sh
>>
>> chmod 700 /nutch/search/config/*.sh
>> Issue :
>> The issue is I ran ant command in nutch folder and runtime folder is created 
>> and a build folder is created. I copied the build/* files to search folder 
>> that I created in nutch folder itself. But after running these dos2unix 
>> commands, it says no bin/Hadoop and bin/nutch files found here which is 
>> obvious because my build folder didn’t had these files.
>> So can you please clarify these statements that how can I follow these steps?
>> I have only 1 user where I am setting all 3 hadoop, solr and nutch which is 
>> not root user.
>>
> 
> 



RE: Regarding Nutch Hadoop Cluster Setup in Deploy Mode

2020-09-04 Thread Dimanshu Parihar
Thanks Sebastian,
This helps a lot. I got the point. They should change the documentation. A lot 
of people gets confused because of that.

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

From: Sebastian Nagel<mailto:wastl.na...@googlemail.com.INVALID>
Sent: Tuesday, August 11, 2020 4:56 PM
To: user@nutch.apache.org<mailto:user@nutch.apache.org>
Subject: Re: Regarding Nutch Hadoop Cluster Setup in Deploy Mode

Hi,

Nutch does not include a search component anymore. These steps are obsolete.

All you need is to setup your Hadoop cluster, then run
   $NUTCH_HOME/runtime/deploy/bin/nutch ...
(instead of .../runtime/local/bin/nutch ...)

Alternatively, you could launch a Nutch tool, eg. Injector
the following way:

hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.15-SNAPSHOT.job \
   org.apache.nutch.crawl.Injector ...

Best,
Sebastian


On 8/10/20 11:31 AM, Dimanshu Parihar wrote:
>
>
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
> Hello Sir,
> I have been using Nutch 1.17 in local mode and now I wanted to shift from 
> local mode to deploy mode. For this, I tried the Apache Nutch Hadoop cluster 
> setup link but I am stuck at the below given point :
>
> Problem :
>
> First copy the files from the nutch build to the deploy directory using 
> something like the following command:
>
> cp -R /path/to/build/* /nutch/search
>
> Then make sure that all of the shell scripts are in unix format and are 
> executable.
>
> dos2unix /nutch/search/bin/*.sh /nutch/search/bin/hadoop 
> /nutch/search/bin/nutch
>
> chmod 700 /nutch/search/bin/*.sh /nutch/search/bin/hadoop 
> /nutch/search/bin/nutch
>
> dos2unix /nutch/search/config/*.sh
>
> chmod 700 /nutch/search/config/*.sh
> Issue :
> The issue is I ran ant command in nutch folder and runtime folder is created 
> and a build folder is created. I copied the build/* files to search folder 
> that I created in nutch folder itself. But after running these dos2unix 
> commands, it says no bin/Hadoop and bin/nutch files found here which is 
> obvious because my build folder didn’t had these files.
> So can you please clarify these statements that how can I follow these steps?
> I have only 1 user where I am setting all 3 hadoop, solr and nutch which is 
> not root user.
>



Re: Regarding Nutch Hadoop Cluster Setup in Deploy Mode

2020-08-11 Thread Sebastian Nagel
Hi,

Nutch does not include a search component anymore. These steps are obsolete.

All you need is to setup your Hadoop cluster, then run
   $NUTCH_HOME/runtime/deploy/bin/nutch ...
(instead of .../runtime/local/bin/nutch ...)

Alternatively, you could launch a Nutch tool, eg. Injector
the following way:

hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.15-SNAPSHOT.job \
   org.apache.nutch.crawl.Injector ...

Best,
Sebastian


On 8/10/20 11:31 AM, Dimanshu Parihar wrote:
> 
> 
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
> Hello Sir,
> I have been using Nutch 1.17 in local mode and now I wanted to shift from 
> local mode to deploy mode. For this, I tried the Apache Nutch Hadoop cluster 
> setup link but I am stuck at the below given point :
> 
> Problem :
> 
> First copy the files from the nutch build to the deploy directory using 
> something like the following command:
> 
> cp -R /path/to/build/* /nutch/search
> 
> Then make sure that all of the shell scripts are in unix format and are 
> executable.
> 
> dos2unix /nutch/search/bin/*.sh /nutch/search/bin/hadoop 
> /nutch/search/bin/nutch
> 
> chmod 700 /nutch/search/bin/*.sh /nutch/search/bin/hadoop 
> /nutch/search/bin/nutch
> 
> dos2unix /nutch/search/config/*.sh
> 
> chmod 700 /nutch/search/config/*.sh
> Issue :
> The issue is I ran ant command in nutch folder and runtime folder is created 
> and a build folder is created. I copied the build/* files to search folder 
> that I created in nutch folder itself. But after running these dos2unix 
> commands, it says no bin/Hadoop and bin/nutch files found here which is 
> obvious because my build folder didn’t had these files.
> So can you please clarify these statements that how can I follow these steps?
> I have only 1 user where I am setting all 3 hadoop, solr and nutch which is 
> not root user.
>