Re: Nutch/Hadoop Cluster
Hi Mike, > It can be tedious to set up for the first time, and there are many components. In case you prefer Linux packages, I can recommend Apache Bigtop, see https://bigtop.apache.org/ and for the list of package repositories https://downloads.apache.org/bigtop/stable/repos/ ~Sebastian On 1/15/23 01:06, Markus Jelsma wrote: Hello Mike, would it pay off for me to put a hadoop cluster on top of the 3 servers. Yes, for as many reasons as Hadoop exists for. It can be tedious to set up for the first time, and there are many components. But at least you have three servers, which is kind of required by Zookeeper, that you will also need. Ideally you would have some additional VMs to run the controlling Hadoop programs and perhaps the Hadoop client nodes on. The workers can run on bare metal. 1.) a server would not be integrated directly into the crawl process as a master. What do you mean? Can you elaborate? 2.) can I run multiple crawl jobs on one server? Yes! Just have separate instances of Nutch home dirs on your Hadoop client nodes, each having their own configuration. Regards, Markus Op za 14 jan. 2023 om 18:42 schreef Mike : Hi! I am now crawling the internet in local mode in parallel with up to 10 instances on 3 computers. would it pay off for me to put a hadoop cluster on top of the 3 servers. 1.) a server would not be integrated directly into the crawl process as a master. 2.) can I run multiple crawl jobs on one server? Thanks
Re: Nutch/Hadoop Cluster
Hello Mike, > would it pay off for me to put a hadoop cluster on top of the 3 servers. Yes, for as many reasons as Hadoop exists for. It can be tedious to set up for the first time, and there are many components. But at least you have three servers, which is kind of required by Zookeeper, that you will also need. Ideally you would have some additional VMs to run the controlling Hadoop programs and perhaps the Hadoop client nodes on. The workers can run on bare metal. > 1.) a server would not be integrated directly into the crawl process as a master. What do you mean? Can you elaborate? > 2.) can I run multiple crawl jobs on one server? Yes! Just have separate instances of Nutch home dirs on your Hadoop client nodes, each having their own configuration. Regards, Markus Op za 14 jan. 2023 om 18:42 schreef Mike : > Hi! > > I am now crawling the internet in local mode in parallel with up to 10 > instances on 3 computers. would it pay off for me to put a hadoop cluster > on top of the 3 servers. > > 1.) a server would not be integrated directly into the crawl process as a > master. > 2.) can I run multiple crawl jobs on one server? > > Thanks >
Nutch/Hadoop Cluster
Hi! I am now crawling the internet in local mode in parallel with up to 10 instances on 3 computers. would it pay off for me to put a hadoop cluster on top of the 3 servers. 1.) a server would not be integrated directly into the crawl process as a master. 2.) can I run multiple crawl jobs on one server? Thanks
Re: Regarding Nutch Hadoop Cluster Setup in Deploy Mode
Hi Dimanshu, Nutch is a community project. If you can, please take the time, be part of the community and improve the documentation. Unlike for the source code, the barrier for the wiki is low: anybody can and *is welcome* to register and update the Nutch Wiki. As a 100% volunteer project we rely on contributions from the community including our users. Thanks, Sebastian On 9/4/20 9:17 PM, Dimanshu Parihar wrote: > Thanks Sebastian, > This helps a lot. I got the point. They should change the documentation. A > lot of people gets confused because of that. > > Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10 > > From: Sebastian Nagel<mailto:wastl.na...@googlemail.com.INVALID> > Sent: Tuesday, August 11, 2020 4:56 PM > To: user@nutch.apache.org<mailto:user@nutch.apache.org> > Subject: Re: Regarding Nutch Hadoop Cluster Setup in Deploy Mode > > Hi, > > Nutch does not include a search component anymore. These steps are obsolete. > > All you need is to setup your Hadoop cluster, then run >$NUTCH_HOME/runtime/deploy/bin/nutch ... > (instead of .../runtime/local/bin/nutch ...) > > Alternatively, you could launch a Nutch tool, eg. Injector > the following way: > > hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.15-SNAPSHOT.job \ >org.apache.nutch.crawl.Injector ... > > Best, > Sebastian > > > On 8/10/20 11:31 AM, Dimanshu Parihar wrote: >> >> >> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10 >> Hello Sir, >> I have been using Nutch 1.17 in local mode and now I wanted to shift from >> local mode to deploy mode. For this, I tried the Apache Nutch Hadoop cluster >> setup link but I am stuck at the below given point : >> >> Problem : >> >> First copy the files from the nutch build to the deploy directory using >> something like the following command: >> >> cp -R /path/to/build/* /nutch/search >> >> Then make sure that all of the shell scripts are in unix format and are >> executable. >> >> dos2unix /nutch/search/bin/*.sh /nutch/search/bin/hadoop >> /nutch/search/bin/nutch >> >> chmod 700 /nutch/search/bin/*.sh /nutch/search/bin/hadoop >> /nutch/search/bin/nutch >> >> dos2unix /nutch/search/config/*.sh >> >> chmod 700 /nutch/search/config/*.sh >> Issue : >> The issue is I ran ant command in nutch folder and runtime folder is created >> and a build folder is created. I copied the build/* files to search folder >> that I created in nutch folder itself. But after running these dos2unix >> commands, it says no bin/Hadoop and bin/nutch files found here which is >> obvious because my build folder didn’t had these files. >> So can you please clarify these statements that how can I follow these steps? >> I have only 1 user where I am setting all 3 hadoop, solr and nutch which is >> not root user. >> > >
RE: Regarding Nutch Hadoop Cluster Setup in Deploy Mode
Thanks Sebastian, This helps a lot. I got the point. They should change the documentation. A lot of people gets confused because of that. Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10 From: Sebastian Nagel<mailto:wastl.na...@googlemail.com.INVALID> Sent: Tuesday, August 11, 2020 4:56 PM To: user@nutch.apache.org<mailto:user@nutch.apache.org> Subject: Re: Regarding Nutch Hadoop Cluster Setup in Deploy Mode Hi, Nutch does not include a search component anymore. These steps are obsolete. All you need is to setup your Hadoop cluster, then run $NUTCH_HOME/runtime/deploy/bin/nutch ... (instead of .../runtime/local/bin/nutch ...) Alternatively, you could launch a Nutch tool, eg. Injector the following way: hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.15-SNAPSHOT.job \ org.apache.nutch.crawl.Injector ... Best, Sebastian On 8/10/20 11:31 AM, Dimanshu Parihar wrote: > > > Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10 > Hello Sir, > I have been using Nutch 1.17 in local mode and now I wanted to shift from > local mode to deploy mode. For this, I tried the Apache Nutch Hadoop cluster > setup link but I am stuck at the below given point : > > Problem : > > First copy the files from the nutch build to the deploy directory using > something like the following command: > > cp -R /path/to/build/* /nutch/search > > Then make sure that all of the shell scripts are in unix format and are > executable. > > dos2unix /nutch/search/bin/*.sh /nutch/search/bin/hadoop > /nutch/search/bin/nutch > > chmod 700 /nutch/search/bin/*.sh /nutch/search/bin/hadoop > /nutch/search/bin/nutch > > dos2unix /nutch/search/config/*.sh > > chmod 700 /nutch/search/config/*.sh > Issue : > The issue is I ran ant command in nutch folder and runtime folder is created > and a build folder is created. I copied the build/* files to search folder > that I created in nutch folder itself. But after running these dos2unix > commands, it says no bin/Hadoop and bin/nutch files found here which is > obvious because my build folder didn’t had these files. > So can you please clarify these statements that how can I follow these steps? > I have only 1 user where I am setting all 3 hadoop, solr and nutch which is > not root user. >
Re: Regarding Nutch Hadoop Cluster Setup in Deploy Mode
Hi, Nutch does not include a search component anymore. These steps are obsolete. All you need is to setup your Hadoop cluster, then run $NUTCH_HOME/runtime/deploy/bin/nutch ... (instead of .../runtime/local/bin/nutch ...) Alternatively, you could launch a Nutch tool, eg. Injector the following way: hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.15-SNAPSHOT.job \ org.apache.nutch.crawl.Injector ... Best, Sebastian On 8/10/20 11:31 AM, Dimanshu Parihar wrote: > > > Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10 > Hello Sir, > I have been using Nutch 1.17 in local mode and now I wanted to shift from > local mode to deploy mode. For this, I tried the Apache Nutch Hadoop cluster > setup link but I am stuck at the below given point : > > Problem : > > First copy the files from the nutch build to the deploy directory using > something like the following command: > > cp -R /path/to/build/* /nutch/search > > Then make sure that all of the shell scripts are in unix format and are > executable. > > dos2unix /nutch/search/bin/*.sh /nutch/search/bin/hadoop > /nutch/search/bin/nutch > > chmod 700 /nutch/search/bin/*.sh /nutch/search/bin/hadoop > /nutch/search/bin/nutch > > dos2unix /nutch/search/config/*.sh > > chmod 700 /nutch/search/config/*.sh > Issue : > The issue is I ran ant command in nutch folder and runtime folder is created > and a build folder is created. I copied the build/* files to search folder > that I created in nutch folder itself. But after running these dos2unix > commands, it says no bin/Hadoop and bin/nutch files found here which is > obvious because my build folder didn’t had these files. > So can you please clarify these statements that how can I follow these steps? > I have only 1 user where I am setting all 3 hadoop, solr and nutch which is > not root user. >