Re: Please share your experience of using Nutch in production
On 23 June 2014 01:44, Meraj A. Khan wrote: > Gora, > > Thanks for sharing your admin perspective , rest assured I am not trying > to circumvent any politeness requirements in any way , as I mentioned > earlier , I am with in the crawl-delay limits that are being set by the web > masters if any , however , you have confirmed my hunch that I might have to > reach out to individual webmasters to try and convince them to not block my > IP address . [...] If you are taking the reasonable precautions that you mentioned earlier, there is no reason that you should be getting banned by webmasters. Unless a crawler is actually causing issues for the site performance, it might not even come to the attention of the webmaster at all. > By being at a disadvantage , I meant at a disadvantage compared to major > players like Google, Bing and Yahoo bots , whom the webmasters probably > would not block access, and by Nutch variant , I meant an instance of a > customized crawler based on Nutch. People are unlikely to ban Google et al, as there are clear benefits to having them search one's site. If you would like special privileges, such as being able to hit the site hard, you will have to convince the webmaster that it your crawler also brings some such benefit to them. Regards, Gora
Re: File not found error
Okay, I got it working again. Not sure exactly what happened, but fsck didn't help. I noticed the last line showed "native method" so moved the native binaries out of the /lib folder. Lo and behold, the next time I ran it, it used the java libs and displayed the filename it was having a problem with. It was /tmp/hadoop-root/mapred/staging/root850517656/.staging so given that I just went and moved the /tmp/hadoop-root directory and then it started working again. Permissions looked fine, so it might have just been corrupt. Thanks for the help! On Tue, Jun 24, 2014 at 9:03 PM, John Lafitte wrote: > Well I'm just using nutch in local mode, no hdfs (as far as I know)... My > latest thing is trying to determine if there is a filesystem issue. It's > not really clear what file is not found. I have about 10 different > configs, this is just one of them and they all have the urls folder. The > script worked for quite a while before this just started happening on it's > own. That's why I'm suspecting a filesystem error. > > > On Tue, Jun 24, 2014 at 6:53 PM, kaveh minooie wrote: > >> you might want to check to see if >> >> > Injector: urlDir: di/urls >> >> still exist in your hdfs. >> >> >> >> >> On 06/24/2014 12:30 AM, John Lafitte wrote: >> >>> Using Nutch 1.7 >>> >>> Out of the blue all of my crawl jobs started failing a few days ago. I >>> checked the user logs and nobody logged into the server and there were no >>> reboots or any other obvious issues. There is plenty of disk space. >>> Here >>> is the error I'm getting, any help is appreciated: >>> >>> Injector: starting at 2014-06-24 07:26:54 >>> Injector: crawlDb: di/crawl/crawldb >>> Injector: urlDir: di/urls >>> Injector: Converting injected urls to crawl db entries. >>> Injector: ENOENT: No such file or directory >>> at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) >>> at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701) >>> at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656) >>> at >>> org.apache.hadoop.fs.RawLocalFileSystem.setPermission( >>> RawLocalFileSystem.java:514) >>> at >>> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs( >>> RawLocalFileSystem.java:349) >>> at org.apache.hadoop.fs.FilterFileSystem.mkdirs( >>> FilterFileSystem.java:193) >>> at >>> org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir( >>> JobSubmissionFiles.java:126) >>> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942) >>> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) >>> at java.security.AccessController.doPrivileged(Native Method) >>> at javax.security.auth.Subject.doAs(Subject.java:416) >>> at >>> org.apache.hadoop.security.UserGroupInformation.doAs( >>> UserGroupInformation.java:1190) >>> at org.apache.hadoop.mapred.JobClient.submitJobInternal( >>> JobClient.java:936) >>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) >>> at org.apache.nutch.crawl.Injector.inject(Injector.java:281) >>> at org.apache.nutch.crawl.Injector.run(Injector.java:318) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> at org.apache.nutch.crawl.Injector.main(Injector.java:308) >>> >>> >> -- >> Kaveh Minooie >> > >
Re: File not found error
Well I'm just using nutch in local mode, no hdfs (as far as I know)... My latest thing is trying to determine if there is a filesystem issue. It's not really clear what file is not found. I have about 10 different configs, this is just one of them and they all have the urls folder. The script worked for quite a while before this just started happening on it's own. That's why I'm suspecting a filesystem error. On Tue, Jun 24, 2014 at 6:53 PM, kaveh minooie wrote: > you might want to check to see if > > > Injector: urlDir: di/urls > > still exist in your hdfs. > > > > > On 06/24/2014 12:30 AM, John Lafitte wrote: > >> Using Nutch 1.7 >> >> Out of the blue all of my crawl jobs started failing a few days ago. I >> checked the user logs and nobody logged into the server and there were no >> reboots or any other obvious issues. There is plenty of disk space. Here >> is the error I'm getting, any help is appreciated: >> >> Injector: starting at 2014-06-24 07:26:54 >> Injector: crawlDb: di/crawl/crawldb >> Injector: urlDir: di/urls >> Injector: Converting injected urls to crawl db entries. >> Injector: ENOENT: No such file or directory >> at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) >> at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701) >> at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656) >> at >> org.apache.hadoop.fs.RawLocalFileSystem.setPermission( >> RawLocalFileSystem.java:514) >> at >> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs( >> RawLocalFileSystem.java:349) >> at org.apache.hadoop.fs.FilterFileSystem.mkdirs( >> FilterFileSystem.java:193) >> at >> org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir( >> JobSubmissionFiles.java:126) >> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942) >> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:416) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs( >> UserGroupInformation.java:1190) >> at org.apache.hadoop.mapred.JobClient.submitJobInternal( >> JobClient.java:936) >> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) >> at org.apache.nutch.crawl.Injector.inject(Injector.java:281) >> at org.apache.nutch.crawl.Injector.run(Injector.java:318) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.nutch.crawl.Injector.main(Injector.java:308) >> >> > -- > Kaveh Minooie >
Re: File not found error
you might want to check to see if > Injector: urlDir: di/urls still exist in your hdfs. On 06/24/2014 12:30 AM, John Lafitte wrote: Using Nutch 1.7 Out of the blue all of my crawl jobs started failing a few days ago. I checked the user logs and nobody logged into the server and there were no reboots or any other obvious issues. There is plenty of disk space. Here is the error I'm getting, any help is appreciated: Injector: starting at 2014-06-24 07:26:54 Injector: crawlDb: di/crawl/crawldb Injector: urlDir: di/urls Injector: Converting injected urls to crawl db entries. Injector: ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:193) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:126) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.nutch.crawl.Injector.inject(Injector.java:281) at org.apache.nutch.crawl.Injector.run(Injector.java:318) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Injector.main(Injector.java:308) -- Kaveh Minooie
Re: updatedb deletes all metadata except _csh_
Hi, I already came up with similar changes to the code as in this patch. Only suggestion to this patch's code is that to move checking if url exists in the datastore under if (!additionsAllowed) { return; } and close datastore. Thanks. Alex. -Original Message- From: Lewis John Mcgibbney To: user Sent: Tue, Jun 24, 2014 9:07 am Subject: Re: updatedb deletes all metadata except _csh_ Hi Alex, I am really sorry for not making the connection here. On Tue, Jun 24, 2014 at 12:31 AM, wrote: > > So far, this looks like a bug in updatedb when filtering with batchId. > > I could only found one solution, to check if new pages are in the datastore > and if they are skip them. > Otherwise updatedb with option -all will also work. > https://issues.apache.org/jira/browse/NUTCH-1679 If you can run with this patch, then please post your results here.
Re: updatedb deletes all metadata except _csh_
Hi Alex, I am really sorry for not making the connection here. On Tue, Jun 24, 2014 at 12:31 AM, wrote: > > So far, this looks like a bug in updatedb when filtering with batchId. > > I could only found one solution, to check if new pages are in the datastore > and if they are skip them. > Otherwise updatedb with option -all will also work. > https://issues.apache.org/jira/browse/NUTCH-1679 If you can run with this patch, then please post your results here.
reg crawled pages with status=2
Hi, our requirement is that the Nutch should not recrawl crawl the pages that was being already crawled. ie., the crawling should not happen for the web pages with the status as '2' in the webpage table. It should not recrawl and should not put the outlinks as well. can you please let me know whether it is possible by changing some configuration parameters in nutch site xml? Thanks and Regards Deepa =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
Incremental web crawling based on number of web pages
Hi, I am going to change crawler class in a way that it can crawl incrementally based on the number of web pages. Suppose the sum of all pages for 2 depth crawling is around 5000 pages. Right now this class runs generate-fetch-update for all pages and after finishing it will send them to solr for indexing. I want to change this class in a way that it can break this 5000 pages to 10 different generate-fetch-update cycle. Is that possible with nutch? If yes how can I do that? Crawler source: public class Crawler extends Configured implements Tool { public static final Logger LOG = LoggerFactory.getLogger(Crawler.class); private static String getDate() { return new SimpleDateFormat("MMddHHmmss").format(new Date(System .currentTimeMillis())); } /* * Perform complete crawling and indexing (to Solr) given a set of root urls * and the -solr parameter respectively. More information and Usage * parameters can be found below. */ public static void main(String args[]) throws Exception { Configuration conf = NutchConfiguration.create(); int res = ToolRunner.run(conf, new Crawler(), args); System.exit(res); } @Override public int run(String[] args) throws Exception { if (args.length < 1) { System.out .println("Usage: Crawl -solr [-dir d] [-threads n] [-depth i] [-topN N]"); return -1; } Path rootUrlDir = null; Path dir = new Path("crawl-" + getDate()); int threads = getConf().getInt("fetcher.threads.fetch", 10); int depth = 5; long topN = Long.MAX_VALUE; String solrUrl = null; for (int i = 0; i < args.length; i++) { if ("-dir".equals(args[i])) { dir = new Path(args[i + 1]); i++; } else if ("-threads".equals(args[i])) { threads = Integer.parseInt(args[i + 1]); i++; } else if ("-depth".equals(args[i])) { depth = Integer.parseInt(args[i + 1]); i++; } else if ("-topN".equals(args[i])) { topN = Integer.parseInt(args[i + 1]); i++; } else if ("-solr".equals(args[i])) { solrUrl = args[i + 1]; i++; } else if (args[i] != null) { rootUrlDir = new Path(args[i]); } } JobConf job = new NutchJob(getConf()); if (solrUrl == null) { LOG.warn("solrUrl is not set, indexing will be skipped..."); } else { // for simplicity assume that SOLR is used // and pass its URL via conf getConf().set("solr.server.url", solrUrl); } FileSystem fs = FileSystem.get(job); if (LOG.isInfoEnabled()) { LOG.info("crawl started in: " + dir); LOG.info("rootUrlDir = " + rootUrlDir); LOG.info("threads = " + threads); LOG.info("depth = " + depth); LOG.info("solrUrl=" + solrUrl); if (topN != Long.MAX_VALUE) LOG.info("topN = " + topN); } Path crawlDb = new Path(dir + "/crawldb"); Path linkDb = new Path(dir + "/linkdb"); Path segments = new Path(dir + "/segments"); //Path tmpDir = job.getLocalPath("crawl" + Path.SEPARATOR + getDate()); Injector injector = new Injector(getConf()); Generator generator = new Generator(getConf()); Fetcher fetcher = new Fetcher(getConf()); ParseSegment parseSegment = new ParseSegment(getConf()); CrawlDb crawlDbTool = new CrawlDb(getConf()); LinkDb linkDbTool = new LinkDb(getConf()); // initialize crawlDb injector.inject(crawlDb, rootUrlDir); int i; for (i = 0; i < depth; i++) { // generate new segment Path[] segs = generator.generate(crawlDb, segments, -1, topN, System.currentTimeMillis()); if (segs == null) { LOG.info("Stopping at depth=" + i + " - no more URLs to fetch."); break; } fetcher.fetch(segs[0], threads); // fetch it if (!Fetcher.isParsing(job)) { parseSegment.parse(segs[0]); // parse it, if needed } crawlDbTool.update(crawlDb, segs, true, true); // update crawldb } if (i > 0) { linkDbTool.invert(linkDb, segments, true, true, false); // invert // links // dedup should be added if (solrUrl != null) { // index FileStatus[] fstats = fs.listStatus(segments, HadoopFSUtil.getPassDirectoriesFilter(fs)); IndexingJob indexer = new IndexingJob(getConf()); boolean noCommit = false; indexer.index(crawlDb, linkDb, Arrays.asList(HadoopFSUtil.getPaths(fstats)), noCommit); } // merge should be added // clean should be added } else { LOG.warn("No URLs to fetch - check your seed list and URL filters."); } if (LOG.isInfoEnabled()) { LOG.info("crawl finished: " + dir); } return 0; } } Best regards. -- A.Nazemian
File not found error
Using Nutch 1.7 Out of the blue all of my crawl jobs started failing a few days ago. I checked the user logs and nobody logged into the server and there were no reboots or any other obvious issues. There is plenty of disk space. Here is the error I'm getting, any help is appreciated: Injector: starting at 2014-06-24 07:26:54 Injector: crawlDb: di/crawl/crawldb Injector: urlDir: di/urls Injector: Converting injected urls to crawl db entries. Injector: ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:193) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:126) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.nutch.crawl.Injector.inject(Injector.java:281) at org.apache.nutch.crawl.Injector.run(Injector.java:318) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Injector.main(Injector.java:308)