Re: Please share your experience of using Nutch in production

2014-06-24 Thread Gora Mohanty
On 23 June 2014 01:44, Meraj A. Khan  wrote:
> Gora,
>
> Thanks for sharing your admin perspective , rest assured  I am not trying
> to circumvent any politeness requirements in any way , as I mentioned
> earlier , I am with in the crawl-delay limits that are being set by the web
> masters if any , however , you have confirmed my hunch that I might have to
> reach out to individual webmasters to try and convince them to not block my
> IP address .
[...]

If you are taking the reasonable precautions that you mentioned
earlier, there is
no reason that you should be getting banned by webmasters. Unless a crawler
is actually causing issues for the site performance, it might not even come to
the attention of the webmaster at all.

> By being at a disadvantage , I meant at a disadvantage compared to major
> players like Google, Bing and Yahoo bots , whom the webmasters probably
> would not block access, and by Nutch variant , I meant an instance of a
> customized crawler based on Nutch.

People are unlikely to ban Google et al, as there are clear benefits
to having them
search one's site. If you would like special privileges, such as being
able to hit
the site hard, you will have to convince the webmaster that it your crawler also
brings some such benefit to them.

Regards,
Gora


Re: File not found error

2014-06-24 Thread John Lafitte
Okay, I got it working again.  Not sure exactly what happened, but fsck
didn't help.  I noticed the last line showed "native method" so moved the
native binaries out of the /lib folder.  Lo and behold, the next time I ran
it, it used the java libs and displayed the filename it was having a
problem with.  It
was /tmp/hadoop-root/mapred/staging/root850517656/.staging so given that I
just went and moved the /tmp/hadoop-root directory and then it started
working again.  Permissions looked fine, so it might have just been corrupt.

Thanks for the help!


On Tue, Jun 24, 2014 at 9:03 PM, John Lafitte 
wrote:

> Well I'm just using nutch in local mode, no hdfs (as far as I know)...  My
> latest thing is trying to determine if there is a filesystem issue.  It's
> not really clear what file is not found.  I have about 10 different
> configs, this is just one of them and they all have the urls folder.  The
> script worked for quite a while before this just started happening on it's
> own.  That's why I'm suspecting a filesystem error.
>
>
> On Tue, Jun 24, 2014 at 6:53 PM, kaveh minooie  wrote:
>
>> you might want to check to see if
>>
>> > Injector: urlDir: di/urls
>>
>> still exist in your hdfs.
>>
>>
>>
>>
>> On 06/24/2014 12:30 AM, John Lafitte wrote:
>>
>>> Using Nutch 1.7
>>>
>>> Out of the blue all of my crawl jobs started failing a few days ago.  I
>>> checked the user logs and nobody logged into the server and there were no
>>> reboots or any other obvious issues.  There is plenty of disk space.
>>>  Here
>>> is the error I'm getting, any help is appreciated:
>>>
>>> Injector: starting at 2014-06-24 07:26:54
>>> Injector: crawlDb: di/crawl/crawldb
>>> Injector: urlDir: di/urls
>>> Injector: Converting injected urls to crawl db entries.
>>> Injector: ENOENT: No such file or directory
>>> at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
>>> at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701)
>>>   at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656)
>>> at
>>> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(
>>> RawLocalFileSystem.java:514)
>>>   at
>>> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(
>>> RawLocalFileSystem.java:349)
>>> at org.apache.hadoop.fs.FilterFileSystem.mkdirs(
>>> FilterFileSystem.java:193)
>>>   at
>>> org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(
>>> JobSubmissionFiles.java:126)
>>> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942)
>>>   at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>>   at javax.security.auth.Subject.doAs(Subject.java:416)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(
>>> UserGroupInformation.java:1190)
>>>   at org.apache.hadoop.mapred.JobClient.submitJobInternal(
>>> JobClient.java:936)
>>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
>>>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
>>> at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
>>>   at org.apache.nutch.crawl.Injector.run(Injector.java:318)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>   at org.apache.nutch.crawl.Injector.main(Injector.java:308)
>>>
>>>
>> --
>> Kaveh Minooie
>>
>
>


Re: File not found error

2014-06-24 Thread John Lafitte
Well I'm just using nutch in local mode, no hdfs (as far as I know)...  My
latest thing is trying to determine if there is a filesystem issue.  It's
not really clear what file is not found.  I have about 10 different
configs, this is just one of them and they all have the urls folder.  The
script worked for quite a while before this just started happening on it's
own.  That's why I'm suspecting a filesystem error.


On Tue, Jun 24, 2014 at 6:53 PM, kaveh minooie  wrote:

> you might want to check to see if
>
> > Injector: urlDir: di/urls
>
> still exist in your hdfs.
>
>
>
>
> On 06/24/2014 12:30 AM, John Lafitte wrote:
>
>> Using Nutch 1.7
>>
>> Out of the blue all of my crawl jobs started failing a few days ago.  I
>> checked the user logs and nobody logged into the server and there were no
>> reboots or any other obvious issues.  There is plenty of disk space.  Here
>> is the error I'm getting, any help is appreciated:
>>
>> Injector: starting at 2014-06-24 07:26:54
>> Injector: crawlDb: di/crawl/crawldb
>> Injector: urlDir: di/urls
>> Injector: Converting injected urls to crawl db entries.
>> Injector: ENOENT: No such file or directory
>> at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
>> at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701)
>>   at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(
>> RawLocalFileSystem.java:514)
>>   at
>> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(
>> RawLocalFileSystem.java:349)
>> at org.apache.hadoop.fs.FilterFileSystem.mkdirs(
>> FilterFileSystem.java:193)
>>   at
>> org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(
>> JobSubmissionFiles.java:126)
>> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942)
>>   at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
>> at java.security.AccessController.doPrivileged(Native Method)
>>   at javax.security.auth.Subject.doAs(Subject.java:416)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(
>> UserGroupInformation.java:1190)
>>   at org.apache.hadoop.mapred.JobClient.submitJobInternal(
>> JobClient.java:936)
>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
>>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
>> at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
>>   at org.apache.nutch.crawl.Injector.run(Injector.java:318)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>   at org.apache.nutch.crawl.Injector.main(Injector.java:308)
>>
>>
> --
> Kaveh Minooie
>


Re: File not found error

2014-06-24 Thread kaveh minooie

you might want to check to see if

> Injector: urlDir: di/urls

still exist in your hdfs.



On 06/24/2014 12:30 AM, John Lafitte wrote:

Using Nutch 1.7

Out of the blue all of my crawl jobs started failing a few days ago.  I
checked the user logs and nobody logged into the server and there were no
reboots or any other obvious issues.  There is plenty of disk space.  Here
is the error I'm getting, any help is appreciated:

Injector: starting at 2014-06-24 07:26:54
Injector: crawlDb: di/crawl/crawldb
Injector: urlDir: di/urls
Injector: Converting injected urls to crawl db entries.
Injector: ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701)
  at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514)
  at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:193)
  at
org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:126)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942)
  at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
  at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
  at org.apache.nutch.crawl.Injector.run(Injector.java:318)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.nutch.crawl.Injector.main(Injector.java:308)



--
Kaveh Minooie


Re: updatedb deletes all metadata except _csh_

2014-06-24 Thread alxsss
Hi,


I already came up with similar changes to the code as in this patch. Only 
suggestion to this patch's code is that to move checking if url exists in the 
datastore under


if (!additionsAllowed) {
 return;
   }


and close datastore.


Thanks.
Alex.
-Original Message-
From: Lewis John Mcgibbney 
To: user 
Sent: Tue, Jun 24, 2014 9:07 am
Subject: Re: updatedb deletes all metadata except _csh_


Hi Alex,

I am really sorry for not making the connection here.

On Tue, Jun 24, 2014 at 12:31 AM,  wrote:

>
> So far, this looks like a bug in updatedb when filtering with batchId.
>
> I could only found one solution, to check if new pages are in the datastore
> and if they are skip them.
> Otherwise updatedb with option -all will also work.
>

https://issues.apache.org/jira/browse/NUTCH-1679

If you can run with this patch, then please post your results here.

 


Re: updatedb deletes all metadata except _csh_

2014-06-24 Thread Lewis John Mcgibbney
Hi Alex,

I am really sorry for not making the connection here.

On Tue, Jun 24, 2014 at 12:31 AM,  wrote:

>
> So far, this looks like a bug in updatedb when filtering with batchId.
>
> I could only found one solution, to check if new pages are in the datastore
> and if they are skip them.
> Otherwise updatedb with option -all will also work.
>

https://issues.apache.org/jira/browse/NUTCH-1679

If you can run with this patch, then please post your results here.


reg crawled pages with status=2

2014-06-24 Thread Deepa Jayaveer
Hi,
  our requirement is that the Nutch should not recrawl crawl the pages 
that was being already crawled. 
ie., the crawling should not happen for the web pages with the status as 
'2' in the webpage table. It should not recrawl and should
not put the outlinks as well.

can you please let me know whether it is possible by changing some 
configuration parameters in nutch site xml?

Thanks and Regards
Deepa
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




Incremental web crawling based on number of web pages

2014-06-24 Thread Ali Nazemian
Hi,
I am going to change crawler class in a way that it can crawl incrementally
based on the number of web pages. Suppose the sum of all pages for 2 depth
crawling is around 5000 pages. Right now this class runs
generate-fetch-update for all pages and after finishing it will send them
to solr for indexing. I want to change this class in a way that it can
break this 5000 pages to 10 different generate-fetch-update cycle.  Is that
possible with nutch? If yes how can I do that?

Crawler source:

public class Crawler extends Configured implements Tool {
public static final Logger LOG = LoggerFactory.getLogger(Crawler.class);

private static String getDate() {
return new SimpleDateFormat("MMddHHmmss").format(new Date(System
.currentTimeMillis()));
}

/*
 * Perform complete crawling and indexing (to Solr) given a set of root urls
 * and the -solr parameter respectively. More information and Usage
 * parameters can be found below.
 */
public static void main(String args[]) throws Exception {
Configuration conf = NutchConfiguration.create();
int res = ToolRunner.run(conf, new Crawler(), args);
System.exit(res);
}

@Override
public int run(String[] args) throws Exception {
if (args.length < 1) {
System.out
.println("Usage: Crawl  -solr  [-dir d] [-threads n]
[-depth i] [-topN N]");
return -1;
}
Path rootUrlDir = null;
Path dir = new Path("crawl-" + getDate());
int threads = getConf().getInt("fetcher.threads.fetch", 10);
int depth = 5;
long topN = Long.MAX_VALUE;
String solrUrl = null;

for (int i = 0; i < args.length; i++) {
if ("-dir".equals(args[i])) {
dir = new Path(args[i + 1]);
i++;
} else if ("-threads".equals(args[i])) {
threads = Integer.parseInt(args[i + 1]);
i++;
} else if ("-depth".equals(args[i])) {
depth = Integer.parseInt(args[i + 1]);
i++;
} else if ("-topN".equals(args[i])) {
topN = Integer.parseInt(args[i + 1]);
i++;
} else if ("-solr".equals(args[i])) {
solrUrl = args[i + 1];
i++;
} else if (args[i] != null) {
rootUrlDir = new Path(args[i]);
}
}

JobConf job = new NutchJob(getConf());

if (solrUrl == null) {
LOG.warn("solrUrl is not set, indexing will be skipped...");
} else {
// for simplicity assume that SOLR is used
// and pass its URL via conf
getConf().set("solr.server.url", solrUrl);
}

FileSystem fs = FileSystem.get(job);

if (LOG.isInfoEnabled()) {
LOG.info("crawl started in: " + dir);
LOG.info("rootUrlDir = " + rootUrlDir);
LOG.info("threads = " + threads);
LOG.info("depth = " + depth);
LOG.info("solrUrl=" + solrUrl);
if (topN != Long.MAX_VALUE)
LOG.info("topN = " + topN);
}

Path crawlDb = new Path(dir + "/crawldb");
Path linkDb = new Path(dir + "/linkdb");
Path segments = new Path(dir + "/segments");

//Path tmpDir = job.getLocalPath("crawl" + Path.SEPARATOR + getDate());
Injector injector = new Injector(getConf());
Generator generator = new Generator(getConf());
Fetcher fetcher = new Fetcher(getConf());
ParseSegment parseSegment = new ParseSegment(getConf());
CrawlDb crawlDbTool = new CrawlDb(getConf());
LinkDb linkDbTool = new LinkDb(getConf());

// initialize crawlDb
injector.inject(crawlDb, rootUrlDir);
int i;
for (i = 0; i < depth; i++) { // generate new segment
Path[] segs = generator.generate(crawlDb, segments, -1, topN,
System.currentTimeMillis());
if (segs == null) {
LOG.info("Stopping at depth=" + i + " - no more URLs to fetch.");
break;
}
fetcher.fetch(segs[0], threads); // fetch it
if (!Fetcher.isParsing(job)) {
parseSegment.parse(segs[0]); // parse it, if needed
}
crawlDbTool.update(crawlDb, segs, true, true); // update crawldb
}
if (i > 0) {
linkDbTool.invert(linkDb, segments, true, true, false); // invert
// links
// dedup should be added

if (solrUrl != null) {
// index
FileStatus[] fstats = fs.listStatus(segments,
HadoopFSUtil.getPassDirectoriesFilter(fs));

IndexingJob indexer = new IndexingJob(getConf());
boolean noCommit = false;
indexer.index(crawlDb, linkDb,
Arrays.asList(HadoopFSUtil.getPaths(fstats)), noCommit);

}
// merge should be added
// clean should be added
} else {
LOG.warn("No URLs to fetch - check your seed list and URL filters.");
}
if (LOG.isInfoEnabled()) {
LOG.info("crawl finished: " + dir);
}
return 0;
}

}

Best regards.
-- 
A.Nazemian


File not found error

2014-06-24 Thread John Lafitte
Using Nutch 1.7

Out of the blue all of my crawl jobs started failing a few days ago.  I
checked the user logs and nobody logged into the server and there were no
reboots or any other obvious issues.  There is plenty of disk space.  Here
is the error I'm getting, any help is appreciated:

Injector: starting at 2014-06-24 07:26:54
Injector: crawlDb: di/crawl/crawldb
Injector: urlDir: di/urls
Injector: Converting injected urls to crawl db entries.
Injector: ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701)
 at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514)
 at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:193)
 at
org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:126)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
 at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
 at org.apache.nutch.crawl.Injector.run(Injector.java:318)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.nutch.crawl.Injector.main(Injector.java:308)