Thanks for you reply. Unfortunately, I have to write that it did not help :(.
--
View this message in context:
http://lucene.472066.n3.nabble.com/New-script-bin-crawl-skipping-urls-different-batch-id--Y-tp4075441p4075665.html
Sent from the Nutch - User mailing list archive
this message in context:
http://lucene.472066.n3.nabble.com/New-script-bin-crawl-skipping-urls-different-batch-id--Y-tp4075441p4075665.html
Sent from the Nutch - User mailing list archive at Nabble.com.
-skipping-urls-different-batch-id--Y-tp4075441p4075805.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Hello everybody,
I am trying to crawl a few websites from my seed.txt with Nutch 2.1 new
crawl script bin/crawl. The problem is that everytime I run my script, it
does not fetch or parse anything (no urls) with message Skipping [/here is
concrete url/] different batch id ([/here is some batch id
I forgot to say that I am using Nutch in version 2.1 ...
--
View this message in context:
http://lucene.472066.n3.nabble.com/New-script-bin-crawl-skipping-urls-different-batch-id--Y-tp4075441p4075443.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Ok, as I have written, the problem was in an old version of nutch (2.1).
After updating to 2.2.1 the message about different batch id disabled but I
have a new problem now.
Everytime I start the script bin/crawl it fetch only the urls from seed (no
pages)
fetching http://www.museumhetvalkhof.nl
:32:36,glumet jan.bouch...@gmail.com wrote:
Ok, as I have written, the problem was in an old version of nutch (2.1).
After updating to 2.2.1 the message about different batch id disabled but I
have a new problem now.
Everytime I start the script bin/crawl it fetch only the urls from seed (no
pages
On Sun, Apr 28, 2013 at 8:33 AM, cervenkovab cervenko...@gmail.com wrote:
Hallo,
I have the same problem with *Skipping some.relevant.page.com; different
batch id (null)* for a lot of pages. My configuration is almost the same
as
bellow (only different OS and storage is Hbase).
I do the steps
the same problem with *Skipping some.relevant.page.com; different
batch id (null)* for a lot of pages. My configuration is almost the same
as
bellow (only different OS and storage is Hbase).
I do the steps (inject), generate, fetch, and the skipping appears in
parse
phase. But I want those pages
- inject - fetch
The second inject will leave entries in the db without fetchmarks seen by
the fetcher later.
--Roland
On Fri, Apr 26, 2013 at 12:30 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Additionally, why do we log.DEBUG that there is a different batch id ( +
mark
-
generate
- inject - fetch
The second inject will leave entries in the db without fetchmarks seen by
the fetcher later.
--Roland
On Fri, Apr 26, 2013 at 12:30 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Additionally, why do we log.DEBUG that there is a different batch id
(Reparsing + unreverseKey);
} else {
if (!NutchJob.shouldProcess(mark, batchId)) {
if (LOG.isDebugEnabled()) {
LOG.debug(Skipping + TableUtil.unreverseUrl(key) + ;
different batch id ( + mark + ));
}
return;
Any ideas? Is this a bug?
On Thu
with baseUrl=null, content=null.
Nutch not parsing, many url. I receive this message in Nutch console:
Skipping http://myurlForParsing.it; different batch id (null)
How can I fix?
This is actually something which I've wondered about for a while and it was
on my TODO list of things to address!!!
I want
://nlp.solutions.asia/?p=180.
I made same changes in conf/nutch-site.xml (set threads to 50).
When I start crawl (path: ~/Desktop/apache-nutch-2.1/runtime/local,
command: bin/nutch crawl urls -depth 5 -topN 1) I saw the message:
Skipping http://www.domainname.com/category/viewvideo/111; different batch
(mark, batchId)) {
if (LOG.isDebugEnabled()) {
LOG.debug(Skipping + TableUtil.unreverseUrl(key) + ;
different batch id ( + mark + ));
}
return;
}
since shouldProcess(mark, batchId) returns false if mark is null.
Then
bin/nutch parse -all
skips all
/nutch generate -topN 1000
bin/nutch fetch -all
bin/nutch parse -all
When looking at the parse log, I'm seeing a bunch of different batch id
messages. These are all on urls that I did not inject into the database.
Any ideas what's causing this?
Thanks.
);
if (!NutchJob.shouldProcess(mark, batchId)) {
if (LOG.isDebugEnabled()) {
LOG.debug(Skipping + TableUtil.unreverseUrl(key) + ; different
batch id ( + mark + ));
}
return;
}
since shouldProcess(mark, batchId) returns false if mark is null.
Then
bin/nutch parse -all
skips all
Shen baishen.li...@gmail.com wrote:
I set up Nutch 2.x with a new instance of HBase. I ran the following
commands.
bin/nutch inject urls
bin/nutch generate -topN 1000
bin/nutch fetch -all
bin/nutch parse -all
When looking at the parse log, I'm seeing a bunch of different batch id
am
Subject: Re: Different batch id
Is there a specific place it's located? I turned on debugging, but I'm not
seeing a batch id.
On Mon, Jul 30, 2012 at 1:14 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Can you stick on debug logging and see what the batch ID's actually
Nope. I ran exactly the listed commands. And like I said, the ones that
show a different batch id were urls that I didn't inject. So no idea how
they got in there.
On Tue, Jul 31, 2012 at 1:44 PM, alx...@aim.com wrote:
Hi,
Most likely you run generate command a few times and did not run
20 matches
Mail list logo