hi,
I'm running the Generator on a crawldb that was injected with 1 million
urls.
I have no politeness issue (its my own IIS server).
I tried 3 things:
1. make the URLPartitioner.java return hashcode of the whole url (instead of
the current host/domain method.. becuase my hos is the same for all
Hi,
In nutch-site.xml you can override generate.max.count to specify the
max number of urls you wish to allocate per single fetchlist. For your
1M URLs you can then logically partition the URLs as you see fit for
this task. This will give you more granular control over how your
fecthlists are
Hi Nishant,
On Sat, Jul 28, 2012 at 11:33 AM, nishant rathore
nishant.rathor...@gmail.com wrote:
I am new to nutch and building an application that uses nutch to crawl
certain folders.
Do you plan to crawl a local file system or the web graph? In the
former AFAIK there is no way to tell
Hi lewis,
from what I see in the code, generate.max.count is used for maximum number
of urls per host.
I don't have a maximum number of urls per host because I only have one host.
I'm using the topN and maxNumSegments.
But all this is relevant for the first step of the generator, in which the
Hello,
Has anyone been successful in hooking up Nutch 2 with Solr4?
I seem to have my config screwed up somehow. I've added the Nutch fields to
Solr's example schema and changed the field type from text' to
text_general
However when I index, I get the message
SolrIndexerJob:starting
Forgot to do Specs
VMWare Machine with CentOS 6.3
On Sun, Jul 29, 2012 at 1:53 PM, X3C TECH t...@x3chaos.com wrote:
Hello,
Has anyone been successful in hooking up Nutch 2 with Solr4?
I seem to have my config screwed up somehow. I've added the Nutch fields
to Solr's example schema and
Which storage do you use? Try solrindex with option -reindex.
-Original Message-
From: X3C TECH t...@x3chaos.com
To: user user@nutch.apache.org
Sent: Sun, Jul 29, 2012 10:58 am
Subject: Re: Nutch 2.0 Solr 4.0 Alpha
Forgot to do Specs
VMWare Machine with CentOS 6.3
On Sun, Jul 29,
Hi Iggy,
We usually start with asking what the log from your solr server is saying?
Additionally, did you know there is another schema for Nutch which
'should' work with Solr 4. Please see below
http://svn.apache.org/repos/asf/nutch/branches/2.x/conf/schema-solr4.xml
Lewis
On Sun, Jul 29,
Hi Alxsss,
I have and the same result. I'm using Hbase, which seems to works, as if I
run list, it shows the parsed paged and timestamps and such
On Sun, Jul 29, 2012 at 11:03 AM, alx...@aim.com wrote:
Which storage do you use? Try solrindex with option -reindex.
-Original Message-
Hi Lewis,
Thanks for below, I just ran it on the new schema. Funny thing is in Solr's
example/logs directory there are no files at all, so I'm wondering if Nutch
is even hitting Solr. I ran a new crawl, and I'm now getting a Null Pointer
at indexing point I assume (right after parse).
This is the
Hi,
On Sun, Jul 29, 2012 at 7:33 PM, X3C TECH t...@x3chaos.com wrote:
I ran a new crawl, and I'm now getting a Null Pointer
at indexing point I assume (right after parse).
This is the hadoop log dump (end of it, as the whole dump is large)
What exactly are you doing here? inject, fetch,
Sorry
On Sun, Jul 29, 2012 at 7:53 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
What exactly are you doing here? inject, fetch, parse... then what?
inject, generate, fetch, parse...
Hi,
Looking at the code, it looks like your batchId is null. Not sure how that can
happen (since the SolrIndexerJob does check arguments).
Have you tried to call the SolrIndexerJob alone (outside the Crawler tool)?
Please do so and post commandline / nutch config / logs.
Cheers,
Mathijs
On
Hi Lewis,
the javadoc obviously belongs to the first method
generate(Path, Path, int, long, long)
This method also uses the two properties generate.filter
and generate#normalise. But this method is only referenced
by Crawl#run and Benchmark.
The third method (whith the javadoc) is used
by
Lewis,
I just ran the crawl command with the solrindex argument. I'll try to rerun
it with single commands
On Sun, Jul 29, 2012 at 2:53 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Sorry
On Sun, Jul 29, 2012 at 7:53 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Mathijs,
I am going to try that now
On Sun, Jul 29, 2012 at 3:19 PM, Mathijs Homminga
mathijs.hommi...@kalooga.com wrote:
Hi,
Looking at the code, it looks like your batchId is null. Not sure how that
can happen (since the SolrIndexerJob does check arguments).
Have you tried to call the
Looking further through Solr' terminal output, as it's still not writing
logs
I found
SEVERE: org.apache.solr.common.SolrException: ERROR:
[doc=org.apache.wiki:http/nutch/] unknown field 'site'
Looking in schema for 4.0 from the svn link, there is in fact no 'site'
field.. It seems this is where
17 matches
Mail list logo