how do I evenly distribute the fetchlists partitions?

2012-07-29 Thread nutch.bu...@gmail.com
hi, I'm running the Generator on a crawldb that was injected with 1 million urls. I have no politeness issue (its my own IIS server). I tried 3 things: 1. make the URLPartitioner.java return hashcode of the whole url (instead of the current host/domain method.. becuase my hos is the same for all

Re: how do I evenly distribute the fetchlists partitions?

2012-07-29 Thread Lewis John Mcgibbney
Hi, In nutch-site.xml you can override generate.max.count to specify the max number of urls you wish to allocate per single fetchlist. For your 1M URLs you can then logically partition the URLs as you see fit for this task. This will give you more granular control over how your fecthlists are

Re: custom generator

2012-07-29 Thread Lewis John Mcgibbney
Hi Nishant, On Sat, Jul 28, 2012 at 11:33 AM, nishant rathore nishant.rathor...@gmail.com wrote: I am new to nutch and building an application that uses nutch to crawl certain folders. Do you plan to crawl a local file system or the web graph? In the former AFAIK there is no way to tell

Re: how do I evenly distribute the fetchlists partitions?

2012-07-29 Thread nutch.bu...@gmail.com
Hi lewis, from what I see in the code, generate.max.count is used for maximum number of urls per host. I don't have a maximum number of urls per host because I only have one host. I'm using the topN and maxNumSegments. But all this is relevant for the first step of the generator, in which the

Nutch 2.0 Solr 4.0 Alpha

2012-07-29 Thread X3C TECH
Hello, Has anyone been successful in hooking up Nutch 2 with Solr4? I seem to have my config screwed up somehow. I've added the Nutch fields to Solr's example schema and changed the field type from text' to text_general However when I index, I get the message SolrIndexerJob:starting

Re: Nutch 2.0 Solr 4.0 Alpha

2012-07-29 Thread X3C TECH
Forgot to do Specs VMWare Machine with CentOS 6.3 On Sun, Jul 29, 2012 at 1:53 PM, X3C TECH t...@x3chaos.com wrote: Hello, Has anyone been successful in hooking up Nutch 2 with Solr4? I seem to have my config screwed up somehow. I've added the Nutch fields to Solr's example schema and

Re: Nutch 2.0 Solr 4.0 Alpha

2012-07-29 Thread alxsss
Which storage do you use? Try solrindex with option -reindex. -Original Message- From: X3C TECH t...@x3chaos.com To: user user@nutch.apache.org Sent: Sun, Jul 29, 2012 10:58 am Subject: Re: Nutch 2.0 Solr 4.0 Alpha Forgot to do Specs VMWare Machine with CentOS 6.3 On Sun, Jul 29,

Re: Nutch 2.0 Solr 4.0 Alpha

2012-07-29 Thread Lewis John Mcgibbney
Hi Iggy, We usually start with asking what the log from your solr server is saying? Additionally, did you know there is another schema for Nutch which 'should' work with Solr 4. Please see below http://svn.apache.org/repos/asf/nutch/branches/2.x/conf/schema-solr4.xml Lewis On Sun, Jul 29,

Re: Nutch 2.0 Solr 4.0 Alpha

2012-07-29 Thread X3C TECH
Hi Alxsss, I have and the same result. I'm using Hbase, which seems to works, as if I run list, it shows the parsed paged and timestamps and such On Sun, Jul 29, 2012 at 11:03 AM, alx...@aim.com wrote: Which storage do you use? Try solrindex with option -reindex. -Original Message-

Re: Nutch 2.0 Solr 4.0 Alpha

2012-07-29 Thread X3C TECH
Hi Lewis, Thanks for below, I just ran it on the new schema. Funny thing is in Solr's example/logs directory there are no files at all, so I'm wondering if Nutch is even hitting Solr. I ran a new crawl, and I'm now getting a Null Pointer at indexing point I assume (right after parse). This is the

Re: Nutch 2.0 Solr 4.0 Alpha

2012-07-29 Thread Lewis John Mcgibbney
Hi, On Sun, Jul 29, 2012 at 7:33 PM, X3C TECH t...@x3chaos.com wrote: I ran a new crawl, and I'm now getting a Null Pointer at indexing point I assume (right after parse). This is the hadoop log dump (end of it, as the whole dump is large) What exactly are you doing here? inject, fetch,

Re: Nutch 2.0 Solr 4.0 Alpha

2012-07-29 Thread Lewis John Mcgibbney
Sorry On Sun, Jul 29, 2012 at 7:53 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: What exactly are you doing here? inject, fetch, parse... then what? inject, generate, fetch, parse...

Re: Nutch 2.0 Solr 4.0 Alpha

2012-07-29 Thread Mathijs Homminga
Hi, Looking at the code, it looks like your batchId is null. Not sure how that can happen (since the SolrIndexerJob does check arguments). Have you tried to call the SolrIndexerJob alone (outside the Crawler tool)? Please do so and post commandline / nutch config / logs. Cheers, Mathijs On

Re: Javadoc incorrect or missing code in 1.5.1 Generator

2012-07-29 Thread Sebastian Nagel
Hi Lewis, the javadoc obviously belongs to the first method generate(Path, Path, int, long, long) This method also uses the two properties generate.filter and generate#normalise. But this method is only referenced by Crawl#run and Benchmark. The third method (whith the javadoc) is used by

Re: Nutch 2.0 Solr 4.0 Alpha

2012-07-29 Thread X3C TECH
Lewis, I just ran the crawl command with the solrindex argument. I'll try to rerun it with single commands On Sun, Jul 29, 2012 at 2:53 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Sorry On Sun, Jul 29, 2012 at 7:53 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote:

Re: Nutch 2.0 Solr 4.0 Alpha

2012-07-29 Thread X3C TECH
Mathijs, I am going to try that now On Sun, Jul 29, 2012 at 3:19 PM, Mathijs Homminga mathijs.hommi...@kalooga.com wrote: Hi, Looking at the code, it looks like your batchId is null. Not sure how that can happen (since the SolrIndexerJob does check arguments). Have you tried to call the

Re: Nutch 2.0 Solr 4.0 Alpha

2012-07-29 Thread X3C TECH
Looking further through Solr' terminal output, as it's still not writing logs I found SEVERE: org.apache.solr.common.SolrException: ERROR: [doc=org.apache.wiki:http/nutch/] unknown field 'site' Looking in schema for 4.0 from the svn link, there is in fact no 'site' field.. It seems this is where