Re: Nutch 1.5 - Error: Java heap space during MAP step of CrawlDb update

2012-06-20 Thread Ferdy Galema
Do you have a stacktrace of a failed task? On Wed, Jun 20, 2012 at 3:08 AM, sidbatra siddharthaba...@gmail.com wrote: I'm using Nutch 1.5 to crawl 30 sites in deploy mode on Amazon Elastic Map Reduce with 30 m1.small machines with the following settings: Parameter Value

Re: Nailing Down Nutch Parser Plugin Configuration

2012-06-20 Thread Lewis John Mcgibbney
What makes you think something is wrong with plugin.xml? I take it you have both ivy and build.xml correctly configured as well? On Tue, Jun 19, 2012 at 5:29 PM, jcfol...@pureperfect.com wrote: Any thoughts on this? I switched to using HTMLParseFilter since it seemed like a more appropriate

Re: Can't retrieve Tika parser for mime-type text/csv

2012-06-20 Thread Olivier LEVILLAIN
Up, please -- View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-csv-tp3990071p3990521.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Can't retrieve Tika parser for mime-type text/csv

2012-06-20 Thread Lewis John Mcgibbney
I can't get the URL with the parserchecker Have you tried application/csv instead? I would have thought parse-tika would have dealt with this anyway... On Wed, Jun 20, 2012 at 1:59 PM, Olivier LEVILLAIN olivier_levill...@coface.com wrote: Up, please -- View this message in context:

Re: Can't retrieve Tika parser for mime-type text/csv

2012-06-20 Thread Olivier LEVILLAIN
Well, I didn't choose text/csv. Actually, I do not understand how nutch chooses the mime type. For instance, for an rtf file, sometimes, it takes text/rtf and sometimes application/rtf (in the same context exactly) Is there a way to manually map file termination to mime types with Nutch? I

RE: HTTP REFERER is missing

2012-06-20 Thread SebaZ
Markus Jelsma-2 wrote Nutch cannot do this by default and is tricky to make because there may not be one unique referrer per page. I don't realy need unique referrer. All I want is to inform requested server on which URL crawler found the link. There is some site which admin informed me

RE: Nailing Down Nutch Parser Plugin Configuration

2012-06-20 Thread jcfolsom
I got it working actually. Thanks! Original Message Subject: Re: Nailing Down Nutch Parser Plugin Configuration From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com Date: Wed, June 20, 2012 6:53 am To: user@nutch.apache.org What makes you think something is wrong

robots.txt, disallow: with empty string

2012-06-20 Thread Magnús Skúlason
Hi, I have noticed that my nutch crawler skips many sites with robots.txt files that look something like this: User-agent: * Disallow: /administrator/ Disallow: /classes/ Disallow: /components/ Disallow: /editor/ Disallow: /images/ Disallow: /includes/ Disallow: /language/ Disallow: /mambots/

Re: Nutch 1.5 - Error: Java heap space during MAP step of CrawlDb update

2012-06-20 Thread sidbatra
thanks for the reply. The MAP tasks are the ones failing and most of them simply fail with: attempt_201206200559_0032_m_000313_0 task_201206200559_0032_m_000313 10.76.89.196 FAILED Error: Java heap space Some of the MAP tasks have a trace as follows: attempt_201206200559_0032_m_000322_1

Re: Can't retrieve Tika parser for mime-type text/csv

2012-06-20 Thread Lewis John Mcgibbney
Hi Oliver, On Wed, Jun 20, 2012 at 2:29 PM, Olivier LEVILLAIN olivier_levill...@coface.com wrote: Actually, I do not understand how nutch chooses the mime type. For instance, for an rtf file, sometimes, it takes text/rtf and sometimes application/rtf (in the same context exactly) Do you

Re: nutch-2.0 updatedb and parse commands

2012-06-20 Thread Lewis John Mcgibbney
Hi Alex, On Tue, Jun 19, 2012 at 6:49 PM, alx...@aim.com wrote: In 1.X version there are -noAdditions options to updatedb  and  -adddays option to generate commands. How something similar to them can be done in 2.X version? Yeah I suppose that this functionality could be added to Nutch

Nutch 1.5 On Hadoop?

2012-06-20 Thread jcfolsom
Hi All, I am trying to get Nutch running with some custom plugins on top of HDFS. It seems like in the runtime/deploy directory there is only a single .job file and a bin/nutch. I renamed the job to nutch-1.5.job as suggested in sidbatra's post on 6/18/12, but now I am getting: Caused by:

Re: Nutch as mirroring tool

2012-06-20 Thread Lewis John Mcgibbney
Hi Vlad, On Mon, Jun 18, 2012 at 2:58 PM, Vlad Paunescu vlad.paune...@gmail.com wrote: - create a local directory structure which resembles the remote structure: is there any elegant way of using the existing Nutch API to accomplish this, or I need to manually create the structure from the

Re: Nutch 1.5 On Hadoop?

2012-06-20 Thread Lewis John Mcgibbney
What Configuration bean settings are you using for plugin.includes? Are there any unusual settings? Have you tried running a test crawl without your custom plugins to ensure that the core Nutch functionality is working OK? On Wed, Jun 20, 2012 at 9:04 PM, jcfol...@pureperfect.com wrote: Hi

Nutch and Solr Redundancy

2012-06-20 Thread Oakage
Okay I've just started researching about nutch and knows that nutch index its crawl and Solr index the document it is given. So my questions are: 1. When nutch sends it's crawled data to Solr, does Solr reindex or uses nutch's index? 2. If nutch's index is sufficient then how would I process

Re: Nutch and Solr Redundancy

2012-06-20 Thread Lewis John Mcgibbney
Hi Oakage, On Wed, Jun 20, 2012 at 9:08 PM, Oakage hnn...@uw.edu wrote: Okay I've just started researching about nutch and knows that nutch index its crawl and Solr index the document it is given. Not quite. Nutch crawls and sends documents to Solr for indexing. Nutch DOES NOT

RE: Nutch 1.5 - Error: Java heap space during MAP step of CrawlDb update

2012-06-20 Thread Markus Jelsma
The log you provide doesn't look like the actual mapper log. Can you check it out? The job has output for the main class but also separate logs for each map and reduce task. -Original message- From:sidbatra siddharthaba...@gmail.com Sent: Wed 20-Jun-2012 20:29 To:

RE: Nutch and Solr Redundancy

2012-06-20 Thread Markus Jelsma
-Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Wed 20-Jun-2012 22:23 To: user@nutch.apache.org Subject: Re: Nutch and Solr Redundancy Hi Oakage, On Wed, Jun 20, 2012 at 9:08 PM, Oakage hnn...@uw.edu wrote: Okay I've just started researching

RE: robots.txt, disallow: with empty string

2012-06-20 Thread Markus Jelsma
If you're sure Nutch treats an empty string the same as / then please file an issue in Jira so we can track and fix it. Thanks -Original message- From:Magnús Skúlason magg...@gmail.com Sent: Wed 20-Jun-2012 18:36 To: nutch-u...@lucene.apache.org Subject: robots.txt, disallow: with

RE: HTTP REFERER is missing

2012-06-20 Thread Markus Jelsma
If you are looking for inlinks to 404 URL's but cannot find them in the LinkDB, it sounds like your should check the db.ignore.* configuration directives. IIRC the LinkDB will not populate internal links. -Original message- From:SebaZ sebastian.zaborow...@gmail.com Sent: Wed

RE: Deleting file: urls from crawldb that give 404 status

2012-06-20 Thread Markus Jelsma
Sounds like: https://issues.apache.org/jira/browse/NUTCH-1245 Also, with a recent Nutch you can index with a -deleteGone flag. It behaves similar to SolrClean but only on records you just fetched. -Original message- From:webdev1977 webdev1...@gmail.com Sent: Tue 19-Jun-2012 21:40

RE: Nutch 1.5 - Error: Java heap space during MAP step of CrawlDb update

2012-06-20 Thread sidbatra
Amazon EMR returned the two logs below in the MAP task logs. All MAP tasks had either of the two logs below. I'll try and farm the syslog directly from the machines this time. Do you know how to get more detailed logs from Amazon EMR? I'm running it again with the same configuration.

Re: using less resources

2012-06-20 Thread alxsss
I was thinking of using last modified header, but it may be absent. In that case we could use signature of urls in the indexing time. I took a look to to code, it seems it is implemented but not working. I tested nutch-1.4 with a single url, solrindexer always sends the same number of documents to

RE: Nutch 1.5 - Error: Java heap space during MAP step of CrawlDb update

2012-06-20 Thread sidbatra
Ok, here are the syslogs from the individual machines. They all have a stack trace similar to this 2012-06-21 00:28:40,838 WARN org.apache.hadoop.conf.Configuration (main): DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml,