Do you have a stacktrace of a failed task?
On Wed, Jun 20, 2012 at 3:08 AM, sidbatra siddharthaba...@gmail.com wrote:
I'm using Nutch 1.5 to crawl 30 sites in deploy mode on Amazon Elastic Map
Reduce with 30 m1.small machines with the following settings:
Parameter Value
What makes you think something is wrong with plugin.xml?
I take it you have both ivy and build.xml correctly configured as well?
On Tue, Jun 19, 2012 at 5:29 PM, jcfol...@pureperfect.com wrote:
Any thoughts on this? I switched to using HTMLParseFilter since it
seemed like a more appropriate
Up, please
--
View this message in context:
http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-csv-tp3990071p3990521.html
Sent from the Nutch - User mailing list archive at Nabble.com.
I can't get the URL with the parserchecker
Have you tried application/csv instead?
I would have thought parse-tika would have dealt with this anyway...
On Wed, Jun 20, 2012 at 1:59 PM, Olivier LEVILLAIN
olivier_levill...@coface.com wrote:
Up, please
--
View this message in context:
Well, I didn't choose text/csv.
Actually, I do not understand how nutch chooses the mime type. For instance,
for an rtf file, sometimes, it takes text/rtf and sometimes
application/rtf (in the same context exactly)
Is there a way to manually map file termination to mime types with Nutch?
I
Markus Jelsma-2 wrote
Nutch cannot do this by default and is tricky to make because there may
not be one unique referrer per page.
I don't realy need unique referrer. All I want is to inform requested server
on which URL crawler found the link.
There is some site which admin informed me
I got it working actually. Thanks!
Original Message
Subject: Re: Nailing Down Nutch Parser Plugin Configuration
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Date: Wed, June 20, 2012 6:53 am
To: user@nutch.apache.org
What makes you think something is wrong
Hi,
I have noticed that my nutch crawler skips many sites with robots.txt
files that look something like this:
User-agent: *
Disallow: /administrator/
Disallow: /classes/
Disallow: /components/
Disallow: /editor/
Disallow: /images/
Disallow: /includes/
Disallow: /language/
Disallow: /mambots/
thanks for the reply.
The MAP tasks are the ones failing and most of them simply fail with:
attempt_201206200559_0032_m_000313_0 task_201206200559_0032_m_000313
10.76.89.196 FAILED
Error: Java heap space
Some of the MAP tasks have a trace as follows:
attempt_201206200559_0032_m_000322_1
Hi Oliver,
On Wed, Jun 20, 2012 at 2:29 PM, Olivier LEVILLAIN
olivier_levill...@coface.com wrote:
Actually, I do not understand how nutch chooses the mime type. For instance,
for an rtf file, sometimes, it takes text/rtf and sometimes
application/rtf (in the same context exactly)
Do you
Hi Alex,
On Tue, Jun 19, 2012 at 6:49 PM, alx...@aim.com wrote:
In 1.X version there are -noAdditions options to updatedb and -adddays
option to generate commands. How something similar to them can be done in 2.X
version?
Yeah I suppose that this functionality could be added to Nutch
Hi All,
I am trying to get Nutch running with some custom plugins on top of
HDFS.
It seems like in the runtime/deploy directory there is only a single
.job file and a bin/nutch. I renamed the job to nutch-1.5.job as
suggested in sidbatra's post on 6/18/12, but now I am getting:
Caused by:
Hi Vlad,
On Mon, Jun 18, 2012 at 2:58 PM, Vlad Paunescu vlad.paune...@gmail.com wrote:
- create a local directory structure which resembles the remote structure:
is there any elegant way of using the existing Nutch API to accomplish
this, or I need to manually create the structure from the
What Configuration bean settings are you using for plugin.includes?
Are there any unusual settings? Have you tried running a test crawl
without your custom plugins to ensure that the core Nutch
functionality is working OK?
On Wed, Jun 20, 2012 at 9:04 PM, jcfol...@pureperfect.com wrote:
Hi
Okay I've just started researching about nutch and knows that nutch index its
crawl and Solr index the document it is given.
So my questions are:
1. When nutch sends it's crawled data to Solr, does Solr reindex or uses
nutch's index?
2. If nutch's index is sufficient then how would I process
Hi Oakage,
On Wed, Jun 20, 2012 at 9:08 PM, Oakage hnn...@uw.edu wrote:
Okay I've just started researching about nutch and knows that nutch index its
crawl and Solr index the document it is given.
Not quite. Nutch crawls and sends documents to Solr for indexing.
Nutch DOES NOT
The log you provide doesn't look like the actual mapper log. Can you check it
out? The job has output for the main class but also separate logs for each map
and reduce task.
-Original message-
From:sidbatra siddharthaba...@gmail.com
Sent: Wed 20-Jun-2012 20:29
To:
-Original message-
From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Sent: Wed 20-Jun-2012 22:23
To: user@nutch.apache.org
Subject: Re: Nutch and Solr Redundancy
Hi Oakage,
On Wed, Jun 20, 2012 at 9:08 PM, Oakage hnn...@uw.edu wrote:
Okay I've just started researching
If you're sure Nutch treats an empty string the same as / then please file an
issue in Jira so we can track and fix it.
Thanks
-Original message-
From:Magnús Skúlason magg...@gmail.com
Sent: Wed 20-Jun-2012 18:36
To: nutch-u...@lucene.apache.org
Subject: robots.txt, disallow: with
If you are looking for inlinks to 404 URL's but cannot find them in the LinkDB,
it sounds like your should check the db.ignore.* configuration directives. IIRC
the LinkDB will not populate internal links.
-Original message-
From:SebaZ sebastian.zaborow...@gmail.com
Sent: Wed
Sounds like:
https://issues.apache.org/jira/browse/NUTCH-1245
Also, with a recent Nutch you can index with a -deleteGone flag. It behaves
similar to SolrClean but only on records you just fetched.
-Original message-
From:webdev1977 webdev1...@gmail.com
Sent: Tue 19-Jun-2012 21:40
Amazon EMR returned the two logs below in the MAP task logs. All MAP tasks
had either of the two logs below.
I'll try and farm the syslog directly from the machines this time. Do you
know how to get more detailed logs from Amazon EMR? I'm running it again
with the same configuration.
I was thinking of using last modified header, but it may be absent. In that
case we could use signature of urls in the indexing time. I took a look to
to code, it seems it is implemented but not working. I tested nutch-1.4 with
a single url, solrindexer always sends the same number of documents to
Ok, here are the syslogs from the individual machines. They all have a stack
trace similar to this
2012-06-21 00:28:40,838 WARN org.apache.hadoop.conf.Configuration (main):
DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml
is deprecated. Instead use core-site.xml,
24 matches
Mail list logo