Already did that. But it doesn't allow me to delete urls from the list to
be crawled.
On Tue, Nov 1, 2011 at 5:56 AM, Ferdy Galema ferdy.gal...@kalooga.comwrote:
As for reading the crawldb, you can use
org.apache.nutch.crawl.**CrawlDbReader.
This allows for dumping the crawldb into a
I think you must add a regex to regex-urlfilter.txt . In that case those urls
will not be fetched by fetcher.
-Original Message-
From: Bai Shen baishen.li...@gmail.com
To: user user@nutch.apache.org
Sent: Tue, Nov 1, 2011 10:35 am
Subject: Re: Removing urls from crawl db
Already did
I'm getting an exception when I try to commit to Solr. Looking at the Solr
log, it's showing that title is getting multiple values when it's not a
multivalue field. None of my code does anything with the title, so I'm not
sure why this is happening.
How can I look at the pending commit and
Hi,
My problem might not be suitable for the nutch mailing list but I
asked on java mailing lists but to no avail and wonder if someone here
has experienced the same.
I am trying to crawl several hosts using Nutch(1.4) and storing
content on Solr with one host per index(core). I had posted this
It looks like the issue I'm encountering is the same one as here.
http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-multiValued-field-title-td1446817.html
I'm not really sure what the linked bug is since that involves the HTML
parser and I'm seeing this problem with a PDF
This should work around the problem in most cases. The parser can output two
titles of which one is actually empty. This patch (in 1.4) skips empty titles.
If this doesn't work you really have two _valid_ titles coming from your
document.
https://issues.apache.org/jira/browse/NUTCH-1004
It
Are you using any non default or experimental JVM options? I've never seen
this happening anywhere with standard SUN JVM's.
Hi,
My problem might not be suitable for the nutch mailing list but I
asked on java mailing lists but to no avail and wonder if someone here
has experienced the same.
Hmm, it may also be a memory problem. You have both Nutch and Tomcat + Solr
running on the same machine with limited RAM? 4GB allocated to Nutch and how
much to Tomcat?
Remeber that file descriptors take memory too, it adds up significantly if
there are many. Both Tomcat + Solr and Nutch can
I think you must add a regex to regex-urlfilter.txt . In that case those
urls will not be fetched by fetcher.
Yes but if you use it when doing updatedb it will disappear from the crawldb
entirely.
-Original Message-
From: Bai Shen baishen.li...@gmail.com
To: user
No. This is standard Sun (Oracle) JVM (Java version 1.6.0_27). I even
tried with 1.6.0_24 but with same effect. Only the time it takes for
the crawler to hang and jvm to crash varies. But then, it varies even
between different runs.
On Wed, Nov 2, 2011 at 2:05 AM, Markus Jelsma
I'm running the latest version of 1.4 We just rebuilt it last week. Is
that patch included?
And where would it get multiple titles from? How do I tell what the titles
are so I can see if they're valid or not?
On Tue, Nov 1, 2011 at 4:33 PM, Markus Jelsma markus.jel...@openindex.iowrote:
Hi,
Just as a side note, the latest 1.4 development version can be found at
trunk SVN repository
https://svn.apache.org/repos/asf/nutch/trunk/
On Tue, Nov 1, 2011 at 8:47 PM, Bai Shen baishen.li...@gmail.com wrote:
I'm running the latest version of 1.4 We just rebuilt it last week. Is
that
It seems like there would be a better way to do that.
I thought 1.4 was going to have a Luke style capability in regards to it's
data?
On Tue, Nov 1, 2011 at 4:45 PM, Markus Jelsma markus.jel...@openindex.iowrote:
I think you must add a regex to regex-urlfilter.txt . In that case those
It seems like there would be a better way to do that.
The problem is that there are many files storing URL's, CrawlDB, LinkDB,
WebGraph DB's, segment data. There is in Nutch 1.x no single place where you
can find an URL.
For example, if we find URL patterns we don't want we write additional
Hi Guys,
I am new to Nutch and I am trying to understand if we can
crawl a website and index the content of the meta tag in the head
section and, if there was a way to pass this to SOLR for indexing.
--
Thanks and regards,
Praveen Adivi
Java Developer
Yaskawa America
Ext: 7232
I'm running the latest version of 1.4 We just rebuilt it last week. Is
that patch included?
Yes, so you actually have more than one non-zero length titles coming from
your parser. Please try the parsechecker tool and confirm, but i'm not sure it
is capable of showing multiple titles.
Sounds like a memory issue. Can you check my other reply in this thread?
No. This is standard Sun (Oracle) JVM (Java version 1.6.0_27). I even
tried with 1.6.0_24 but with same effect. Only the time it takes for
the crawler to hang and jvm to crash varies. But then, it varies even
between
Hi,
I stopped using de-duplication in Nutch 0.9-1.2 versions because too many URLs
were being removed for no apparent reason. I did not report the problem to the
list though. I am working with version 1.4 now, tried de-duplication again, and
the problem appears to be still there. There are
18 matches
Mail list logo