Re: Removing urls from crawl db

2011-11-01 Thread Bai Shen
Already did that. But it doesn't allow me to delete urls from the list to be crawled. On Tue, Nov 1, 2011 at 5:56 AM, Ferdy Galema ferdy.gal...@kalooga.comwrote: As for reading the crawldb, you can use org.apache.nutch.crawl.**CrawlDbReader. This allows for dumping the crawldb into a

Re: Removing urls from crawl db

2011-11-01 Thread alxsss
I think you must add a regex to regex-urlfilter.txt . In that case those urls will not be fetched by fetcher. -Original Message- From: Bai Shen baishen.li...@gmail.com To: user user@nutch.apache.org Sent: Tue, Nov 1, 2011 10:35 am Subject: Re: Removing urls from crawl db Already did

Multiple values encountered for non multivalued field

2011-11-01 Thread Bai Shen
I'm getting an exception when I try to commit to Solr. Looking at the Solr log, it's showing that title is getting multiple values when it's not a multivalue field. None of my code does anything with the title, so I'm not sure why this is happening. How can I look at the pending commit and

Crawler stuck, crashes after fatal error in JRE

2011-11-01 Thread Sudip Datta
Hi, My problem might not be suitable for the nutch mailing list but I asked on java mailing lists but to no avail and wonder if someone here has experienced the same. I am trying to crawl several hosts using Nutch(1.4) and storing content on Solr with one host per index(core). I had posted this

Re: Multiple values encountered for non multivalued field

2011-11-01 Thread Bai Shen
It looks like the issue I'm encountering is the same one as here. http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-multiValued-field-title-td1446817.html I'm not really sure what the linked bug is since that involves the HTML parser and I'm seeing this problem with a PDF

Re: Multiple values encountered for non multivalued field

2011-11-01 Thread Markus Jelsma
This should work around the problem in most cases. The parser can output two titles of which one is actually empty. This patch (in 1.4) skips empty titles. If this doesn't work you really have two _valid_ titles coming from your document. https://issues.apache.org/jira/browse/NUTCH-1004 It

Re: Crawler stuck, crashes after fatal error in JRE

2011-11-01 Thread Markus Jelsma
Are you using any non default or experimental JVM options? I've never seen this happening anywhere with standard SUN JVM's. Hi, My problem might not be suitable for the nutch mailing list but I asked on java mailing lists but to no avail and wonder if someone here has experienced the same.

Re: Crawler stuck, crashes after fatal error in JRE

2011-11-01 Thread Markus Jelsma
Hmm, it may also be a memory problem. You have both Nutch and Tomcat + Solr running on the same machine with limited RAM? 4GB allocated to Nutch and how much to Tomcat? Remeber that file descriptors take memory too, it adds up significantly if there are many. Both Tomcat + Solr and Nutch can

Re: Removing urls from crawl db

2011-11-01 Thread Markus Jelsma
I think you must add a regex to regex-urlfilter.txt . In that case those urls will not be fetched by fetcher. Yes but if you use it when doing updatedb it will disappear from the crawldb entirely. -Original Message- From: Bai Shen baishen.li...@gmail.com To: user

Re: Crawler stuck, crashes after fatal error in JRE

2011-11-01 Thread Sudip Datta
No. This is standard Sun (Oracle) JVM (Java version 1.6.0_27). I even tried with 1.6.0_24 but with same effect. Only the time it takes for the crawler to hang and jvm to crash varies. But then, it varies even between different runs. On Wed, Nov 2, 2011 at 2:05 AM, Markus Jelsma

Re: Multiple values encountered for non multivalued field

2011-11-01 Thread Bai Shen
I'm running the latest version of 1.4 We just rebuilt it last week. Is that patch included? And where would it get multiple titles from? How do I tell what the titles are so I can see if they're valid or not? On Tue, Nov 1, 2011 at 4:33 PM, Markus Jelsma markus.jel...@openindex.iowrote:

Re: Multiple values encountered for non multivalued field

2011-11-01 Thread Lewis John Mcgibbney
Hi, Just as a side note, the latest 1.4 development version can be found at trunk SVN repository https://svn.apache.org/repos/asf/nutch/trunk/ On Tue, Nov 1, 2011 at 8:47 PM, Bai Shen baishen.li...@gmail.com wrote: I'm running the latest version of 1.4 We just rebuilt it last week. Is that

Re: Removing urls from crawl db

2011-11-01 Thread Bai Shen
It seems like there would be a better way to do that. I thought 1.4 was going to have a Luke style capability in regards to it's data? On Tue, Nov 1, 2011 at 4:45 PM, Markus Jelsma markus.jel...@openindex.iowrote: I think you must add a regex to regex-urlfilter.txt . In that case those

Re: Removing urls from crawl db

2011-11-01 Thread Markus Jelsma
It seems like there would be a better way to do that. The problem is that there are many files storing URL's, CrawlDB, LinkDB, WebGraph DB's, segment data. There is in Nutch 1.x no single place where you can find an URL. For example, if we find URL patterns we don't want we write additional

Question regarding meta tags

2011-11-01 Thread Praveen Adivi
Hi Guys, I am new to Nutch and I am trying to understand if we can crawl a website and index the content of the meta tag in the head section and, if there was a way to pass this to SOLR for indexing. -- Thanks and regards, Praveen Adivi Java Developer Yaskawa America Ext: 7232

Re: Multiple values encountered for non multivalued field

2011-11-01 Thread Markus Jelsma
I'm running the latest version of 1.4 We just rebuilt it last week. Is that patch included? Yes, so you actually have more than one non-zero length titles coming from your parser. Please try the parsechecker tool and confirm, but i'm not sure it is capable of showing multiple titles.

Re: Crawler stuck, crashes after fatal error in JRE

2011-11-01 Thread Markus Jelsma
Sounds like a memory issue. Can you check my other reply in this thread? No. This is standard Sun (Oracle) JVM (Java version 1.6.0_27). I even tried with 1.6.0_24 but with same effect. Only the time it takes for the crawler to hang and jvm to crash varies. But then, it varies even between

De-duplication seems to work too aggressively

2011-11-01 Thread Arkadi.Kosmynin
Hi, I stopped using de-duplication in Nutch 0.9-1.2 versions because too many URLs were being removed for no apparent reason. I did not report the problem to the list though. I am working with version 1.4 now, tried de-duplication again, and the problem appears to be still there. There are