Re: Nutch script to crawl a whole domain

2012-08-09 Thread Niccolò Becchi
Hi, I think the best start point could be: http://wiki.apache.org/nutch/Nutch_0.9_Crawl_Script_Tutorial You can modify the order of same steps. On Thu, Aug 9, 2012 at 1:26 AM, aabbcc wella_ge...@hotmail.it wrote: Hi, my problem is that i have a domain (es http://*.apache.org) and I want to

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-08-09 Thread Ferdy Galema
Hi, Of course setting a bigger heap sure helps, but most of the time only temporary. Can you see in the logs what type of documents are parsed? In case of html documents crawled on the wild web, a single document can cause the heap to explode. By default the cyberneko parser (in HtmlParser) is

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-08-09 Thread Niccolò Becchi
Hi Ferdy, When you get the Out of memory error if you have these opzions on the JVM: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/tmp You get file on your filesystem with a heap dump at the instant of the problem. You can use http://www.eclipse.org/mat/ (an eclipse's extension) that is a

Re: Happy 10th Birthday Nutch!

2012-08-09 Thread Ferdy Galema
Cheers! On Thu, Aug 9, 2012 at 9:56 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Doug Cutting on twitter : https://twitter.com/cutting/status/233415059798372353 *RT @StefanGroschupf: Happy 10th birthday#Nutch! Registered at sourceforce august 2002. Turned out to be quite a game

Re: Nutch script to crawl a whole domain

2012-08-09 Thread Julien Nioche
The version of Nutch in the trunk has a useful crawl script in the bin dir which does all the typical steps of a crawl and sends the docs to SOLR for indexing at the end of each fetching round. The script is also more robust and can work in both local and deployed mode HTH Julien On 9 August

RE: Happy 10th Birthday Nutch!

2012-08-09 Thread Markus Jelsma
Nice! -Original message- From:Ferdy Galema ferdy.gal...@kalooga.com Sent: Thu 09-Aug-2012 10:12 To: user@nutch.apache.org Cc: d...@nutch.apache.org Subject: Re: Happy 10th Birthday Nutch! Cheers! On Thu, Aug 9, 2012 at 9:56 AM, Julien Nioche lists.digitalpeb...@gmail.com

Nutch 2 encoding

2012-08-09 Thread Ake Tangkananond
Hi all, I just wonder if Nutch 2 is working fine with non english characters in your deployment? Thai language used to work fine for me in Nutch 1.5 but not in Nutch 2. Did I miss something. Anything I should check. Sorry for silly questions, but thank you in advance. ;-) Regards, Ake

Re: Nutch 2 encoding

2012-08-09 Thread Ferdy Galema
It depends on the datastore and possibly the server? What store are you using? On Thu, Aug 9, 2012 at 4:05 PM, Ake Tangkananond iam...@gmail.com wrote: Hi all, I just wonder if Nutch 2 is working fine with non english characters in your deployment? Thai language used to work fine for me in

Re: Nutch 2 encoding

2012-08-09 Thread Ake Tangkananond
Hi, Sorry for late reply. I was trying to figure out myself but seem no luck. I'm on Hbase with local deploy version 0.90.6, r1295128, the working version as said in Wiki: http://wiki.apache.org/nutch/Nutch2Tutorial Regards, Ake Tangkananond On 8/9/12 10:30 PM, Ferdy Galema

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-08-09 Thread Bai Shen
It was crawling HTML files when it started throwing the exception. Unfortunately, I didn't keep copies of the files or urls. On Thu, Aug 9, 2012 at 3:07 AM, Ferdy Galema ferdy.gal...@kalooga.comwrote: Hi, Of course setting a bigger heap sure helps, but most of the time only temporary. Can

Re: Nutch 2 encoding

2012-08-09 Thread Ake Tangkananond
Hi, I'm debugging. I inserted a code to print out the encoding here in HtmlParser:java function getParse and it printed utf-8. So I think it might be the data store problem. What else could be the cause? Could you advise what next I should go for to have my Thai chars stored correctly in HBase?

Re: Nutch 2 encoding

2012-08-09 Thread alxsss
Hi, I use hbase-0.92.1 and do not have problem with utf-8 chars. What is exactly your problem? Alex. -Original Message- From: Ake Tangkananond iam...@gmail.com To: user user@nutch.apache.org Sent: Thu, Aug 9, 2012 11:12 am Subject: Re: Nutch 2 encoding Hi, I'm debugging. I

Re: Happy 10th Birthday Nutch!

2012-08-09 Thread Lewis John Mcgibbney
Nice one Julien I'm going to update the site with this as its a pretty huge milestone @Apache and a lot of projects and current developers owe a lot to the great work done by all you guys over the years. Thank you for sharing. Lewis On Thu, Aug 9, 2012 at 8:56 AM, Julien Nioche

Re: CHM Files and Tika

2012-08-09 Thread Sebastian Nagel
Hi Jan, confirmed: Nutch cannot parse, while Tika (same version used by Nutch) can parse chm. The chm parsers are in tika-parser*.jar which is contained in the Nutch package. Any ideas? Sebastian On 08/08/2012 12:03 PM, Jan Riewe wrote: Hey there, i try to parse CHM (Microsoft Help Files)

cache field in index-basic in 2.X

2012-08-09 Thread Lewis John Mcgibbney
Hi, Can someone please explain to me exactly what the cashing field is actually cashing in index-basic? I see the various fields in o.a.n.metadata.Nutch e.g. CACHING_FORBIDDEN_ALL, CACHING_FORBIDDEN_CONTENT, etc. but I am still not sure how the functionality or indeed the 'what' actually is!!!

SolrIndex command

2012-08-09 Thread marora
Hi There, I am a new Nutch user. I am using Nutch to crawl and then send crawl data to SOLR. I have a question about bin/nutch solrindex command. Which tika libraries are being used to index; Is it the tika libraries in Nutch or does Nutch let SOLR index so it uses Solr's tika libraries? I think I

Re: Happy 10th Birthday Nutch!

2012-08-09 Thread Sebastian Nagel
Hi, I just discovered this nice but really old site http://nutch.sourceforge.net/docs/en/ with translations for a dozen of languages. The proposition is still challenging. Sebastian On 08/09/2012 10:31 PM, Lewis John Mcgibbney wrote: Nice one Julien I'm going to update the site with this

RE: CHM Files and Tika

2012-08-09 Thread Markus Jelsma
hmm, i'm not sure but maybe we don't include all Tika parser deps in our build.xml? -Original message- From:Sebastian Nagel wastl.na...@googlemail.com Sent: Thu 09-Aug-2012 23:18 To: user@nutch.apache.org Subject: Re: CHM Files and Tika Hi Jan, confirmed: Nutch cannot

Re: Happy 10th Birthday Nutch!

2012-08-09 Thread Mattmann, Chris A (388J)
Super cool. Proud to have been around since 2005 (7 of them!) :) Cheers, Chris On Aug 9, 2012, at 1:31 PM, Lewis John Mcgibbney wrote: Nice one Julien I'm going to update the site with this as its a pretty huge milestone @Apache and a lot of projects and current developers owe a lot to