date:20120809

Re: Nutch script to crawl a whole domain

2012-08-09 Thread Niccolò Becchi

Hi, I think the best start point could be: http://wiki.apache.org/nutch/Nutch_0.9_Crawl_Script_Tutorial You can modify the order of same steps. On Thu, Aug 9, 2012 at 1:26 AM, aabbcc wella_ge...@hotmail.it wrote: Hi, my problem is that i have a domain (es http://*.apache.org) and I want to

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-08-09 Thread Ferdy Galema

Hi, Of course setting a bigger heap sure helps, but most of the time only temporary. Can you see in the logs what type of documents are parsed? In case of html documents crawled on the wild web, a single document can cause the heap to explode. By default the cyberneko parser (in HtmlParser) is

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-08-09 Thread Niccolò Becchi

Hi Ferdy, When you get the Out of memory error if you have these opzions on the JVM: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/tmp You get file on your filesystem with a heap dump at the instant of the problem. You can use http://www.eclipse.org/mat/ (an eclipse's extension) that is a

Re: Happy 10th Birthday Nutch!

2012-08-09 Thread Ferdy Galema

Cheers! On Thu, Aug 9, 2012 at 9:56 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Doug Cutting on twitter : https://twitter.com/cutting/status/233415059798372353 *RT @StefanGroschupf: Happy 10th birthday#Nutch! Registered at sourceforce august 2002. Turned out to be quite a game

Re: Nutch script to crawl a whole domain

2012-08-09 Thread Julien Nioche

The version of Nutch in the trunk has a useful crawl script in the bin dir which does all the typical steps of a crawl and sends the docs to SOLR for indexing at the end of each fetching round. The script is also more robust and can work in both local and deployed mode HTH Julien On 9 August

RE: Happy 10th Birthday Nutch!

2012-08-09 Thread Markus Jelsma

Nice! -Original message- From:Ferdy Galema ferdy.gal...@kalooga.com Sent: Thu 09-Aug-2012 10:12 To: user@nutch.apache.org Cc: d...@nutch.apache.org Subject: Re: Happy 10th Birthday Nutch! Cheers! On Thu, Aug 9, 2012 at 9:56 AM, Julien Nioche lists.digitalpeb...@gmail.com

Nutch 2 encoding

2012-08-09 Thread Ake Tangkananond

Hi all, I just wonder if Nutch 2 is working fine with non english characters in your deployment? Thai language used to work fine for me in Nutch 1.5 but not in Nutch 2. Did I miss something. Anything I should check. Sorry for silly questions, but thank you in advance. ;-) Regards, Ake

Re: Nutch 2 encoding

2012-08-09 Thread Ferdy Galema

It depends on the datastore and possibly the server? What store are you using? On Thu, Aug 9, 2012 at 4:05 PM, Ake Tangkananond iam...@gmail.com wrote: Hi all, I just wonder if Nutch 2 is working fine with non english characters in your deployment? Thai language used to work fine for me in

Re: Nutch 2 encoding

2012-08-09 Thread Ake Tangkananond

Hi, Sorry for late reply. I was trying to figure out myself but seem no luck. I'm on Hbase with local deploy version 0.90.6, r1295128, the working version as said in Wiki: http://wiki.apache.org/nutch/Nutch2Tutorial Regards, Ake Tangkananond On 8/9/12 10:30 PM, Ferdy Galema

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-08-09 Thread Bai Shen

It was crawling HTML files when it started throwing the exception. Unfortunately, I didn't keep copies of the files or urls. On Thu, Aug 9, 2012 at 3:07 AM, Ferdy Galema ferdy.gal...@kalooga.comwrote: Hi, Of course setting a bigger heap sure helps, but most of the time only temporary. Can

Re: Nutch 2 encoding

2012-08-09 Thread Ake Tangkananond

Hi, I'm debugging. I inserted a code to print out the encoding here in HtmlParser:java function getParse and it printed utf-8. So I think it might be the data store problem. What else could be the cause? Could you advise what next I should go for to have my Thai chars stored correctly in HBase?

Re: Nutch 2 encoding

2012-08-09 Thread alxsss

Hi, I use hbase-0.92.1 and do not have problem with utf-8 chars. What is exactly your problem? Alex. -Original Message- From: Ake Tangkananond iam...@gmail.com To: user user@nutch.apache.org Sent: Thu, Aug 9, 2012 11:12 am Subject: Re: Nutch 2 encoding Hi, I'm debugging. I

Re: Happy 10th Birthday Nutch!

2012-08-09 Thread Lewis John Mcgibbney

Nice one Julien I'm going to update the site with this as its a pretty huge milestone @Apache and a lot of projects and current developers owe a lot to the great work done by all you guys over the years. Thank you for sharing. Lewis On Thu, Aug 9, 2012 at 8:56 AM, Julien Nioche

Re: CHM Files and Tika

2012-08-09 Thread Sebastian Nagel

Hi Jan, confirmed: Nutch cannot parse, while Tika (same version used by Nutch) can parse chm. The chm parsers are in tika-parser*.jar which is contained in the Nutch package. Any ideas? Sebastian On 08/08/2012 12:03 PM, Jan Riewe wrote: Hey there, i try to parse CHM (Microsoft Help Files)

cache field in index-basic in 2.X

2012-08-09 Thread Lewis John Mcgibbney

Hi, Can someone please explain to me exactly what the cashing field is actually cashing in index-basic? I see the various fields in o.a.n.metadata.Nutch e.g. CACHING_FORBIDDEN_ALL, CACHING_FORBIDDEN_CONTENT, etc. but I am still not sure how the functionality or indeed the 'what' actually is!!!

SolrIndex command

2012-08-09 Thread marora

Hi There, I am a new Nutch user. I am using Nutch to crawl and then send crawl data to SOLR. I have a question about bin/nutch solrindex command. Which tika libraries are being used to index; Is it the tika libraries in Nutch or does Nutch let SOLR index so it uses Solr's tika libraries? I think I

Re: Happy 10th Birthday Nutch!

2012-08-09 Thread Sebastian Nagel

Hi, I just discovered this nice but really old site http://nutch.sourceforge.net/docs/en/ with translations for a dozen of languages. The proposition is still challenging. Sebastian On 08/09/2012 10:31 PM, Lewis John Mcgibbney wrote: Nice one Julien I'm going to update the site with this

RE: CHM Files and Tika

2012-08-09 Thread Markus Jelsma

hmm, i'm not sure but maybe we don't include all Tika parser deps in our build.xml? -Original message- From:Sebastian Nagel wastl.na...@googlemail.com Sent: Thu 09-Aug-2012 23:18 To: user@nutch.apache.org Subject: Re: CHM Files and Tika Hi Jan, confirmed: Nutch cannot

Re: Happy 10th Birthday Nutch!

2012-08-09 Thread Mattmann, Chris A (388J)

Super cool. Proud to have been around since 2005 (7 of them!) :) Cheers, Chris On Aug 9, 2012, at 1:31 PM, Lewis John Mcgibbney wrote: Nice one Julien I'm going to update the site with this as its a pretty huge milestone @Apache and a lot of projects and current developers owe a lot to

Re: Nutch script to crawl a whole domain

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

Re: Happy 10th Birthday Nutch!

Re: Nutch script to crawl a whole domain

RE: Happy 10th Birthday Nutch!

Nutch 2 encoding

Re: Nutch 2 encoding

Re: Nutch 2 encoding

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

Re: Nutch 2 encoding

Re: Nutch 2 encoding

Re: Happy 10th Birthday Nutch!

Re: CHM Files and Tika

cache field in index-basic in 2.X

SolrIndex command

Re: Happy 10th Birthday Nutch!

RE: CHM Files and Tika

Re: Happy 10th Birthday Nutch!

19 matches

Site Navigation

Mail list logo

Footer information