Hi, I think the best start point could be:
http://wiki.apache.org/nutch/Nutch_0.9_Crawl_Script_Tutorial
You can modify the order of same steps.
On Thu, Aug 9, 2012 at 1:26 AM, aabbcc wella_ge...@hotmail.it wrote:
Hi,
my problem is that i have a domain (es http://*.apache.org) and I want to
Hi,
Of course setting a bigger heap sure helps, but most of the time only
temporary. Can you see in the logs what type of documents are parsed?
In case of html documents crawled on the wild web, a single document can
cause the heap to explode. By default the cyberneko parser (in HtmlParser)
is
Hi Ferdy,
When you get the Out of memory error if you have these opzions on the JVM:
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/tmp
You get file on your filesystem with a heap dump at the instant of the
problem.
You can use http://www.eclipse.org/mat/ (an eclipse's extension) that is a
Cheers!
On Thu, Aug 9, 2012 at 9:56 AM, Julien Nioche lists.digitalpeb...@gmail.com
wrote:
Doug Cutting on twitter :
https://twitter.com/cutting/status/233415059798372353
*RT @StefanGroschupf: Happy 10th birthday#Nutch! Registered at sourceforce
august 2002. Turned out to be quite a game
The version of Nutch in the trunk has a useful crawl script in the bin dir
which does all the typical steps of a crawl and sends the docs to SOLR for
indexing at the end of each fetching round. The script is also more robust
and can work in both local and deployed mode
HTH
Julien
On 9 August
Nice!
-Original message-
From:Ferdy Galema ferdy.gal...@kalooga.com
Sent: Thu 09-Aug-2012 10:12
To: user@nutch.apache.org
Cc: d...@nutch.apache.org
Subject: Re: Happy 10th Birthday Nutch!
Cheers!
On Thu, Aug 9, 2012 at 9:56 AM, Julien Nioche lists.digitalpeb...@gmail.com
Hi all,
I just wonder if Nutch 2 is working fine with non english characters in your
deployment? Thai language used to work fine for me in Nutch 1.5 but not in
Nutch 2. Did I miss something. Anything I should check.
Sorry for silly questions, but thank you in advance. ;-)
Regards,
Ake
It depends on the datastore and possibly the server? What store are you
using?
On Thu, Aug 9, 2012 at 4:05 PM, Ake Tangkananond iam...@gmail.com wrote:
Hi all,
I just wonder if Nutch 2 is working fine with non english characters in
your
deployment? Thai language used to work fine for me in
Hi,
Sorry for late reply. I was trying to figure out myself but seem no luck.
I'm on Hbase with local deploy version 0.90.6, r1295128, the working
version as said in Wiki:
http://wiki.apache.org/nutch/Nutch2Tutorial
Regards,
Ake Tangkananond
On 8/9/12 10:30 PM, Ferdy Galema
It was crawling HTML files when it started throwing the exception.
Unfortunately, I didn't keep copies of the files or urls.
On Thu, Aug 9, 2012 at 3:07 AM, Ferdy Galema ferdy.gal...@kalooga.comwrote:
Hi,
Of course setting a bigger heap sure helps, but most of the time only
temporary. Can
Hi,
I'm debugging.
I inserted a code to print out the encoding here in HtmlParser:java
function getParse and it printed utf-8. So I think it might be the data
store problem. What else could be the cause? Could you advise what next I
should go for to have my Thai chars stored correctly in HBase?
Hi,
I use hbase-0.92.1 and do not have problem with utf-8 chars. What is exactly
your problem?
Alex.
-Original Message-
From: Ake Tangkananond iam...@gmail.com
To: user user@nutch.apache.org
Sent: Thu, Aug 9, 2012 11:12 am
Subject: Re: Nutch 2 encoding
Hi,
I'm debugging.
I
Nice one Julien
I'm going to update the site with this as its a pretty huge milestone
@Apache and a lot of projects and current developers owe a lot to the
great work done by all you guys over the years.
Thank you for sharing.
Lewis
On Thu, Aug 9, 2012 at 8:56 AM, Julien Nioche
Hi Jan,
confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
can parse chm. The chm parsers are in tika-parser*.jar which is contained
in the Nutch package.
Any ideas?
Sebastian
On 08/08/2012 12:03 PM, Jan Riewe wrote:
Hey there,
i try to parse CHM (Microsoft Help Files)
Hi,
Can someone please explain to me exactly what the cashing field is
actually cashing in index-basic?
I see the various fields in o.a.n.metadata.Nutch e.g.
CACHING_FORBIDDEN_ALL, CACHING_FORBIDDEN_CONTENT, etc. but I am still
not sure how the functionality or indeed the 'what' actually is!!!
Hi There,
I am a new Nutch user. I am using Nutch to crawl and then send crawl data
to SOLR. I have a question about bin/nutch solrindex command. Which tika
libraries are being used to index; Is it the tika libraries in Nutch or
does Nutch let SOLR index so it uses Solr's tika libraries? I think I
Hi,
I just discovered this nice but really old site
http://nutch.sourceforge.net/docs/en/
with translations for a dozen of languages.
The proposition is still challenging.
Sebastian
On 08/09/2012 10:31 PM, Lewis John Mcgibbney wrote:
Nice one Julien
I'm going to update the site with this
hmm, i'm not sure but maybe we don't include all Tika parser deps in our
build.xml?
-Original message-
From:Sebastian Nagel wastl.na...@googlemail.com
Sent: Thu 09-Aug-2012 23:18
To: user@nutch.apache.org
Subject: Re: CHM Files and Tika
Hi Jan,
confirmed: Nutch cannot
Super cool. Proud to have been around since 2005 (7 of them!)
:)
Cheers,
Chris
On Aug 9, 2012, at 1:31 PM, Lewis John Mcgibbney wrote:
Nice one Julien
I'm going to update the site with this as its a pretty huge milestone
@Apache and a lot of projects and current developers owe a lot to
19 matches
Mail list logo