Hi
I'm fairly new to use Nutch, I have segments of 2,000 URLs, and I'm trying
to index them to solr, but throws out error message like this:
and I checked the logs/hadoop.log and get something like:
So I thought that
http://forum.kompas.com/avatars/bangthoyib.gif?dateline=1261039714 is the
o
Hi
I'm fairly new to use Nutch, I have segments of 2,000 URLs, and I'm trying
to index them to solr, but throws out error message like this:
and I checked the logs/hadoop.log and get something like:
So I thought that
http://forum.kompas.com/avatars/bangthoyib.gif?dateline=1261039714 is the
o
Hi,
If you want to index every page on example.org with the term 'java' you need
to crawl every page on example.org. Nutch will follow links and add the links
to its CrawlDB so you can crawl newly discovered pages in the next crawl cycle
(read the wiki tutorial on what a cycle is).
If you real
Hello.
I am new to Nutch.
I need to use Nutch to index data into Solr.
Lets say I need to crawl some newspaper search pages and index any article
regarding the word "java".
I understand that I would need to point Nutch to the search result page.
1) What I need from nutch is not to crawl/index t
Hi everyone,
We'd like to use Nutch and Solr to replace an existing Verity search that's
become a bit long in the tooth. In our Verity search, we have a hack which
allows each document to have a machine-readable URL which is indexed (generally
an xml document), and a human-readable URL which we
Hi,
I am having issues crawling an intranet site with an (imho) odd redirect
mechanism. One part of the intranet website requires authentication
which Nutch can bypass sending a special http.agent.name. This works fine.
The issue I am facing is that the server sends a redirect (302) after
su
Hello Lewis, Julien.
Solr isn´t the problem. I know that Nutch don´t includes it, so I´ve added
it as a Maven dependency aswell. Otherwise, I´ve got a Java project that
compiles and deploys a Solr server without problems. The thing is that the
parent project (which includes my version of Nutch, an
Hi everybody
I've a nutch 1.2, while the indexing procces it fail indexing protected pdf
dont know why, I look up on the internet and nothing found... I checked the
jai_codec.jar and jai_core.jar and its ok, even I updated. Could any one tellme
what is going on or where to look for it.. please
Excellent :0)
Sorry Luis for my limited knowledge on the Maven side of things.
On Thu, Sep 15, 2011 at 5:15 PM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:
> [... ]
>
>
> > We do not push any Nutch related stuff to the Sonatype Nexus Maven
> > Repository so you
> > can't therefore pull
[... ]
> We do not push any Nutch related stuff to the Sonatype Nexus Maven
> Repository so you
> can't therefore pull it and depend upon in an any way.
>
We do and it is synced with Central see :
- http://wiki.apache.org/nutch/NutchMavenSupport
- http://search.maven.org/#artifactdetails|org.ap
This looks pretty tricky. I am not experienced with using http-client in
general and we could do with getting a wiki page established to comment on
the re-direct policies and scenarios as there is quite a bit of confusion
within the community as to what some 'states' actually mean and how to
crawl/
Hi Luis,
There appears to be two things which stick out for me here. I'm not sure if
you know what Solr is not shipped with Nutch therefore to index you would
also need to include some kind of Solr image in your parent project. I
mention this because although the log at 2011-09-15 16:57:09 indicat
Hello.
I've downloaded Nutch-1.3 version via Subversion and modified some classes a
little. My intention is to integrate with Maven the new artifacts created
from the new "hacked" Nutch version and integrate them with another Maven
project which has a dependency to the hacked version mentioned. Bo
Hello.
I've downloaded Nutch-1.3 version via Subversion and modified some classes a
little. My intention is to integrate with Maven the new artifacts created
from the new "hacked" Nutch version and integrate them with another Maven
project which has a dependency to the hacked version mentioned. Bo
I've run into a small issue with my deployment of Nutch. Some of the sites I
crawl use characters such as æøå in their URLs, and these never seem to
parse properly. Is there any way to get around this? I tried adding the
UTF-values (as '\u00e5' and so on) in regex-normalize.xml, but I suppose
they
15 matches
Mail list logo