Hello,
I have installed nutch-1.2 in Fedora 14 and tomcat6. I added path to crawl
dir in searcher.dir property in WEB_INF/classes/nutch-default.xml
as /home/user/nutch-1.2/crawl
I see in catalina.out file
WARN SearchBean - Neither file:///home/user/nutch-1.2/crawl/index nor
Hello
I use nutch-1.2 with fedora 14 and try to index about 4000 domains. I use
bin/nutch crawl urls -dir crawl -depth 3 topN -1 and have in
crawl-urlfilter.txt this
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*
I noticed that if a domain has entered like http://mydomain.com in
Hello,
I used nutch-1.2 to index a few domains. I noticed that nutch correctly crawled
all sub-pages of domains. By sub-pages I mean the followings, for example for a
domain mydomain.com all links inside it like
mydomain.com/show/photos/1 and etc. I also noticed in our apache logs that
Hello,
Thanks you for your response.
Let me give you more detail of the issue that I have.
First definitions. Let say I have my own domain that I host on a dedicated
server and call it mydomain.com
Next, call subdomain the followings answers.mydomain.com, mail.mydomain.com,
maps.mydomain.com
Which command did you use? Merging segments is very expensive in resources, so
I try to avoid merging them.
-Original Message-
From: Marseld Dedgjonaj marseld.dedgjo...@ikubinfo.com
To: user user@nutch.apache.org
Sent: Tue, Jan 4, 2011 7:12 am
Subject: FW: Exception on segment
One more thing I just noticed is that Nutch search results do not display
information from meta tag.
Google and yahoo does.
In more details, Nutch search results for keyword mydomain.com displays some
short text from page mydomain.com. In contrary, google and yahoo search results
for the
Hello,
Just noticed that google actually has results from all subpages of mydomain.com
for keyword mydomain.com but they are hidden in a link show more results from
mydomain.com. Is there a way of putting more results from the same domain in
such a link in Nutch rss feed, since I use
you can put fetch external and internal links to false and increase depth.
-Original Message-
From: Churchill Nanje Mambe mambena...@afrovisiongroup.com
To: user user@nutch.apache.org
Sent: Wed, Jan 26, 2011 8:03 am
Subject: Re: Few questions from a newbie
even if the url
Hello,
I run crawl command with -depth 7 -topN -1 on my linux box with 1.5Mps
internet, amd 3.1ghz processor, 4GB memory, Fedora Linux 14, nutch 1.2. After
1-2 days nutch takes 98% of cpu. My seed file includes about 3500 domains and I
put fetch.external links to false.
Is this normal? If
2nd, after testing to fetch several pages from wikipedia, the search
query with e.g. bin/nutch org.apache.nutch.searcher.NutchBean apache
../wiki_dir returns
It returns a result for keyword apache because that url has apache in it.
-topN 50), it actually fetches some pages e.g. `fetching
Hello,
I wondered if there is a way of adding to solrindex made from nutch segments
another solrindex also made from nutch segments.
I have to index about 3000 domains but 5 of them are newspaper sites. So, I
need to crawl-fetch-parse these 5 domains(with depth 2) and update index every
That tutorial is applicable for the new version too.
-Original Message-
From: Marseld Dedgjonaj marseld.dedgjo...@ikubinfo.com
To: user user@nutch.apache.org; 'McGibbney, Lewis John'
lewis.mcgibb...@gcu.ac.uk
Sent: Tue, Mar 8, 2011 5:25 am
Subject: RE: Reload index without
Hello,
I wondered if nutch version 2 be able to index image files?
Thanks.
Alex.
I meant to extract image title, src link and alt from img tags and not store
image files. For a keyword search in must display link, which automatically
displays image itself in the search page.
Not sure what do you mean image content-based retrieval? Do image files have
tags like mp3 ones?
Hello,
Which version this patch is applicable?
Thanks.
Alex.
-Original Message-
From: Alexis alexis.detregl...@gmail.com
To: user user@nutch.apache.org
Sent: Tue, Feb 8, 2011 9:59 am
Subject: Re: nutch crawl command takes 98% of cpu
Hi,
Thanks for all the feedback. It
Hello
I see in nutch-1.2/conf/regex-urlfilter.txt file the following lines
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
However, nutch fetch urls like
http://www.example.com/text/dev/faq/dev/content/2305/dev/content/246/
Thanks.
Hi,
If you donwload gora and build it with ant you get rid of the one of the
dependency
--unresolved dependency: org.apache.gora#gora-core;0.1: not found
if you change gora version from 1.0 to 1.0-incubator in one of the ivy files
but this one
--unresolved dependency:
Hi,
Did you build gora with ant? I checked out from svn a few days ago and ant for
gora gives error
::
[ivy:resolve] :: UNRESOLVED DEPENDENCIES ::
[ivy:resolve] ::
[ivy:resolve] ::
It seems to me that you may have the same problem as before with the disk
space. This may happen because you do mergesegs. Try not to merge segments.
Alex.
-Original Message-
From: McGibbney, Lewis John lewis.mcgibb...@gcu.ac.uk
To: user user@nutch.apache.org
Sent: Wed, Apr
Hello,
Looks like I will have some spare time in the next month, so I may work on
writing this image indexing plugin. I wondered if there is a similar plugin to
leverage code from or follow it?
Thanks.
Alex.
-Original Message-
From: Andrzej Bialecki a...@getopt.org
To:
It seems you should move www.example.com example.com from line 3 to line 1,
uncomment line 3 and comment other lines.
Alex.
-Original Message-
From: Alex alex.thegr...@ambix.net
To: user user@nutch.apache.org
Sent: Tue, Apr 26, 2011 4:18 am
Subject: Re: Hosts File Nutch
Hello,
I use nutch-1.2 to index about 3000 sites. One of them has about 1500 pdf files
which do not change over time.
I wondered if there is a way of configuring nutch not to fetch unchanged
documents again and again, but keep the old index for them.
Thanks.
Alex.
Hi,
I took a look to the recrawl script and noticed that all the steps except urls
injection are repeated at the consequent indexing and wondered why would we
generate new segments?
Is it possible to do fetch, update for all previous $s1..$sn , invertlink and
index steps.
Thanks.
Alex.
Hello,
I use nutch 1.2 and solr to index about 3500 domains. I noticed that search
results for two or more keywords are not ranked properly.
For example for keyword Lady Gaga some results that has Lady are displayed
first then some results with both keywords and etc. It seems to me that results
Hello,
One more question. Is there a way of adding new urls to crawldb created in
previous crawls to include in subsequent recrawls?
Thanks.
Alex.
-Original Message-
From: lewis john mcgibbney lewis.mcgibb...@gmail.com
To: user user@nutch.apache.org; markus.jelsma
check for errors in solr log.
-Original Message-
From: Way Cool way1.wayc...@gmail.com
To: user user@nutch.apache.org
Sent: Tue, Jul 26, 2011 3:14 pm
Subject: Re: solrindex command` not working
The latest solr version is 3.3. Maybe you can try that.
On Tue, Jul 26, 2011 at 2:10 AM,
Hello,
I use nutch-1.2 with solr 1.4. Recently, I noticed that for search for a domain
name, for example yahoo.com, yahoo.com is not in the first place. Instead other
sites that has in content yahoo.com, are in the first places. I tested this
issue with google. In its results domain is in the
https://issues.apache.org/jira/browse/NUTCH-1044
-Original Message-
From: abhayd ajdabhol...@hotmail.com
To: nutch-user nutch-u...@lucene.apache.org
Sent: Wed, Aug 17, 2011 11:44 am
Subject: nutch redirect treatment
hi
I have seen similar posts in this forum but still not
As far as I understood redirected urls are scored 0 and that is why fetcher
does not pick them up in the earlier depths. They may be crawled starting depth
4 depending on the size of the seed list.
-Original Message-
From: abhayd ajdabhol...@hotmail.com
To: nutch-user
Hi Lewis,
I stopped fetcher and started it on the same segment again.
But before doing that I turned off modem and fetcher started giving
Unknown.Host exception.
It was not giving any error, with dsl failure, i.e. I was not able to connect
to any sites. Again this is nutch-1.2.
Thanks.
Alex.
It is the DNS problem, because it was giving a lot of UnknownHost exception. I
decreased thread number to 5, but still DSL fails periodically.
I wondered what is the common internet connection for fetching about 3500
domains. I currently have DSL with 3 Mps.
Thanks.
Alex.
-Original
Hello,
I have tried to implement spellchecker based on index in nutch-solr by adding
spell field to schema.xml and making it a copy from content field. However,
this increased data folder size twice and spell filed as a copy of content
field appears in xml feed which is not necessary. Is it
Comparing with nutch-1.2 I do not see any content folder under segments ones.
Does this mean that we cannot put store.content to false in nutch1-3?
Thanks.
Alex.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Crawl-fails-Input-path-does-not-exist-tp996823p3334709.html
Sent
I see what is in done in nutch results. Results are grouped with 1 doc in each
group. I need to group with 3 max docs in each group.
In Solr, it is impossible to paginate when grouping with more than 1 doc in
each group.
Google can do it with 5 docs in the first group, as I see.
Thanks.
Alex.
Hello,
I wondered if it is possible to restart a failed job in nutch-1.3 version.
I have this error
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/
after fetching for 5 days. I know the reason for the error, but do not want to
restart the
Hello,
I tried fetch command with the following config
property
namefetcher.store.content/name
valuefalse/value
descriptionIf true, fetcher will store content./description
/property
property
namefetcher.parse/name
valuetrue/value
descriptionIf true, fetcher will
I think you must add a regex to regex-urlfilter.txt . In that case those urls
will not be fetched by fetcher.
-Original Message-
From: Bai Shen baishen.li...@gmail.com
To: user user@nutch.apache.org
Sent: Tue, Nov 1, 2011 10:35 am
Subject: Re: Removing urls from crawl db
Already did
I think this patch already included in the current version.
-Original Message-
From: mina tahereganji...@gmail.com
To: nutch-user nutch-u...@lucene.apache.org
Sent: Wed, Nov 2, 2011 7:08 pm
Subject: how use NUTCH-16 in my nutch 1.3?
i want to use NUTCH-61 in
Hello,
It is interesting to know how can one put a filter on outlinks? I mean if I
have a regex, in which file should I put it?
For example, I want nutch to ignore outlinks ending with .info.
Thanks.
Alex.
-Original Message-
From: Arkadi.Kosmynin arkadi.kosmy...@csiro.au
To:
If I understand you correctly, you state that even if my question is related to
the current thread, nevertheless I must open a new one?
-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
To: user user@nutch.apache.org
Sent: Thu, Dec 1, 2011 3:01 pm
Subject:
I think you should add this to nutch-site.xml
property
namegenerate.max.count/name
value1000/value
descriptionThe maximum number of urls in a single
fetchlist. -1 if unlimited. The urls are counted according
to the value of the parameter generator.count.mode.
/description
/property
It seems that robots.txt in
libraries.mit.edu
has a lot of restrictions.
Alex.
-Original Message-
From: Chip Calhoun ccalh...@aip.org
To: user user@nutch.apache.org; 'markus.jel...@openindex.io'
markus.jel...@openindex.io
Sent: Tue, Dec 20, 2011 7:28 am
Subject: RE: Can't crawl
Hello,
I took a look to source of SolrDeleteDuplicates class. The patch is already
applied.
Any ideas what might be wrong? I issue this command
bin/nutch solrdedup http://127.0.0.1:8983/solr/
and the solr schema is the one that comes with nutch.
Thanks in advance.
Alex.
Hello,
I tried 1, 2, -1 for the config http.redirect.max, but nutch still postpones
redirected urls to later depths.
What is the correct config setting to have nutch crawl redirected urls
immediately. I need it because I have restriction on depth be at most 2.
Thanks.
Alex.
Hello,
I need to have different fetch intervals for initial seed urls and urls
extracted from them at depth 1. How this can be achieved. I tried -adddays
option in generate command but it seems it cannot be used to solve this issue.
Thanks in advance.
Alex.
I need to make this as a cron job, so cannot do changes manually.
My problem is to index newspaper sites, but only new links that are added
every day and not fetch ones that have already been fetched.
Thanks.
Alex.
-Original Message-
From: Markus Jelsma
Hello,
As far as I understood nutch recrawls urls when their fetch time has past
current time regardless if those urls were modified or not.
Is there any initiative on restricting recrawls to only those urls that have
modified time(MT) greater than the old MT?
In detail: if nutch have crawled
Hello,
It seems to me that all options to updatedb command that nutch 1.4 has, have
been removed in nutch-2.0. I would like to know if this was done purposefully
or they will be added later? Also, how can I create multiple doc using parse
command? It seem there is no sufficient arguments to
Hi Lewis,
In 1.X version there are -noAdditions options to updatedb and -adddays option
to generate commands. How something similar to them can be done in 2.X version?
Here, http://wiki.apache.org/nutch/Nutch2Roadmap it is stated
Modify code so that parser can generate multiple documents
I was thinking of using last modified header, but it may be absent. In that
case we could use signature of urls in the indexing time. I took a look to
to code, it seems it is implemented but not working. I tested nutch-1.4 with
a single url, solrindexer always sends the same number of documents to
Hello,
I have tested nutch-2.0 with hbase and mysql trying to index only one url with
depth 1.
I tried to fetch an html tag value and parse it to metadata column in webpage
object by adding parse-tag plugin. I saw there is no metadata member variable
in Parse class, so I used putToMetadata
Hi,
Thank you for clarifications.
Regarding the metadata, what would be a proper way of parsing end indexing
multivalued tags in nutch-2.0 then?
Thanks.
Alex.
-Original Message-
From: Ferdy Galema ferdy.gal...@kalooga.com
To: user user@nutch.apache.org
Sent: Wed, Jun 27, 2012 1:20
Hi,
I was planning to parse img tags from a url content and put it in metadata
filed of Webpage storage class in nutch2.0 to retrieve them later in the
indexing step.
However, since there is no metadata data type variable in Parse class (compare
with outlinks) this can not be done in nutch
Not sure if I understood correctly.
I did
Counters c currentJob.getCounters();
System.out.println(c.toString());
With Mysql
DbUpdaterJob: starting
Counters: 20
DbUpdaterJob: starting
counter name=Counters: 20
FileSystemCounters
FILE_BYTES_READ=878298
I queried webpage table and there are a few links in outlinks column. As I
noted in the original letter updatedb works with Hbase. This is the counters
output in the case of Hbase.
bin/nutch updatedb
DbUpdaterJob: starting
counter name=Counters: 20
FileSystemCounters
I tried your suggestion with sql server and everything works fine.
The issue that I had was with mysql though.
mysql Ver 14.14 Distrib 5.5.18, for Linux (i686) using readline 5.1
After I have restarted mysql server and added to gora.properties mysql root
user, updatdb adds outlinks as new
Which storage do you use? Try solrindex with option -reindex.
-Original Message-
From: X3C TECH t...@x3chaos.com
To: user user@nutch.apache.org
Sent: Sun, Jul 29, 2012 10:58 am
Subject: Re: Nutch 2.0 Solr 4.0 Alpha
Forgot to do Specs
VMWare Machine with CentOS 6.3
On Sun, Jul 29,
Why do not you test your regex, to see if it really takes the urls you want to
eliminate. It seems to me that your regex does not eliminate the type of urls
you specified.
Alex.
-Original Message-
From: Ian Piper ianpi...@tellura.co.uk
To: user user@nutch.apache.org
Sent: Mon, Jul
Hi,
Most likely you run generate command a few times and did not run updatedb. So,
each generate command assigned different batchId s to its own set of urls.
Alex.
-Original Message-
From: Bai Shen baishen.li...@gmail.com
To: user user@nutch.apache.org
Sent: Tue, Jul 31, 2012 10:26
Hello,
I noticed that updatedb command must remove gen, parse and fetch marks and put
UPDATEDB_MARK mark.
according to the code
Utf8 mark = Mark.PARSE_MARK.removeMarkIfExist(page);
if (mark != null) {
Mark.UPDATEDB_MARK.putMark(page, mark);
}
in DbUpdateReducer.java
This is directly related to the thread I have opened yesterday. I think this is
a bug, since updatedb fails to put update mark.
I have fixed it by modifying code. I have a patch, but not sure if I can send
it as an attachment.
Alex.
-Original Message-
From: Bai Shen
The current code putting updb_mrk in dbUpdateReducer is as follows
Utf8 mark = Mark.PARSE_MARK.removeMarkIfExist(page);
if (mark != null) {
Mark.UPDATEDB_MARK.putMark(page, mark);
}
the mark is always null, independent if there is PARSE_MARK or not.
This function calls
public
Hi,
I have found out that, what happens after
bin/nutch generate -topN 1000
is that only 1000 of the urls have been marked by gnmrk
Then
bin/nutch fetch -all
skips all urls that do not have gnmrk
according to the code
Utf8 mark = Mark.GENERATE_MARK.checkMark(page);
if
Hi,
I use hbase-0.92.1 and do not have problem with utf-8 chars. What is exactly
your problem?
Alex.
-Original Message-
From: Ake Tangkananond iam...@gmail.com
To: user user@nutch.apache.org
Sent: Thu, Aug 9, 2012 11:12 am
Subject: Re: Nutch 2 encoding
Hi,
I'm debugging.
I
Hello,
I am getting the same error and here is the log
2012-08-11 13:33:08,223 ERROR http.Http - Failed with the following error:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at
I was able to do jstack just before the program exited. The output is attached.
-Original Message-
From: alxsss alx...@aim.com
To: user user@nutch.apache.org
Sent: Sat, Aug 11, 2012 2:17 pm
Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
Hello,
I am
Hello,
I get the following error when I do bin/nutch updatedb in nutch-2.0 with hbase
java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:54)
at
I found out that the key sent to
unreverseUrl in DbUpdateMapper.map was :index.php/http
This happened in the depth 3 and I checked seed file there was no line in the
form of http:/index.php
Thanks.
Alex.
-Original Message-
From: Ferdy Galema ferdy.gal...@kalooga.com
To: user
did you delete the old hbase jar from the lib dir?
Alex.
-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
To: user user@nutch.apache.org
Sent: Mon, Aug 13, 2012 10:16 am
Subject: Re: nutch 2.0 with hbase 0.94.0
Nutch contains no knowledge of which specific
Hi,
I noticed that updatedb command goes over all urls, even if they have been
updated in the previous generate, fetch updatedb stages.
As a result updatedb takes long time depending on the number of rows in the
datastore.
I thought maybe this is redundant and it must be restricted to not
After fetching for about 18 hours fetcher throws this error
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at
Hello,
I am using nutch-2.0 with hbase-0.92.1. I noticed that, in depth 1, 2,3
fetcher was fetching around 20K urls per hour. In depth 4 it fetches only 8K
urls per hour.
Any ideas what could cause this decrease in speed. I use local mode with 10
threads.
Thanks.
Alex.
This will work only for urls that has If-Modified-Since headers. But most urls
does not have this header.
Thanks.
Alex.
-Original Message-
From: Max Dzyuba max.dzy...@comintelli.com
To: Markus Jelsma markus.jel...@openindex.io; user user@nutch.apache.org
Sent: Fri, Aug 24, 2012
You can use -reindex option, since updt markers are not set properly in 2.0
release.
-Original Message-
From: Bai Shen baishen.li...@gmail.com
To: user user@nutch.apache.org
Sent: Mon, Sep 17, 2012 10:16 am
Subject: Re: Nutch 2 solrindex fails with no error
The problem appears
Hello,
updatedb in nutch-2.0 increases fetch time of all pages independent of if they
have already been fetched or not.
For example if updatedb is applied in depth 1 and page A is fetched and its
fetchTime is 30 days from now, then as a result of running updatedb in depth 2
fetch time of page
It seems to me that if you run nutch in deploy mode and make changes to config
files then you need to rebuild .job file again unless you specify config_dir
option in hadoop command.
Alex.
-Original Message-
From: Christopher Gross cogr...@gmail.com
To: user user@nutch.apache.org
Hello,
I use nutch-2.0 with hadoop-0.20.2. bin/nutch generate command takes 87% of
cpu in deploy mode versus 18% in local mode.
Any ideas how to fix this issue?
Thanks.
Alex.
According to code in bin/nutch if you have .job file in you NUTCH_HOME then it
means that you run it in deploy mode. If there is no .job file then you run it
in local mode, so you do not need to build nutch each time you change conf
files.
Alex.
-Original Message-
From:
Can you provide a few lines of log or the url that gives the exception?
-Original Message-
From: CarinaBambina carina.rei...@yahoo.de
To: user user@nutch.apache.org
Sent: Tue, Oct 2, 2012 2:04 pm
Subject: Re: Error parsing html
Thanks for the reply. I'm now using Nutch 1.5.1, but
I checked the url you privided with parsechecker and they are parsed correctly.
You can check yourself by doing bin/nutch parsechecker yoururl. In you
implementation can you check if segment dir has correct permission.
Alex.
-Original Message-
From: CarinaBambina
Hello,
I try to use nutch-2.0, hadoop-1.03, hbase-0.92.1 in pseudo distributed mode
with iptables turned off. As soon as map reaches 100%, fetcher works for a few
minutes and fails with the error
java.net.ConnectException: Connection refused
at
Hello,
Today, I closely followed all hbase and hadoop logs. As soon as map reached
100% reduce was 33%. Then when reduce reached 66% I saw in hadoop's datanode
log the following error
2012-10-16 22:44:54,634 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
Hello,
I think the problem is with the storage not nutch itself. Looks like generate
cannot read status or fetch time (or gets null values) from mysql.
I had a bunch of issues with mysql storage and switched to hbase at the end.
Alex.
-Original Message-
From: Sebastian Nagel
Hello,
I meant that it could be a gora-mysql problem. In order to test it, you can run
nutch in local mode with Generator Debug enabled. Put this
log4j.logger.org.apache.nutch.crawl.GeneratorJob=DEBUG,cmdstdout
in your conf/log4j.properties
and run the crawl cycle with updatedb. if gora-mysql
Hello,
I have also written this kind of plugin. But instead of putting thumbnail files
in solr index they are put in a folder. Only, filenames are kept in the solr
index.
I wondered what is the advantage of putting thumbnail files in the solr index?
Thanks in advance.
Alex.
:
Thank you alxsss for the suggestion. It displays the actualSize and
inHeaderSize for every file and two more lines in logs but it did not
much
information even when i set parserJob to Debug.
I had the same problem when i re-compiled everything today. I have to run
the parse command
It is not clear what you try to achieve. We have done something similar in
regard of indexing img tags. We retrieve img tag data while parsing the html
page and keep it in a metadata and when parsing img url itself we create
thumbnail.
hth.
Alex.
-Original Message-
From:
Hi,
Unfortunately, my employer does not want me to disclose details of the plugin
at this time.
Alex.
-Original Message-
From: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu
To: user user@nutch.apache.org
Sent: Wed, Nov 28, 2012 6:20 pm
Subject: Re: Access crawled content
move or copy that jar file to local/lib and try again.
hth.
Alex.
-Original Message-
From: Arcondo arcondo.dasi...@gmail.com
To: user user@nutch.apache.org
Sent: Fri, Jan 4, 2013 2:55 am
Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents
Hope that
Which version of nutch is this? Did you follow the tutorial? I can help yuu if
you provide all steps you did, starting with downloading nutch.
Alex.
-Original Message-
From: Arcondo Dasilva arcondo.dasi...@gmail.com
To: user user@nutch.apache.org
Sent: Fri, Jan 4, 2013 1:23 pm
Hi,
You can unjar the jar file, check if the class that parse complains about is
inside it. You can also try to put content of jar file under local /lib. Maybe
there is some read restriction. If this does not help, I can only suggest to
start again with a new copy of nutch.
Alex.
Hello,
I use this class NodeWalker at src/java/org/apache/nutch/util/NodeWalker.java
in one of our plugins. I noticed this comment
//Currently this class is not thread safe. It is assumed that only one
thread will be accessing the codeNodeWalker/code at any given time.
above the class
I see that inlinks are saved as ol in hbase.
Alex.
-Original Message-
From: kiran chitturi chitturikira...@gmail.com
To: user user@nutch.apache.org
Sent: Wed, Jan 30, 2013 9:31 am
Subject: Re: Nutch 2.0 updatedb and gora query
Link to the reference (
What do you call inlinks? I call inlink for mysite.com all urls as
mysite.com/myhtml1.html, mysite.com/myhtml2.html and etc.
Currently they are saved as ol in hbase. from hbase shell do this
get 'webpage', 'com.mysite:http/' and check what ol family looks like.
I have these config
property
Hi,
Not sure about solrdedup, but solrindex worked for me in nutch-1.4 with
solr-4.1.0.
Alex.
-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
To: user user@nutch.apache.org
Sent: Wed, Feb 6, 2013 6:13 pm
Subject: Re: Nutch 1.6 +solr 4.1.0
Hi,
We are
Hi,
So, you do not run hadoop and nutch job works in distributed mode?
Thanks.
Alex.
-Original Message-
From: k4200 k4...@kazu.tv
To: user user@nutch.apache.org
Sent: Wed, Feb 6, 2013 7:43 pm
Subject: Re: Nutch 2.1 + HBase cluster settings
Hi Lewis,
There seems to be a bug
Are you telling that your sites have form siteA.mydomain.com,
siteB.mydomain.com, siteC.mydomain.com?
Alex.
-Original Message-
From: mbehlok m_beh...@hotmail.com
To: user user@nutch.apache.org
Sent: Wed, Feb 13, 2013 11:05 am
Subject: Nutch identifier while indexing.
Hello, I
Hello,
I noticed that nutch cannot retrieve title and inlinks of one of the domains in
the seed list. However, if I run identical code from the server where this
domain is hosted then it correctly parses it. The surprising thing is that in
both cases this urls has
status: 2 (status_fetched)
Hi,
I noticed that for other urls in the seed inlinks are saved as ol. I checked
the code and figured out that this is done with the part that saves anchors.
So, in my case inlinks are saved as anchors in the field ol in hbase. But, for
one of the ulrs, titile and inlinks are not retrieved,
Hello,
I see that there are
field dest=segment source=segment/
field dest=boost source=boost/
field dest=digest source=digest/
field dest=tstamp source=tstamp/
fields in addition to title, host and content ones in nutch-2.x'
1 - 100 of 141 matches
Mail list logo