Hello Guys,
I try to crawling mp3 files on local filesystems and get lots of error
like this :
Error parsing: file:/home/wildan/personal/Musik/Indonesia/Ebit G
Ade/06 rembulan menangis.mp3: org.apache.nutch.parse.ParseException:
parser not found for contentType=application/octet-stream
Can you share the architecture do you use ? are you using nutch also
for the backend ?
Regards,
Wildan
On Tue, Jan 27, 2009 at 4:53 PM, Sjaiful Bahri sba...@rocketmail.com wrote:
FYI,
Zipclue is designed to crawl news information on the
web effectively and efficiently.
http://zipclue.com
I don't know .., ask the creator of zipclue, sjaiful bachri.
On Thu, Feb 12, 2009 at 12:39 PM, Saurabh Bhutyani saur...@in.com wrote:
Hi Wildn, I don't find the recent news of last 23 days when I do a search on
zipclue. What is the crawl frequency? Also are you storing and displaying the
Hello Buddha,
Read the nutch Wiki,there is plenty of information there, if there is
something unclear, ask here.
Regards,
Wildan
On 2/15/09, buddha1021 buddha1...@yahoo.cn wrote:
hi:
How to build clusters to search web ,through nutch?
Any document ?
thank you!
--
--
---
OpenThink Labs
Armando, Thanks for the tutorial!
On Wed, Feb 18, 2009 at 6:58 AM, Armando Gonçalves
mandinho...@gmail.com wrote:
Try wiki or this
http://computercranium.com/distributed-systems/distributed-search-using-nutch
--
---
OpenThink Labs
www.tobethink.com
Aligning IT and Education
021-99325243
Hello Nutch User,
Just read a tutorial how to get information from segment an then i got
error when running readseg command :
Can any body tell me why this is happen ?
wil...@tobethink:/opt/nutch-trunk$ ./bin/nutch crawl -dump
crawl-tobethink/segments/20090306002848/crawl started in:
Thanks Martina ...
May be a little sleepy when I wrote that command .. :)
It work's now ..., Thanks !
Regards,
Wildan
On Fri, Mar 6, 2009 at 7:28 PM, Koch Martina k...@huberverlag.de wrote:
Hi Wildan,
the example you posted doesn't show a readseg command. You're doing a crawl
which tries
Just check out the code from the svn branch, and build your self .., i
think it's easy enough ...
On Tue, Mar 17, 2009 at 5:21 PM, Mayank Kamthan mkamt...@gmail.com wrote:
Hello ppl,
Please provide a pointer to 0.7 release.. I need it urgently..
Thanks n regards,
Mayank.
On Mon, Mar 16,
Logging of the Fetcher output in 0.8-dev used to work (writing to the
corresponding tasktracker output log) but doesn't appear to any more with
the nightly build from a couple of weeks ago and also the one from last
night.
I've enabled DEBUG for the first 4 logging properties in
Hi Sami,
In case it helps (since I've experience the same issue) I'm running on a
multiple node setup and run dfs and the nutch commands same as Otis.
However, with my fix of hard-wiring the path of the hadoop.log file in
log4j.properties I get multiple machines and threads trying to write
, Ed-
I'm seeing the same problem. If anyone has had a similar experience and
solved it, please let me know. In the mean time, I'll keep investigating
and
post back if I figure out what's going wrong.
This may or may not matter, but I'm running everything on a single MP
machine w/o DFS.
Doug
e w
Hi,
What would be the best way to perform crawling with two different
user-agents so as to compare the pages (requested with the two different
agents) returned by a server and accept/reject the url (for subseqent
parsing/indexing etc.)?
I believe the Google crawler used to do (still does?)
Haven't seen anyone mention this on the lists yet but is probably of
interest to the community:
http://www.techcrunch.com/2006/12/23/wikipedia-to-launch-searchengine-exclusive-screenshot/
(The message below was posted to nutch-dev a few days ago.) Can anyone
(anonymous or otherwise) confirm whether it's possible to use Nutch 0.7 for
a 4-6 billion page search engine? Is this a typo or for real? Just curious
and if it's true what were the major issues e.g. time, RAM, (storage
Partial success on the way to installing Nutch 0.8.1 With Debian Etch.
http://mfgis.com/docs/nutchconfig.html
I would like to relate here my progress towards implementing Nutch
0.8.1 on Debian Etch in hope of receiving help at the stage where I
have become stuck.
So here goes:
Disclaimer: I
/07, Steve W. [EMAIL PROTECTED] wrote:
Partial success on the way to installing Nutch 0.8.1 With Debian Etch.
http://mfgis.com/docs/nutchconfig.html
I would like to relate here my progress towards implementing Nutch
0.8.1 on Debian Etch in hope of receiving help at the stage where I
have become
I documented my approach to this under Debian on the Nutch Wiki here:
http://wiki.apache.org/nutch/GettingNutchRunningWithDebian
Steve Walker
Middle Fork Geographic Information Services
http://mfgis.com
On 3/28/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
Hi,
Nutch Wiki used to have
I'm using Nutch 0.9.
I appears that Nutch is ignoring the http.content.limit number in the config
file. I have left this setting at the default (64K), and the httpclient
plugin logs that value (...httpclient.Http - http.content.limit = 65536),
yet Nutch is attempting to fetch a 115MB file.
I
and Nutch is able to do the right thing.
The default protocol-http plugin does not use the apache commons httpclient
stuff, and works correctly.
On 5/10/07, charlie w [EMAIL PROTECTED] wrote:
I'm using Nutch 0.9.
I appears that Nutch is ignoring the http.content.limit number in the
config file. I
Created as NUTCH-481.
On 5/11/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
charlie w wrote:
The answer is that http.content.limit is indeed broken in the
protocol-httpclient plugin, though it doesn't really look like it's
entirely Nutch's fault
I've been using the Nutch and Hadoop tutorials on the respective wikis to
try to get Nutch to use Hadoop for crawling, and have worked through many
problems, but now have run up against something I can't work out.
Nutch version is 0.9, and Hadoop is 0.12.2.
To try to keep things simple, I have
I'm seeing a problem where pages are fetched, but are not indexed. I've
pared the crawl down to a very small example using the plain Nutch crawl
tool. It fails consistently with the same url (among others):
http://new.marketwire.com/2.0/rel.jsp?id=710360. The url redirects, so a
-depth option
can be expected?
Thanks,
C
On 7/24/07, charlie w [EMAIL PROTECTED] wrote:
I'm seeing a problem where pages are fetched, but are not indexed. I've
pared the crawl down to a very small example using the plain Nutch crawl
tool. It fails consistently with the same url (among others):
http
With regard to distributed search I see lots of discussion about splitting
the index, but no actual discussion about specifically how that's done.
I have a small, but growing, index. Is it possible to split my existing
index, and if so, how?
Thanks very much for the extended reply; lots of food for thought.
WRT the merge/index time on a large index, I kind of suspected this might be
the case. It's already taking a bit of time (albeit on a weak box) with my
relatively small index. In general the approach you outline sounds like
On 8/1/07, Dennis Kubes [EMAIL PROTECTED] wrote:
I am currently writing a python script to automate this whole process
from inject to pushing out to search servers. It should be done in a
day or two and I will post it on the wiki.
I'm very much looking forward to this. Reading the code
Ah, OK, I get it. Sadly for me, this precise approach is probably not going
meet my requirements, but it really helps to get me going, and I think a
variation on it will suit me quite well. I'm very much looking forward to
seeing the script that automates this.
I have one minor quibble with
Is there documentation that explains how Nutch does locking? According to
the Lucene doc, the lock should go in java.io.tmpdir, but I never see
anything looking like a lock file appear there. I do see a file write.lock
in the directory where the Lucene index lives.
But strangely, that file is
For my purposes using Nutch, I need to implement my own Similarity class
(really I just extend NutchSimilarity). The similarity class is hardcoded
in the indexer and searcher to NutchSimilarity. It would be more convenient
if this was a configurable setting. I've made changes to the indexer and
Is there a way to get a nutch search server to reopen the index in which it
is searching? Failing that, is there a graceful way to restart the
individual search servers?
Thanks
Charlie
I have crawled a page with both English and Russian (I think) content
into my index but can't seem to get search results when using a
Russian search term.
The page is: http://englishrussia.com/?p=845
The search term is: воды
The term appears in one of the comments ('Comment by Henry').
I've
I have a question about the proper interpretation of a noindex robots
directive in a meta tag (meta name=robots content=noindex /).
When Nutch fetches such a page, the content, title, etc. of the page
is not indexed, but the URL itself is. The document is searchable by
terms in the URL. That
:
charlie w wrote:
I have a question about the proper interpretation of a noindex robots
directive in a meta tag (meta name=robots content=noindex /).
I couldn't find any unambiguous description of this tag in the official
documents (robotstxt.org or HTML 4.01). Should a crawler completely skip
This is in reference to the Nutch content segments
(segments/timestamp/parse_text, etc.), not the segments of a Lucene
index.
I am considering using SegmentMerger to combine a large number of
fetch segments into a single huge segment. Will doing so create a
performance problem when generating
If I use the SegmentMerger tool to merge many fetched content segments
(segments/timestamp/parse_text, etc.) into a single huge segment, do
I then create a performance problem when generating page summaries for
search hits? Are there contention or other issues reading these
fetched segments?
If
Hi,
is it possible to edit the index structure of nutch?
I have following problem:
The files will be indexed by Nutch, the frontend will be implemented with
Zend Framework 1.6.0 (Zend_Search_Lucene).
Zend_Search_Lucene IMO doesn't support the nutch index structure, so I can
only read the title,
with Luke and the nutch webapp I get
results.
Andrzej Bialecki wrote:
Matthias W. wrote:
Hi,
I want to use Nutch for crawling contents and Lucene webapp to search the
Nutch-created index.
I thought nutch creates a Lucene interoperable index, but when I'm
searching
the index with the Lucene
Hi,
every document saved in the nutch index has a unique Id !?
Is it possible to get search the index by this unique Id? (Like 'id:123')
--
View this message in context:
http://www.nabble.com/searching-by-Id-tp20092545p20092545.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Patrick Markiewicz wrote:
I'm not sure what you're using for searching, but wherever you
reference an analyzer in Lucene, you need to change that from
StandardAnalyzer to
AnalyzerFactory.get(NutchConfiguration.create().get(en)) (which may
require importing nutch-specific classes).
I
Hi,
I've got a textfile with all URLs to index, I don't want to crawl URLs
before indexing.
How to do this?
Also I'm creating an index in a temporary folder and on success I want to
overwrite the old index.
How do I check in the shell script, if the crawl- (index-) command was
successful?
--
Dennis Kubes-2 wrote:
Just having the urls isn't the same as having an index. You would still
need to crawl them. You can inject your url list into a clean crawldb
and fetch only those urls with the inject, generate, fetch commands.
Then you can use the index command to index them.
Hello,
I'm new to nutch and have successfully configured the fetching
application
but had some questions about its tomcat search component:
a. should indexes be stored under the webapps dir?
b. can these segments be read with a Luke type application?
c. are the pages being stored as html? if
Message
From: Matthias W. matthias.wang...@e-projecta.com
To: nutch-user@lucene.apache.org
Sent: Tuesday, January 13, 2009 7:17:50 AM
Subject: nutch crawling with java (not shellscript)
Hi,
is there a tutorial or can anyone explain if and how I can run the nutch
crawler via java
Matthias W. matthias.wang...@e-projecta.com
Ok thanks!
But I decided against using the nutch crawler.
It will be the better way to build the index directly with Lucene,
because
I
do not need to crawl.
(I'm also searching with Lucene)
Now I use the parsers PDFBox for PDF-Documents
Hello I new with nutch how do I enable PDF indexing support?
conf/nutch-default
Jérôme Charron wrote:
http.content.limit=542256565536 and file.content.limit=4541165536
still the same error:
where do you specify these values? in nutch-default or nutch-site?
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
Don't have a conf/nutch-site.xml
Jérôme Charron wrote:
conf/nutch-default
Checks that they are not overrided in the conf/nutch-site
If no, sorry, no more idea for now :-(
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
java.lang.Interger.MAX_VALUE).
Regards
Jérôme
On 11/16/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote:
Have now added conf/nutch-site.xml but still the same problem. | Related
to the problem? http://sourceforge.net/forum/message.php?msg_id=3391668
http://sourceforge.net/forum/message.php?msg_id
Hi, I am still testing nutch 0.7.1 but now I have another problem.
When I do a normal intranet crawl on some web folders with 2000 pdfs,
nutch only fetches 47 pdfs from each folder.
Do you mean http.content.limit? I have set it to -1 already.
There are no Content truncated at 65536 bytes. Parser can't handle
incomplete errors in the log.
Stefan Groschupf wrote:
Check the maximal content limit in nutch-default.xml
Am 22.11.2005 um 16:38 schrieb Håvard W. Kongsgård
If you want an out of the box solution with another search engine try
this link, http://www.searchtools.com/info/multimedia-search.html
But I don't know if any of them is open source :-(
Aled Jones wrote:
Hi
It's not very clear from the nutch site what can nutch do with images.
Currently
Hello I have still some questions about nutch
- I want to index about 50 – 100 sites with lots of documents, is it
best use the Intranet Crawling or Whole-web Crawling method.
- Is the crawl auto updated in nutch, or must I run a cron task
So how to update a crawl, the updating section of the FAQ is empty!
http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6
Doug Cutting wrote:
Håvard W. Kongsgård wrote:
- I want to index about 50 – 100 sites with lots of documents, is it
best use the Intranet
So how to update a crawl, the updating section of the FAQ is empty :-(
http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6
Doug Cutting wrote:
Håvard W. Kongsgård wrote:
- I want to index about 50 – 100 sites with lots of documents, is it
best use the Intranet
I have followed the media-style.com quick tutorial, but when I try to
fetch my segment the fetch is killed!
Have tried to set the system timer + 30 days, no anti-virus is running
on the systems.
System SUSE 9.2 and SUSE 10
# bin/nutch fetch segments/20060109014654/
060109 014714 parsing
status: 0.17515436 pages/s, 20.703678 kb/s, 15129.917
bytes/page
-.-.-.-.-.-
What is java.net.SocketTimeoutException?
Håvard W. Kongsgård wrote:
Is the fetcher not supposed to fetch all the docs from the urls
provide in the ulrs.txt file?
The fetch process only takes some seconds, and the whole
/set/print
In any case that are just logging statement what makes you guess that
something crashed?
Stefan
Am 09.12.2005 um 17:44 schrieb Håvard W. Kongsgård:
But then i fetch the other domains www.sf.net http://www.sf.net/
. the output is only
060109 014715 http.agent = NutchCVS
writing code for stuff i need :-)
Thanks and Regards,
Pushpesh
On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote:
Try using the whole-web fetching method instead of the crawl method.
http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling
http://wiki.media-style.com/display
property
nameindexer.max.tokens/name
value1/value
description
The maximum number of tokens that will be indexed for a single field
in a document. This limits the amount of memory required for
indexing, so that collections with very large files will not crash
the indexing process by
Hi, I am running a nutch server with a db containing 20 docs. When I
start tomcat and search for something the browser displays an empty site.
Is this a memory problem, how do I fix it?
System: 2,6 | Memory 1 GB | SUSE 9.2
of your tomcat nutch deployment.
regards
Dominik
Håvard W. Kongsgård wrote:
Hi, I am running a nutch server with a db containing 20 docs.
When I start tomcat and search for something the browser displays an
empty site.
Is this a memory problem, how do I fix it?
System: 2,6 | Memory 1 GB
Is there a bug in 0.7.1 that causes the fetcher.threads.per.host setting
to be ignored?
Nutch-site.xml
property
namefetcher.server.delay/name
value15.0/value
descriptionThe number of seconds the fetcher will delay between
successive requests to the same server./description /property
Never mind solved it
for tomcat 5 run
export JAVA_OPTS=-Xmx128m -Xms128m
Håvard W. Kongsgård wrote:
No I use 0.7.1, I have tested nutch/tomcat with 20 000 docs so I know
it works. Searching using site like china site:www.fas.org also works.
Dominik Friedrich wrote:
If you use the mapred
No cluster results is displayed next to the search results.
Is this because I turned clustering on after running the fetch and the
indexing?
nutch-site.xml
No the current version of nutch don't support password protected sites,
sites that are password protected = http error 404 in the nutch log
Andy Morris wrote:
Can nutch access password protected sites?
If so how?
Thanks,
Andy
Hi I have setup a nutch (0.7.1) system running on multiple servers
following Stefan Groschupf tutorial
(http://wiki.media-style.com/display/nutchDocu/setup+multiple+search+sever).
I already had a nutch index and a set of segments so I copied some
segments to different servers.
No I want to add
If your old urls have not expired(30 day) then a bin/nutch generate
will process only the new urls.
Ennio Tosi wrote:
Hi, I created an index from an injected url. My problem is that if now
I inject another url in the webdb, the fetcher reprocesses the
starting url too... Is there a way to
I have been doing some testing on different nutch configurations to see
what slows down the fetching process on my servers(nutch 0.7.1).
My general experience is that the PDF parse process is nutchs Achilles heel.
Nutch works fine on older computers, but with the combination of
PDFBox-0.7.2 or one of the nightly builds PDFBox-0.7.3-dev...
Steve Betts wrote:
I should have included the link, but I used PDFBox.
Thanks,
Steve Betts
[EMAIL PROTECTED]
937-477-1797
-Original Message-
From: Håvard W. Kongsgård [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 25
W. Kongsgård wrote:
Cud you create a new version from the latest xpdf version,
I know that the older versions of pdftotext (before October 2005) had
some issues with PDF 1.6 (acrobat 7).
Doug Cutting wrote:
Steve Betts wrote:
I am using PDFBox-0.7.2-log4j.jar. That doesn't make it run
I hope I am emailing the correct address, if not, I apologize.
I am installing on Windows Server 2000 SP4 and am willing to produce a
detailed (with installation notes) installation/setup document for windows
(PDF even) in exchange for your help with this issue.
Windows Nutch testing server (do
I hope I am emailing the correct address, if not, I apologize.
I am installing on Windows Server 2000 SP4 and am willing to produce a
detailed (with installation notes) installation/setup document for windows
(PDF even) in exchange for your help with this issue.
Windows Nutch testing server (do
So you have been following the quick tutorial for nutch 0.8 and later at
media-style…
The author has left out the parse and updatedb part.
After fetch simply run bin/nutch parse segment/2006 and then
bin/nutch crawldb updatedb segment/2006xxx.
Rafit Izhak_Ratzin wrote:
part the parsing is done in the mapping or in
the reducing of the fetch process?
Thanks again,
Rafit
From: Håvard W. Kongsgård [EMAIL PROTECTED]
Reply-To: nutch-user@lucene.apache.org
To: nutch-user@lucene.apache.org
Subject: Re: The parsing is part of the Map or part of the Reduce?
Date
Hi, I have a problem with last Friday nightly build. When I try to fetch
my segment the fetch process freezesAborting with 10 hung threads.
After failing Nutch tries to run the same urls on another tasktracker
but again fails.
I have tried turning fetcher.parse off, protocol-httpclient,
I get the same error (15.02 nightly build)
Gal Nitzan wrote:
I am getting this error all the time. Cant start inject.
060215 183808 parsing file:/home/nutchuser/nutch/conf/hadoop-site.xml
Exception in thread main java.io.IOException: Cannot open
filename
I am unable to set java_home in bin/hadoop, is there a bug? I have used
nutch 0.7.1 with the same java path.
localhost: Error: JAVA_HOME is not set.
if [ -f $HADOOP_HOME/conf/hadoop-env.sh ]; then
source ${HADOOP_HOME}/conf/hadoop-env.sh
fi
# some Java parameters
if [ $JAVA_HOME !=
Thanks it worked. Is there any other path I need to set?
# The java implementation to use.
export JAVA_HOME=/usr/lib/java
Doug Cutting wrote:
Have you edited conf/hadoop-env.sh, and defined JAVA_HOME there?
Doug
Håvard W. Kongsgård wrote:
I am unable to set java_home in bin/hadoop
When searching with nutch the title of pdf documents is a url to the
file like:
http://www.ists.dartmouth.edu/library/wse0901.pdf
I have noticed that google and ultraseek creates a normal title like:
WebALPS: A Survey of E-Commerce Privacy and Security Applications
Is it possible to make nutch
Must I have index-more enabled to get the pdf titles to work.
I did a test with some pdf files, all pdf titles were ignored (nutch 0.7.1).
Håvard W. Kongsgård wrote:
It'd be nice if this was changed so that if a PDF has no title then
the first xx words become the new title.
(but it seems
Take a look at the Google search result of this rand publication
http://www.google.com/search?hs=z0nhl=enlr=client=firefox-arls=org.mozilla%3Aen-US%3Aofficialq=Implementing+Security+Improvement+Options+at+Los+Angeles+International+Airport+btnG=Search
The pdf document (RAND_DB468-1.sum.pdf) has
http://wiki.media-style.com/display/nutchDocu/Home
Roeland Weve wrote:
Hi,
I've installed Nutch 0.7.1 today on Windows XP with Cygwin and tried
to follow the tutorial at:
http://lucene.apache.org/nutch/tutorial.html
But this tutorial seems to be written for another version of Nutch.
What about upgrading from 0.7.1? Can I use my existing db and segments?
Piotr Kosiorowski wrote:
Hello all,
The 0.7.2 release of Nutch is now available. This is a bug fix release
for 0.7 branch. See CHANGES.txt
Hi, I am running nutch 0.7.2 on 3 servers|1 tomcat/db|2 segment servers port
8081|
is it possible to run bin/nutch dedup on multiple servers so that nutch removes
all duplicated pages?
Run bin/nutch dedup segments dedup.tmp
Dima Mazmanov wrote:
Hi all!! I'm running on nutch-0.7.1.
Here is result of my search.
ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web
Site Our web site has new look and ... link on the ...
So what filter settings do you use?
Like this +^http://([a-z0-9]*\.)*bbc.co.uk/
Then you will get bbc.co.uk and www.bbc.co.uk http://www.bbc.co.uk/ and
since this site is dynamic, content might bee different.
Have the same problem myself :-(
---
Well my script
Like this
+http://[^/]*\.(com|org|net|biz|mil|us|info|cc)/
-.*
see: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg00479.html
Dima Mazmanov wrote:
I'm not adding urls into urlfilter files.
Besides, I still don't understand how to allow only one zone in
urlfilter.
Let's say I
Don't know but you can try to upgrading to 0.7.2
See Nutch Change Log:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=390158
Dima Mazmanov wrote:
Hi,Håvard.
Thank you again for your help.
..mmm. there is else once thing I'm cuerious about...
The search
For Internet Explorer
http://www.favicon.com/ie.html
Firefox
Works for me in nutch 0.7.2
Is it the right size?
http://www.photoshopsupport.com/tutorials/jennifer/favicon.html
Bill Goffe wrote:
At http://ese.rfe.org I've Nutch running for some time, but I have a minor
question: how to put
Kerry Wilson wrote:
Trying to use nutch on windows and the executables are shell scripts,
how do you use nutch on windows?
http://wiki.apache.org/nutch/GettingNutchRunningWithWindows
I am trying to get nutch/hadoop to run on 3 servers with SUSE linux.
I have followed the Nutch Hadoop Tutorial and everything works find (I
can run bin/hadoop dfs –ls), but when I run “bin/nutch inject crawldb
urls” I get this error.
Exception in thread main
When I run “bin/nutch invertlinks linkdb segments” I get this error
Exception in thread main java.io.IOException: Input directory
/user/nutch/segments/parse_data in linux3:9000 is invalid.
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
at
When I try to index my second segment “bin/nutch index issep crawldb
linkdb segments/x” I get this error
Exception in thread main java.io.IOException: Output directory
/user/nutch/issep already exists.
at
Sami Siren wrote:
try “bin/nutch invertlinks linkdb -dir segments”
--
Sami Siren
Håvard W. Kongsgård wrote:
When I run “bin/nutch invertlinks linkdb segments” I get this error
Exception in thread main java.io.IOException: Input directory
/user/nutch/segments/parse_data in linux3:9000
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg02394.html
Teruhiko Kurosaka wrote:
Can I use MapReduce to run Nutch on a multi CPU system?
Yes.
I want to run the index job on two (or four) CPUs
on a single system. I'm not trying to distribute the job
over
In Google the user can search in more than one specific site using OR
admission site:www.stanford.edu OR site: cmu.edu OR site:mit.edu OR
site:berkeley.edu
Is this possible in the nutch web gui?
Do /user/root/url exist, have you uploaded the url folder to you dfs system?
bin/hadoop dfs -mkdir urls
bin/hadoop dfs -copyFromLocal urls.txt urls/urls.txt
or
bin/hadoop -put localsrc dst
Mohan Lal wrote:
Hi all,
While iam try to crawl using distributed machines its throw an error
What is the best way to create a master index on a nutch 8 / hadoop system?
Is it to merge all of the segments together, and then create an index?
Or like Roberto Navoni in his Tutorial
First index all the segments separately and then merge the indexes into
one master index?
-.-.-.-.-.-.-
#
I have a problem with my Nutch web gui sometimes returning empty pages
when I do a search. In Nutch 0.7 this was fixed by giving
ipc.client.timeout a higher value in my webapp/ROOT/
WEB-INF/classes/hadoop-site.xml but this has no effect in nutch 0.8.1,
the nutch web gui still times out after
/crawl.1/segments/20060929120235
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl.1/indexes
Dedup: done
Adding /user/root/crawl.1/indexes/part-0
Adding /user/root/crawl.1/indexes/part-1
crawl finished: crawl.1
Thanks and Regards
Mohanlal
quot;H?vard W. Kongsg?rdquot;-2 wrote
1 - 100 of 107 matches
Mail list logo