problems when crawling mp3 files ...

2009-01-20 Thread W
Hello Guys, I try to crawling mp3 files on local filesystems and get lots of error like this : Error parsing: file:/home/wildan/personal/Musik/Indonesia/Ebit G Ade/06 rembulan menangis.mp3: org.apache.nutch.parse.ParseException: parser not found for contentType=application/octet-stream

Re: Crawl News Web

2009-01-27 Thread W
Can you share the architecture do you use ? are you using nutch also for the backend ? Regards, Wildan On Tue, Jan 27, 2009 at 4:53 PM, Sjaiful Bahri sba...@rocketmail.com wrote: FYI, Zipclue is designed to crawl news information on the web effectively and efficiently. http://zipclue.com

Re: Crawl News Web

2009-02-11 Thread W
I don't know .., ask the creator of zipclue, sjaiful bachri. On Thu, Feb 12, 2009 at 12:39 PM, Saurabh Bhutyani saur...@in.com wrote: Hi Wildn, I don't find the recent news of last 23 days when I do a search on zipclue. What is the crawl frequency? Also are you storing and displaying the

Re: How to build clusters?

2009-02-15 Thread W
Hello Buddha, Read the nutch Wiki,there is plenty of information there, if there is something unclear, ask here. Regards, Wildan On 2/15/09, buddha1021 buddha1...@yahoo.cn wrote: hi: How to build clusters to search web ,through nutch? Any document ? thank you! -- -- --- OpenThink Labs

Re: How to build clusters?

2009-02-17 Thread W
Armando, Thanks for the tutorial! On Wed, Feb 18, 2009 at 6:58 AM, Armando Gonçalves mandinho...@gmail.com wrote: Try wiki or this http://computercranium.com/distributed-systems/distributed-search-using-nutch -- --- OpenThink Labs www.tobethink.com Aligning IT and Education 021-99325243

readseg error

2009-03-06 Thread W
Hello Nutch User, Just read a tutorial how to get information from segment an then i got error when running readseg command : Can any body tell me why this is happen ? wil...@tobethink:/opt/nutch-trunk$ ./bin/nutch crawl -dump crawl-tobethink/segments/20090306002848/crawl started in:

Re: readseg error

2009-03-06 Thread W
Thanks Martina ... May be a little sleepy when I wrote that command .. :) It work's now ..., Thanks ! Regards, Wildan On Fri, Mar 6, 2009 at 7:28 PM, Koch Martina k...@huberverlag.de wrote: Hi Wildan, the example you posted doesn't show a readseg command. You're doing a crawl which tries

Re: nutch 0.7

2009-03-17 Thread W
Just check out the code from the svn branch, and build your self .., i think it's easy enough ... On Tue, Mar 17, 2009 at 5:21 PM, Mayank Kamthan mkamt...@gmail.com wrote: Hello ppl, Please provide a pointer to 0.7 release.. I need it  urgently.. Thanks  n regards, Mayank. On Mon, Mar 16,

Problem with logging of Fetcher output in 0.8-dev

2006-07-25 Thread e w
Logging of the Fetcher output in 0.8-dev used to work (writing to the corresponding tasktracker output log) but doesn't appear to any more with the nightly build from a couple of weeks ago and also the one from last night. I've enabled DEBUG for the first 4 logging properties in

Re: [Nutch-general] log4j.properties bug(?)

2006-08-12 Thread e w
Hi Sami, In case it helps (since I've experience the same issue) I'm running on a multiple node setup and run dfs and the nutch commands same as Otis. However, with my fix of hard-wiring the path of the hadoop.log file in log4j.properties I get multiple machines and threads trying to write

Re: Problem with logging of Fetcher output in 0.8-dev

2006-08-23 Thread e w
, Ed- I'm seeing the same problem. If anyone has had a similar experience and solved it, please let me know. In the mean time, I'll keep investigating and post back if I figure out what's going wrong. This may or may not matter, but I'm running everything on a single MP machine w/o DFS. Doug e w

Fetching with two different user agents

2006-11-13 Thread e w
Hi, What would be the best way to perform crawling with two different user-agents so as to compare the pages (requested with the two different agents) returned by a server and accept/reject the url (for subseqent parsing/indexing etc.)? I believe the Google crawler used to do (still does?)

New Wikipedia search engine using Nutch

2006-12-25 Thread e w
Haven't seen anyone mention this on the lists yet but is probably of interest to the community: http://www.techcrunch.com/2006/12/23/wikipedia-to-launch-searchengine-exclusive-screenshot/

Re: Nutch Programmer Wanted

2007-01-07 Thread e w
(The message below was posted to nutch-dev a few days ago.) Can anyone (anonymous or otherwise) confirm whether it's possible to use Nutch 0.7 for a 4-6 billion page search engine? Is this a typo or for real? Just curious and if it's true what were the major issues e.g. time, RAM, (storage

Partial Success installing Nutch 0.8.1 under Debian Etch: Procedure and Question(s)

2007-01-24 Thread Steve W.
Partial success on the way to installing Nutch 0.8.1 With Debian Etch. http://mfgis.com/docs/nutchconfig.html I would like to relate here my progress towards implementing Nutch 0.8.1 on Debian Etch in hope of receiving help at the stage where I have become stuck. So here goes: Disclaimer: I

Partial Success installing Nutch 0.8.1 under Debian Etch: Procedure and Question(s)

2007-02-02 Thread Steve W.
/07, Steve W. [EMAIL PROTECTED] wrote: Partial success on the way to installing Nutch 0.8.1 With Debian Etch. http://mfgis.com/docs/nutchconfig.html I would like to relate here my progress towards implementing Nutch 0.8.1 on Debian Etch in hope of receiving help at the stage where I have become

Re: 1 Nutch, multiple indices?

2007-03-28 Thread Steve W.
I documented my approach to this under Debian on the Nutch Wiki here: http://wiki.apache.org/nutch/GettingNutchRunningWithDebian Steve Walker Middle Fork Geographic Information Services http://mfgis.com On 3/28/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi, Nutch Wiki used to have

http content limit not working?

2007-05-10 Thread charlie w
I'm using Nutch 0.9. I appears that Nutch is ignoring the http.content.limit number in the config file. I have left this setting at the default (64K), and the httpclient plugin logs that value (...httpclient.Http - http.content.limit = 65536), yet Nutch is attempting to fetch a 115MB file. I

Re: http content limit not working?

2007-05-11 Thread charlie w
and Nutch is able to do the right thing. The default protocol-http plugin does not use the apache commons httpclient stuff, and works correctly. On 5/10/07, charlie w [EMAIL PROTECTED] wrote: I'm using Nutch 0.9. I appears that Nutch is ignoring the http.content.limit number in the config file. I

Re: http content limit not working?

2007-05-11 Thread charlie w
Created as NUTCH-481. On 5/11/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: charlie w wrote: The answer is that http.content.limit is indeed broken in the protocol-httpclient plugin, though it doesn't really look like it's entirely Nutch's fault

can't crawl with hadoop under cygwin

2007-07-16 Thread charlie w
I've been using the Nutch and Hadoop tutorials on the respective wikis to try to get Nutch to use Hadoop for crawling, and have worked through many problems, but now have run up against something I can't work out. Nutch version is 0.9, and Hadoop is 0.12.2. To try to keep things simple, I have

documents fetched but not indexed (Nutch 0.9)

2007-07-24 Thread charlie w
I'm seeing a problem where pages are fetched, but are not indexed. I've pared the crawl down to a very small example using the plain Nutch crawl tool. It fails consistently with the same url (among others): http://new.marketwire.com/2.0/rel.jsp?id=710360. The url redirects, so a -depth option

Re: documents fetched but not indexed (Nutch 0.9)

2007-07-25 Thread charlie w
can be expected? Thanks, C On 7/24/07, charlie w [EMAIL PROTECTED] wrote: I'm seeing a problem where pages are fetched, but are not indexed. I've pared the crawl down to a very small example using the plain Nutch crawl tool. It fails consistently with the same url (among others): http

spliting an index

2007-07-31 Thread charlie w
With regard to distributed search I see lots of discussion about splitting the index, but no actual discussion about specifically how that's done. I have a small, but growing, index. Is it possible to split my existing index, and if so, how?

Re: Nutch and distributed searching (w/ apologies)

2007-08-01 Thread charlie w
Thanks very much for the extended reply; lots of food for thought. WRT the merge/index time on a large index, I kind of suspected this might be the case. It's already taking a bit of time (albeit on a weak box) with my relatively small index. In general the approach you outline sounds like

Re: Nutch and distributed searching (w/ apologies)

2007-08-01 Thread charlie w
On 8/1/07, Dennis Kubes [EMAIL PROTECTED] wrote: I am currently writing a python script to automate this whole process from inject to pushing out to search servers. It should be done in a day or two and I will post it on the wiki. I'm very much looking forward to this. Reading the code

Re: Nutch and distributed searching (w/ apologies)

2007-08-02 Thread charlie w
Ah, OK, I get it. Sadly for me, this precise approach is probably not going meet my requirements, but it really helps to get me going, and I think a variation on it will suit me quite well. I'm very much looking forward to seeing the script that automates this. I have one minor quibble with

index locking in nutch

2007-08-07 Thread charlie w
Is there documentation that explains how Nutch does locking? According to the Lucene doc, the lock should go in java.io.tmpdir, but I never see anything looking like a lock file appear there. I do see a file write.lock in the directory where the Lucene index lives. But strangely, that file is

NutchSimilarity

2007-08-09 Thread charlie w
For my purposes using Nutch, I need to implement my own Similarity class (really I just extend NutchSimilarity). The similarity class is hardcoded in the indexer and searcher to NutchSimilarity. It would be more convenient if this was a configurable setting. I've made changes to the indexer and

distributed search server

2007-09-26 Thread charlie w
Is there a way to get a nutch search server to reopen the index in which it is searching? Failing that, is there a graceful way to restart the individual search servers? Thanks Charlie

Problems with mixed English/Russian page

2007-11-26 Thread charlie w
I have crawled a page with both English and Russian (I think) content into my index but can't seem to get search results when using a Russian search term. The page is: http://englishrussia.com/?p=845 The search term is: воды The term appears in one of the comments ('Comment by Henry'). I've

semantics of meta noindex

2007-12-18 Thread charlie w
I have a question about the proper interpretation of a noindex robots directive in a meta tag (meta name=robots content=noindex /). When Nutch fetches such a page, the content, title, etc. of the page is not indexed, but the URL itself is. The document is searchable by terms in the URL. That

Re: semantics of meta noindex

2007-12-19 Thread charlie w
: charlie w wrote: I have a question about the proper interpretation of a noindex robots directive in a meta tag (meta name=robots content=noindex /). I couldn't find any unambiguous description of this tag in the official documents (robotstxt.org or HTML 4.01). Should a crawler completely skip

large content/parse segments

2008-05-14 Thread charlie w
This is in reference to the Nutch content segments (segments/timestamp/parse_text, etc.), not the segments of a Lucene index. I am considering using SegmentMerger to combine a large number of fetch segments into a single huge segment. Will doing so create a performance problem when generating

Is there a performance penalty for merging content segments?

2008-05-29 Thread charlie w
If I use the SegmentMerger tool to merge many fetched content segments (segments/timestamp/parse_text, etc.) into a single huge segment, do I then create a performance problem when generating page summaries for search hits? Are there contention or other issues reading these fetched segments? If

Edit index structure

2008-09-11 Thread Matthias W.
Hi, is it possible to edit the index structure of nutch? I have following problem: The files will be indexed by Nutch, the frontend will be implemented with Zend Framework 1.6.0 (Zend_Search_Lucene). Zend_Search_Lucene IMO doesn't support the nutch index structure, so I can only read the title,

Re: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

2008-10-15 Thread Matthias W.
with Luke and the nutch webapp I get results. Andrzej Bialecki wrote: Matthias W. wrote: Hi, I want to use Nutch for crawling contents and Lucene webapp to search the Nutch-created index. I thought nutch creates a Lucene interoperable index, but when I'm searching the index with the Lucene

searching by Id

2008-10-21 Thread Matthias W.
Hi, every document saved in the nutch index has a unique Id !? Is it possible to get search the index by this unique Id? (Like 'id:123') -- View this message in context: http://www.nabble.com/searching-by-Id-tp20092545p20092545.html Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

2008-11-03 Thread Matthias W.
Patrick Markiewicz wrote: I'm not sure what you're using for searching, but wherever you reference an analyzer in Lucene, you need to change that from StandardAnalyzer to AnalyzerFactory.get(NutchConfiguration.create().get(en)) (which may require importing nutch-specific classes). I

my own crawlscript.sh

2008-12-05 Thread Matthias W.
Hi, I've got a textfile with all URLs to index, I don't want to crawl URLs before indexing. How to do this? Also I'm creating an index in a temporary folder and on success I want to overwrite the old index. How do I check in the shell script, if the crawl- (index-) command was successful? --

Re: my own crawlscript.sh

2008-12-08 Thread Matthias W.
Dennis Kubes-2 wrote: Just having the urls isn't the same as having an index. You would still need to crawl them. You can inject your url list into a clean crawldb and fetch only those urls with the inject, generate, fetch commands. Then you can use the index command to index them.

nutch questions

2008-12-12 Thread Peter W .
Hello, I'm new to nutch and have successfully configured the fetching application but had some questions about its tomcat search component: a. should indexes be stored under the webapps dir? b. can these segments be read with a Luke type application? c. are the pages being stored as html? if

Re: nutch crawling with java (not shellscript)

2009-01-14 Thread Matthias W.
Message From: Matthias W. matthias.wang...@e-projecta.com To: nutch-user@lucene.apache.org Sent: Tuesday, January 13, 2009 7:17:50 AM Subject: nutch crawling with java (not shellscript) Hi, is there a tutorial or can anyone explain if and how I can run the nutch crawler via java

Re: nutch crawling with java (not shellscript)

2009-01-14 Thread Matthias W.
Matthias W. matthias.wang...@e-projecta.com Ok thanks! But I decided against using the nutch crawler. It will be the better way to build the index directly with Lucene, because I do not need to crawl. (I'm also searching with Lucene) Now I use the parsers PDFBox for PDF-Documents

PDF indexing support?

2005-11-14 Thread Håvard W. Kongsgård
Hello I new with nutch how do I enable PDF indexing support?

Re: PDF indexing support?

2005-11-15 Thread Håvard W. Kongsgård
conf/nutch-default Jérôme Charron wrote: http.content.limit=542256565536 and file.content.limit=4541165536 still the same error: where do you specify these values? in nutch-default or nutch-site? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: PDF indexing support?

2005-11-15 Thread Håvard W. Kongsgård
Don't have a conf/nutch-site.xml Jérôme Charron wrote: conf/nutch-default Checks that they are not overrided in the conf/nutch-site If no, sorry, no more idea for now :-( Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: PDF indexing support?

2005-11-16 Thread Håvard W. Kongsgård
java.lang.Interger.MAX_VALUE). Regards Jérôme On 11/16/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote: Have now added conf/nutch-site.xml but still the same problem. | Related to the problem? http://sourceforge.net/forum/message.php?msg_id=3391668 http://sourceforge.net/forum/message.php?msg_id

Intranet craw folder

2005-11-22 Thread Håvard W. Kongsgård
Hi, I am still testing nutch 0.7.1 but now I have another problem. When I do a normal intranet crawl on some web folders with 2000 pdfs, nutch only fetches 47 pdfs from each folder.

Re: Intranet craw folder

2005-11-22 Thread Håvard W. Kongsgård
Do you mean http.content.limit? I have set it to -1 already. There are no Content truncated at 65536 bytes. Parser can't handle incomplete errors in the log. Stefan Groschupf wrote: Check the maximal content limit in nutch-default.xml Am 22.11.2005 um 16:38 schrieb Håvard W. Kongsgård

Re: Images

2005-11-22 Thread Håvard W. Kongsgård
If you want an out of the box solution with another search engine try this link, http://www.searchtools.com/info/multimedia-search.html But I don't know if any of them is open source :-( Aled Jones wrote: Hi It's not very clear from the nutch site what can nutch do with images. Currently

Crawl auto updated in nutch?

2005-11-25 Thread Håvard W. Kongsgård
Hello I have still some questions about nutch - I want to index about 50 – 100 sites with lots of documents, is it best use the Intranet Crawling or Whole-web Crawling method. - Is the crawl auto updated in nutch, or must I run a cron task

Re: Crawl auto updated in nutch?

2005-11-28 Thread Håvard W. Kongsgård
So how to update a crawl, the updating section of the FAQ is empty! http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6 Doug Cutting wrote: Håvard W. Kongsgård wrote: - I want to index about 50 – 100 sites with lots of documents, is it best use the Intranet

Re: Crawl auto updated in nutch?

2005-11-29 Thread Håvard W. Kongsgård
So how to update a crawl, the updating section of the FAQ is empty :-( http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6 Doug Cutting wrote: Håvard W. Kongsgård wrote: - I want to index about 50 – 100 sites with lots of documents, is it best use the Intranet

Problem with fetching segment

2005-12-08 Thread Håvard W. Kongsgård
I have followed the media-style.com quick tutorial, but when I try to fetch my segment the fetch is killed! Have tried to set the system timer + 30 days, no anti-virus is running on the systems. System SUSE 9.2 and SUSE 10 # bin/nutch fetch segments/20060109014654/ 060109 014714 parsing

Re: Problem with fetching segment

2005-12-09 Thread Håvard W. Kongsgård
status: 0.17515436 pages/s, 20.703678 kb/s, 15129.917 bytes/page -.-.-.-.-.- What is java.net.SocketTimeoutException? Håvard W. Kongsgård wrote: Is the fetcher not supposed to fetch all the docs from the urls provide in the ulrs.txt file? The fetch process only takes some seconds, and the whole

Re: Problem with fetching segment

2005-12-13 Thread Håvard W. Kongsgård
/set/print In any case that are just logging statement what makes you guess that something crashed? Stefan Am 09.12.2005 um 17:44 schrieb Håvard W. Kongsgård: But then i fetch the other domains www.sf.net http://www.sf.net/ . the output is only 060109 014715 http.agent = NutchCVS

Re: is nutch recrawl possible?

2005-12-19 Thread Håvard W. Kongsgård
writing code for stuff i need :-) Thanks and Regards, Pushpesh On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote: Try using the whole-web fetching method instead of the crawl method. http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling http://wiki.media-style.com/display

Re: Out of memory exception-while updating

2005-12-20 Thread Håvard W. Kongsgård
property nameindexer.max.tokens/name value1/value description The maximum number of tokens that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by

Search result is an empty site

2006-01-09 Thread Håvard W. Kongsgård
Hi, I am running a nutch server with a db containing 20 docs. When I start tomcat and search for something the browser displays an empty site. Is this a memory problem, how do I fix it? System: 2,6 | Memory 1 GB | SUSE 9.2

Re: Search result is an empty site

2006-01-09 Thread Håvard W. Kongsgård
of your tomcat nutch deployment. regards Dominik Håvard W. Kongsgård wrote: Hi, I am running a nutch server with a db containing 20 docs. When I start tomcat and search for something the browser displays an empty site. Is this a memory problem, how do I fix it? System: 2,6 | Memory 1 GB

fetcher.threads.per.host bug in 0.7.1?

2006-01-09 Thread Håvard W. Kongsgård
Is there a bug in 0.7.1 that causes the fetcher.threads.per.host setting to be ignored? Nutch-site.xml property namefetcher.server.delay/name value15.0/value descriptionThe number of seconds the fetcher will delay between successive requests to the same server./description /property

Re: Search result is an empty site

2006-01-09 Thread Håvard W. Kongsgård
Never mind solved it for tomcat 5 run export JAVA_OPTS=-Xmx128m -Xms128m Håvard W. Kongsgård wrote: No I use 0.7.1, I have tested nutch/tomcat with 20 000 docs so I know it works. Searching using site like china site:www.fas.org also works. Dominik Friedrich wrote: If you use the mapred

No cluster results

2006-01-09 Thread Håvard W. Kongsgård
No cluster results is displayed next to the search results. Is this because I turned clustering on after running the fetch and the indexing? nutch-site.xml

Re: Access pasword protected sites?

2006-01-13 Thread Håvard W. Kongsgård
No the current version of nutch don't support password protected sites, sites that are password protected = http error 404 in the nutch log Andy Morris wrote: Can nutch access password protected sites? If so how? Thanks, Andy

Nutch system running on multiple servers | fetcher

2006-01-17 Thread Håvard W. Kongsgård
Hi I have setup a nutch (0.7.1) system running on multiple servers following Stefan Groschupf tutorial (http://wiki.media-style.com/display/nutchDocu/setup+multiple+search+sever). I already had a nutch index and a set of segments so I copied some segments to different servers. No I want to add

Re: Injecting new url

2006-01-24 Thread Håvard W. Kongsgård
If your old urls have not expired(30 day) then a bin/nutch generate will process only the new urls. Ennio Tosi wrote: Hi, I created an index from an injected url. My problem is that if now I inject another url in the webdb, the fetcher reprocesses the starting url too... Is there a way to

Parsing PDF Nutch Achilles heel?

2006-01-25 Thread Håvard W. Kongsgård
I have been doing some testing on different nutch configurations to see what slows down the fetching process on my servers(nutch 0.7.1). My general experience is that the PDF parse process is nutchs Achilles heel. Nutch works fine on older computers, but with the combination of

Re: Parsing PDF Nutch Achilles heel?

2006-01-25 Thread Håvard W. Kongsgård
PDFBox-0.7.2 or one of the nightly builds PDFBox-0.7.3-dev... Steve Betts wrote: I should have included the link, but I used PDFBox. Thanks, Steve Betts [EMAIL PROTECTED] 937-477-1797 -Original Message- From: Håvard W. Kongsgård [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 25

Re: Parsing PDF Nutch Achilles heel?

2006-01-26 Thread Håvard W. Kongsgård
W. Kongsgård wrote: Cud you create a new version from the latest xpdf version, I know that the older versions of pdftotext (before October 2005) had some issues with PDF 1.6 (acrobat 7). Doug Cutting wrote: Steve Betts wrote: I am using PDFBox-0.7.2-log4j.jar. That doesn't make it run

Connecting the search to the Db

2006-01-27 Thread Lawrence W Thorne
I hope I am emailing the correct address, if not, I apologize. I am installing on Windows Server 2000 SP4 and am willing to produce a detailed (with installation notes) installation/setup document for windows (PDF even) in exchange for your help with this issue. Windows Nutch testing server (do

RE: readdb command error, help

2006-01-28 Thread Lawrence W Thorne
I hope I am emailing the correct address, if not, I apologize. I am installing on Windows Server 2000 SP4 and am willing to produce a detailed (with installation notes) installation/setup document for windows (PDF even) in exchange for your help with this issue. Windows Nutch testing server (do

Re: The parsing is part of the Map or part of the Reduce?

2006-01-28 Thread Håvard W. Kongsgård
So you have been following the quick tutorial for nutch 0.8 and later at media-style… The author has left out the parse and updatedb part. After fetch simply run bin/nutch parse segment/2006 and then bin/nutch crawldb updatedb segment/2006xxx. Rafit Izhak_Ratzin wrote:

Re: The parsing is part of the Map or part of the Reduce?

2006-01-28 Thread Håvard W. Kongsgård
part the parsing is done in the mapping or in the reducing of the fetch process? Thanks again, Rafit From: Håvard W. Kongsgård [EMAIL PROTECTED] Reply-To: nutch-user@lucene.apache.org To: nutch-user@lucene.apache.org Subject: Re: The parsing is part of the Map or part of the Reduce? Date

Hung threads

2006-01-29 Thread Håvard W. Kongsgård
Hi, I have a problem with last Friday nightly build. When I try to fetch my segment the fetch process freezesAborting with 10 hung threads. After failing Nutch tries to run the same urls on another tasktracker but again fails. I have tried turning fetcher.parse off, protocol-httpclient,

Re: Nutch inject problem with hadoop - Missing /tmp/hadoop/mapred/system

2006-02-15 Thread Håvard W. Kongsgård
I get the same error (15.02 nightly build) Gal Nitzan wrote: I am getting this error all the time. Cant start inject. 060215 183808 parsing file:/home/nutchuser/nutch/conf/hadoop-site.xml Exception in thread main java.io.IOException: Cannot open filename

Problem/bug setting java_home in hadoop nightly 16.02.06

2006-02-16 Thread Håvard W. Kongsgård
I am unable to set java_home in bin/hadoop, is there a bug? I have used nutch 0.7.1 with the same java path. localhost: Error: JAVA_HOME is not set. if [ -f $HADOOP_HOME/conf/hadoop-env.sh ]; then source ${HADOOP_HOME}/conf/hadoop-env.sh fi # some Java parameters if [ $JAVA_HOME !=

Re: Problem/bug setting java_home in hadoop nightly 16.02.06

2006-02-17 Thread Håvard W. Kongsgård
Thanks it worked. Is there any other path I need to set? # The java implementation to use. export JAVA_HOME=/usr/lib/java Doug Cutting wrote: Have you edited conf/hadoop-env.sh, and defined JAVA_HOME there? Doug Håvard W. Kongsgård wrote: I am unable to set java_home in bin/hadoop

Pdf document title in nutch search

2006-02-20 Thread Håvard W. Kongsgård
When searching with nutch the title of pdf documents is a url to the file like: http://www.ists.dartmouth.edu/library/wse0901.pdf I have noticed that google and ultraseek creates a normal title like: WebALPS: A Survey of E-Commerce Privacy and Security Applications Is it possible to make nutch

Re: Pdf document title in nutch search

2006-02-20 Thread Håvard W. Kongsgård
Must I have index-more enabled to get the pdf titles to work. I did a test with some pdf files, all pdf titles were ignored (nutch 0.7.1). Håvard W. Kongsgård wrote: It'd be nice if this was changed so that if a PDF has no title then the first xx words become the new title. (but it seems

Re: Pdf document title in nutch search

2006-02-21 Thread Håvard W. Kongsgård
Take a look at the Google search result of this rand publication http://www.google.com/search?hs=z0nhl=enlr=client=firefox-arls=org.mozilla%3Aen-US%3Aofficialq=Implementing+Security+Improvement+Options+at+Los+Angeles+International+Airport+btnG=Search The pdf document (RAND_DB468-1.sum.pdf) has

Re: nutch 0.7.1 where is the tutorial? crawldb not found?

2006-02-25 Thread Håvard W. Kongsgård
http://wiki.media-style.com/display/nutchDocu/Home Roeland Weve wrote: Hi, I've installed Nutch 0.7.1 today on Windows XP with Cygwin and tried to follow the tutorial at: http://lucene.apache.org/nutch/tutorial.html But this tutorial seems to be written for another version of Nutch.

Re: Nutch 0.7.2 release | upgrading from 0.7.1?

2006-04-02 Thread Håvard W. Kongsgård
What about upgrading from 0.7.1? Can I use my existing db and segments? Piotr Kosiorowski wrote: Hello all, The 0.7.2 release of Nutch is now available. This is a bug fix release for 0.7 branch. See CHANGES.txt

How to run bin/nutch dedup when running multiple servers

2006-04-15 Thread Håvard W. Kongsgård
Hi, I am running nutch 0.7.2 on 3 servers|1 tomcat/db|2 segment servers port 8081| is it possible to run bin/nutch dedup on multiple servers so that nutch removes all duplicated pages?

Re: Nutch shows same results multiple times.

2006-04-18 Thread Håvard W. Kongsgård
Run bin/nutch dedup segments dedup.tmp Dima Mazmanov wrote: Hi all!! I'm running on nutch-0.7.1. Here is result of my search. ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web Site Our web site has new look and ... link on the ...

Re: Nutch shows same results multiple times.

2006-04-19 Thread Håvard W. Kongsgård
So what filter settings do you use? Like this +^http://([a-z0-9]*\.)*bbc.co.uk/ Then you will get bbc.co.uk and www.bbc.co.uk http://www.bbc.co.uk/ and since this site is dynamic, content might bee different. Have the same problem myself :-( --- Well my script

Re: Nutch shows same results multiple times.

2006-04-20 Thread Håvard W. Kongsgård
Like this +http://[^/]*\.(com|org|net|biz|mil|us|info|cc)/ -.* see: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg00479.html Dima Mazmanov wrote: I'm not adding urls into urlfilter files. Besides, I still don't understand how to allow only one zone in urlfilter. Let's say I

Re: Nutch shows same results multiple times.

2006-04-20 Thread Håvard W. Kongsgård
Don't know but you can try to upgrading to 0.7.2 See Nutch Change Log: http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=390158 Dima Mazmanov wrote: Hi,Håvard. Thank you again for your help. ..mmm. there is else once thing I'm cuerious about... The search

Re: favicon?

2006-04-21 Thread Håvard W. Kongsgård
For Internet Explorer http://www.favicon.com/ie.html Firefox Works for me in nutch 0.7.2 Is it the right size? http://www.photoshopsupport.com/tutorials/jennifer/favicon.html Bill Goffe wrote: At http://ese.rfe.org I've Nutch running for some time, but I have a minor question: how to put

Re: Nutch on Windows

2006-07-14 Thread Håvard W. Kongsgård
Kerry Wilson wrote: Trying to use nutch on windows and the executables are shell scripts, how do you use nutch on windows? http://wiki.apache.org/nutch/GettingNutchRunningWithWindows

Nutch 0.8 java 1.4/1.5

2006-07-17 Thread Håvard W. Kongsgård
I am trying to get nutch/hadoop to run on 3 servers with SUSE linux. I have followed the Nutch Hadoop Tutorial and everything works find (I can run bin/hadoop dfs –ls), but when I run “bin/nutch inject crawldb urls” I get this error. Exception in thread main

Generate linkDb | hadoop/nutch 0.8

2006-07-20 Thread Håvard W. Kongsgård
When I run “bin/nutch invertlinks linkdb segments” I get this error Exception in thread main java.io.IOException: Input directory /user/nutch/segments/parse_data in linux3:9000 is invalid. at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) at

Indexing segment | nutch 0.8/hadoop

2006-07-20 Thread Håvard W. Kongsgård
When I try to index my second segment “bin/nutch index issep crawldb linkdb segments/x” I get this error Exception in thread main java.io.IOException: Output directory /user/nutch/issep already exists. at

Re: Generate linkDb | hadoop/nutch 0.8

2006-07-20 Thread Håvard W. Kongsgård
Sami Siren wrote: try “bin/nutch invertlinks linkdb -dir segments” -- Sami Siren Håvard W. Kongsgård wrote: When I run “bin/nutch invertlinks linkdb segments” I get this error Exception in thread main java.io.IOException: Input directory /user/nutch/segments/parse_data in linux3:9000

Re: Best performance approach for single MP machine?

2006-07-20 Thread Håvard W. Kongsgård
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg02394.html Teruhiko Kurosaka wrote: Can I use MapReduce to run Nutch on a multi CPU system? Yes. I want to run the index job on two (or four) CPUs on a single system. I'm not trying to distribute the job over

How to search for multiple site:

2006-09-27 Thread Håvard W. Kongsgård
In Google the user can search in more than one specific site using OR admission site:www.stanford.edu OR site: cmu.edu OR site:mit.edu OR site:berkeley.edu Is this possible in the nutch web gui?

Re: Problem in Distributed crawling using nutch 0.8

2006-09-28 Thread Håvard W. Kongsgård
Do /user/root/url exist, have you uploaded the url folder to you dfs system? bin/hadoop dfs -mkdir urls bin/hadoop dfs -copyFromLocal urls.txt urls/urls.txt or bin/hadoop -put localsrc dst Mohan Lal wrote: Hi all, While iam try to crawl using distributed machines its throw an error

Indexing in nutch 0.8 / hadoop

2006-09-28 Thread Håvard W. Kongsgård
What is the best way to create a master index on a nutch 8 / hadoop system? Is it to merge all of the segments together, and then create an index? Or like Roberto Navoni in his Tutorial First index all the segments separately and then merge the indexes into one master index? -.-.-.-.-.-.- #

Tomcat 5 / Nutch web gui timeout blank page

2006-09-28 Thread Håvard W. Kongsgård
I have a problem with my Nutch web gui sometimes returning empty pages when I do a search. In Nutch 0.7 this was fixed by giving ipc.client.timeout a higher value in my webapp/ROOT/ WEB-INF/classes/hadoop-site.xml but this has no effect in nutch 0.8.1, the nutch web gui still times out after

Re: Problem in Distributed crawling using nutch 0.8

2006-09-29 Thread Håvard W. Kongsgård
/crawl.1/segments/20060929120235 Indexer: done Dedup: starting Dedup: adding indexes in: crawl.1/indexes Dedup: done Adding /user/root/crawl.1/indexes/part-0 Adding /user/root/crawl.1/indexes/part-1 crawl finished: crawl.1 Thanks and Regards Mohanlal quot;H?vard W. Kongsg?rdquot;-2 wrote

  1   2   >