Re: Multiple indexes on a single server instance.
Yes you nailed it. I am not sure, if it is doable. I am still trying to figure that.. My problem is I capture same or similar data from all sites. I should be able to apply those extra points. Stefan Neufeind [EMAIL PROTECTED] wrote: sudhendra seshachala wrote: I am experiencing a similar problem. What I have done is as follows. I have different parse-plugin for each site ( I have 3 sites to crawl and fetch data). But I capture data into same format I call it datarepository. I have one index-plugin which indexes on data repository and one query-plugin on the data repository, I dont have to run multiple instances. I just run one instance of search engine. However the parse configuration is different for each site so I run different crawler for each site Then I index and merge all of them. So far the results are good if not WOW. I still have to figure a way of ranking the page. For example I would like to be able to apply ranking on the data repository. Let me know If I was clear... Hi, not sure if I got you right with your last point, but it just came to my mind: It would be nice to be able to have something like If it's from indexA, give it 100 extra-points - if from indexB give it 50 extra-points. Or some if indexA give it 20% extra-weight or so. But I don't believe this is easily doable. Or is it? I got a similar problem with languages: give priority to documents in German and English. But somewhere after those results also list documents in other languages. So I'd need to be able to give extra-points on a per-language-basis, based on the indexed language-field, right? Regards, Stefan Stefan Groschupf wrote: I'm not sure what you are planing to do, but you can just switch a symbolic link on your hdd driven by a cronjob to switch between index on a given time. May be you need to touch the web.xml to restart the searcher. If you try to search in different kind of indexes at the same time, I suggest to merge the indexes and have a kind keyfield for each of the indexes. For example add a field to each of your indexes names indexName and put A, B and C as value into it. Than you can merge your index. During runtime you just need to have a queryfilter that extend a indexName:A or indexName:B to the query string. Does this somehow help to solve your problem? Stefan Am 23.05.2006 um 15:26 schrieb TJ Roberts: I have five different indexes each with their own special configuration. I would like to be able to switch between the different indexes dynamically on a single instance of nutch running on jakarta-tomcat. Is this possible, or do I have to run five instances of nutch, one for each index? Sudhi Seshachala http://sudhilogs.blogspot.com/ __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Multiple indexes on a single server instance.
Yes. That is wha I am trying. But for some reason it is not working.. Does these fields should be lower case only. ? Andrzej Bialecki [EMAIL PROTECTED] wrote: Stefan Neufeind wrote: sudhendra seshachala wrote: I am experiencing a similar problem. What I have done is as follows. I have different parse-plugin for each site ( I have 3 sites to crawl and fetch data). But I capture data into same format I call it datarepository. I have one index-plugin which indexes on data repository and one query-plugin on the data repository, I dont have to run multiple instances. I just run one instance of search engine. However the parse configuration is different for each site so I run different crawler for each site Then I index and merge all of them. So far the results are good if not WOW. I still have to figure a way of ranking the page. For example I would like to be able to apply ranking on the data repository. Let me know If I was clear... Hi, not sure if I got you right with your last point, but it just came to my mind: It would be nice to be able to have something like If it's from indexA, give it 100 extra-points - if from indexB give it 50 extra-points. Or some if indexA give it 20% extra-weight or so. But I don't believe this is easily doable. Or is it? I got a similar problem with languages: give priority to documents in German and English. But somewhere after those results also list documents in other languages. So I'd need to be able to give extra-points on a per-language-basis, based on the indexed language-field, right? This is not only doable, but fairly easy - just add these fields to the index through a custom IndexingFilter plugin, and then implement a corresponding QueryPlugin that will expand your query appropriately - this prioritization that you describe is equivalent to adding a non-required and non-prohibited clause to a Lucene query. Please see how it's done in the existing index-more/query-more and index-basic/query-basic plugins. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Sudhi Seshachala http://sudhilogs.blogspot.com/ __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: changing ranking
If some has to adopt the plugin, it has to go with new crawling. Will there be a way, where we could apply these scoring mechanisms to existing already fetched, indexed and merged pages too. Can you please shed some light? Thanks Andrzej Bialecki [EMAIL PROTECTED] wrote: Ken Krugler wrote: Eugen Kochuev wrote: Hello Andrzej, Please see the scoring API - you can write a plugin that manipulates page scores according to your own idea. Thanks a lot for your answer, but could you please shed some more light onto scoring technique used in the Nutch? As I can see from the source code Nutch uses something similar to the pagerank algorithm propagating page scores through outlinks, but only one iteration is used (while pagerank requires several iterations to converge). That's a bit complicated subject - I could either explain this in very general terms, or suggest that you read the paper that underlies the current Nutch implementation (with a twist). Please see the comment in OPICScoringFilter.java for the link to the paper. I've started writing up a description of the changes that I think need to be made to Nutch to really implement the OPIC algorithm, as described by by the Adaptive On-Line Page Importance Computation paper (ACM 1-58113-680-3/03/0005). Should I just open a JIRA issue, and dump what might be a pretty long write-up into it? Yes, please do - I'd love to implement this in that original form, even if it would go into another plugin ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: [Nutch-general] Re: Extending Nutch talk, May 11th, Palo Alto, CA
Sone one needs to send me the timings and how long the conference will run... May be I can just have 10 nuimbers.. The conference has a service where in the session can be recorded and downloaded Please suggest ASAP or better still in US Call me @ 281 516 2495 or (408 203 9960) so that we can finalize.. I would realli like to be in... and I hope I have decent support too :) Thanks Sudhi TDLN [EMAIL PROTECTED] wrote: +1 I would be interested as well. Rgrds, Thomas Delnoij On 5/10/06, [EMAIL PROTECTED] wrote: +1 to this! I won't be in San Francisco on the 11th, but would be interested in seeing/listening either in real-time or a recorded version. Thanks, Otis - Original Message From: sudhendra seshachala To: nutch-user@lucene.apache.org Sent: Tuesday, May 9, 2006 8:18:36 PM Subject: [Nutch-general] Re: Extending Nutch talk, May 11th, Palo Alto, CA Is there a way to pod/video cast this... or atleast conference call (Just listening mode) ... I have a personal account. May be I can sponsor the Listening mode conference...Please let me know If I can be of any assitance. It will really help folks who are outside bay area. There is life (hitech) outside bay area too.. in US. Thanks Stefan Groschupf wrote: Hi Nutch Users, Doug already mentioned it in the developers list (thanks!), but for those of you that does not subscribe the developer list... The next CommerceNet Thursday Tech Talk will be about Extending Nutch. I'll present a few slides about the plugin system and meta data 'flow' in nutch. http://events.commerce.net/?p=58 I would be glad to hear about experience, needs and thoughts about this topic from nutch users around the bay area. :) Cheers, Stefan Sudhi Seshachala http://sudhilogs.blogspot.com/ - Love cheap thrills? Enjoy PC-to-Phone calls to 30+ countries for just 2¢/min with Yahoo! Messenger with Voice. Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Mail goes everywhere you do. Get it on your phone.
Re: Extending Nutch talk, May 11th, Palo Alto, CA
Is there a way to pod/video cast this... or atleast conference call (Just listening mode) ... I have a personal account. May be I can sponsor the Listening mode conference...Please let me know If I can be of any assitance. It will really help folks who are outside bay area. There is life (hitech) outside bay area too.. in US. Thanks Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Nutch Users, Doug already mentioned it in the developers list (thanks!), but for those of you that does not subscribe the developer list... The next CommerceNet Thursday Tech Talk will be about Extending Nutch. I'll present a few slides about the plugin system and meta data 'flow' in nutch. http://events.commerce.net/?p=58 I would be glad to hear about experience, needs and thoughts about this topic from nutch users around the bay area. :) Cheers, Stefan Sudhi Seshachala http://sudhilogs.blogspot.com/ - Love cheap thrills? Enjoy PC-to-Phone calls to 30+ countries for just 2¢/min with Yahoo! Messenger with Voice.
Nutch ADMIN -GUI Mirror
I have hosted the bundle at the following URL. http://68.178.249.66/nutch-admin/nutch-0.8-dev_guiBundle_05_02_06.tar.gz I hope it helps. Thanks Sudhi Sudhi Seshachala http://sudhilogs.blogspot.com/ - Love cheap thrills? Enjoy PC-to-Phone calls to 30+ countries for just 2¢/min with Yahoo! Messenger with Voice.
Re: GUI
It just got completed few days back. You could beta test it downloading from http://68.178.249.66/nutch-admin/nutch-0.8-dev_guiBundle_05_02_06.tar.gz It is still in early stages... So I would not rank it as stable.. Thanks Markus Franz [EMAIL PROTECTED] wrote: Hello! Are there any powerful and stable (or almost stable) administration GUIs for Nutch? Did you test them? Regards, Markus -- Danziger Weg 2 97350 Mainbernheim Germany -- +491626077635 [EMAIL PROTECTED] -- Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Messenger with Voice. PC-to-Phone calls for ridiculously low rates.
Re: Admin Gui beta test (was Re: ATB: Heritrix)
Hi Stefan I would be willing to host the app. I have virutal dedicated server from Godaddy with Fedora core2 and apache webserver and tomcat running. The IP address is http://68.178.249.66 Right now, on webserver side, I have a default page (hosted by godaddy running) But can make sure the Admin GUI is running.. I might need some help, but should not be a problem at all. Thanks Sudhi Stefan Groschupf [EMAIL PROTECTED] wrote: Hi there, since building the gui is some how complicated I was thinking about providing a ready to use binary. This may be would help to get some more beta testers we currently looking for. Any thoughts? However I afraid that this would hit my server to hard and I have to pay for traffic. :-/ Does any one has an idea where we can mirror this file for free? Any volunteer is very welcome. Thanks. Stefan Am 28.04.2006 um 15:14 schrieb Aled Jones: Thanks for your replies guys. I hadn't realised that the admin gui was already in development. We should be able to cope till it gets released ;-) Thanks again Aled -Neges Wreiddiol-/-Original Message- Oddi wrth/From: Dan Morrill [mailto:[EMAIL PROTECTED] Anfonwyd/Sent: 28 April 2006 14:07 At/To: nutch-user@lucene.apache.org Pwnc/Subject: RE: Heritrix Aled, I used heritrix before going over to nutch, while it is an excellent program, with lots of good things to offer, it didn't quite meet my need, and when designing the architecture had too many dependencies for me to be comfortable with. If you want to run an internet archive though, heritrix can not be beat, if you want to run a search engine, nutch is a good choice. My personal opinion. r/d -Original Message- From: Aled Jones [mailto:[EMAIL PROTECTED] Sent: Friday, April 28, 2006 1:59 AM To: nutch-user@lucene.apache.org Subject: Heritrix Hi Anyone used Heritrix (http://crawler.archive.org/) as a crawler? How does it compare with the Nutch crawler? Can Nutch serve its crawled results? Main reason I'm interested is that it has a WUI interface that might make maintenance for the IT guys easier, although I know that some of you guys are working on an interface. Cheers Aled ### This message has been scanned by F-Secure Anti-Virus for Microsoft Exchange. For more information, connect to http://www.f-secure.com/ ** ** This e-mail and any attachments are strictly confidential and intended solely for the addressee. They may contain information which is covered by legal, professional or other privilege. If you are not the intended addressee, you must not copy the e-mail or the attachments, or use them for any purpose or disclose their contents to any other person. To do so may be unlawful. If you have received this transmission in error, please notify us as soon as possible and delete the message and attachments from all places in your computer where they are stored. Although we have scanned this e-mail and any attachments for viruses, it is your responsibility to ensure that they are actually virus free. ### This message has been scanned by F-Secure Anti-Virus for Microsoft Exchange. For more information, connect to http://www.f-secure.com/ ** ** This e-mail and any attachments are strictly confidential and intended solely for the addressee. They may contain information which is covered by legal, professional or other privilege. If you are not the intended addressee, you must not copy the e-mail or the attachments, or use them for any purpose or disclose their contents to any other person. To do so may be unlawful. If you have received this transmission in error, please notify us as soon as possible and delete the message and attachments from all places in your computer where they are stored. Although we have scanned this e-mail and any attachments for viruses, it is your responsibility to ensure that they are actually virus free. - blog: http://www.find23.org company: http://www.media-style.com Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Mail goes everywhere you do. Get it on your phone.
Re: Beagle and Nutch
For Searches, it still uses Lucene (Dot Lucene) How could it be much different except Beagle is using C# rather than java as in nutch. I would be vry interested, how it would perform though.. How easy to set it all up.. Thanks Sudhi Andrew Libby [EMAIL PROTECTED] wrote: Has anyone attempted to accomplish the same things with Nutch that are being accomplished by the Beagle project (http://beaglewiki.org/Main_Page)? I'm very interested in working with something like Beagle, however I'm using Nutch for other things I'm doing, and am looking for any excuse to get deeper into it for learning purposes. Thanks. Andy -- Andrew Libby [EMAIL PROTECTED] http://philadelphiariders.com/ Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Messenger with Voice. PC-to-Phone calls for ridiculously low rates.
Re: Where to put the nutch-site.xml ?
There are two ways Bundle with the jar file itself which is in Web-INF/lib folder or you can go and add it conf under Module_NAME/WEB-INF/Conf. Tomcat restart is required, if you modify the conf folder and modify the jar in the lib folder. Hope this helps./ ahmed ghouzia [EMAIL PROTECTED] wrote: Dear nutchers I comleted a successful crawling process with nutch 0.7.1 , and i am trying to make the search 1-I have my {segments, db and index} resulted from the last crawl located at: /home/ahmed/Desktop/Downloads/nutch/bin/ crawl.testagain and tomcat is located at: /home/ahmed/Desktop/Downloads/tomcat/ 2- I have edited the nutch-site.xml so that the searcher.dir refers to /home/ahmed/Desktop/Downloads/nutch/bin/ crawl.testagain 3-then i put a copy of nutch-site.xml at ~/tomcat/ webapps/nutch-0.7.1/WEB-INF/classes 4-then i restarted tomcat, then tried to browse http:// localhost:8080/ but it gave the following error HTTP Status 500 - No Context configured to process this request 5- I think that the problem is with the location to put in the nutch-site.xml Where exactly can i put it? __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Messenger with Voice. Make PC-to-Phone Calls to the US (and 30+ countries) for 2¢/min or less.
Re: Adding Level to Website Parse Data
Dennis, I am in the same dilemma as you are. Here are my thoughts. 1. I am planning to write the Plugin to do it where in the plugin can be modified based on the site map and levels 2. The Fetcher itself can be modified. But again code merging with latest contributons fixes and enhancement from community will be very hard. 3. Other way is to write a prefetcher which will fetch all the urls from a site, populate the file. Then the Nutch Crawler can be triggered to crawl the prefetched urls. Within the prefetched url pages, any unnecessary URLs not to be crawled, will have to be ignored. I am still trying a way to do this. Please share your thoughts.. Thanks Dennis Kubes [EMAIL PROTECTED] wrote: I am trying to modify Nutch to add level to the website parse data. What I mean by this is suppose you start parsing a website at its homepage that would be level one. Any links in the same site from the homepage would be level two, links from those pages would be level three and so on. I am only counting links in the same site. How would I go about modifying Nutch to handle this? I was thinking that I would have to modify Fetcher to do this, adding the level to the parse metadata. What I am not gettings is how would I get the link level initially? I was thinking I would have to modify something in the generator but didn't know what. Dennis Sudhi Seshachala http://sudhilogs.blogspot.com/ - Blab-away for as little as 1¢/min. Make PC-to-Phone Calls using Yahoo! Messenger with Voice.
Re: Saving Metadata to Mysql
Sorry to just jumpping in. We have doc id associated when we index. We could store the doc id in mysql table.We could use the docid to query the nutch database.. When parsing, capture things needed as part of metadata Index the metadata. the docId associated is stored in mysql. Does that give any idea ?... Please do share your concerns. I am working on a similar stuff where eventually we have to adopt a database. Thanks John Reidy [EMAIL PROTECTED] wrote: I am looking at something similar. I would guess the place to put it is the indexer. As I understand it the parser runs for just about everything fetched, however the indexer is only run for pages you want to index. I am also looking at having static objects (Eg a connection) that is initialise when the plugin is loaded, ideally through the startup method. Regards John Hey all, I have writen a custom HTML parser and indexer. I would like to save some information that I have gathered during the parse in a Mysql DB. I imagine there could be some performance hit here (e.g. connecting to db). What's the best place to add code to save this information - the parser or the indexer? -Mike -- View this message in context: http://www.nabble.com/Saving-Metadata-to-Mysql-t1389216.html#a3732992 Sent from the Nutch - User forum at Nabble.com. Sudhi Seshachala http://sudhilogs.blogspot.com/ - How low will we go? Check out Yahoo! Messengers low PC-to-Phone call rates.
RE: Nutch 500 Error
check the nutch-default.xml there should be a property searcher.dir Provide the path for the index folder. Better still copy the property node and paste it in nutch-site.xml provide the path for the index folder. For ex: If the index folder is stored as home/nutch/crawl - crawldb - segments - index - indexes point searcher.dir to home/nutch/crawl. Hope this helps. Thanks Sudhi Paul Stewart [EMAIL PROTECTED] wrote: Thanks I was doing the java command wrong... Back to my original problem - I re-ran throught the entire tutorial to ensure I was doing it right and it seems proper How do I tell Nutch where to look specifically in the code for the segments and indexes in case it is in the wrong place? All the best, Paul -Original Message- From: sudhendra seshachala [mailto:[EMAIL PROTECTED] Sent: Thursday, April 06, 2006 12:02 PM To: nutch-user@lucene.apache.org Subject: RE: Nutch 500 Error It should be java -versionI think. Paul Stewart wrote: Thanks for the reply... I apologize as I'm very new to the Java world...:) I am running the following: Fedora Core 4 Apache Tomcat 5.5.16 (binary download from Tomcat site installed to /usr/local/tomcat5) jre1.5.0_06 (binary download from Sun site to /usr/java/jre1.5.0_06) Weird though - when I try to do a java -v I get this now: [EMAIL PROTECTED] jre1.5.0_06]# export JAVA_HOME=/usr/java/jre1.5.0_06/ [EMAIL PROTECTED] jre1.5.0_06]# /usr/java/jre1.5.0_06/bin/java -v Unrecognized option: -v Could not create the Java virtual machine. Is this my actual problem possibly? Or is this the wrong Java version to be running? When I downloaded 1.4.x tomcat told me it didn't support anything but 1.5.x Thanks again for your patience... Paul -Original Message- From: TDLN [mailto:[EMAIL PROTECTED] Sent: Thursday, April 06, 2006 7:16 AM To: nutch-user@lucene.apache.org Subject: Re: Nutch 500 Error What version are you on? If you trace the NullPointerException back to the code, the NutchBean.init method is where it expects to find the index and segments, so either they're missing (did you follow the tutorial and merge your segment indexes?) or it is looking in the wrong place. That's what I think. Rgrds, Thomas On 4/6/06, Paul Stewart wrote: Thanks.. Tried that ... Same error HTTP Status 500 - -- -- type Exception report message description The server encountered an internal error () that prevented it from fulfilling this request. exception org.apache.jasper.JasperException org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServ le tWrapper.java:510) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper. ja va:393) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:31 4) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:264) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) root cause java.lang.NullPointerException org.apache.nutch.searcher.NutchBean.init(NutchBean.java:96) org.apache.nutch.searcher.NutchBean.(NutchBean.java:82) org.apache.nutch.searcher.NutchBean.(NutchBean.java:72) org.apache.nutch.searcher.NutchBean.get(NutchBean.java:64) org.apache.jsp.search_jsp._jspService(search_jsp.java:112) org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper. ja va:332) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:31 4) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:264) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) -Original Message- From: TDLN [mailto:[EMAIL PROTECTED] Sent: Thursday, April 06, 2006 3:30 AM To: nutch-user@lucene.apache.org Subject: Re: Nutch 500 Error My guess is you have to override the searcher.dir property in nutch-site.xml and have it point to your crawl dir. Rgrds, Thomas Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Messenger with Voice. PC-to-Phone calls for ridiculously low rates. Sudhi Seshachala http://sudhilogs.blogspot.com/ - Love cheap thrills? Enjoy PC-to-Phone calls to 30+ countries for just 2¢/min with Yahoo! Messenger with Voice.
RE: Nutch 500 Error
It should be java -versionI think. Paul Stewart [EMAIL PROTECTED] wrote: Thanks for the reply... I apologize as I'm very new to the Java world...:) I am running the following: Fedora Core 4 Apache Tomcat 5.5.16 (binary download from Tomcat site installed to /usr/local/tomcat5) jre1.5.0_06 (binary download from Sun site to /usr/java/jre1.5.0_06) Weird though - when I try to do a java -v I get this now: [EMAIL PROTECTED] jre1.5.0_06]# export JAVA_HOME=/usr/java/jre1.5.0_06/ [EMAIL PROTECTED] jre1.5.0_06]# /usr/java/jre1.5.0_06/bin/java -v Unrecognized option: -v Could not create the Java virtual machine. Is this my actual problem possibly? Or is this the wrong Java version to be running? When I downloaded 1.4.x tomcat told me it didn't support anything but 1.5.x Thanks again for your patience... Paul -Original Message- From: TDLN [mailto:[EMAIL PROTECTED] Sent: Thursday, April 06, 2006 7:16 AM To: nutch-user@lucene.apache.org Subject: Re: Nutch 500 Error What version are you on? If you trace the NullPointerException back to the code, the NutchBean.init method is where it expects to find the index and segments, so either they're missing (did you follow the tutorial and merge your segment indexes?) or it is looking in the wrong place. That's what I think. Rgrds, Thomas On 4/6/06, Paul Stewart wrote: Thanks.. Tried that ... Same error HTTP Status 500 - -- -- type Exception report message description The server encountered an internal error () that prevented it from fulfilling this request. exception org.apache.jasper.JasperException org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServ le tWrapper.java:510) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper. ja va:393) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:31 4) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:264) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) root cause java.lang.NullPointerException org.apache.nutch.searcher.NutchBean.init(NutchBean.java:96) org.apache.nutch.searcher.NutchBean.(NutchBean.java:82) org.apache.nutch.searcher.NutchBean.(NutchBean.java:72) org.apache.nutch.searcher.NutchBean.get(NutchBean.java:64) org.apache.jsp.search_jsp._jspService(search_jsp.java:112) org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper. ja va:332) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:31 4) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:264) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) -Original Message- From: TDLN [mailto:[EMAIL PROTECTED] Sent: Thursday, April 06, 2006 3:30 AM To: nutch-user@lucene.apache.org Subject: Re: Nutch 500 Error My guess is you have to override the searcher.dir property in nutch-site.xml and have it point to your crawl dir. Rgrds, Thomas Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Messenger with Voice. PC-to-Phone calls for ridiculously low rates.
Re: Crawling the local file system with Nutch - Document-
I just modified search.jsp. Basically set the content type based on document type I was querying. Rest is handled protocol and browser. I can send the code if you would like. Thanks kauu [EMAIL PROTECTED] wrote: thx for ur idea!! but i get a question . how to modify the search.jsp and cached servlet to view word and pdf as demanded by user seamlessly. On 4/1/06, Vertical Search wrote: Nutchians, I have tried to document the sequence of steps to adopt nutch to crawl and search local file system on windows machine. I have been able to do it successfully using nutch 0.8 Dev The configuration are as follows *Inspiron 630m Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine Windows XP Professional)* *If some can review it, it will be very helpful.* Crawling the local filesystem with nutch Platform: Microsoft / nutch 0.8 Dev For a linux version, please refer to http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch The link did help me get it off the ground. I have been working on adopting nutch in a vertical domain. All of a sudden, I was asked to develop a proof of concept to adopt nutch to crawl and search local file syste, Initially I did face some problems. But some mail archieves did help me proceed further. The intention is to provide a overview of steps to crawl local file systems and search through the browser. I downloaded the nuctch nightly from 1. Create the environment variable such as NUTCH_HOME. (Not mandatory, but helps) 2. Extract the downloaded nightly build. 3. Create a folder -- c:/LocalSearch -- copied the following folders and librariees 1. bin/ 2. conf/ 3. *.job, *.jar and *.war files 4. urls/ 5. Plugins folder 4. Modify the nutch-site.xml to include the Plugin folder 5. Modify the nutch-site.xml to include the includes. An example is as follows plugin.includes protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url) file.content.limit -1 6. Modify crawl-urlfilter.txt Remember we have to crawl the local file system. Hence we have to modify the entries as follows #skip http:, ftp:, mailto: urls ##-^(file|ftp|mailto): -^(http|ftp|mailto): #skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ #skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] #accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ #accecpt anything else +.* 7. urls folder Create a file for all the urls to be crawled. The file should have the urls as below save the file under the urls folder. The directories should be in file:// format. Example entries were as follows file://c:/resumes/word file://c:/resumes/pdf #file:///data/readings/semanticweb/ Nutch recognises that the third line does not contain a valid file-url and skips it As suggested by the link 8. Ignoring the parent directories. As suggested in the linux flavor of local fs crawl, I did modify the code in org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse( java.io.File f). I changed the following line: this.content = list2html(f.listFiles(), path, /.equals(path) ? false : true); to this.content = list2html(f.listFiles(), path, false); and recompiled. 9. Compile the changes. Just compiled the whole source code base. did not take more than 2 minutes. 10. Crawling the file system. on my desktop, I have a short cut to cygdrive, double click pwd. cd ../../cygdrive/c/$NUTCH_HOME Execute bin/nutch crawl urls -dir c:/localfs/database Voila, that is it, After 20 minutes, the files were indexed, merged and all done. 11. extracted the nutch-o.8-dev.war file to /webapps/ROOT folder Opened the nutch-site.xml and added the following snippet to reflect the search folder searcher.dir c:/localfs/database Path to root of crawl. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory index containing merged indexes, or the directory segments containing segment indexes. 12. Searching locally was a bit slow. So I changed the hosts.ini file to map machine name to localhost. That increased search considerably. 13. Modified the search.jsp and cached servlet to view word and pdf as demanded by user seamlessly. I hope this helps folks who are trying to adopt nutch for local file system. Personally I believe corporates should adopt nutch rather buying google appliance :) -- www.babatu.com Sudhi Seshachala http://sudhilogs.blogspot.com/ - New Yahoo! Messenger with Voice. Call regular phones from your PC and save big.
RE: Problems Installing
REname the file as ROOT.war (all upper case) Then, http://localhost:8080 should work Paul Stewart [EMAIL PROTECTED] wrote: Thanks for the reply... I re-did what you mentioned below It re-installed just fine (I'm running Fedora Core 4 and installed with yum using rpm's) Even when I rename it, I must access it now via http://www.myserver..:8080/root Or else I get a 404 not found... When I try and do a search I get the same error Any other thoughts? :) Paul -Original Message- From: Dan Morrill [mailto:[EMAIL PROTECTED] Sent: Sunday, April 02, 2006 2:17 PM To: nutch-user@lucene.apache.org Subject: RE: Problems Installing Did you: 1. remove the root.war from tomcat? 2. rename nutch.war to root.war and dump that into webapps under tomcat? 3. did it install ok (can you see the exploded pages under webapps root? Just checking, this is how I fixed the same issue under windows. r/d -Original Message- From: Paul Stewart [mailto:[EMAIL PROTECTED] Sent: Sunday, April 02, 2006 11:00 AM To: nutch-user@lucene.apache.org Subject: Problems Installing Hi there... I am trying to get nutch running Have done a trial indexing run successfully etc... Now I'm running into issues that may be more Tomcat related than Nutch: HTTP Status 500 - type Exception report message description The server encountered an internal error () that prevented it from fulfilling this request. exception org.apache.jasper.JasperException org.apache.jasper.servlet.JspServletWrapper.service(javax.servlet.http.H ttpServletRequest, javax.servlet.http.HttpServletResponse, boolean) (/usr/lib/libjasper5-compiler-5.0.30.jar.so) org.apache.jasper.servlet.JspServlet.serviceJspFile(javax.servlet.http.H ttpServletRequest, javax.servlet.http.HttpServletResponse, java.lang.String, java.lang.Throwable, boolean) (/usr/lib/libjasper5-compiler-5.0.30.jar.so) org.apache.jasper.servlet.JspServlet.service(javax.servlet.http.HttpServ letRequest, javax.servlet.http.HttpServletResponse) (/usr/lib/libjasper5-compiler-5.0.30.jar.so) javax.servlet.http.HttpServlet.service(javax.servlet.ServletRequest, javax.servlet.ServletResponse) (/usr/lib/libservletapi5-5.0.30.jar.so) org.apache.catalina.valves.ErrorReportValve.invoke(org.apache.catalina.R equest, org.apache.catalina.Response, org.apache.catalina.ValveContext) (/usr/lib/libcatalina-5.0.30.jar.so) org.apache.coyote.tomcat5.CoyoteAdapter.service(org.apache.coyote.Reques t, org.apache.coyote.Response) (/usr/lib/libcatalina-5.0.30.jar.so) org.apache.coyote.http11.Http11Processor.process(java.io.InputStream, java.io.OutputStream) (/usr/lib/libtomcat-http11-5.0.30.jar.so) org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processC onnection(org.apache.tomcat.util.net.TcpConnection, java.lang.Object[]) (/usr/lib/libtomcat-http11-5.0.30.jar.so) org.apache.tomcat.util.net.TcpWorkerThread.runIt(java.lang.Object[]) (/tmp/libtomcat-util-5.0.30.jar.socuf3wu.so) org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() (/tmp/libtomcat-util-5.0.30.jar.socuf3wu.so) java.lang.Thread.run() (/usr/lib/libgcj.so.6.0.0) root cause java.lang.NullPointerException org.apache.nutch.searcher.NutchBean.init(java.io.File, java.io.File) (Unknown Source) org.apache.nutch.searcher.NutchBean.NutchBean(java.io.File) (Unknown Source) org.apache.nutch.searcher.NutchBean.NutchBean() (Unknown Source) org.apache.nutch.searcher.NutchBean.get(javax.servlet.ServletContext) (Unknown Source) org.apache.jsp.search_jsp._jspService(javax.servlet.http.HttpServletRequ est, javax.servlet.http.HttpServletResponse) (Unknown Source) org.apache.jasper.runtime.HttpJspBase.service(javax.servlet.http.HttpSer vletRequest, javax.servlet.http.HttpServletResponse) (/usr/lib/libjasper5-runtime-5.0.30.jar.so) javax.servlet.http.HttpServlet.service(javax.servlet.ServletRequest, javax.servlet.ServletResponse) (/usr/lib/libservletapi5-5.0.30.jar.so) org.apache.jasper.servlet.JspServletWrapper.service(javax.servlet.http.H ttpServletRequest, javax.servlet.http.HttpServletResponse, boolean) (/usr/lib/libjasper5-compiler-5.0.30.jar.so) org.apache.jasper.servlet.JspServlet.serviceJspFile(javax.servlet.http.H ttpServletRequest, javax.servlet.http.HttpServletResponse, java.lang.String, java.lang.Throwable, boolean) (/usr/lib/libjasper5-compiler-5.0.30.jar.so) org.apache.jasper.servlet.JspServlet.service(javax.servlet.http.HttpServ letRequest, javax.servlet.http.HttpServletResponse) (/usr/lib/libjasper5-compiler-5.0.30.jar.so) javax.servlet.http.HttpServlet.service(javax.servlet.ServletRequest, javax.servlet.ServletResponse) (/usr/lib/libservletapi5-5.0.30.jar.so) org.apache.catalina.valves.ErrorReportValve.invoke(org.apache.catalina.R equest, org.apache.catalina.Response, org.apache.catalina.ValveContext) (/usr/lib/libcatalina-5.0.30.jar.so)
Re: nutch config setup to crawl/query for word/pdf files
OOPS, my bad. I was seeing 0.8 Dev. Michael Ji [EMAIL PROTECTED] wrote: hi Sudhendra: I didn't see a file with such name (parse-plugins.xml)in nutch/conf/ folder; Should I create it by myself? Any tutorial I could follow to set it up? thanks, Michael, --- sudhendra seshachala wrote: Have you checked parse-plugins.xml in conf/ Thanks Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Messenger with Voice. PC-to-Phone calls for ridiculously low rates. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com Sudhi Seshachala http://sudhilogs.blogspot.com/ - Talk is cheap. Use Yahoo! Messenger to make PC-to-Phone calls. Great rates starting at 1cent;/min.
Re: Removing urls from webdb
I guess the problem is with the package name src/java/org/apache/nutch/tools.PruneDB and src/java/org/apache/nutch/toos.PruneDB... Can you please verify again. It seems to be a typo mistake Thanks keren nutch [EMAIL PROTECTED] wrote: Hi Matt, Thanks for reply. I put PruneDB.java in src/java/org/apache/nutch/tools and run ant. But when I run 'nutch org.apache.nutch.toos.PruneDB db -s', I got the error: Exception in thread main java.lang.NoClassDefFoundError: org/apache/nutch/tools/PruneDB Please let me know where I'm wrong. Keren Matt Kangas wrote: I'm puzzled by the claim that It takes ~4 hours to remove a url from the webdb.. If you're removing them one at a time, yes, because you have to rewrite the entire webdb for any change. But you want to process them in bulk. So it should only take: = (time to rewrite webdb) + (time to process 11M urls through URLFilter chain) = 4 hrs + X X depends on the complexity of your URLFilter chain. You only need RegexURLFilter with two patterns defined. (a minus for a bad site, and a plus for all else). Using my PruneDBTool, as discussed earlier, you can eliminate all of those urls in a single pass over the webdb. http://blog.busytonight.com/2006/03/nutch_07_prunedb_tool.html HTH, --Matt On Mar 22, 2006, at 12:55 PM, keren nutch wrote: Actually, we have 11,000,000 urls in the webdb. Keren Insurance Squared Inc. wrote: We've got a website that is causing our crawler to slow down (from 20mbits down to 3-5) - 400K pages that are basically not available, we're just getting 404's. I'd like to remove them from the DB to get our crawl speed back up again. Here's what our developer told me - I'm stumped, that seems really odd. Is there a better way to remove a URL so that it doesn't get crawled? Running nutch 0.71 on a dual xeon with 8 gigs of ram. - There are more than 400,000 urls in the webdb. It takes ~4 hours to remove a url from the webdb. That means that it'll take ~1,600,000 hours (~66,666 days, or ~ months, ~185 years) to remove 400,000 CAA urls from the webdb. Do you really want to remove them in this way? -- Matt Kangas / [EMAIL PROTECTED] - Have a question? Yahoo! Canada Answers. Go to Yahoo! Canada Answers Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Mail Bring photos to life! New PhotoMail makes sharing a breeze.
Re: crawling pdf and word file
In Nutch-default.xml, Include plugin for word and PDF as below. property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(text|html||msword|pdf)|index-basic|query-(basic|site|url|jobs)/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. /description /property But reco is to include the property in nutch-site.xml Hope this helps. Michael Ji [EMAIL PROTECTED] wrote: hi there, Is there any specific setting need to be added in configuration file in order to crawl and index pdf and word file? thanks, Michael, __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com Sudhi Seshachala http://sudhilogs.blogspot.com/ - Blab-away for as little as 1¢/min. Make PC-to-Phone Calls using Yahoo! Messenger with Voice.
Crawling sites with Encoded URLs
Hi I have been trying to crawl sites with url encoded. But am trying to escape characters in crawl-urlfilter.txt. For some reason, does not seem to be working... One solution is to extend the crawler, .. if there are any other options ? :) Please let me know.. Thanks Sudhi Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Mail Bring photos to life! New PhotoMail makes sharing a breeze.
Multi dimensional searches
I have been using nutch for learning purpose as to how it works so far. I have been fairly successful in actually getting it up and running for some sites on my local machine. I sincerely thank the vibrant group helping me and many others.. I have some questions or issues, however you all might want to percieve. The idea is to build a niche searh, based on some parameters such as location (city, state, zip code and radius). I believe Nutch is fabulous way to build upon a location based search. Again, the location based search is just one dimension. There are other dimensions as well.. I noticed there is a GeoPosition plugin.. Has any one used this plugin in US.. Just wanted to see, how i could re-use the framework. Further more, has any built a two dimensional search? For instance, some one searching Hotels should get all the hotels globally. But some one searching hotels in San Jose, CA Should get those hotels located in the city of San Jose only Then some one searching hotels in San Jose 95129 should get only hotels located in that area,,, and 5-10 mile radius by default. I can always write a unidimensional search like just hotels. I crawl the Hotels datbase and get it indexed.. (some filterng based in which I do not want to be filtered. If I have to also build the other search dimenstions, is there a rule of book to follow, has any one done it before? Any kind of insights or thoughts would be very helpful. Thanks Sudhi Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Mail Bring photos to life! New PhotoMail makes sharing a breeze.
Re: project vitality?
I could not agree with Doug more. This is one of the best.. am trying UIMA too... though UIMA also uses Lucene...as of today, it is still a framework and community in early stages.. In fact the nightly builds has good improvements than 0.71. Any serious user or adopter should be trying with a snapshot of nightly build.. Doug, It would be better, if there is official 0.8 release or atleast a RC. before major releasing 1.0. I am newbie, so let me know about ideas on releasing 0.8. Thanks Sudhi Doug Cutting [EMAIL PROTECTED] wrote: Richard Braman wrote: I think it is still very much at proof of concept stage. I think it is close, but as you have mentioned, the website Is severely out of date and the information and documentation on it lacks luster. It stands to reason that if the documentation lacks luster the project must be dead! Seriously, this is an active project. It is not yet 1.0, so don't expect polish. If it doesn't look easily usable to you then perhaps it is not. It's still for early adopters. The commit list shows a fair amount of activity: http://www.mail-archive.com/nutch-commits%40lucene.apache.org/maillist.html Lots of public sites are using Nutch. Some are listed at http://wiki.apache.org/nutch/PublicServers, but many are not, like http://search.bittorrent.com/. I have tried to get the tutorial and faqs updated, but I haven't heard back. This is an all-volunteer project. If you find a bug, please file a bug report, so that other folks are aware of it. Better yet, if you have a solution or improvement, please construct a patch file (even for documentation) and attach it to a bug report. On the wiki, anyone can make themselves an account and update documentation. We don't boss folks around here, or complain. We pitch in and help. Doug Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Mail Bring photos to life! New PhotoMail makes sharing a breeze.
Re: Exception from crawl command
Okay. Have you tried, the 0.8 version. Seems like it is more stable than the 0.7.X. (The one you are using) It is a bit different too.. with Hadoop and nutch being separate.. I had few issues using 0.7X. But nightly-build (0.8), I was upto speed comparatively sooner. I hope this helps.. I am not trying to go away from the problem, just that next release is more stable and more ever, there is no backward compatibility for 0,8X. (That is what I read in one of the mails achieve) You are better off using 0.8.. Thanks Sudhi [EMAIL PROTECTED] wrote: Hi, sorry for the fumbled reply, I've tried deleting the directory and starting the crawl from scratch a number of times, with very similar results. The system seems to be generating the exception after the fetch block of the output after an apparently arbitrary depth. It leaves the directory with a db folder containing: Mar 2 09:30 dbreadlock Mar 2 09:31 dbwritelock Mar 2 09:30 webdb Mar 2 09:31 webdb.new The webdb.new folder contains: Mar 2 09:30 pagesByURL Mar 2 09:30 stats Mar 2 09:31 tmp I have the following set in my nutch-site.xml file: urlnormalizer.class org.apache.nutch.net.RegexUrlNormalizer Name of the class used to normalize URLs. urlnormalizer.regex.file regex-normalize.xml Name of the config file used by the RegexUrlNormalizer class. http.content.limit -1 The length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. plugin.includes nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|pdf)|index-basic|query-(basic|site|url) Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. I don't think any of this should cause the problem. I'm going to try reinstalling and setting everything up again, but if anyone has any idea what the problem might be then please let me know. cheers, Julian. --- sudhendra seshachala wrote: Delete the folder/database and then re-issue the crawl command. The database/folder gets created when Crawl is used. I am recent user too... But, I did get the same message and I corrected by deleting the folder. IF any one has better ideas, please share. Thanks [EMAIL PROTECTED] wrote: Hi, I've been experimenting with nutch and lucene, everything was working fine, but now I'm getting an exception thrown from the crawl command. The command manages a few fetch cycles but then I get the following message: 060301 161128 status: segment 20060301161046, 38 pages, 0 errors, 856591 bytes, 41199 ms 060301 161128 status: 0.92235243 pages/s, 162.43396 kb/s, 22541.87 bytes/page 060301 161129 Updating C:\PF\nutch-0.7.1\LIVE\db 060301 161129 Updating for C:\PF\nutch-0.7.1\LIVE\segments\20060301161046 060301 161129 Processing document 0 060301 161130 Finishing update 060301 161130 Processing pagesByURL: Sorted 952 instructions in 0.02 seconds. 060301 161130 Processing pagesByURL: Sorted 47600.0 instructions/second java.io.IOException: already exists: C:\PF\nutch-0.7.1\LIVE\db\webdb.new\pagesByURL at org.apache.nutch.io.MapFile$Writer.(MapFile.java:86) at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549) at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544) at org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321) at org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371) at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141) Exception in thread main Does anyone have any ideas what the problem is likely to be. I am running nutch 0.7.1 thanks, Julian. Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Mail Use Photomail to share photos without annoying attachments. Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Mail Bring photos to life! New PhotoMail makes sharing a breeze.
Running the crawl.. can any one point me to step by step guide ?
I built the nightlly build after creating the folders. But when I run on crawl, I get the following errors. I am using cygwin I am not able to figure out what input is missing..., can any one help ? $ bin/nutch crawl urls.txt -dir c:/SearchEngine/Database 060228 100707 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev .jar!/hadoop-default.xml 060228 100707 parsing file:/C:/SearchEngine/nutch-nightly/conf/nutch-default.xml 060228 100707 parsing file:/C:/SearchEngine/nutch-nightly/conf/crawl-tool.xml 060228 100707 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev .jar!/mapred-default.xml 060228 100707 parsing file:/C:/SearchEngine/nutch-nightly/conf/nutch-site.xml 060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/hadoop-site.xml 060228 100708 crawl started in: c:\SearchEngine\Database 060228 100708 rootUrlDir = urls.txt 060228 100708 threads = 10 060228 100708 depth = 5 060228 100708 Injector: starting 060228 100708 Injector: crawlDb: c:\SearchEngine\Database\crawldb 060228 100708 Injector: urlDir: urls.txt 060228 100708 Injector: Converting injected urls to crawl db entries. 060228 100708 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev .jar!/hadoop-default.xml 060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/nutch-default.xml 060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/crawl-tool.xml 060228 100708 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev .jar!/mapred-default.xml 060228 100708 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev .jar!/mapred-default.xml 060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/nutch-site.xml 060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/hadoop-site.xml 060228 100708 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev .jar!/hadoop-default.xml 060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/nutch-default.xml 060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/crawl-tool.xml 060228 100708 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev .jar!/mapred-default.xml 060228 100708 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev .jar!/mapred-default.xml 060228 100708 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev .jar!/mapred-default.xml 060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/nutch-site.xml 060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/hadoop-site.xml 060228 100708 Running job: job_ofko1u 060228 100708 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev .jar!/hadoop-default.xml 060228 100708 parsing jar:file:/C:/SearchEngine/nutch-nightly/lib/hadoop-0.1-dev .jar!/mapred-default.xml 060228 100708 parsing c:\SearchEngine\Database\local\localRunner\job_ofko1u.xml 060228 100708 parsing file:/C:/SearchEngine/nutch-nightly/conf/hadoop-site.xml java.io.IOException: No input directories specified in: Configuration: defaults: hadoop-default.xml , mapred-default.xml , c:\SearchEngine\Database\local\localR unner\job_ofko1u.xmlfinal: hadoop-site.xml at org.apache.hadoop.mapred.InputFormatBase.listFiles(InputFormatBase.ja va:84) at org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.ja va:94) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:7 0) 060228 100709 map 0% reduce 0% Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:310) at org.apache.nutch.crawl.Injector.inject(Injector.java:114) at org.apache.nutch.crawl.Crawl.main(Crawl.java:104) Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Mail Bring photos to life! New PhotoMail makes sharing a breeze.
Nutch 0.8 -building WAR file
Hi there, I got the nightly build and if I try to run Ant war I get the following error BUILD FAILED C:\kool\nutch-nightly\build.xml:94: The following error occurred while executing this line: C:\kool\nutch-nightly\src\plugin\build.xml:9: The following error occurred while executing this line: C:\kool\nutch-nightly\src\plugin\clustering-carrot2\build.xml:26: The following error occurred while executing this line: C:\kool\nutch-nightly\src\plugin\build-plugin.xml:97: srcdir C:\kool\nutch-nigh tly\src\plugin\nutch-extensionpoints\src\java does not exist! I guess am mising something. Can some one point me to exact direction, where I can get the missing things. Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Mail Bring photos to life! New PhotoMail makes sharing a breeze.
Whole Web Indexing
IS invertlinks supported or not ? I am using nutch 0.7.1. I am getting no class def found error. or should I use a compiled version.. Can some help me here ? Whole-web: Indexing Before indexing we first invert all of the links, so that we may index incoming anchor text with the pages. bin/nutch invertlinks crawl/linkdb crawl/segments To index the segments we use the index command, as follows: bin/nutch index indexes crawl/linkdb crawl/segments/* Sudhi Seshachala http://sudhilogs.blogspot.com/ - Relax. Yahoo! Mail virus scanning helps detect nasty viruses!
Nutch 0.8 version required..
The latest version I could see in the SVN is 0.7.1, Where can I get 0.8., source code is even better. Could I just grab from nightly builds ? Please let me know.. Thanks Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Autos. Looking for a sweet ride? Get pricing, reviews, more on new and used cars.
Re: Nutch 0.8 version required..
Thanks Stefan. But when I compiled, the jar size was just 318kB for 0.8-dev where as the 0.7.1 release was 718KB. Am I missing something ? Sudhi Stefan Groschupf [EMAIL PROTECTED] wrote: http://cvs.apache.org/dist/lucene/nutch/nightly/ Am 24.02.2006 um 01:44 schrieb sudhendra seshachala: The latest version I could see in the SVN is 0.7.1, Where can I get 0.8., source code is even better. Could I just grab from nightly builds ? Please let me know.. Thanks Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Autos. Looking for a sweet ride? Get pricing, reviews, more on new and used cars. - blog: http://www.find23.org company: http://www.media-style.com Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Autos. Looking for a sweet ride? Get pricing, reviews, more on new and used cars.
Nutch and HTTrack Crawler
Is there a way I could use HTTrack for crawling and nutch for just searching? Has any body done this before andcomparision between crawlers. How easy or tough is it to customize nutch crawler for a specific vertical ? I know a crude way of writing a crawler, but was wondering if any one has actually done a custom crawler.. Thanks Sudhi Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Mail Use Photomail to share photos without annoying attachments.