RE: Merging indexes -- please help....
Hi, I encountered the same problem on 0.8. See my post http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04103.html. Anyone has any idea? Is it a bug or a configuration issue? Please let me know. Thanks. Olive From: Dan Morrill [EMAIL PROTECTED] Reply-To: nutch-user@lucene.apache.org To: nutch-user@lucene.apache.org Subject: RE: Merging indexes -- please help Date: Mon, 3 Apr 2006 05:18:34 -0700 Hi, I noticed that when I used the drive designation that it didn't like that (windows cygwin environment) if you did ./nutch merge -local /STG1/index /STG1/indexes that may work better, let me know. Cheers/r/dan H -Original Message- From: Vertical Search [mailto:[EMAIL PROTECTED] Sent: Sunday, April 02, 2006 7:07 PM To: nutch-user@lucene.apache.org Subject: Re: Merging indexes -- please help Okay. I had 2 sets of crawl such as E:/STG1 and E/STG2 I used the dedup command to remove duplicates Then I the command i used to merge is as follows based on what have been available on mail archieves and responses I got First I can bin/nutch merge E:/STG1/index E:/STG1/indexes bin/nutch merge E:/STG1/index E:/STG2/indexes In the nutch-site .xml I have searcher.dir ad E:/STG1 I get the absolutely no results...The command console is as follows. Can some one shed some light on this please ASAP.. INFO: creating new bean Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init INFO: opening merged index in E:\Hoodukoo\STG5\index Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init INFO: opening segments in E:\Hoodukoo\STG5\segments Apr 2, 2006 8:58:36 PM org.apache.hadoop.conf.ConfigurationgetConfResourceAsRea der INFO: found resource common-terms.utf8 at file:/C:/xampp/tomcat/webapps/hoodukoo /WEB-INF/classes/common-terms.utf8 Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init INFO: opening linkdb in E:\Hoodukoo\STG5\linkdb Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService INFO: query request from 127.0.0.1 Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService INFO: query: site Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean search INFO: searching for 20 raw hits _ Express yourself instantly with MSN Messenger! Download today - it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Meta-Refresh Question
Silly question but nutch won't follow meta-refreshes will it? Dennis
Re: Merging indexes -- please help....
Sorry. I too have faced the same problem.. I am in process of releasing for a demo (mangement) over this weekend. I will try to work on merging stuff after that... IT is a very important part and have to get it to work, if I have to succeed in adopting Nutch for a vertical domain. Further more. I could not get the PruneIndexTool up and running. It asks for query. I wonder if some can share the query file or format, the tool expects. But goes without saying.. I am very thankful for folks here extending the help. Thanks On 4/4/06, Olive g [EMAIL PROTECTED] wrote: Hi, I encountered the same problem on 0.8. See my post http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04103.html. Anyone has any idea? Is it a bug or a configuration issue? Please let me know. Thanks. Olive From: Dan Morrill [EMAIL PROTECTED] Reply-To: nutch-user@lucene.apache.org To: nutch-user@lucene.apache.org Subject: RE: Merging indexes -- please help Date: Mon, 3 Apr 2006 05:18:34 -0700 Hi, I noticed that when I used the drive designation that it didn't like that (windows cygwin environment) if you did ./nutch merge -local /STG1/index /STG1/indexes that may work better, let me know. Cheers/r/dan H -Original Message- From: Vertical Search [mailto:[EMAIL PROTECTED] Sent: Sunday, April 02, 2006 7:07 PM To: nutch-user@lucene.apache.org Subject: Re: Merging indexes -- please help Okay. I had 2 sets of crawl such as E:/STG1 and E/STG2 I used the dedup command to remove duplicates Then I the command i used to merge is as follows based on what have been available on mail archieves and responses I got First I can bin/nutch merge E:/STG1/index E:/STG1/indexes bin/nutch merge E:/STG1/index E:/STG2/indexes In the nutch-site .xml I have searcher.dir ad E:/STG1 I get the absolutely no results...The command console is as follows. Can some one shed some light on this please ASAP.. INFO: creating new bean Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init INFO: opening merged index in E:\Hoodukoo\STG5\index Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init INFO: opening segments in E:\Hoodukoo\STG5\segments Apr 2, 2006 8:58:36 PM org.apache.hadoop.conf.ConfigurationgetConfResourceAsRea der INFO: found resource common-terms.utf8 at file:/C:/xampp/tomcat/webapps/hoodukoo /WEB-INF/classes/common-terms.utf8 Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init INFO: opening linkdb in E:\Hoodukoo\STG5\linkdb Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService INFO: query request from 127.0.0.1 Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService INFO: query: site Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean search INFO: searching for 20 raw hits _ Express yourself instantly with MSN Messenger! Download today - it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Re: Merging indexes -- please help....
We too have deadlines :(. I would appreciate it very much if someone can provide more insight. Is it a bug or configuration issue? How can we even do incremental crawsl on 0.8 with these issues? Should I send email to the developer mailing list? Would that help? Gurus, please help From: Vertical Search [EMAIL PROTECTED] Reply-To: nutch-user@lucene.apache.org To: nutch-user@lucene.apache.org Subject: Re: Merging indexes -- please help Date: Tue, 4 Apr 2006 10:11:51 -0500 Sorry. I too have faced the same problem.. I am in process of releasing for a demo (mangement) over this weekend. I will try to work on merging stuff after that... IT is a very important part and have to get it to work, if I have to succeed in adopting Nutch for a vertical domain. Further more. I could not get the PruneIndexTool up and running. It asks for query. I wonder if some can share the query file or format, the tool expects. But goes without saying.. I am very thankful for folks here extending the help. Thanks On 4/4/06, Olive g [EMAIL PROTECTED] wrote: Hi, I encountered the same problem on 0.8. See my post http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04103.html. Anyone has any idea? Is it a bug or a configuration issue? Please let me know. Thanks. Olive From: Dan Morrill [EMAIL PROTECTED] Reply-To: nutch-user@lucene.apache.org To: nutch-user@lucene.apache.org Subject: RE: Merging indexes -- please help Date: Mon, 3 Apr 2006 05:18:34 -0700 Hi, I noticed that when I used the drive designation that it didn't like that (windows cygwin environment) if you did ./nutch merge -local /STG1/index /STG1/indexes that may work better, let me know. Cheers/r/dan H -Original Message- From: Vertical Search [mailto:[EMAIL PROTECTED] Sent: Sunday, April 02, 2006 7:07 PM To: nutch-user@lucene.apache.org Subject: Re: Merging indexes -- please help Okay. I had 2 sets of crawl such as E:/STG1 and E/STG2 I used the dedup command to remove duplicates Then I the command i used to merge is as follows based on what have been available on mail archieves and responses I got First I can bin/nutch merge E:/STG1/index E:/STG1/indexes bin/nutch merge E:/STG1/index E:/STG2/indexes In the nutch-site .xml I have searcher.dir ad E:/STG1 I get the absolutely no results...The command console is as follows. Can some one shed some light on this please ASAP.. INFO: creating new bean Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init INFO: opening merged index in E:\Hoodukoo\STG5\index Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init INFO: opening segments in E:\Hoodukoo\STG5\segments Apr 2, 2006 8:58:36 PM org.apache.hadoop.conf.ConfigurationgetConfResourceAsRea der INFO: found resource common-terms.utf8 at file:/C:/xampp/tomcat/webapps/hoodukoo /WEB-INF/classes/common-terms.utf8 Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init INFO: opening linkdb in E:\Hoodukoo\STG5\linkdb Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService INFO: query request from 127.0.0.1 Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService INFO: query: site Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean search INFO: searching for 20 raw hits _ Express yourself instantly with MSN Messenger! Download today - it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/ _ Express yourself instantly with MSN Messenger! Download today - it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Re: Merging indexes -- please help....
You might want to try this but I am not sure if it works :-) Please make backups before!! This is a work around.. I assume that you have two working index i.e CrawlA and CrawlB (Ready to go and works like a charm via the browser :-). Ok I am taking for granted that all directory like index, indexes, segments etc are in the directory CrawlA and CrawlB Now make a new directory called CrawlC mkdir CrawlC cd CrawlC mkdir crawldb cd crawldb mkdir current cd current Now copy the cp -r CrawlA/crawldb/current/part-0 to CrawlC/crawldb/current/part-0 cp -r CrawlB/crawldb/current/part-0 to CrawlC/crawldb/current/part-1 NOTE the part-1 Now make a directory segments under CrawlC cd to CrawlC/segments Now copy the cp-r CrawlA/segments/* to CrawlC/segments/* cp-r CrawlB/segments/* to CrawlC/segments/* etc.. Now you should have under CrawlC two directory crawldb segments Proceed with - bin/nutch invertlinks linkdb segments/* - bin/nutch index indexes crawldb linkdb segments/* - bin/nutch dedup indexes - bin/nutch merge index indexes Change your searcher.dir in nutch-site.xml and give it a go.. Cheers On 4/4/06, Olive g [EMAIL PROTECTED] wrote: We too have deadlines :(. I would appreciate it very much if someone can provide more insight. Is it a bug or configuration issue? How can we even do incremental crawsl on 0.8 with these issues? Should I send email to the developer mailing list? Would that help? Gurus, please help From: Vertical Search [EMAIL PROTECTED] Reply-To: nutch-user@lucene.apache.org To: nutch-user@lucene.apache.org Subject: Re: Merging indexes -- please help Date: Tue, 4 Apr 2006 10:11:51 -0500 Sorry. I too have faced the same problem.. I am in process of releasing for a demo (mangement) over this weekend. I will try to work on merging stuff after that... IT is a very important part and have to get it to work, if I have to succeed in adopting Nutch for a vertical domain. Further more. I could not get the PruneIndexTool up and running. It asks for query. I wonder if some can share the query file or format, the tool expects. But goes without saying.. I am very thankful for folks here extending the help. Thanks On 4/4/06, Olive g [EMAIL PROTECTED] wrote: Hi, I encountered the same problem on 0.8. See my post http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04103.html. Anyone has any idea? Is it a bug or a configuration issue? Please let me know. Thanks. Olive From: Dan Morrill [EMAIL PROTECTED] Reply-To: nutch-user@lucene.apache.org To: nutch-user@lucene.apache.org Subject: RE: Merging indexes -- please help Date: Mon, 3 Apr 2006 05:18:34 -0700 Hi, I noticed that when I used the drive designation that it didn't like that (windows cygwin environment) if you did ./nutch merge -local /STG1/index /STG1/indexes that may work better, let me know. Cheers/r/dan H -Original Message- From: Vertical Search [mailto:[EMAIL PROTECTED] Sent: Sunday, April 02, 2006 7:07 PM To: nutch-user@lucene.apache.org Subject: Re: Merging indexes -- please help Okay. I had 2 sets of crawl such as E:/STG1 and E/STG2 I used the dedup command to remove duplicates Then I the command i used to merge is as follows based on what have been available on mail archieves and responses I got First I can bin/nutch merge E:/STG1/index E:/STG1/indexes bin/nutch merge E:/STG1/index E:/STG2/indexes In the nutch-site .xml I have searcher.dir ad E:/STG1 I get the absolutely no results...The command console is as follows. Can some one shed some light on this please ASAP.. INFO: creating new bean Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init INFO: opening merged index in E:\Hoodukoo\STG5\index Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init INFO: opening segments in E:\Hoodukoo\STG5\segments Apr 2, 2006 8:58:36 PM org.apache.hadoop.conf.ConfigurationgetConfResourceAsRea der INFO: found resource common-terms.utf8 at file:/C:/xampp/tomcat/webapps/hoodukoo /WEB-INF/classes/common-terms.utf8 Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init INFO: opening linkdb in E:\Hoodukoo\STG5\linkdb Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService INFO: query request from 127.0.0.1 Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService INFO: query: site Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean search INFO: searching for 20 raw hits _ Express yourself instantly with MSN Messenger! Download today - it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/ _ Express yourself instantly
Re: Query on merged indexes returned 0 hit - test case included (Nutch 0.8)
Olive g wrote: Hi Andrzej other gurus who might be reading this message :-): I ran some tests and somehow my query returned 0 hit against merged indexes. Here is my test case and it's a bit long, thank you in advance for your patience: 1. crawled the first 100 urls ~/nutch/search/bin/nutch crawl urls-001-100 -dir test1 -depth 1 test1.log 2. set searcher.dir to test1 3. query for movie ~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie it returned 64 hits (a web research with tomcat returned the same result) 4. crawled the second 100 urls ~/nutch/search/bin/nutch crawl urls-101-200 -dir test2 -depth 1 test2.log 5. set searcher.dir to test2 6. query for movie ~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie it returned 55 hits (a web research with tomcat returned the same result) 7. attempted to merge using the following command: ../search/bin/nutch merge test3 test1 test2 merge-test3 returned error: Exception in thread main java.rmi.RemoteException: java.io.IOException: Cannot open filename /user/root/test1/crawldb/segments at org.apache.hadoop.dfs.NameNode.open(NameNode.java:120) 8. attempted to merge again using the following command: ../search/bin/nutch merge test4 test1/indexes test2/indexes merge-test4 merged successfully with no errors 9. set searcher.dir to test4 10. query for movie by: ~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie and it returned 0 hit (a web research with tomcat returned the same result) 060403 201545 10 opening segments in test4/segments 060403 201545 10 found resource common-terms.utf8 at file:/root/nutch/search/conf/common-terms.utf8 060403 201545 10 opening linkdb in test4/linkdb Total hits: 0 It appeared to be looking for test4/segments and test4/linkdb which did not exist? Well, the short answer is that you cannot at the moment merge crawldbs or linkdbs. As a consequence, you cannot use multiple outputs of 'nutch crawl' together (because NutchBean needs to reference a single linkdb during searching). This is technically possible, but simply not implemented (yet). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Query on merged indexes returned 0 hit - test case included (Nutch 0.8)
Thank you! Zaheed sent out a workaround in another thread as follows. Do you think this would work (on Nutch 0.8 w/ DFS). Also, when do you expect to port the feature to 0.8 (I know it's not the highest priority for you :)) - but really, merging index is critical for incremental crawls. Is it possible that it can be implemented sooner? Please ... Our project depends on this ... Thanks again for your help! Olive From : Zaheed Haque [EMAIL PROTECTED] Reply-To : nutch-user@lucene.apache.org Sent : Tuesday, April 4, 2006 4:12 PM To : nutch-user@lucene.apache.org Subject : Re: Merging indexes -- please help Go to previous message | Go to next message | Delete | Inbox You might want to try this but I am not sure if it works :-) Please make backups before!! This is a work around.. I assume that you have two working index i.e CrawlA and CrawlB (Ready to go and works like a charm via the browser :-). Ok I am taking for granted that all directory like index, indexes, segments etc are in the directory CrawlA and CrawlB Now make a new directory called CrawlC mkdir CrawlC cd CrawlC mkdir crawldb cd crawldb mkdir current cd current Now copy the cp -r CrawlA/crawldb/current/part-0 to CrawlC/crawldb/current/part-0 cp -r CrawlB/crawldb/current/part-0 to CrawlC/crawldb/current/part-1 NOTE the part-1 Now make a directory segments under CrawlC cd to CrawlC/segments Now copy the cp-r CrawlA/segments/* to CrawlC/segments/* cp-r CrawlB/segments/* to CrawlC/segments/* etc.. Now you should have under CrawlC two directory crawldb segments Proceed with - bin/nutch invertlinks linkdb segments/* - bin/nutch index indexes crawldb linkdb segments/* - bin/nutch dedup indexes - bin/nutch merge index indexes Change your searcher.dir in nutch-site.xml and give it a go.. From: Andrzej Bialecki [EMAIL PROTECTED] Reply-To: nutch-user@lucene.apache.org To: nutch-user@lucene.apache.org Subject: Re: Query on merged indexes returned 0 hit - test case included (Nutch 0.8) Date: Tue, 04 Apr 2006 18:29:07 +0200 Olive g wrote: Hi Andrzej other gurus who might be reading this message :-): I ran some tests and somehow my query returned 0 hit against merged indexes. Here is my test case and it's a bit long, thank you in advance for your patience: 1. crawled the first 100 urls ~/nutch/search/bin/nutch crawl urls-001-100 -dir test1 -depth 1 test1.log 2. set searcher.dir to test1 3. query for movie ~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie it returned 64 hits (a web research with tomcat returned the same result) 4. crawled the second 100 urls ~/nutch/search/bin/nutch crawl urls-101-200 -dir test2 -depth 1 test2.log 5. set searcher.dir to test2 6. query for movie ~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie it returned 55 hits (a web research with tomcat returned the same result) 7. attempted to merge using the following command: ../search/bin/nutch merge test3 test1 test2 merge-test3 returned error: Exception in thread main java.rmi.RemoteException: java.io.IOException: Cannot open filename /user/root/test1/crawldb/segments at org.apache.hadoop.dfs.NameNode.open(NameNode.java:120) 8. attempted to merge again using the following command: ../search/bin/nutch merge test4 test1/indexes test2/indexes merge-test4 merged successfully with no errors 9. set searcher.dir to test4 10. query for movie by: ~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie and it returned 0 hit (a web research with tomcat returned the same result) 060403 201545 10 opening segments in test4/segments 060403 201545 10 found resource common-terms.utf8 at file:/root/nutch/search/conf/common-terms.utf8 060403 201545 10 opening linkdb in test4/linkdb Total hits: 0 It appeared to be looking for test4/segments and test4/linkdb which did not exist? Well, the short answer is that you cannot at the moment merge crawldbs or linkdbs. As a consequence, you cannot use multiple outputs of 'nutch crawl' together (because NutchBean needs to reference a single linkdb during searching). This is technically possible, but simply not implemented (yet). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ Express yourself instantly with MSN Messenger! Download today - it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
RE: nutch config setup to crawl/query for word/pdf files
You don't need pdf|msword for index- and query-. There are no such plugins. -kuro
Re: Query on merged indexes returned 0 hit - test case included (Nutch 0.8)
Olive g wrote: Thank you! Zaheed sent out a workaround in another thread as follows. Do you think this would work (on Nutch 0.8 w/ DFS). Yes, it should work. This is a cheap way to merge two DBs - thanks Zaheed! Just remember to rename the part-x dirs so that they are sequential. Also, when do you expect to port the feature to 0.8 (I know it's not the highest priority for you :)) - but really, merging index is critical for incremental crawls. Is it possible that it can be implemented sooner? Please ... Our project depends on this ... These features (incremental updates, merging indexes) are already supported if you use individual command-line tools and a single DB. So, I'm not planning to do anything about it. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: Meta-Refresh Question
I searched through the code and the problem is the URL returned for the meta-refresh is like this: http://www.oneforever.com/tohomepage.do;jsessionid=F3C8BBAC224990A9214A1785E 5001AFD Which matches the RegexURLFilter for this pattern: [EMAIL PROTECTED] (because of the = sign So my question is should the URL be cleaned up inside of the HttpBase where it is grabbed from the page content or would it be better to put in a URL filter to match before it gets eliminated by the filter above? Dennis -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 04, 2006 9:56 AM To: nutch-user@lucene.apache.org Subject: Re: Meta-Refresh Question Dennis Kubes wrote: Silly question but nutch won't follow meta-refreshes will it? It should have, parse-html has support for this (ParseStatus.SUCCESS_REDIRECT), and it did work in 0.7, but now I can see that one of the necessary pieces (in Fetcher) didn't make it to 0.8. Please create a JIRA issue so that it doesn't escape our attention. Thank you! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Separate search and index servers?
I currently have Nutch 0.8 set up with two HDFS machines that store and process searches and another machine that is both the HDFS index server and the machine running Tomcat to run searches against. Is it possible to separate the search machine from the index machine? I want to put the index machine on a highly available HA cluster using the Linux Heartbeat HA system since it always needs to be around. I then want to create a set of search machines that a load balancer will feed searches to and these machines will in turn send requests to the HDFS machines. Does this make sense?
Re: Crawling a file but not indexing it
Okay, that sounds good. Two questions: * If I don't want to index a document, then from BasicIndexingFilter.filter, should I just return the document I receive? Or should I return null? Or something else? * What change(s) do I have to make to HtmlParser? It seems like I can use the Parser object as-is, e.g. parse.getData().get(index) to get the meta-data value for index. What am I missing? Thanks for the pointers! Ben On 4/3/06, TDLN [EMAIL PROTECTED] wrote: It depends if you control the seed pages or not; if you do, you could tag them index=no and skip them during indexing. You would have to change HtmlParser and BasicIndexingFilter. Rgrds, Thomas On 4/4/06, Benjamin Higgins [EMAIL PROTECTED] wrote: Hello, I've gone through the documentation and tried searching the mailing list archives. I bet this has come up before, but I just couldn't find it. So, if someone could point me to a past discussion that would be great. What I want to do is be able to crawl html files for links, but not actually index that file. I ask this because I have several seed pages that are not meant for human consumption, so I never want them to show up in search results. How can this be accomplished? Thanks in advance, Ben
Re: Crawling the local file system with Nutch - Document-
I just modified search.jsp. Basically set the content type based on document type I was querying. Rest is handled protocol and browser. I can send the code if you would like. Thanks kauu [EMAIL PROTECTED] wrote: thx for ur idea!! but i get a question . how to modify the search.jsp and cached servlet to view word and pdf as demanded by user seamlessly. On 4/1/06, Vertical Search wrote: Nutchians, I have tried to document the sequence of steps to adopt nutch to crawl and search local file system on windows machine. I have been able to do it successfully using nutch 0.8 Dev The configuration are as follows *Inspiron 630m Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine Windows XP Professional)* *If some can review it, it will be very helpful.* Crawling the local filesystem with nutch Platform: Microsoft / nutch 0.8 Dev For a linux version, please refer to http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch The link did help me get it off the ground. I have been working on adopting nutch in a vertical domain. All of a sudden, I was asked to develop a proof of concept to adopt nutch to crawl and search local file syste, Initially I did face some problems. But some mail archieves did help me proceed further. The intention is to provide a overview of steps to crawl local file systems and search through the browser. I downloaded the nuctch nightly from 1. Create the environment variable such as NUTCH_HOME. (Not mandatory, but helps) 2. Extract the downloaded nightly build. 3. Create a folder -- c:/LocalSearch -- copied the following folders and librariees 1. bin/ 2. conf/ 3. *.job, *.jar and *.war files 4. urls/ 5. Plugins folder 4. Modify the nutch-site.xml to include the Plugin folder 5. Modify the nutch-site.xml to include the includes. An example is as follows plugin.includes protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url) file.content.limit -1 6. Modify crawl-urlfilter.txt Remember we have to crawl the local file system. Hence we have to modify the entries as follows #skip http:, ftp:, mailto: urls ##-^(file|ftp|mailto): -^(http|ftp|mailto): #skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ #skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] #accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ #accecpt anything else +.* 7. urls folder Create a file for all the urls to be crawled. The file should have the urls as below save the file under the urls folder. The directories should be in file:// format. Example entries were as follows file://c:/resumes/word file://c:/resumes/pdf #file:///data/readings/semanticweb/ Nutch recognises that the third line does not contain a valid file-url and skips it As suggested by the link 8. Ignoring the parent directories. As suggested in the linux flavor of local fs crawl, I did modify the code in org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse( java.io.File f). I changed the following line: this.content = list2html(f.listFiles(), path, /.equals(path) ? false : true); to this.content = list2html(f.listFiles(), path, false); and recompiled. 9. Compile the changes. Just compiled the whole source code base. did not take more than 2 minutes. 10. Crawling the file system. on my desktop, I have a short cut to cygdrive, double click pwd. cd ../../cygdrive/c/$NUTCH_HOME Execute bin/nutch crawl urls -dir c:/localfs/database Voila, that is it, After 20 minutes, the files were indexed, merged and all done. 11. extracted the nutch-o.8-dev.war file to /webapps/ROOT folder Opened the nutch-site.xml and added the following snippet to reflect the search folder searcher.dir c:/localfs/database Path to root of crawl. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory index containing merged indexes, or the directory segments containing segment indexes. 12. Searching locally was a bit slow. So I changed the hosts.ini file to map machine name to localhost. That increased search considerably. 13. Modified the search.jsp and cached servlet to view word and pdf as demanded by user seamlessly. I hope this helps folks who are trying to adopt nutch for local file system. Personally I believe corporates should adopt nutch rather buying google appliance :) -- www.babatu.com Sudhi Seshachala http://sudhilogs.blogspot.com/ - New Yahoo! Messenger with Voice. Call regular phones from your PC and save big.
stackoverflow
Hi, I am getting a stick over flow error when I run the CrawlTool with a depth of 5. Is the depth too high and resulting in stackoverflow? Or am I messing up some other parameters?. The URL file contains a single URL. Thanks, Rajesh
Re: Query on merged indexes returned 0 hit - more issues
Hi gurus, I tried the workaround and I found some more issues. It appears to me that inverlinks does not work properly with more than 5 input parts. For example the following command (with number of map tasks set to 5 and the number of reduce tasks set to 5, using dfs, nutch 0.8) ../search/bin/nutch invertlinks test5/linkdb test5/segments/20060403192429 test5/segments/20060403193814 linkdb-test5 generated basically the same error for all 5 reduce tasks: java.rmi.RemoteException: java.io.IOException: Could not complete write to file /user/root/test5/linkdb/362527374/part-0/.data.crc by DFSClient_441718647 at java.lang.Throwable.(Throwable.java:57) at java.lang.Throwable.(Throwable.java:68) at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:205) the contents of test5/segments/20060403192429/content/ are /user/root/test5/segments/20060403192429/content/part-0 123617 /user/root/test5/segments/20060403192429/content/part-1 141105 /user/root/test5/segments/20060403192429/content/part-2 168565 /user/root/test5/segments/20060403192429/content/part-3 179788 /user/root/test5/segments/20060403192429/content/part-4 70356 the contents of test5/segments/20060403193814/content/ are /user/root/test5/segments/20060403193814/content/part-0 103014 /user/root/test5/segments/20060403193814/content/part-1 159010 /user/root/test5/segments/20060403193814/content/part-2 92892 /user/root/test5/segments/20060403193814/content/part-3 103847 /user/root/test5/segments/20060403193814/content/part-4 102626 In the example above there are 10 input parts in two segments. I noticed that this doesn't happen when there are no more than 5 input parts and it consistently happens when there are more than 5, even if they are in the same segment. The urgency of this problem is that it prevents incremental crawling, whether by merging segments or by incremental depth crawling, because after 5 more incremental crawls we have 6 parts. Please let me know what you think. Thank you! Olive From: Andrzej Bialecki [EMAIL PROTECTED] Reply-To: nutch-user@lucene.apache.org To: nutch-user@lucene.apache.org Subject: Re: Query on merged indexes returned 0 hit - test case included (Nutch 0.8) Date: Tue, 04 Apr 2006 19:20:43 +0200 Olive g wrote: Thank you! Zaheed sent out a workaround in another thread as follows. Do you think this would work (on Nutch 0.8 w/ DFS). Yes, it should work. This is a cheap way to merge two DBs - thanks Zaheed! Just remember to rename the part-x dirs so that they are sequential. Also, when do you expect to port the feature to 0.8 (I know it's not the highest priority for you :)) - but really, merging index is critical for incremental crawls. Is it possible that it can be implemented sooner? Please ... Our project depends on this ... These features (incremental updates, merging indexes) are already supported if you use individual command-line tools and a single DB. So, I'm not planning to do anything about it. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ Express yourself instantly with MSN Messenger! Download today - it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Re: Adaptive fetch
Hi, Is the patch for Adaptive Refetch has been released? Considering intranet and using nutch for indexing large static HTML pages, i hope this feature plays a crucial role. Please update me on this. Thanks, D.Saravanaraj On 3/31/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: Raghavendra Prabhu wrote: I believe we had a recent mail with problem of redirection also (with this patch applied..) And as you said more people testing the patch would be better. Considering that this has the highest votes for add-on features, it is a critical one i guess. Ok, I'll bring this patch up to date over the weekend. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
nutch-svn: Reduce operations: can't open map output
Hi list, I downloaded the nightly build of nutch+hadoop, and have been trying to get it working on a small cluster of machines. I have it working properly on a single machine, however when I try to have my map and reduce tasks run on the cluster slaves, I get the following exception: 060405 000219 SEVERE Can't open map output:/home2/nutch/filesystem/mapreduce/local/part-2.out/task_m_1v749p java.io.FileNotFoundException: /home2/nutch/filesystem/mapreduce/local/part-2.out/task_m_1v749p at org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:114) (rest of stack trace snipped) Oddly enough, this map task ran on the same machine which produced the above error message. This it the output from the map task on the same machine: 060405 000210 task_m_1v749p Child starting 060405 000211 Server connection on port 50050 from 127.0.0.1: starting 060405 000211 task_m_1v749p Client connection to 0.0.0.0:50050: starting 060405 000211 task_m_1v749p Client connection to 10.10.0.3:9000: starting 060405 000211 task_m_1v749p Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer 060405 000211 Server connection on port 50050 from 127.0.0.1: starting 060405 000211 task_m_1v749p Client connection to 0.0.0.0:50050: starting 060405 000211 task_m_1v749p 1.0% /user/nutch/urls/urllist.txt:2+2 060405 000211 Task task_m_1v749p is done. (parsing lines snipped for brevity) All map tasks finish with the output above, however none of my reduce tasks are finishing. The problem exists when the map task, and the corresponding reduce task which depends on the map's output, are run on the same machine or different machines. In both cases I see an IPC timeout Exception being thrown, 1 minute (60,000 ms, as specified in the hadoop-default.xml file) after the above FileNotFound exception is generated. Does anyone have any pointers as to where I should look to determine the reason the map output is not being generated, or is not able to be accessed? Regards, -Shawn
RE: stackoverflow
HI, Rajesh, will u be able to tell me about the deploying on Nutch. Thanks in advance. -Original Message- From: Rajesh Munavalli [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 05, 2006 3:48 AM To: nutch-user@lucene.apache.org Subject: stackoverflow Hi, I am getting a stick over flow error when I run the CrawlTool with a depth of 5. Is the depth too high and resulting in stackoverflow? Or am I messing up some other parameters?. The URL file contains a single URL. Thanks, Rajesh