Adaptive fetch
Hi Andrzej Can you put in the latest version of the diff for the adaptive fetch? Because we seem to have problem patching agains the latest release. This should help us test it. Rgds Prabhu
Re: [Nutch-general] Re: Using Nutch with Ferret (ruby)
On Mar 30, 2006, at 4:10 PM, mike c wrote: Hi Erik, Thanks for pointing this out - as I just got Ferret working with indexes created using Nutch. Any recommendations on how to address this issue? This is a particularly insidious issue. Java Lucene is not using pure UTF-8, whereas ports like Ferret are. But changing Java Lucene is a big deal and does introduce a (slight) performance hit apparently. The plan is for Java Lucene to be corrected in this regard at some point in the future, perhaps as soon as Lucene 2.0. But for now, I don't know of a way to address this issue. I gave up on Ferret for the time being because of this incompatibility and am now prototyping with Solr while still using my custom XML-RPC search server for now. Erik -Mike On 3/30/06, Erik Hatcher [EMAIL PROTECTED] wrote: There is one incompatibility between Ferret and Java Lucene of note. It is the UTF-8 issue that has surfaced with regards to Java Lucene. All can be well between Java Lucene and Ferret, until characters in another range are indexed, and then Ferret will blow up trying to search the index. Maybe this has been worked around in a more recent version of Ferret than I've tried? Erik On Mar 30, 2006, at 2:50 PM, mike c wrote: Thanks. I'll try it out. In the mean time, if I get Ferret working I'll post an update. -Mike On 3/30/06, Steven Yelton [EMAIL PROTECTED] wrote: I use WEBrick instead of tomcat to query and serve search results. I used ruby's 'rjb' to bridge the gap. http://raa.ruby-lang.org/project/rjb/ There may be more direct ways (ruby-lucene), but this was quick and easy and still has decent performance. Steven mike c wrote: Hi all, I was wondering if anyone is using Nutch (for crawling) with Ferret (indexing / searching). Basically, my front-end is built using Ruby on Rails that's why I'm asking. I have the Nutch crawler up and running fine, but can't seem to figure out how to integrate the two. Any help is appreciated. Regards, Mike --- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnkkid0944bid$1720dat1642 ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general --- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnkkid0944bid$1720dat1642 ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] Re: Using Nutch with Ferret (ruby)
Any easy link to the bug report of this utf8 lucene issue? On 3/31/06, Erik Hatcher [EMAIL PROTECTED] wrote: On Mar 30, 2006, at 4:10 PM, mike c wrote: Hi Erik, Thanks for pointing this out - as I just got Ferret working with indexes created using Nutch. Any recommendations on how to address this issue? This is a particularly insidious issue. Java Lucene is not using pure UTF-8, whereas ports like Ferret are. But changing Java Lucene is a big deal and does introduce a (slight) performance hit apparently. The plan is for Java Lucene to be corrected in this regard at some point in the future, perhaps as soon as Lucene 2.0. But for now, I don't know of a way to address this issue. I gave up on Ferret for the time being because of this incompatibility and am now prototyping with Solr while still using my custom XML-RPC search server for now. Erik -Mike On 3/30/06, Erik Hatcher [EMAIL PROTECTED] wrote: There is one incompatibility between Ferret and Java Lucene of note. It is the UTF-8 issue that has surfaced with regards to Java Lucene. All can be well between Java Lucene and Ferret, until characters in another range are indexed, and then Ferret will blow up trying to search the index. Maybe this has been worked around in a more recent version of Ferret than I've tried? Erik On Mar 30, 2006, at 2:50 PM, mike c wrote: Thanks. I'll try it out. In the mean time, if I get Ferret working I'll post an update. -Mike On 3/30/06, Steven Yelton [EMAIL PROTECTED] wrote: I use WEBrick instead of tomcat to query and serve search results. I used ruby's 'rjb' to bridge the gap. http://raa.ruby-lang.org/project/rjb/ There may be more direct ways (ruby-lucene), but this was quick and easy and still has decent performance. Steven mike c wrote: Hi all, I was wondering if anyone is using Nutch (for crawling) with Ferret (indexing / searching). Basically, my front-end is built using Ruby on Rails that's why I'm asking. I have the Nutch crawler up and running fine, but can't seem to figure out how to integrate the two. Any help is appreciated. Regards, Mike --- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnkkid0944bid$1720dat1642 ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general --- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnkkid0944bid$1720dat1642 ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general -- Minds are like parachutes, they work best when open. Bruno Patini Furtado Software Developer webpage: http://bpfurtado.net software development blog: http://bpfurtado.livejournal.com
Re: Adaptive fetch
Raghavendra Prabhu wrote: Hi Andrzej Can you put in the latest version of the diff for the adaptive fetch? Because we seem to have problem patching agains the latest release. This should help us test it. The patch is probably out of sync, there have been many (trivial) changes in the meantime. The best option would be to commit this functionality, if enough people consider it of a sufficiently good quality. What prevents me from doing this is that I don't use this version on a regular basis - the original version is good enough for my use, even though not ideal. And I have a feeling that not too many people really reviewed this patch. So, IMHO these patches need more testing, because the potential for disruption is rather large. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Adaptive fetch
I believe we had a recent mail with problem of redirection also (with this patch applied..) And as you said more people testing the patch would be better. Considering that this has the highest votes for add-on features, it is a critical one i guess. Rgds Prabhu On 3/31/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: Raghavendra Prabhu wrote: Hi Andrzej Can you put in the latest version of the diff for the adaptive fetch? Because we seem to have problem patching agains the latest release. This should help us test it. The patch is probably out of sync, there have been many (trivial) changes in the meantime. The best option would be to commit this functionality, if enough people consider it of a sufficiently good quality. What prevents me from doing this is that I don't use this version on a regular basis - the original version is good enough for my use, even though not ideal. And I have a feeling that not too many people really reviewed this patch. So, IMHO these patches need more testing, because the potential for disruption is rather large. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Adaptive fetch
Raghavendra Prabhu wrote: I believe we had a recent mail with problem of redirection also (with this patch applied..) And as you said more people testing the patch would be better. Considering that this has the highest votes for add-on features, it is a critical one i guess. Ok, I'll bring this patch up to date over the weekend. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Multiple crawls how to get them to work together
can u share the script with everyone? On 3/31/06, Berlin Brown [EMAIL PROTECTED] wrote: Do you have that shell script? On 3/30/06, Dan Morrill [EMAIL PROTECTED] wrote: Hi folks, It worked, it worked great, I made a shell script to do the work for me. Thank you, thank you, and again, thank you. r/d -Original Message- From: Dan Morrill [mailto:[EMAIL PROTECTED] Sent: Thursday, March 30, 2006 5:12 AM To: nutch-user@lucene.apache.org Subject: RE: Multiple crawls how to get them to work together Aled, I'll try that today, excellent, and thanks for the heads up on the db directory. I'll let you now how it goes. r/d -Original Message- From: Aled Jones [mailto:[EMAIL PROTECTED] Sent: Thursday, March 30, 2006 12:24 AM To: nutch-user@lucene.apache.org Subject: ATB: Multiple crawls how to get them to work together Hi Dan I'll presume you've done the crawls already.. Each resulting crawled folder should have 3 folders, db, index and segments. Create your search.dir folder and create a segments folder in that. Each segments folder in each crawl folder should contain folders with timestamps as the names. Copy the contents of: crawlA/segments crawlB/segments crawlc/segments (i.e. The folders with timestamps as names)Into: search.dir/segments Next, delete the duplicates from the segments by running the command: bin/nutch dedup -local search.dir/segments Then you need to merge the segments to create an index folder, so run the command: bin/nutch merge -local search.dir/index search.dir/segments/* You should now have two folders in your search.dir: search.dir/segments search.dir/index That's all you need for serving pages (db folder is only used when fetching). Now just set the searcher.dir property value in nutch-site.xml to be the location of search.dir That's how I've been doing it, although it may not be the right way. :-) Hope this helps. Cheers Aled -Neges Wreiddiol-/-Original Message- Oddi wrth/From: Dan Morrill [mailto:[EMAIL PROTECTED] Anfonwyd/Sent: 29 March 2006 18:06 At/To: nutch-user@lucene.apache.org Copi/Cc: [EMAIL PROTECTED] Pwnc/Subject: Multiple crawls how to get them to work together Hi folks, I have 3 crawls, crawlA, crawlB, and crawlC. I would like all of them to be available to the search.jsp page. I went through the site saw merge, index, make new db, and followed all the directions that I could find, but still no resolution on this one. So what I need are some idea's on where to proceed from here, I intend on having 2 or 3 boxes make a crawl, then somehow merge the crawls together and form a master under search.dir. I would also want to update this one on a regular basis. Unfortunately, the instructions to date have all been tried, and have all lead to the idea not working. There is also no indexmerger or indexsemgents directives in nutch 0.7.1. Any support ideas, direct pointers, or even step-by-step instructions on how to do this (outside of what is in the tutorials because that has been tried already, including support idea's in the user web mail list). Cheers/r/dan ### This message has been scanned by F-Secure Anti-Virus for Microsoft Exchange. For more information, connect to http://www.f-secure.com/ This e-mail and any attachments are strictly confidential and intended solely for the addressee. They may contain information which is covered by legal, professional or other privilege. If you are not the intended addressee, you must not copy the e-mail or the attachments, or use them for any purpose or disclose their contents to any other person. To do so may be unlawful. If you have received this transmission in error, please notify us as soon as possible and delete the message and attachments from all places in your computer where they are stored. Although we have scanned this e-mail and any attachments for viruses, it is your responsibility to ensure that they are actually virus free. = -- www.babatu.com
Log Analysis
What open source tools do people like for analyzing nutch search log files? I'm specifically looking to find out most frequent search terms. The reports are for internal consumption to help understand what people are looking for and try to make sure they're finding it. Thanks, Jake.
Crawling the local file system with Nutch - Document-
Nutchians, I have tried to document the sequence of steps to adopt nutch to crawl and search local file system on windows machine. I have been able to do it successfully using nutch 0.8 Dev The configuration are as follows *Inspiron 630m Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine Windows XP Professional)* *If some can review it, it will be very helpful.* Crawling the local filesystem with nutch Platform: Microsoft / nutch 0.8 Dev For a linux version, please refer to http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch The link did help me get it off the ground. I have been working on adopting nutch in a vertical domain. All of a sudden, I was asked to develop a proof of concept to adopt nutch to crawl and search local file syste, Initially I did face some problems. But some mail archieves did help me proceed further. The intention is to provide a overview of steps to crawl local file systems and search through the browser. I downloaded the nuctch nightly from 1. Create the environment variable such as NUTCH_HOME. (Not mandatory, but helps) 2. Extract the downloaded nightly build. Dont build yet 3. Create a folder -- c:/LocalSearch -- copied the following folders and librariees 1. bin/ 2. conf/ 3. *.job, *.jar and *.war files 4. urls/ URLS folder 5. Plugins folder 4. Modify the nutch-site.xml to include the Plugin folder 5. Modify the nutch-site.xml to include the includes. An example is as follows ?xml version=1.0? ?xml-stylesheet type=text/xsl href=nutch-conf.xsl? !-- Put site-specific property overrides in this file. -- nutch-conf property nameplugin.includes/name valueprotocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)/value /property property namefile.content.limit/name value-1/value /property /nutch-conf 6. Modify crawl-urlfilter.txt Remember we have to crawl the local file system. Hence we have to modify the entries as follows #skip http:, ftp:, mailto: urls ##-^(file|ftp|mailto): -^(http|ftp|mailto): #skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ #skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] #accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ #accecpt anything else +.* 7. urls folder Create a file for all the urls to be crawled. The file should have the urls as below save the file under the urls folder. The directories should be in file:// format. Example entries were as follows file://c:/resumes/word file:///c:/resumes/word file://c:/resumes/pdf file:///c:/resumes/pdf #file:///data/readings/semanticweb/ Nutch recognises that the third line does not contain a valid file-url and skips it As suggested by the link 8. Ignoring the parent directories. As suggested in the linux flavor of local fs crawl, I did modify the code in org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse( java.io.File f). I changed the following line: this.content = list2html(f.listFiles(), path, /.equals(path) ? false : true); to this.content = list2html(f.listFiles(), path, false); and recompiled. 9. Compile the changes. Just compiled the whole source code base. did not take more than 2 minutes. 10. Crawling the file system. on my desktop, I have a short cut to cygdrive, double click pwd. cd ../../cygdrive/c/$NUTCH_HOME Execute bin/nutch crawl urls -dir c:/localfs/database Voila, that is it, After 20 minutes, the files were indexed, merged and all done. 11. extracted the nutch-o.8-dev.war file to TOMCAT_HOME/webapps/ROOT folder Opened the nutch-site.xml and added the following snippet to reflect the search folder property namesearcher.dir/name valuec:/localfs/database/value description Path to root of crawl. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory index containing merged indexes, or the directory segments containing segment indexes. /description /property 12. Searching locally was a bit slow. So I changed the hosts.ini file to map machine name to localhost. That increased search considerably. 13. Modified the search.jsp and cached servlet to view word and pdf as demanded by user seamlessly. I hope this helps folks who are trying to adopt nutch for local file system. Personally I believe corporates should adopt nutch rather buying google appliance :)
Re: Crawling the local file system with Nutch - Document-
thx for ur idea!! but i get a question . how to modify the search.jsp and cached servlet to view word and pdf as demanded by user seamlessly. On 4/1/06, Vertical Search [EMAIL PROTECTED] wrote: Nutchians, I have tried to document the sequence of steps to adopt nutch to crawl and search local file system on windows machine. I have been able to do it successfully using nutch 0.8 Dev The configuration are as follows *Inspiron 630m Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine Windows XP Professional)* *If some can review it, it will be very helpful.* Crawling the local filesystem with nutch Platform: Microsoft / nutch 0.8 Dev For a linux version, please refer to http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch The link did help me get it off the ground. I have been working on adopting nutch in a vertical domain. All of a sudden, I was asked to develop a proof of concept to adopt nutch to crawl and search local file syste, Initially I did face some problems. But some mail archieves did help me proceed further. The intention is to provide a overview of steps to crawl local file systems and search through the browser. I downloaded the nuctch nightly from 1. Create the environment variable such as NUTCH_HOME. (Not mandatory, but helps) 2. Extract the downloaded nightly build. Dont build yet 3. Create a folder -- c:/LocalSearch -- copied the following folders and librariees 1. bin/ 2. conf/ 3. *.job, *.jar and *.war files 4. urls/ URLS folder 5. Plugins folder 4. Modify the nutch-site.xml to include the Plugin folder 5. Modify the nutch-site.xml to include the includes. An example is as follows ?xml version=1.0? ?xml-stylesheet type=text/xsl href=nutch-conf.xsl? !-- Put site-specific property overrides in this file. -- nutch-conf property nameplugin.includes/name valueprotocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)/value /property property namefile.content.limit/name value-1/value /property /nutch-conf 6. Modify crawl-urlfilter.txt Remember we have to crawl the local file system. Hence we have to modify the entries as follows #skip http:, ftp:, mailto: urls ##-^(file|ftp|mailto): -^(http|ftp|mailto): #skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ #skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] #accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ #accecpt anything else +.* 7. urls folder Create a file for all the urls to be crawled. The file should have the urls as below save the file under the urls folder. The directories should be in file:// format. Example entries were as follows file://c:/resumes/word file:///c:/resumes/word file://c:/resumes/pdf file:///c:/resumes/pdf #file:///data/readings/semanticweb/ Nutch recognises that the third line does not contain a valid file-url and skips it As suggested by the link 8. Ignoring the parent directories. As suggested in the linux flavor of local fs crawl, I did modify the code in org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse( java.io.File f). I changed the following line: this.content = list2html(f.listFiles(), path, /.equals(path) ? false : true); to this.content = list2html(f.listFiles(), path, false); and recompiled. 9. Compile the changes. Just compiled the whole source code base. did not take more than 2 minutes. 10. Crawling the file system. on my desktop, I have a short cut to cygdrive, double click pwd. cd ../../cygdrive/c/$NUTCH_HOME Execute bin/nutch crawl urls -dir c:/localfs/database Voila, that is it, After 20 minutes, the files were indexed, merged and all done. 11. extracted the nutch-o.8-dev.war file to TOMCAT_HOME/webapps/ROOT folder Opened the nutch-site.xml and added the following snippet to reflect the search folder property namesearcher.dir/name valuec:/localfs/database/value description Path to root of crawl. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory index containing merged indexes, or the directory segments containing segment indexes. /description /property 12. Searching locally was a bit slow. So I changed the hosts.ini file to map machine name to localhost. That increased search considerably. 13. Modified the search.jsp and cached servlet to view word and pdf as demanded by user seamlessly. I hope this helps folks who are trying to adopt nutch for local file system. Personally I believe corporates should adopt nutch rather buying google appliance :) -- www.babatu.com