Adaptive fetch

2006-03-31 Thread Raghavendra Prabhu
Hi Andrzej

Can you put in the latest version of the diff for the adaptive fetch?

Because we seem to have problem patching agains the latest release.

This should help us test it.

Rgds
Prabhu


Re: [Nutch-general] Re: Using Nutch with Ferret (ruby)

2006-03-31 Thread Erik Hatcher

On Mar 30, 2006, at 4:10 PM, mike c wrote:

Hi Erik,
Thanks for pointing this out - as I just got Ferret working with
indexes created using Nutch.  Any recommendations on how to address
this issue?


This is a particularly insidious issue.  Java Lucene is not using  
pure UTF-8, whereas ports like Ferret are.  But changing Java Lucene  
is a big deal and does introduce a (slight) performance hit  
apparently.  The plan is for Java Lucene to be corrected in this  
regard at some point in the future, perhaps as soon as Lucene 2.0.


But for now, I don't know of a way to address this issue.  I gave up  
on Ferret for the time being because of this incompatibility and am  
now prototyping with Solr while still using my custom XML-RPC search  
server for now.


Erik





-Mike

On 3/30/06, Erik Hatcher [EMAIL PROTECTED] wrote:

There is one incompatibility between Ferret and Java Lucene of note.
It is the UTF-8 issue that has surfaced with regards to Java
Lucene.  All can be well between Java Lucene and Ferret, until
characters in another range are indexed, and then Ferret will blow up
trying to search the index.  Maybe this has been worked around in a
more recent version of Ferret than I've tried?

Erik


On Mar 30, 2006, at 2:50 PM, mike c wrote:


Thanks.  I'll try it out.  In the mean time, if I get Ferret working
I'll post an update.

-Mike

On 3/30/06, Steven Yelton [EMAIL PROTECTED] wrote:
I use WEBrick instead of tomcat to query and serve search  
results.  I

used ruby's 'rjb' to bridge the gap.

http://raa.ruby-lang.org/project/rjb/

There may be more direct ways (ruby-lucene), but this was  
quick and

easy and still has decent performance.

Steven

mike c wrote:


Hi all,
I was wondering if anyone is using Nutch (for crawling) with  
Ferret
(indexing / searching).  Basically, my front-end is built using  
Ruby

on Rails that's why I'm asking.  I have the Nutch crawler up and
running fine, but can't seem to figure out how to integrate the  
two.

Any help is appreciated.

Regards,
Mike







---
This SF.Net email is sponsored by xPML, a groundbreaking scripting
language
that extends applications into web and mobile media. Attend the
live webcast
and join the prime developer group breaking into this new coding
territory!
http://sel.as-us.falkag.net/sel?cmd=lnkkid0944bid$1720dat1642
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general






---
This SF.Net email is sponsored by xPML, a groundbreaking scripting  
language
that extends applications into web and mobile media. Attend the  
live webcast
and join the prime developer group breaking into this new coding  
territory!

http://sel.as-us.falkag.net/sel?cmd=lnkkid0944bid$1720dat1642
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general




Re: [Nutch-general] Re: Using Nutch with Ferret (ruby)

2006-03-31 Thread Bruno Patini Furtado
Any easy link to the bug report of this utf8 lucene issue?

On 3/31/06, Erik Hatcher [EMAIL PROTECTED] wrote:

 On Mar 30, 2006, at 4:10 PM, mike c wrote:
  Hi Erik,
  Thanks for pointing this out - as I just got Ferret working with
  indexes created using Nutch.  Any recommendations on how to address
  this issue?

 This is a particularly insidious issue.  Java Lucene is not using
 pure UTF-8, whereas ports like Ferret are.  But changing Java Lucene
 is a big deal and does introduce a (slight) performance hit
 apparently.  The plan is for Java Lucene to be corrected in this
 regard at some point in the future, perhaps as soon as Lucene 2.0.

 But for now, I don't know of a way to address this issue.  I gave up
 on Ferret for the time being because of this incompatibility and am
 now prototyping with Solr while still using my custom XML-RPC search
 server for now.

 Erik



 
  -Mike
 
  On 3/30/06, Erik Hatcher [EMAIL PROTECTED] wrote:
  There is one incompatibility between Ferret and Java Lucene of note.
  It is the UTF-8 issue that has surfaced with regards to Java
  Lucene.  All can be well between Java Lucene and Ferret, until
  characters in another range are indexed, and then Ferret will blow up
  trying to search the index.  Maybe this has been worked around in a
  more recent version of Ferret than I've tried?
 
  Erik
 
 
  On Mar 30, 2006, at 2:50 PM, mike c wrote:
 
  Thanks.  I'll try it out.  In the mean time, if I get Ferret working
  I'll post an update.
 
  -Mike
 
  On 3/30/06, Steven Yelton [EMAIL PROTECTED] wrote:
  I use WEBrick instead of tomcat to query and serve search
  results.  I
  used ruby's 'rjb' to bridge the gap.
 
  http://raa.ruby-lang.org/project/rjb/
 
  There may be more direct ways (ruby-lucene), but this was
  quick and
  easy and still has decent performance.
 
  Steven
 
  mike c wrote:
 
  Hi all,
  I was wondering if anyone is using Nutch (for crawling) with
  Ferret
  (indexing / searching).  Basically, my front-end is built using
  Ruby
  on Rails that's why I'm asking.  I have the Nutch crawler up and
  running fine, but can't seem to figure out how to integrate the
  two.
  Any help is appreciated.
 
  Regards,
  Mike
 
 
 
 
 
  ---
  This SF.Net email is sponsored by xPML, a groundbreaking scripting
  language
  that extends applications into web and mobile media. Attend the
  live webcast
  and join the prime developer group breaking into this new coding
  territory!
  http://sel.as-us.falkag.net/sel?cmd=lnkkid0944bid$1720dat1642
  ___
  Nutch-general mailing list
  Nutch-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/nutch-general
 
 
 
 
  ---
  This SF.Net email is sponsored by xPML, a groundbreaking scripting
  language
  that extends applications into web and mobile media. Attend the
  live webcast
  and join the prime developer group breaking into this new coding
  territory!
  http://sel.as-us.falkag.net/sel?cmd=lnkkid0944bid$1720dat1642
  ___
  Nutch-general mailing list
  Nutch-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/nutch-general




--
Minds are like parachutes, they work best when open.

Bruno Patini Furtado
Software Developer
webpage: http://bpfurtado.net
software development blog: http://bpfurtado.livejournal.com


Re: Adaptive fetch

2006-03-31 Thread Andrzej Bialecki

Raghavendra Prabhu wrote:

Hi Andrzej

Can you put in the latest version of the diff for the adaptive fetch?

Because we seem to have problem patching agains the latest release.

This should help us test it.
  


The patch is probably out of sync, there have been many (trivial) 
changes in the meantime. The best option would be to commit this 
functionality, if enough people consider it of a sufficiently good 
quality. What prevents me from doing this is that I don't use this 
version on a regular basis - the original version is good enough for my 
use, even though not ideal. And I have a feeling that not too many 
people really reviewed this patch.


So, IMHO these patches need more testing, because the potential for 
disruption is rather large.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Adaptive fetch

2006-03-31 Thread Raghavendra Prabhu
I believe we had a recent mail with problem of redirection also (with this
patch applied..)

And as you said  more people testing the patch would be better.

Considering that this has the highest votes for add-on features, it is a
critical one i guess.


Rgds
Prabhu

On 3/31/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Raghavendra Prabhu wrote:
  Hi Andrzej
 
  Can you put in the latest version of the diff for the adaptive fetch?
 
  Because we seem to have problem patching agains the latest release.
 
  This should help us test it.
 

 The patch is probably out of sync, there have been many (trivial)
 changes in the meantime. The best option would be to commit this
 functionality, if enough people consider it of a sufficiently good
 quality. What prevents me from doing this is that I don't use this
 version on a regular basis - the original version is good enough for my
 use, even though not ideal. And I have a feeling that not too many
 people really reviewed this patch.

 So, IMHO these patches need more testing, because the potential for
 disruption is rather large.

 --
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





Re: Adaptive fetch

2006-03-31 Thread Andrzej Bialecki

Raghavendra Prabhu wrote:

I believe we had a recent mail with problem of redirection also (with this
patch applied..)

And as you said  more people testing the patch would be better.

Considering that this has the highest votes for add-on features, it is a
critical one i guess.
  


Ok, I'll bring this patch up to date over the weekend.

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Multiple crawls how to get them to work together

2006-03-31 Thread kauu
can u share the script with everyone?

On 3/31/06, Berlin Brown [EMAIL PROTECTED] wrote:

 Do you have that shell script?

 On 3/30/06, Dan Morrill [EMAIL PROTECTED] wrote:
  Hi folks,
 
  It worked, it worked great, I made a shell script to do the work for me.
  Thank you, thank you, and again, thank you.
 
  r/d
 
  -Original Message-
  From: Dan Morrill [mailto:[EMAIL PROTECTED]
  Sent: Thursday, March 30, 2006 5:12 AM
  To: nutch-user@lucene.apache.org
  Subject: RE: Multiple crawls how to get them to work together
 
  Aled,
 
  I'll try that today, excellent, and thanks for the heads up on the db
  directory. I'll let you now how it goes.
 
  r/d
 
 
 
  -Original Message-
  From: Aled Jones [mailto:[EMAIL PROTECTED]
  Sent: Thursday, March 30, 2006 12:24 AM
  To: nutch-user@lucene.apache.org
  Subject: ATB: Multiple crawls how to get them to work together
 
  Hi Dan
 
  I'll presume you've done the crawls already..
 
  Each resulting crawled folder should have 3 folders, db, index and
  segments.
 
  Create your search.dir folder and create a segments folder in that.
 
  Each segments folder in each crawl folder should contain folders with
  timestamps as the names.  Copy the contents of:
 
  crawlA/segments
  crawlB/segments
  crawlc/segments
 
  (i.e. The folders with timestamps as names)Into:
 
  search.dir/segments
 
  Next, delete the duplicates from the segments by running the command:
 
  bin/nutch dedup -local search.dir/segments
 
  Then you need to merge the segments to create an index folder, so run
  the command:
 
  bin/nutch merge -local search.dir/index search.dir/segments/*
 
  You should now have two folders in your search.dir:
  search.dir/segments
  search.dir/index
 
  That's all you need for serving pages (db folder is only used when
  fetching).
 
  Now just set the searcher.dir property value in nutch-site.xml to be the
  location of search.dir
 
  That's how I've been doing it, although it may not be the right way.
  :-) Hope this helps.
 
  Cheers
  Aled
 
 
   -Neges Wreiddiol-/-Original Message-
   Oddi wrth/From: Dan Morrill [mailto:[EMAIL PROTECTED]
   Anfonwyd/Sent: 29 March 2006 18:06
   At/To: nutch-user@lucene.apache.org
   Copi/Cc: [EMAIL PROTECTED]
   Pwnc/Subject: Multiple crawls how to get them to work together
  
   Hi folks,
  
  
  
   I have 3 crawls, crawlA, crawlB, and crawlC. I would like all
   of them to be available to the search.jsp page.
  
  
  
   I went through the site saw merge, index, make new db, and
   followed all the directions that I could find, but still no
   resolution on this one. So what I need are some idea's on
   where to proceed from here, I intend on having 2 or
   3 boxes make a crawl, then somehow merge the crawls together
   and form a master under search.dir. I would also want to
   update this one on a regular basis.
  
  
  
   Unfortunately, the instructions to date have all been tried,
   and have all lead to the idea not working. There is also no
   indexmerger or indexsemgents directives in nutch 0.7.1. Any
   support ideas, direct pointers, or even step-by-step
   instructions on how to do this (outside of what is in the
   tutorials because that has been tried already, including
   support idea's in the user web mail list).
  
  
  
   Cheers/r/dan
  
  
  
  
  
  
  
  
  ###
 
  This message has been scanned by F-Secure Anti-Virus for Microsoft
 Exchange.
  For more information, connect to http://www.f-secure.com/
 
  
  This e-mail and any attachments are strictly confidential and intended
  solely for the addressee. They may contain information which is covered
 by
  legal, professional or other privilege. If you are not the intended
  addressee, you must not copy the e-mail or the attachments, or use them
 for
  any purpose or disclose their contents to any other person. To do so may
 be
  unlawful. If you have received this transmission in error, please notify
 us
  as soon as possible and delete the message and attachments from all
 places
  in your computer where they are stored.
 
  Although we have scanned this e-mail and any attachments for viruses, it
 is
  your responsibility to ensure that they are actually virus free.
 
 
  =
 
 




--
www.babatu.com


Log Analysis

2006-03-31 Thread Vanderdray, Jacob
What open source tools do people like for analyzing nutch search
log files?  I'm specifically looking to find out most frequent search
terms.  The reports are for internal consumption to help understand what
people are looking for and try to make sure they're finding it.

Thanks,
Jake.


Crawling the local file system with Nutch - Document-

2006-03-31 Thread Vertical Search
Nutchians,
I have tried to document the sequence of steps to adopt nutch to crawl and
search local file system on windows machine.
I have been able to do it successfully using nutch 0.8 Dev
The configuration are as follows
*Inspiron 630m
Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine Windows XP
Professional)*
*If some can review it, it will be very helpful.*

Crawling the local filesystem with nutch
Platform: Microsoft / nutch 0.8 Dev
For a linux version, please refer to
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
The link did help me get it off the ground.

I have been working on adopting nutch in a vertical domain. All of a sudden,
I was asked to develop a proof of concept
to adopt nutch to crawl and search local file syste,
Initially I did face some problems. But some mail archieves did help me
proceed further.
The intention is to provide a overview of steps to crawl local file systems
and search through the browser.

I downloaded the nuctch nightly from
1. Create the environment variable such as NUTCH_HOME. (Not mandatory, but
helps)
2. Extract the downloaded nightly build. Dont build yet
3. Create a folder -- c:/LocalSearch -- copied the following folders and
librariees
 1. bin/
 2. conf/
 3. *.job, *.jar and *.war files
 4. urls/ URLS folder
 5. Plugins folder
4. Modify the nutch-site.xml to include the Plugin folder
5. Modify the nutch-site.xml to include the includes. An example is as
follows

?xml version=1.0?
?xml-stylesheet type=text/xsl href=nutch-conf.xsl?
!-- Put site-specific property overrides in this file. --
nutch-conf
property
nameplugin.includes/name
valueprotocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)/value
/property
property
namefile.content.limit/name value-1/value
/property
/nutch-conf

6. Modify crawl-urlfilter.txt
Remember we have to crawl the local file system. Hence we have to modify the
entries as follows

#skip http:, ftp:,  mailto: urls
##-^(file|ftp|mailto):

-^(http|ftp|mailto):

#skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

#skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

#accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

#accecpt anything else
+.*

7. urls folder
Create a file for all the urls to be crawled. The file should have the urls
as below
save the file under the urls folder.

The directories should be in file:// format. Example entries were as
follows

file://c:/resumes/word file:///c:/resumes/word
file://c:/resumes/pdf file:///c:/resumes/pdf

#file:///data/readings/semanticweb/

Nutch recognises that the third line does not contain a valid file-url and
skips it

As suggested by the link
8. Ignoring the parent directories. As suggested in the linux flavor of
local fs crawl, I did modify the code in
org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
java.io.File f).

I changed the following line:

this.content = list2html(f.listFiles(), path, /.equals(path) ? false :
true);
to

this.content = list2html(f.listFiles(), path, false);
and recompiled.

9. Compile the changes. Just compiled the whole source code base. did not
take more than 2 minutes.

10. Crawling the file system.
on my desktop, I have a short cut to cygdrive, double click
pwd.
cd ../../cygdrive/c/$NUTCH_HOME

Execute
bin/nutch crawl urls -dir c:/localfs/database

Voila, that is it, After 20 minutes, the files were indexed, merged and all
done.

11. extracted the nutch-o.8-dev.war file to TOMCAT_HOME/webapps/ROOT
folder

Opened the nutch-site.xml and added the following snippet to reflect the
search folder
property
  namesearcher.dir/name
  valuec:/localfs/database/value
  description
  Path to root of crawl.  This directory is searched (in
  order) for either the file search-servers.txt, containing a list of
  distributed search servers, or the directory index containing
  merged indexes, or the directory segments containing segment
  indexes.
  /description
/property

12. Searching locally was a bit slow. So I changed the hosts.ini file to map
machine name to localhost. That increased search considerably.

13. Modified the search.jsp and cached servlet to view word and pdf as
demanded by user seamlessly.


I hope this helps folks who are trying to adopt nutch for local file system.
Personally I believe corporates should adopt nutch rather buying google
appliance :)


Re: Crawling the local file system with Nutch - Document-

2006-03-31 Thread kauu
thx for ur idea!!
but i get a question .
how to modify the search.jsp and cached servlet to view word and pdf  as
demanded by user seamlessly.



On 4/1/06, Vertical Search [EMAIL PROTECTED] wrote:

 Nutchians,
 I have tried to document the sequence of steps to adopt nutch to crawl and
 search local file system on windows machine.
 I have been able to do it successfully using nutch 0.8 Dev
 The configuration are as follows
 *Inspiron 630m
 Intel(r) Pentium(r) M Processor 760 (2GHz/2MB Cache/533MHz, Genuine
 Windows XP
 Professional)*
 *If some can review it, it will be very helpful.*

 Crawling the local filesystem with nutch
 Platform: Microsoft / nutch 0.8 Dev
 For a linux version, please refer to
 http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
 The link did help me get it off the ground.

 I have been working on adopting nutch in a vertical domain. All of a
 sudden,
 I was asked to develop a proof of concept
 to adopt nutch to crawl and search local file syste,
 Initially I did face some problems. But some mail archieves did help me
 proceed further.
 The intention is to provide a overview of steps to crawl local file
 systems
 and search through the browser.

 I downloaded the nuctch nightly from
 1. Create the environment variable such as NUTCH_HOME. (Not mandatory,
 but
 helps)
 2. Extract the downloaded nightly build. Dont build yet
 3. Create a folder -- c:/LocalSearch -- copied the following folders and
 librariees
 1. bin/
 2. conf/
 3. *.job, *.jar and *.war files
 4. urls/ URLS folder
 5. Plugins folder
 4. Modify the nutch-site.xml to include the Plugin folder
 5. Modify the nutch-site.xml to include the includes. An example is as
 follows

 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=nutch-conf.xsl?
 !-- Put site-specific property overrides in this file. --
 nutch-conf
 property
 nameplugin.includes/name

 valueprotocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)/value
 /property
 property
 namefile.content.limit/name value-1/value
 /property
 /nutch-conf

 6. Modify crawl-urlfilter.txt
 Remember we have to crawl the local file system. Hence we have to modify
 the
 entries as follows

 #skip http:, ftp:,  mailto: urls
 ##-^(file|ftp|mailto):

 -^(http|ftp|mailto):

 #skip image and other suffixes we can't yet parse

 -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

 #skip URLs containing certain characters as probable queries, etc.
 [EMAIL PROTECTED]

 #accept hosts in MY.DOMAIN.NAME
 #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

 #accecpt anything else
 +.*

 7. urls folder
 Create a file for all the urls to be crawled. The file should have the
 urls
 as below
 save the file under the urls folder.

 The directories should be in file:// format. Example entries were as
 follows

 file://c:/resumes/word file:///c:/resumes/word
 file://c:/resumes/pdf file:///c:/resumes/pdf

 #file:///data/readings/semanticweb/

 Nutch recognises that the third line does not contain a valid file-url and
 skips it

 As suggested by the link
 8. Ignoring the parent directories. As suggested in the linux flavor of
 local fs crawl, I did modify the code in
 org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
 java.io.File f).

 I changed the following line:

 this.content = list2html(f.listFiles(), path, /.equals(path) ? false :
 true);
 to

 this.content = list2html(f.listFiles(), path, false);
 and recompiled.

 9. Compile the changes. Just compiled the whole source code base. did not
 take more than 2 minutes.

 10. Crawling the file system.
 on my desktop, I have a short cut to cygdrive, double click
 pwd.
 cd ../../cygdrive/c/$NUTCH_HOME

 Execute
 bin/nutch crawl urls -dir c:/localfs/database

 Voila, that is it, After 20 minutes, the files were indexed, merged and
 all
 done.

 11. extracted the nutch-o.8-dev.war file to TOMCAT_HOME/webapps/ROOT
 folder

 Opened the nutch-site.xml and added the following snippet to reflect the
 search folder
 property
   namesearcher.dir/name
   valuec:/localfs/database/value
   description
   Path to root of crawl.  This directory is searched (in
   order) for either the file search-servers.txt, containing a list of
   distributed search servers, or the directory index containing
   merged indexes, or the directory segments containing segment
   indexes.
   /description
 /property

 12. Searching locally was a bit slow. So I changed the hosts.ini file to
 map
 machine name to localhost. That increased search considerably.

 13. Modified the search.jsp and cached servlet to view word and pdf as
 demanded by user seamlessly.


 I hope this helps folks who are trying to adopt nutch for local file
 system.
 Personally I believe corporates should adopt nutch rather buying google
 appliance :)




--
www.babatu.com