Re: Indexing distant web sites

Karl ďż˝ie Mon, 04 Nov 2002 05:40:48 -0800

oh, sorry.. i was perhaps not making me self clear here...

you will have to use the crawler to retrieve the content and store it locally for indexing, so you will have to set up your crawler to fetch a site and store every html page's content to disk, then run Lucene on the locally stored html pages and afterwards delete the html pages... you will also need a way to get the original url from the crawler and store that in Lucene as well as a keyword field.

a much more efficient way is to get the crawler to get one page, store it in memory, run Lucene on it, and then discard the buffer and then keep on to the next page.

if you want to take a look at a real lucene+ crawler implementation you can check out the Cocoon project at http://xml.apache.org/cocoon/index.html :

Lucene integration:

> http://cvs.apache.org/viewcvs.cgi/xml-cocoon2/src/java/org/apache/ cocoon/components/search/

Crawler implementation:

> http://cvs.apache.org/viewcvs.cgi/xml-cocoon2/src/java/org/apache/ cocoon/components/crawler/

This impl is indexing XML, but the principe is the same...

mvh karl ďż˝ie

On Monday, Nov 4, 2002, at 14:29 Europe/Oslo, Friaa Nafaa wrote:

Thank you,I was installed this crawler and I run it,but I would like to index the web site and not to list the visited links by the crawler,Is there a way to serch a web page by lucene witch use this crawler for visiting the pages.thanks--- On Mon 11/04, Karl Marx < [EMAIL PROTECTED] > wrote:From: Karl Marx [mailto: [EMAIL PROTECTED]]To: [EMAIL PROTECTED]: Mon, 4 Nov 2002 12:31:50 +0100Subject: Re: Indexing distant web sitesAs stated in the official FAQ Lucene doesn't implement a web-crawler, you can however use a self-made crawler or customate a crawler framework like websphinx (http://www-2.cs.cmu.edu/~rcm/websphinx/) to retrieve html documents from a site and then feed them to Lucene.mvh karl ďż˝ieOn Monday, Nov 4, 2002, at 11:49 Europe/Oslo, Friaa Nafaa wrote:> Hello,is there any way to index web sites by lucene, assuming we know > only the url of the site ? :--&gt;In local use we passe to lucene the > full arborexcence or directory of our site (contain all the documents) > and we begin the indexing operation, but when I would like to index a > distant site on the web... what i do ?For exemple I installed Lucene > on my computer and I would like to index the site : > http://www.excite.com ...Thanks>> _______________________________________________> Join Excite! - http://www.excite.com> The most personalized portal on the Web!--To unsubscribe, e-mail: For additional commands, e-mail:

_______________________________________________
Join Excite! - http://www.excite.com
The most personalized portal on the Web!


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>

Re: Indexing distant web sites

Reply via email to