Re: A Poor Man's JSP compatible Search Engine Implementation

Julian Doherty Thu, 17 May 2001 14:38:58 -0700
We are using Lucene (http://www.lucene.com/) as a search engine on our
Intranet project with good results. It handles indexing terms, and running
searches on them. You would need to write/obtain a separate spider to parse
the pages and load the terms into it though, as it doesn't cover this part
of the search process.

We have extensive meta data that is tagged onto all documents on our
Intranet, and Lucene handles all that nicely allowing searches per field
etc. In our case everything is stored in a DB, so it's just a matter of
indexing the database contents into Lucene on a period basis to allow
searching. There is example code included with Lucene that shows how to set
up an indexer that recurses through a directory struture and indexes all
the files of a certain type etc. It wouldn't be too hard to add a spider if
you knew what you were doing either.

The other nice thing is that Lucene is open source and java based, so
hooking it in with any existing JSP based system is very easy.

Julian Doherty
Information Systems Analyst
Education Review Office




                    Geert Van Damme
                    <geert.vandamme@D        To:     [EMAIL PROTECTED]
                    arling.be>               cc:
                    Sent by: A               Subject:     Re: A Poor Man's JSP 
compatible Search Engine Implementation
                    mailing list
                    about Java Server
                    Pages
                    specification and
                    reference
                    <JSP-INTEREST@jav
                    a.sun.com>


                    05/17/01 10:50 PM
                    Please respond to
                    geert.vandamme






well, there's definitely a need for a poor man's Search engine.
But I think you need to keep 2 things separate
1) the search engine
2) the spider (to insert into the engine's db). Well, you don't need to
call
it a spider, it's more an interface to gather info. Spidering is only one
of
the possibilities.

I think 1 can be pretty general. It has an obvious interface and there
might
be a few different implementations (one based on a db, another on a file
e.g.)

but the spider will be very depending on the project. In some projects, all
info is in the db. In that case, there's no need to crawl through the pages
at all.

> Search Engine are now a web site / app fundamental. If you don't have
> one or dont bother offering "search this site" you look sad. If you
> have one already and it does not work properly or very limited
> and/or tries to do too much
> ( "http://www.javascript.com/"; or "http://www.docjs.com/"; )
>
> I just looking Java Networking by Elliot Rusty  Orielly 2nd Edition
> where there is an example of using Swing HTML Document / Parser package
> `javax.swing.html.*' to trawl through a web page and print the hyperlink.
> That gave me a brillaint idea of implementing a poor mans search engine.
> You could take this example as basis for a Poor man JSP search engine.
> But may be I dont wanna invent the wheel for Nth time in this new
century.
>
> No if you forget about indexing words for now. I can see two methods
> of doing this for JSP pages.
>
> 1) Look at File System and grab "*.jsp" using FileReader or something.
> Effectively you are looking at raw JSP which have a mixture of Java
> and HTML.
>
> Pros:
> you dont have to deal with security.
> It goings to be fast on the same server
>
> Cons:
> Potentially mixtures of Java and HTML
> Restricted to the web server.

And you might miss the most crucial information. I really think this is a
bad idea.


>
> 2) Grab the JSP pages using a java.net.URL object effectively you
> are surfing your web app.
>
> Pros:
>
> Most portable solution
>
> Cons:
>
> Big problem all web app have some form security built in. Usually it
> either custom form based or (intranet wise) web app realm based.
> How do U get your search engine to authenticate itself through its
> own web app? This is important for ecommerce site were you want
> to able list pages that are deemed protected resources.
>

Not only that, but there's also the problem that many pages need parameters
(either POSTed or GET) and the combination of all JSP pages with the
parameters might make your Search engine DB explode.

I like the idea but I really think it's not all that obvious. Especially
the
spider stuff.

Geert Van Damme

===========================================================================
To unsubscribe: mailto [EMAIL PROTECTED] with body: "signoff
JSP-INTEREST".
For digest: mailto [EMAIL PROTECTED] with body: "set JSP-INTEREST
DIGEST".
Some relevant FAQs on JSP/Servlets can be found at:

 http://java.sun.com/products/jsp/faq.html
 http://www.esperanto.org.nz/jsp/jspfaq.html
 http://www.jguru.com/jguru/faq/faqpage.jsp?name=JSP
 http://www.jguru.com/jguru/faq/faqpage.jsp?name=Servlets

===========================================================================
To unsubscribe: mailto [EMAIL PROTECTED] with body: "signoff JSP-INTEREST".
For digest: mailto [EMAIL PROTECTED] with body: "set JSP-INTEREST DIGEST".
Some relevant FAQs on JSP/Servlets can be found at:

 http://java.sun.com/products/jsp/faq.html
 http://www.esperanto.org.nz/jsp/jspfaq.html
 http://www.jguru.com/jguru/faq/faqpage.jsp?name=JSP
 http://www.jguru.com/jguru/faq/faqpage.jsp?name=Servlets
Re: A Poor Man's JSP compatible Search Engine Implementation

Reply via email to