PHP isn't totally bad for a search engine. Here's my story.
I was in a bit of a predicament when I first started work, because I had to develop a search engine for online video objects. My company is essentially a video re-purposing venture, where we take reams of analog, tape-based videos, encode them into something like MPEG or ASF or whatever, create clip indexes (i.e. a 30 minute clip is broken up into 10x 3-minute clips, with each clip described and categorized) and provide a search engine and interface to watch clips or videos on the web via a broadband connection. (An added bonus is that you can create your own "video" via personal playlists -- you can take 10 different clips from 10 different videos and run them together into one playlist, all online. In a few months, you'll be able to create your own clips if you don't like our predefined ones.) Anyways, the search engine thing was my deal. I'm the only programmer (*period*) on our team, and I basically had to write a search engine, web site backend, admin interface and all that jazz for our app alone. I was hired March 6, 2001 or so, and I had until, oh, April 15, 2001 to do it. Plus there were a few conditions -- like, it should be portable and inexpensive. PHP seemed like a good choice -- it was portable (Win32, *ix, whatever), it was cheap ($0) and it isn't too bad for rapid development. So off I was. I did manage to finish the search engine and back end by April 15, but it was a mess. It wasn't exactly a stellar search engine, but more of a proof of concept, which was the whole point of the project -- to show that we could provide high quality streaming video through a browser with a relatively good interface. After the proof-of-concept project, we started to get serious, and I dropped most of the code base and started again from scratch. Pretty much the entire search engine now is in PHP, with the sole exceptions being the keyword indexer (Perl, as PHP was a lot slower doing the indexing) and a few extensions to the PHP engine. The search engine itself is fairly fast -- it can do a keyword search on a collection of nearly 8,000 video objects in an average of 0.02 to 0.20 seconds or so, depending on the complexity of the query. It's features include: * "Boolean"-type searches. Okay, not really, as in you can't use AND, OR and NOT, but you can use +/- style prefixes like in mnoGoSearch and whatnot. Words are automatically OR'ed, +'d words are AND'ed and -'d words are NOT'ed. * Decent search times. On a PIII 500 with 128 MB of RAM, it still averages less than 0.20 seconds for a collection of 8,000 video objects and over 100,000 keywords. * Filtering. We're mostly an education-based site, so you can filter by things like subject (Physics, Chemistry, etc.) and grades (Kindergarden, grade 10, college, etc.) * Spellchecking and somewhat fuzzy searches. Spellchecks work okay, but the fuzzy searches is kind of lame. (Porter stemming.) I might shoehorn in something like Metaphone-type stuff eventually. * Search ranking. Yes, keywords are given weights, everything is ranked and all that jazz. You know, inverse document frequencies, collection distribution, all that stuff. In the end, video objects returned in a search are given a ranking of 1 to 4 based on how well they match your query. It's not terribly advanced, and could use some tuning, but it's surprising how well it works. * XML-based. The search engine itself runs as it's own daemon on either it's own server or along side the web site, and just waits for connections via a UNIX domain socket or a TCP socket. When it receives a query, and sends back an XML document containing the search results. This is especially nice -- you can use it with anything for any purpose, not just a web site, i.e. you can build an native app for Windows and you can still use the search engine, and just format the results via an XSL or whatever. There are a lot of other nifty features, like being able to do remote admin via telnet or whatever. But in the end, it's still just a decent search engine and definitely not Google or even htdig. It's very focused on our specific task, the searching of online educational videos, so something using something like htdig would have required a lot of hacking to get it to where we wanted it. So the morale I guess is, sure, you can make a half-decent search engine out of PHP. Ours gets the job done. But remember, I only had, like, a scant few months to write one, plus a web-based app to go around it, and I was alone on this one. PHP was great for RAD, and the damn thing even works to boot. My search engine could handle a web site easily enough, maybe even a group of sites, but it would totally suck ass as a WWW indexer/spider-type search engine. So there ya go. J Greg Schnippel wrote: > >> * On 15-01-02 at 12:09 >> * Yogesh Mahadnac said.... >> >>> Hi all! I want to develop a search engine in PHP for a >>> portal that I'm working on at the moment, and I'd be glad if >>> someone could please show me how to do it, or if anyone knows >>> of a link where i can find a tutorial for that. >> >> I don't think PHP is really a very good language for a genuine www >> search engine. (although it works very well on site-wide basis) >> I'm sure more knowledgeable people than I can make some alternative >> suggestions but I'm certain that PHP won't be the best tool >> for the job. > > I would concur with what everyone else is saying. If you need a search > engine and you have system-level access on your machine, your best > bet is to set up either htdig or mnogosearch (open source search > engine packages) because they already have done the hard work of > figuring out fuzzy matching and search ranking. > > http://www.htdig.org/ > http://mnogosearch.org/ > > Alternatively, if you are using a database you can use some tricky sql > statements to search your records for the user's search query. Here's > a good tutorial that should get you started on this route: > > http://www.devshed.com/Server_Side/PHP/Search_Engine/page1.html > > > -schnippy -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED]