On Sun, 23 Aug 1998, Cort wrote:
>
> On 23-Aug-98 Gevaerts Frank wrote:
> > Does anyone know if there is a faster way to search a directory tree for a
> > certain word than "find /dir/ -name "*.txt"|xargs grep -l "theword" " ?
> > I want to do this to make a search engine for my local (LAN) website.
> > Since the site is over 50 megs (I mirror a lot of stuff for local use),
> > and the server is a 486, the search takes more time than I would like.
> >
> > Is it possible to achieve the same effect using some database, while
> > allowing a search for _ALL_ words, not just a few predefined keywords? If
> > so, how?
>
> Yes, you can use the "locate" database.
>
> It isn't a typical database in the sense of MySQL or Oracle. It is just a
> simple utility used for finding files quickly. When I tested it against find,
> locate managed to locate the files many times faster than find (10 times faster
> on my system, but you shouldn't expect the same improvement). But what it gains
> in speed, it loses in features. You will also need to update the database
> whenever you add or remove a file.
>
> An even faster solution would be to do the search in advance. Run "find /dir/
> -name "*.txt" > searchresult" and run "cat searchresult|xargs grep -l "theword"
> " when you need to search the files. There probably wouldn't be much
> improvement over using locate.
Doing this instead of a find at each search gives 2:33 instead of 2:43, so
it isn't really worth it.
I also tested catting all searchable files into one big file (40 megs) and
grepping that. This takes 1:45, but if I choose this way, I'll still have
to use some kind of database to do a linenumber/real file mapping, so I
don't know if I win much.
> Neither of these two solutions speed up the grep part. Using a typical database
> might help, but I have never used any before. Another solution is to implement
> something yourself, that will almost certainly improve performance, but it is
> time consuming and may be less flexible. Let me know if you wish to try this,
> I'm always on the lookout for tiny projects to play around with.
>
> I'm not too sure what you mean by allowing a search for all words, but I have
> included a simple CGI script which should demonstrate something to that effect.
> This is my first attempt at writing a CGI script and I have based it on the
> finger CGI script that comes with apache. Please forgive the lameness of this
> script. Incidently, the database shouldn't be in /tmp/ but I'm not sure where
> it should be placed instead for use from a CGI script.
basically, I do the same thing, but I have a bit more options right now
(case sensitive / whole words ...) I also added a boolean ADD option,
which basically takes the output of the first grep as the input files for
the next one, and so on. This doesn't take as mch time, provided you put
the least occuring word first.
> -- START DEMO CGI SCRIPT -- > #!/bin/sh
>
> echo Content-type: text/html
> echo
>
> if [ $# = 0 ]; then
> cat << EOM
> <TITLE>Locate Gateway</TITLE>
> <H1>Locate Gateway</H1>
>
> <ISINDEX>
>
> This is a gateway to "locate". Type a search string in your browser's
> search dialog.<P>
> EOM
> else
> echo \<PRE\>
> locate -d /tmp/testlocatedb "*.txt" | xargs grep -l "$*"
> fi
> -- END DEMO CGI SCRIPT --
>
> Cort
> [EMAIL PROTECTED]
>
I think there are two possible ways to improve : writing a basic grep
replacement which doesn't take a lot of options, and thus should be
faster, or putting all words occuring in the searchable files in a
database, together with an URL. Of course, common words like "the" should
not be taken into account. I'm very affraid the database solution will
take lots of disk space however (I didn't count of course, but there
should be a few thousand different words in that tree), and will take a
lot of time to update if I add some files.
Frank