[gentoo-user] Re: [OT sphinx] Any users of sphinx here

Harry Putnam Sat, 12 Jun 2010 16:05:56 -0700

Brandon Vargo <[email protected]> writes:

> do. When I go to find code that I have written, I do not remember
> variable names, lines of code, etc that I can match with a regular
> expression. Thus, that kind of search is pointless for me. I remember
> what the code does, the project for which I wrote the code, and
> approximately where the code is located within the project. I remember
> function calls for libraries that I probably used. If I cannot find what
> I am looking for, I use grep on the name of a function call I remember,
> or I have a ctags file containing all the information I need about
> function definitions.


Again, thanks for a thorough answer... just a note on the above
comment.

I often find myself searching for a technique... NOT variable names or
sub function names because who knows what I might call stuff in any
particular script.

For example... I once was shown how to compile as regular expression
an element of @ARGV in perl, in one step:

   my $what_re = qr/@{[shift]}/;

I liked that and have used it many times... but only recently could I
remember at a moments notice how to write it.

I used `grep -r' or 'egrep -r' as you've mentioned, now I use a
my own perl script (recently written [since posting original query])
that uses regex and File::Find, where user feeds the regex and the
approximate location to begin the search, on the cmd line.

In my case that would be an nfs share /projects/reader/perl which is
kept in my ENV as $perlp

So:
  script.pl 'qr/.*?@' $perlp

Will find a number of examples of using that particular technique.

What prompted my query here, was looking for a way to search several
thousand html pages that are a collection of Perl books on CD.

These are 2 of the Oreilly Perl CDbooks.  (I spent $150 for the first
one, and I think the second was a little cheaper, it was yrs ago) The
Books on CD have built in search tools but those only work on a
windows OS and aren't up to much anyway.

I've since downloaded the data from the CDS onto an opensolaris zfs
server and access them through NFS.

I was attempting to use `webglimpse'
(http://webglimpse.net/download.php) for the task, hence the interest
in indexing.  But I suspect a search for a particular technique I read
about, but have forgotten how to code, would be best searched for
using regular expressions.  This would be long after I've forgotten
which section or even which book I read about it in.

The tool I've written can be made to strip html if necessary and can
be made to include (by regex) only certain kinds of filenames, but
uses no index so consequently is pretty slow... but still very useful
and is fully perl regex capable.

It returns up to 4 lines of context, 2 above the line with the hit,
and 1 below (where possible), along with the page number and the
absolute filename where the hit was found.

Here is an example search being timed:
-------        ---------       ---=---       ---------      -------- 
(I purposely picked something that would be found many times)

 time ./pgrep3  /var/www/localhost/htdocs/lcweb/cdbk+/AllPerl/ hash

 (So above we are searching a collection from the Oreilly CDbooks for
 the term `hash'..)
 
 (Just one example of the thousands of lines returned)
  [...]

   /var/www/localhost/htdocs/lcweb/cdbk+/AllPerl/perlnut/index/idx_p.htm
  135         dereferencing with : [104]4.8.2. Dereferencing
  136         modulus operator : [105]4.5.3. Arithmetic Operators
  137         prototype symbol (hash) : [106]4.7.5. Prototypes
  138         %= (assignment) operator : [107]4.5.6. Assignment Operators
 ---

 [...]

 Total files searched: 522 
 Total lines searched: 431689
 real    1m48.344s
 user    1m25.234s
 sys     0m14.336s

-------        ---------       ---=---       ---------      -------- 
Almost 2 minutes to search 431689 lines

So it is slow, maybe even very slow by comparison to tools using an
indexed search.

I don't really mind the sloth, but of course it would not be scalable
very much above the scope of use I'm doing with it. I do like the
precision search capability and plenty of context. All of the above is
also possible with grep, egrep... and friends too, of course, but only
with quite a lot more cmdline manipulation and piping.

I'm currently working on using something like this basic search script
to return URLS linking to the page and lines found, and working the
whole thing into something that can be carried out with a web browser.

Something pretty similar to webglimpse, I guess but without the
benefit of indexing.

Also webglimpe relies on glimpse which is not capable of full regex
search but does have a rich mixture of regex, regex like and boolean
query capability.

[gentoo-user] Re: [OT sphinx] Any users of sphinx here

Reply via email to