[gentoo-user] Re: [OT sphinx] Any users of sphinx here

2010-06-12 Thread Harry Putnam
Brandon Vargo brandon.va...@gmail.com writes:

 do. When I go to find code that I have written, I do not remember
 variable names, lines of code, etc that I can match with a regular
 expression. Thus, that kind of search is pointless for me. I remember
 what the code does, the project for which I wrote the code, and
 approximately where the code is located within the project. I remember
 function calls for libraries that I probably used. If I cannot find what
 I am looking for, I use grep on the name of a function call I remember,
 or I have a ctags file containing all the information I need about
 function definitions.

Again, thanks for a thorough answer... just a note on the above
comment.

I often find myself searching for a technique... NOT variable names or
sub function names because who knows what I might call stuff in any
particular script.

For example... I once was shown how to compile as regular expression
an element of @ARGV in perl, in one step:

   my $what_re = qr/@{[shift]}/;

I liked that and have used it many times... but only recently could I
remember at a moments notice how to write it.

I used `grep -r' or 'egrep -r' as you've mentioned, now I use a
my own perl script (recently written [since posting original query])
that uses regex and File::Find, where user feeds the regex and the
approximate location to begin the search, on the cmd line.

In my case that would be an nfs share /projects/reader/perl which is
kept in my ENV as $perlp

So:
  script.pl 'qr/.*?@' $perlp

Will find a number of examples of using that particular technique.

What prompted my query here, was looking for a way to search several
thousand html pages that are a collection of Perl books on CD.

These are 2 of the Oreilly Perl CDbooks.  (I spent $150 for the first
one, and I think the second was a little cheaper, it was yrs ago) The
Books on CD have built in search tools but those only work on a
windows OS and aren't up to much anyway.

I've since downloaded the data from the CDS onto an opensolaris zfs
server and access them through NFS.

I was attempting to use `webglimpse'
(http://webglimpse.net/download.php) for the task, hence the interest
in indexing.  But I suspect a search for a particular technique I read
about, but have forgotten how to code, would be best searched for
using regular expressions.  This would be long after I've forgotten
which section or even which book I read about it in.

The tool I've written can be made to strip html if necessary and can
be made to include (by regex) only certain kinds of filenames, but
uses no index so consequently is pretty slow... but still very useful
and is fully perl regex capable.

It returns up to 4 lines of context, 2 above the line with the hit,
and 1 below (where possible), along with the page number and the
absolute filename where the hit was found.

Here is an example search being timed:
----   ---=---   -   
(I purposely picked something that would be found many times)

 time ./pgrep3  /var/www/localhost/htdocs/lcweb/cdbk+/AllPerl/ hash

 (So above we are searching a collection from the Oreilly CDbooks for
 the term `hash'..)
 
 (Just one example of the thousands of lines returned)
  [...]

   /var/www/localhost/htdocs/lcweb/cdbk+/AllPerl/perlnut/index/idx_p.htm
  135 dereferencing with : [104]4.8.2. Dereferencing
  136 modulus operator : [105]4.5.3. Arithmetic Operators
  137 prototype symbol (hash) : [106]4.7.5. Prototypes
  138 %= (assignment) operator : [107]4.5.6. Assignment Operators
 ---

 [...]

 Total files searched: 522 
 Total lines searched: 431689
 real1m48.344s
 user1m25.234s
 sys 0m14.336s

----   ---=---   -   
Almost 2 minutes to search 431689 lines

So it is slow, maybe even very slow by comparison to tools using an
indexed search.

I don't really mind the sloth, but of course it would not be scalable
very much above the scope of use I'm doing with it. I do like the
precision search capability and plenty of context. All of the above is
also possible with grep, egrep... and friends too, of course, but only
with quite a lot more cmdline manipulation and piping.

I'm currently working on using something like this basic search script
to return URLS linking to the page and lines found, and working the
whole thing into something that can be carried out with a web browser.

Something pretty similar to webglimpse, I guess but without the
benefit of indexing.

Also webglimpe relies on glimpse which is not capable of full regex
search but does have a rich mixture of regex, regex like and boolean
query capability.




[gentoo-user] Re: [OT sphinx] Any users of sphinx here

2010-06-12 Thread Harry Putnam
Brandon Vargo brandon.va...@gmail.com writes:

 [1]: http://www.google.com/codesearch
 [2]: http://beagle-project.org/

Acckk, I forgot to thank you for the URLS you posted.. thanks




Re: [gentoo-user] Re: [OT sphinx] Any users of sphinx here

2010-06-07 Thread Brandon Vargo
On Sun, 2010-06-06 at 15:37 -0500, Harry Putnam wrote:
 Brandon Vargo brandon.va...@gmail.com writes:
 
  As an example of how it works, suppose I am making a news website and
  have a bunch of news posts, each of which has an author, category, and
 
 Thank you brandon for such a nice through answer... Yeah, looks like
 I'm barking up the wrong tree.
 
 I know about htdig.. Not much though.  Far as remember it didn't have
 much in the way of search interface... something like google.  Where
 as webglimpse has a rich set of search terms, including some regular
 expressions and regular expression like operators... all the same
 tools as glimpse (and agrep).  So many in fact it can be a bit
 daunting to try to become proficient with.
 
 Maybe you can enlighten me about htdig... its been yrs since I tried
 htdig.

Sorry, it has been awhile since I have used it as well.

 Even webglimpse fails though when it comes to trying to search for
 snippets of code like perl or C etc.  No body want the sloth and cpu
 overhead of serious regular expression searching and that maybe the
 only (good) way to search for things like /,{,$,(,[,!,@ etc etc like
 one would need to find types of code snippets. Also I guess it
 would be pretty hard to build an index with that in mind.

Certainly it is a hard problem to index for arbitrary regular
expressions. Even Google's code search [1] is not terribly good at it.
However, I also do not think it is something most people will want to
do. When I go to find code that I have written, I do not remember
variable names, lines of code, etc that I can match with a regular
expression. Thus, that kind of search is pointless for me. I remember
what the code does, the project for which I wrote the code, and
approximately where the code is located within the project. I remember
function calls for libraries that I probably used. If I cannot find what
I am looking for, I use grep on the name of a function call I remember,
or I have a ctags file containing all the information I need about
function definitions.

I suggest, for code, you just organize whatever you have in a sane
directory structure. Or, even better, you can put your code in a central
place using a version control system (SVN, git, hg, CVS, etc), where it
is organized in a way that makes sense to you. After all, it sounds like
this is for your personal use, so use something that makes you happy.
Personally, I have a series of git repositories that I use to keep track
of my code and some of my documents.

 I keep thinking some good developer will come out with a tool aimed at
 websites like might be found on a home lan (in scope)... where regular
 expression searching wouldn't be so far out.
 
 Or maybe there just is no herd of people who are competent in regular
 expression searching, and hence no audience for such a tool

I do not think the problem is a lack of people with knowledge of regular
expressions, but rather the lack of a need for such a product. Many
people, at least those I know, do not think Oh, I want to search for
xyz; I'll write a regular expression to search for what I want across
all my data. Instead, they have a directory structure of organized
documents that makes finding that particular document or series of
documents on xyz easy. When that fails, there is the find and locate
commands for terminal users, which support regex searching in filenames,
desktop search tools such as Beagle [2], and of course grep.

Certainly it would be really nice to have a search tool that would
produce results for show me all the code on this computer used for
validating HTTP POST requests in Python for a submitted HTML form,
preferably using Django. If you find one, let me know, as I would love
to try it. In the meantime, `grep -RE 'form|POST'
projects/python/django/project_xyz` works fairly well once I figure out
that what I want is probably in that directory. (grep -E, or egrep,
supports extended regular expression; -R is recursive) Or, I just go
search through the documentation, if available.

Maybe someone here can suggestion something better for code searching.
For everything else, use Beagle/something similar or a web-based search
engine you can install locally if you really want to be able to search
through your documents. Maybe there is something better for that too; I
do not know. I still use directories and git repositories in said
directories, where appropriate, as it is more efficient for me. Of
course your mileage may vary.

[1]: http://www.google.com/codesearch
[2]: http://beagle-project.org/

Regards,

Brandon Vargo




[gentoo-user] Re: [OT sphinx] Any users of sphinx here

2010-06-06 Thread Harry Putnam
Brandon Vargo brandon.va...@gmail.com writes:

 As an example of how it works, suppose I am making a news website and
 have a bunch of news posts, each of which has an author, category, and

Thank you brandon for such a nice through answer... Yeah, looks like
I'm barking up the wrong tree.

I know about htdig.. Not much though.  Far as remember it didn't have
much in the way of search interface... something like google.  Where
as webglimpse has a rich set of search terms, including some regular
expressions and regular expression like operators... all the same
tools as glimpse (and agrep).  So many in fact it can be a bit
daunting to try to become proficient with.

Maybe you can enlighten me about htdig... its been yrs since I tried
htdig. 

Even webglimpse fails though when it comes to trying to search for
snippets of code like perl or C etc.  No body want the sloth and cpu
overhead of serious regular expression searching and that maybe the
only (good) way to search for things like /,{,$,(,[,!,@ etc etc like
one would need to find types of code snippets. Also I guess it
would be pretty hard to build an index with that in mind.

I keep thinking some good developer will come out with a tool aimed at
websites like might be found on a home lan (in scope)... where regular
expression searching wouldn't be so far out.

Or maybe there just is no herd of people who are competent in regular
expression searching, and hence no audience for such a tool




[gentoo-user] Re: [OT sphinx] Any users of sphinx here

2010-06-05 Thread Hans de Graaff
On Fri, 04 Jun 2010 17:52:05 -0500, Harry Putnam wrote:

 Googling lead to a tool called Sphinx that apparently is coupled with a
 data base tool like mysql.  It is advertised as the kind of search tool
 I'm after and has a perl front-end also available in portage
 (dev-perl/Sphinx-Search).
 
 The call it a `full text search engine', but never really say what that
 means.

It means that you can dump a lot of text documents into it (based on 
html pages, database records, actual documents, etc). sphinx efficiently 
indexes all the text in it, and then allows you to retrieve it again, 
supporting things that are useful for searching in text such as stemming.

It can use MySQL but this isn't needed to use it.

It should be able to help you with the task you want to solve, although 
I'm not familiar with the capabilities of the Sphinx-Search front-end/
binding.

Kind regards,

Hans