[gentoo-user] Re: [OT sphinx] Any users of sphinx here
Brandon Vargo brandon.va...@gmail.com writes: do. When I go to find code that I have written, I do not remember variable names, lines of code, etc that I can match with a regular expression. Thus, that kind of search is pointless for me. I remember what the code does, the project for which I wrote the code, and approximately where the code is located within the project. I remember function calls for libraries that I probably used. If I cannot find what I am looking for, I use grep on the name of a function call I remember, or I have a ctags file containing all the information I need about function definitions. Again, thanks for a thorough answer... just a note on the above comment. I often find myself searching for a technique... NOT variable names or sub function names because who knows what I might call stuff in any particular script. For example... I once was shown how to compile as regular expression an element of @ARGV in perl, in one step: my $what_re = qr/@{[shift]}/; I liked that and have used it many times... but only recently could I remember at a moments notice how to write it. I used `grep -r' or 'egrep -r' as you've mentioned, now I use a my own perl script (recently written [since posting original query]) that uses regex and File::Find, where user feeds the regex and the approximate location to begin the search, on the cmd line. In my case that would be an nfs share /projects/reader/perl which is kept in my ENV as $perlp So: script.pl 'qr/.*?@' $perlp Will find a number of examples of using that particular technique. What prompted my query here, was looking for a way to search several thousand html pages that are a collection of Perl books on CD. These are 2 of the Oreilly Perl CDbooks. (I spent $150 for the first one, and I think the second was a little cheaper, it was yrs ago) The Books on CD have built in search tools but those only work on a windows OS and aren't up to much anyway. I've since downloaded the data from the CDS onto an opensolaris zfs server and access them through NFS. I was attempting to use `webglimpse' (http://webglimpse.net/download.php) for the task, hence the interest in indexing. But I suspect a search for a particular technique I read about, but have forgotten how to code, would be best searched for using regular expressions. This would be long after I've forgotten which section or even which book I read about it in. The tool I've written can be made to strip html if necessary and can be made to include (by regex) only certain kinds of filenames, but uses no index so consequently is pretty slow... but still very useful and is fully perl regex capable. It returns up to 4 lines of context, 2 above the line with the hit, and 1 below (where possible), along with the page number and the absolute filename where the hit was found. Here is an example search being timed: ---- ---=--- - (I purposely picked something that would be found many times) time ./pgrep3 /var/www/localhost/htdocs/lcweb/cdbk+/AllPerl/ hash (So above we are searching a collection from the Oreilly CDbooks for the term `hash'..) (Just one example of the thousands of lines returned) [...] /var/www/localhost/htdocs/lcweb/cdbk+/AllPerl/perlnut/index/idx_p.htm 135 dereferencing with : [104]4.8.2. Dereferencing 136 modulus operator : [105]4.5.3. Arithmetic Operators 137 prototype symbol (hash) : [106]4.7.5. Prototypes 138 %= (assignment) operator : [107]4.5.6. Assignment Operators --- [...] Total files searched: 522 Total lines searched: 431689 real1m48.344s user1m25.234s sys 0m14.336s ---- ---=--- - Almost 2 minutes to search 431689 lines So it is slow, maybe even very slow by comparison to tools using an indexed search. I don't really mind the sloth, but of course it would not be scalable very much above the scope of use I'm doing with it. I do like the precision search capability and plenty of context. All of the above is also possible with grep, egrep... and friends too, of course, but only with quite a lot more cmdline manipulation and piping. I'm currently working on using something like this basic search script to return URLS linking to the page and lines found, and working the whole thing into something that can be carried out with a web browser. Something pretty similar to webglimpse, I guess but without the benefit of indexing. Also webglimpe relies on glimpse which is not capable of full regex search but does have a rich mixture of regex, regex like and boolean query capability.
[gentoo-user] Re: [OT sphinx] Any users of sphinx here
Brandon Vargo brandon.va...@gmail.com writes: [1]: http://www.google.com/codesearch [2]: http://beagle-project.org/ Acckk, I forgot to thank you for the URLS you posted.. thanks
Re: [gentoo-user] Re: [OT sphinx] Any users of sphinx here
On Sun, 2010-06-06 at 15:37 -0500, Harry Putnam wrote: Brandon Vargo brandon.va...@gmail.com writes: As an example of how it works, suppose I am making a news website and have a bunch of news posts, each of which has an author, category, and Thank you brandon for such a nice through answer... Yeah, looks like I'm barking up the wrong tree. I know about htdig.. Not much though. Far as remember it didn't have much in the way of search interface... something like google. Where as webglimpse has a rich set of search terms, including some regular expressions and regular expression like operators... all the same tools as glimpse (and agrep). So many in fact it can be a bit daunting to try to become proficient with. Maybe you can enlighten me about htdig... its been yrs since I tried htdig. Sorry, it has been awhile since I have used it as well. Even webglimpse fails though when it comes to trying to search for snippets of code like perl or C etc. No body want the sloth and cpu overhead of serious regular expression searching and that maybe the only (good) way to search for things like /,{,$,(,[,!,@ etc etc like one would need to find types of code snippets. Also I guess it would be pretty hard to build an index with that in mind. Certainly it is a hard problem to index for arbitrary regular expressions. Even Google's code search [1] is not terribly good at it. However, I also do not think it is something most people will want to do. When I go to find code that I have written, I do not remember variable names, lines of code, etc that I can match with a regular expression. Thus, that kind of search is pointless for me. I remember what the code does, the project for which I wrote the code, and approximately where the code is located within the project. I remember function calls for libraries that I probably used. If I cannot find what I am looking for, I use grep on the name of a function call I remember, or I have a ctags file containing all the information I need about function definitions. I suggest, for code, you just organize whatever you have in a sane directory structure. Or, even better, you can put your code in a central place using a version control system (SVN, git, hg, CVS, etc), where it is organized in a way that makes sense to you. After all, it sounds like this is for your personal use, so use something that makes you happy. Personally, I have a series of git repositories that I use to keep track of my code and some of my documents. I keep thinking some good developer will come out with a tool aimed at websites like might be found on a home lan (in scope)... where regular expression searching wouldn't be so far out. Or maybe there just is no herd of people who are competent in regular expression searching, and hence no audience for such a tool I do not think the problem is a lack of people with knowledge of regular expressions, but rather the lack of a need for such a product. Many people, at least those I know, do not think Oh, I want to search for xyz; I'll write a regular expression to search for what I want across all my data. Instead, they have a directory structure of organized documents that makes finding that particular document or series of documents on xyz easy. When that fails, there is the find and locate commands for terminal users, which support regex searching in filenames, desktop search tools such as Beagle [2], and of course grep. Certainly it would be really nice to have a search tool that would produce results for show me all the code on this computer used for validating HTTP POST requests in Python for a submitted HTML form, preferably using Django. If you find one, let me know, as I would love to try it. In the meantime, `grep -RE 'form|POST' projects/python/django/project_xyz` works fairly well once I figure out that what I want is probably in that directory. (grep -E, or egrep, supports extended regular expression; -R is recursive) Or, I just go search through the documentation, if available. Maybe someone here can suggestion something better for code searching. For everything else, use Beagle/something similar or a web-based search engine you can install locally if you really want to be able to search through your documents. Maybe there is something better for that too; I do not know. I still use directories and git repositories in said directories, where appropriate, as it is more efficient for me. Of course your mileage may vary. [1]: http://www.google.com/codesearch [2]: http://beagle-project.org/ Regards, Brandon Vargo
[gentoo-user] Re: [OT sphinx] Any users of sphinx here
Brandon Vargo brandon.va...@gmail.com writes: As an example of how it works, suppose I am making a news website and have a bunch of news posts, each of which has an author, category, and Thank you brandon for such a nice through answer... Yeah, looks like I'm barking up the wrong tree. I know about htdig.. Not much though. Far as remember it didn't have much in the way of search interface... something like google. Where as webglimpse has a rich set of search terms, including some regular expressions and regular expression like operators... all the same tools as glimpse (and agrep). So many in fact it can be a bit daunting to try to become proficient with. Maybe you can enlighten me about htdig... its been yrs since I tried htdig. Even webglimpse fails though when it comes to trying to search for snippets of code like perl or C etc. No body want the sloth and cpu overhead of serious regular expression searching and that maybe the only (good) way to search for things like /,{,$,(,[,!,@ etc etc like one would need to find types of code snippets. Also I guess it would be pretty hard to build an index with that in mind. I keep thinking some good developer will come out with a tool aimed at websites like might be found on a home lan (in scope)... where regular expression searching wouldn't be so far out. Or maybe there just is no herd of people who are competent in regular expression searching, and hence no audience for such a tool
[gentoo-user] Re: [OT sphinx] Any users of sphinx here
On Fri, 04 Jun 2010 17:52:05 -0500, Harry Putnam wrote: Googling lead to a tool called Sphinx that apparently is coupled with a data base tool like mysql. It is advertised as the kind of search tool I'm after and has a perl front-end also available in portage (dev-perl/Sphinx-Search). The call it a `full text search engine', but never really say what that means. It means that you can dump a lot of text documents into it (based on html pages, database records, actual documents, etc). sphinx efficiently indexes all the text in it, and then allows you to retrieve it again, supporting things that are useful for searching in text such as stemming. It can use MySQL but this isn't needed to use it. It should be able to help you with the task you want to solve, although I'm not familiar with the capabilities of the Sphinx-Search front-end/ binding. Kind regards, Hans