Re: [gentoo-user] [OT sphinx] Any users of sphinx here

2010-06-05 Thread Brandon Vargo
On Fri, 2010-06-04 at 17:52 -0500, Harry Putnam wrote:
 I've been looking for a perl based search tool that uses some kind of
 indexing to index and render searchable my home library of software
 manual and the like.  Quite a few html pages involved, maybe 15-16,000.
 
 Webglimpse is something I've worked with before and know a bit about
 but thought I might like to see what else is available.
 
 Googling lead to a tool called Sphinx that apparently is coupled with
 a data base tool like mysql.  It is advertised as the kind of search
 tool I'm after and has a perl front-end also available in portage 
 (dev-perl/Sphinx-Search).
 
 The trouble is I haven't been able to figure out the first thing about
 using it.  The overview, and Introduction, like a lot of such
 documents fails to give a really basic idea of what the tool does.
 
 The call it a `full text search engine', but never really say what
 that means.
 
 There are 12-15 FEATURES listed, and none appear to describe sensibly
 what they really do.
 
 The faq is a string a questions about using sql.. really.
 
 So far I haven't found a good statement of what the darn thing really
 does or how to aim it at data.
 
 The manual is probably great if you already know a lot about using
 sphinx but very thin for my case.
 
 I've not even been able to get a rough idea of how to aim the darn
 thing at the desired (Local lan) web site.
 
 Or, to show how thin it really is or how dumb I really am, I've been
 unable to tell if it can even do what I want to do.
 
 I've posted on a sphinx list on gmane... but it appears to be only
 moderately active and haven't gotten any replies... 
 
 I hoped some one here may be familiar with sphinx and willing to coach
 me a bit or at least let me know if it can even do what I want to do.
 
 Also any other perl based search tools involving indexing and some
 kind of versatile search query capability.. like regular expressions
 I'd be interested to know about.

If you can put your HTML pages into a database, Sphinx might be able to
help you with your issue. Basically what Sphinx does is let you search
databases. You specify one or more SQL sources of data ans associated
queries, and Sphinx provides an API (or a emulated SQL server) that
makes searching easy. Sphinx is for full text database searching; it
does not index files or websites directly. (Note that is this not
actually true; it can search XML files directly, but you still specify
XML attributes instead of database columns, etc, so it is treating the
XML as a data store and not as a generic document.) I recall reading
that Craigslist uses Sphinx to search their database of listings.

As an example of how it works, suppose I am making a news website and
have a bunch of news posts, each of which has an author, category, and
text. With Sphinx, I can setup a source -- let's call it news_catalog --
that will index this data. news_catalog will be associated with an SQL
query that will allow Sphinx to access the data it needs to index. Let's
use SELECT id, author, category, text FROM catalog as our query. Note
that catalog is a table or view in your database, though this query can
also use complex joins, etc, as long as the database supports it. Via
the Sphinx API, I can say I want to search for Europe | America and it
will return a list of news articles containing the terms Europe,
America, or both, as a pipe is the or operator. It actually returns a
list of ids which correspond to the id I specified in my query; a unique
key is always the first argument in the query. My application is
responsible for fetching the actual data from the original database
using that id and presenting the data in a useful way to the user.
Extended query syntax allows for other boolean operators, searching
specific fields, strict order, exact match, field start/end, etc. The
documentation has lots of examples; look at
http://www.sphinxsearch.com/docs/current.html for the current reference
manual.

If you have a bunch of HTML files on a disk or website that you want to
index and search, I do not think Sphinx is the software you want. Yes,
you could load your data into a database and then use Sphinx, but that
does not seem like the best solution. Sphinx provides the API for use in
your application; it does not provide a user interface. As an
alternative, I recommend you look at something like ht://Dig
(htdig.org), which will search HTML pages directly in addition to PDF,
Word, Excel, Powerpoint, etc with the help of external converters. It
also includes a user interface. After glancing at webglimpse, with which
I am not familiar, it looks like it does something similar to ht://Dig.

Regards,

Brandon Vargo




[gentoo-user] [OT sphinx] Any users of sphinx here

2010-06-04 Thread Harry Putnam
I've been looking for a perl based search tool that uses some kind of
indexing to index and render searchable my home library of software
manual and the like.  Quite a few html pages involved, maybe 15-16,000.

Webglimpse is something I've worked with before and know a bit about
but thought I might like to see what else is available.

Googling lead to a tool called Sphinx that apparently is coupled with
a data base tool like mysql.  It is advertised as the kind of search
tool I'm after and has a perl front-end also available in portage 
(dev-perl/Sphinx-Search).

The trouble is I haven't been able to figure out the first thing about
using it.  The overview, and Introduction, like a lot of such
documents fails to give a really basic idea of what the tool does.

The call it a `full text search engine', but never really say what
that means.

There are 12-15 FEATURES listed, and none appear to describe sensibly
what they really do.

The faq is a string a questions about using sql.. really.

So far I haven't found a good statement of what the darn thing really
does or how to aim it at data.

The manual is probably great if you already know a lot about using
sphinx but very thin for my case.

I've not even been able to get a rough idea of how to aim the darn
thing at the desired (Local lan) web site.

Or, to show how thin it really is or how dumb I really am, I've been
unable to tell if it can even do what I want to do.

I've posted on a sphinx list on gmane... but it appears to be only
moderately active and haven't gotten any replies... 

I hoped some one here may be familiar with sphinx and willing to coach
me a bit or at least let me know if it can even do what I want to do.

Also any other perl based search tools involving indexing and some
kind of versatile search query capability.. like regular expressions
I'd be interested to know about.