Thanks for the nice summary. Couple of other things that comes to my mind are: 1. Not columns are indexed in a database; searching on a non-indexed field would be very expensive. 2. Some dbs do support free form indexes but it is an offline index. Good thing is that regular sql can be used to take advantage of full text indexes; however, there are performance issues as data gets moved between full-text and the search core. I remember seeing an article that talks in detail about this; will post to this group once I find it :-) 3. with lucene type approach, you can potentially support any type of querying on any field; typically database search restricts search on specific fields and often custom coded to generate appropriate sql (depends largely on the implementation though).
Chandra --- Kan Deng <[EMAIL PROTECTED]> wrote: > For many applications, search engine and database > are > competitive solutions. One has to consider in depth > to > choose either search engine or database, and in some > cases, the border is blurred. > > There are several issues to consider when selecting > database or Lucene or both as the solution to a > specific application. > > 1. Data change frequency. > > Suppose there is a "restaurant" table, which > contains a "current customers" column. Since the > content of this column is updated very frequently, > Lucene is not a good choice. > > However, for "menu" column, Lucene definitely a > strong candidate for indexing and searching. > > > 2. Query. > > Suppose the "restaurant" table which contains a > "menu" column, there are two ways to store the > content > into the "menu" column, one is put all the dishes > into > it as a big string delimited by comma. Another way > is > to separate each dish into different rows, so that > one > single restaurant may have multiple rows. > > If the menu is stored as a big string, then > conventional RDBMS is not capable of doing the > search. > Because "select * from restaurant where menu like > '%BBQ%'" will induce a full table scan. > > However, if the dishes are stored in separated > rows, RDBMS is among the candidates. More analysis > needs to be considered before making the decision. > > > 3. Performance. > > Intuitively, B+tree based search sounds faster > than > inverted-index search. However, since inverted-index > organizes the data in a sorted way, its search > performance doesn't fall behind B+tree search for > too > far. Imagine there is an inverted-index, "0, 1, 2, > 3, > 4, 5, 6, 7, 8", the "hopping" algorithm can be > implemented in a way identical to the binary search, > if starting with "4", then either "2" or "6", so on. > Lucene uses fixed interval hopping, mainly because > it > is convenient to implement with satisfying > performance, even though theoritically not so good > as > binary search or B+tree search. > > The above comparison assume B+tree and > inverted-index are loaded fully into RAM. In case > the > index is so big that it is mandatory to swap between > RAM and disk, inverted-index tends to excel B+tree > index. The reason is that B+tree index occupies more > space, which induces more disk IO, and consequently > reduces the overall performance. > > > 4. Infrastructure. > > RDBMS products consume a lot of computational > resource, if the user doesn't want to break into the > RDBMS's source code and get rid of the unwanted > functionality like SQL language. As a result, > usually > RDBMS is running on a standalone machine, and use > JDBC/ODBC to interact with application which is > running on another machine, via network. Keep in > mind > that average it costs 10-1000 milliseconds to > transmit > data through network, depending on the size of your > data. If one wants his server running fast and > scaling > well like Google.com, even 10 millisecond is a big > issue that he has to think hard how to save. > > Therefore, for better performance, maybe it is > more > appropriate to put the application logic on the same > machine as the data repository. However, if one > wants > to use conventional RDBMS as a blackbox, the one > tier > solution may not work well because RDBMS consumes so > much computational resource that there is little > remained for application logic to use. > > As a best practice, an embedded database is more > favorable than the standalone RDBMS blackbox, > especially for non-critical data repository, and > usually with better performance. > > The comparison between embedded database and > Lucene > is subtle. Sometimes it is reasonable to merge > Lucene > with embedded database. > > > 5. Integrate Lucene with embedded database. > > One reason is to integrate Lucene with embedded > database is to support blind query like "select * > from > restaurant where menu like '%BBQ%'". > > It is possible to integrate Lucene with embedded > database, but a difficulty is how to update the > inverted-indexes frequently. This is an interesting > problem, however, to my best knowledge, there is no > good solution for the time being. > > > Kan > > > --- "Peter A. Daly" <[EMAIL PROTECTED]> wrote: > > > In many cases that essentially require traditional > > RDBMS but also require > > lucene like functionality, I would use the > database > > as the primary data > > store. I would then either: > > 1. Update the lucene index using data from the > > database based on a > > scheduled process. > > 2. As records are added, add them to both Lucene > > and the database. > > > > It's a little extra work (and space,) but you get > > the best of both worlds. > > > > -Pete > > > > On 1/17/06, John Powers <[EMAIL PROTECTED]> > > wrote: > > > > > > Would you say as a best practice that you can > use > > both? When would > > > you and when wouldn't you? I trust databases > more > > then free files, so I > > > am putting my more sensitive and volatile data > in > > the database. If you > > > built a commenting system.. like a blog or an > > flickr type app, would > > > just a lucene solution be best? The problem > > with both of course is > > > syncing.. > > > > > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam > protection around > http://mail.yahoo.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]