Nuno wrote: > Hi Charley, very interesting indeed your little MV > query engine, how have you implemented it? In other > words what is the underlying data store? > > 1) Is it an RDBMS? > 2) Is it a file or multiple files in the FS? If so > what is the format of those files, XML?
Our data store is simple text files (so that's option "2", but not in XML). Structured data is "standard" tab-delimited or comma-delimited, similar to what you'd get from a standard "export" from a spreadsheet or database. The goal was simplicity for manual access or access by custom utilities, or simple interchange among commercial systems. But, this is an academic development effort and we must admit that we're not really optimized for queries and data reduction, which is the sole purpose of SQL and XSLT. Rather, we're a storage and processing format for "unstructured" data (free-form prose and multi-media), and are now making some additions for "structured" data since that's typically a part of any content base (and it really helps for publishing "tables"). > In the case of 1) do you use the LIKE statement to > apply pattern matching rules such as us.% (<=> > us.*)? If not, how do you make it happen minimizing > joins over multiple tables (that is the problem of > Tony's approach IMO)? Our approach is a little bit different in that we're trying to do content selection based on "slicing", similar to what you do in Perl or Ruby. We enable slicing of items by 'offset' or 'name' or 'expression', and either 'name' or 'expression' can yield more than one result. Slice operations can be nested, so the result is list processing similar to what you might see in Perl/Ruby, or even something like LISP. In specific answer to your question of "a.b.c", and whether we use something like the SQL "LIKE" operator, I guess not really. Since we are implementing our own "data language" (academics can do whatever they want ;-), the language permits hierarchical "indexing" that preserves "nested level" or placement like 'a.*.b.*'. (Yes, it might be a minor distinction to state we are "indexing/slicing" as opposed to "querying", but there are a few areas where we've found that distinction useful.) > In case of the second, do you maintain indexes for > fast string matching of megabytes (at least) of > info? If so, how do you maintain them? If not > then do you use BTrees or something similar? > > >coverage.location = 'us.ma.springfield' > >coverage.country = 'us' > >coverage.state = 'ma' > >coverage.city = 'springfield' Our "content of record" (the master set) is always the simple text files. After they are parsed, we perform all processing on the in-memory representation which is a logical web of content cross-references. All processing for item selection is based on the in-memory cache. (So, no, we don't really track offsets or the likes to strings, but split the string into a hierarchy.) But, again, our hierarchy processing is a bit immature because we haven't decided at what level we should impose "type checking" of a finite structure on content. (Our issues are that we can have a single finite structure of content that's "well-known", like Usenet's 'comp.lang.c++.std', but if you want to merge two *different* well-known hierarchies from two *different* content bases, how do you permit mapping of one to overlap a subset of relevant nodes in the other?) Until we get a better handle on that, we're going slowly. > Yes, I also thought about implementing some > syntactic sugar over SQL queries. > ...<snip, examples>... > But for now, we are not planning to support this > syntactic sugar mainly because users will define > queries schemas using GUI and with a UI steamed from > faceted classification that basically does more or > less the same. Our goal is to configure the system > without the need for coding including query schemas > (search interfaces). I agree that it's a good idea to stay away from SQL whenever possible. Users don't like it nor understand it, and it's a shame that you really need a lot of training to write good SQL queries (including understanding intimate details of the storage schema if you care about query performance). Those were the reaons that we did not start with SQL as our basis for content selection, but went to indexing/slicing (with names and expressions) instead. So, it seems to me that if you permit content identification/selection in a visual way (with a tool or GUI) or in a simplified syntax way that's not as wordy or confusing as SQL, that would be good. --charley [EMAIL PROTECTED] __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- http://cms-list.org/ more signal, less noise.