RE: [cms-list] managing metadata;

Charley Bay Sun, 09 Feb 2003 14:46:58 -0800

Nuno wrote:
> Hi Charley, very interesting indeed your little MV
> query engine, how have you implemented it? In other
> words what is the underlying data store?
> 
> 1) Is it an RDBMS?
> 2) Is it a file or multiple files in the FS? If so
> what is the format of those files, XML?


Our data store is simple text files (so that's option
"2", but not in XML).  Structured data is "standard"
tab-delimited or comma-delimited, similar to what
you'd get from a standard "export" from a spreadsheet
or database.  The goal was simplicity for manual
access or access by custom utilities, or simple 
interchange among commercial systems.

But, this is an academic development effort and we
must admit that we're not really optimized for
queries and data reduction, which is the sole purpose
of SQL and XSLT.  Rather, we're a storage and 
processing format for "unstructured" data (free-form
prose and multi-media), and are now making some
additions for "structured" data since that's typically
a part of any content base (and it really helps for
publishing "tables").

> In the case of 1) do you use the LIKE statement to
> apply pattern matching rules such as us.% (<=>
> us.*)? If not, how do you make it happen minimizing
> joins over multiple tables (that is the problem of
> Tony's approach IMO)?

Our approach is a little bit different in that we're
trying to do content selection based on "slicing",
similar to what you do in Perl or Ruby.  We enable
slicing of items by 'offset' or 'name' or 
'expression', and either 'name' or 'expression' can 
yield more than one result.  Slice operations can be
nested, so the result is list processing similar to
what you might see in Perl/Ruby, or even something
like LISP.

In specific answer to your question of "a.b.c", and
whether we use something like the SQL "LIKE" operator,
I guess not really.  Since we are implementing our
own "data language" (academics can do whatever they
want ;-), the language permits hierarchical "indexing"
that preserves "nested level" or placement like
'a.*.b.*'.  (Yes, it might be a minor distinction to
state we are "indexing/slicing" as opposed to
"querying", but there are a few areas where we've
found that distinction useful.)

> In case of the second, do you maintain indexes for
> fast string matching of megabytes (at least) of
> info? If so, how do you maintain them? If not
> then do you use BTrees or something similar?
> 
> >coverage.location = 'us.ma.springfield'
> >coverage.country = 'us'
> >coverage.state = 'ma'
> >coverage.city = 'springfield'

Our "content of record" (the master set) is always
the simple text files.  After they are parsed, we
perform all processing on the in-memory representation
which is a logical web of content cross-references.
All processing for item selection is based on the
in-memory cache.  (So, no, we don't really track
offsets or the likes to strings, but split the
string into a hierarchy.)

But, again, our hierarchy processing is a bit 
immature because we haven't decided at what level
we should impose "type checking" of a finite structure
on content.  (Our issues are that we can have a single
finite structure of content that's "well-known", like
Usenet's 'comp.lang.c++.std', but if you want to merge
two *different* well-known hierarchies from two
*different* content bases, how do you permit mapping
of one to overlap a subset of relevant nodes in the
other?)  Until we get a better handle on that, we're
going slowly.

> Yes, I also thought about implementing some
> syntactic sugar over SQL queries.
> ...<snip, examples>...
> But for now, we are not planning to support this
> syntactic sugar mainly because users will define
> queries schemas using GUI and with a UI steamed from
> faceted classification that basically does more or
> less the same. Our goal is to configure the system
> without the need for coding including query schemas
> (search interfaces).

I agree that it's a good idea to stay away from SQL
whenever possible.  Users don't like it nor understand
it, and it's a shame that you really need a lot of
training to write good SQL queries (including 
understanding intimate details of the storage schema
if you care about query performance).  Those were the
reaons that we did not start with SQL as our basis
for content selection, but went to indexing/slicing
(with names and expressions) instead.

So, it seems to me that if you permit content 
identification/selection in a visual way (with a tool
or GUI) or in a simplified syntax way that's not as
wordy or confusing as SQL, that would be good.

--charley
[EMAIL PROTECTED]


__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com
--
http://cms-list.org/
more signal, less noise.

RE: [cms-list] managing metadata;

Reply via email to