Hi Umit,
Thanks for commenting.
I do think that scientific/hierarchical file formats like HDF5 and
> RDBMS system have their specific use cases and I don't think it makes
> sense to replace one with the other.
>
Right. But data modeling is notoriously difficult to get right when complex
and changing schemas are involved, and here we have three paradigms
involved (relational, object, hierarchical), so I am trying to spell out
what is different. In my particular case the DB is both of recorded
immutable data and of analyses that are continuously created by expensive
functions and want to be cached and tightly integrated with the
experimental recordings.
> I do also think that you shouldn't try to apply RDBMS principles to
> HDF5 like formats and also vice versa. Working with NoSQL DBs
> (key/value store) or GraphDB is different than working with a RDBMS.
> The same applies to HDF5 and RDBMS.
>
Got it. My E-Mail is about pinpointing those differences.
> PyTables introduces some RBDMS like concepts (tables) but in the end
> it is based on HDF5 and to get the best performance you have to have
> some knowledge about the underlying HDF5 file structure and its
> concepts.
> The way you should store data in HDF5 really depends on how you will
> access it (if you want to get the best performance). Sometimes this
> means that you have to store the data redundantly in two different
> ways if you two orthogonal ways to access it.
>
This is a clever comment that I find important to include in the document.
This redundancy is unacceptable in a relational context because things may
get out of sync very fast, especially in a multi-user context (that is not
supported by hdf5), but here in PyTables where columns are normally written
by a single user in a single operation, it is less of a problem. So having
different 'views' of the data in different parts of the tree is a pattern
of usage in hdf5 that does not cause many problems, but it is very
discouraged in SQL.
On the other hand, there is no analog of stored queries in the form of
views for HDF5, which complicates independence of physical and logical data
layout.
Does all of the above make sense?
However nobody said you can't combine these different storage systems.
> For example you can use a RDBMS system to store meta-information and
> make use of relationships, constraints, foreign keys, etc but store
> the "raw data" (that is not suited for an RDBMS system) in HDF5 or
> PyTables respectively and just relate them by using unique identifier.
>
Yes, this is what many people are doing.
> Some databases like PostgreSQL even support retrieving data from non
> SQL sources (flat files, XML, etc)
> (http://en.wikipedia.org/wiki/PostgreSQL#Foreign_Data_Wrappers) which
> might be of an interest
>
Yes, we had a discussion about a similar mechanism with SQLite's 'Virtual
tables' a few days ago.
> P.S.: AFAIK Postgresql supports schemas which might be comparable to
> groups in HDF5.
Would you care to elaborate a bit more on that?
Cheers,
Álvaro.
>
> On Wed, Apr 25, 2012 at 11:41 PM, Alvaro Tejero Cantero <alv...@minin.es>
> wrote:
> > Hello list,
> >
> > The relational model has a strong foundation and I have spent a few hours
> > thinking about what in PyTables is structurally different from it. Here
> are
> > my thoughts. I would be delighted if you could add/comment/correct on
> these
> > ideas. This could eventually help people with a relational/SQL background
> > who want to know how to best use the idioms of PyTables for their data
> > modeling
> >
> > ---
> >
> > I make a distinction between relational and SQL (see CJ Date’s "SQL and
> > relational theory" for more on that).
> >
> > From a purely structural point of view, the following differences are
> > apparent:
> >
> > relations vs. sequences. Relations are sets (i.e. not ordered) of tuples
> > (again, not ordered).
> >
> > rows: In PyTables, every container has an implicit row number, meaning
> there
> > is always a candidate key and order matters. Although strictly an
> > implementation-level concern, row numbers in PyTables are not stored but
> > computed, thanks to the in-disk adjacency of the records. This is
> important
> > for large datasets, where adding storage of row numbers means roughly a
> > doubling of diskspace.
> > columns: In PyTables columns are ordered. That is not the case in a
> purely
> > relational system but it is the case in SQL.
> >
> > Flat tablespace vs. hierarchical tablespace. SQL tables live in a global
> > namespace. PyTables objects can be put inside Groups. Each approach can
> be
> > mapped onto the other by name mangling. Groups in PyTables are like
> tables
> > of tables -- for each node in a group there is a full table (or another
> > group...). This introduces a possible ambiguity in data modeling:
> >
> > Consider a table of car parts, one column is Part ID and the other is
> Model
> > ID, indicating in what car models a particular part is built in. In
> PyTables
> > you can construct the same table /or/ create a /models group and create
> one
> > table per model consisting of a single column of Part IDs e.g.
> /model/sedan,
> > /model/cabrio... etc. The same is possible in a relational setting
> (dividing
> > the tables according to one attribute, and naming them according to the
> > attribute value, e.g. model_sedan, model_cabrio...). The defining
> difference
> > is that the interface to manipulate that list is the same (it is a table)
> > whereas in PyTables one listing is a Table object and the other is a
> list of
> > Nodes, and the API for both is a bit different.
> >
> > Attributes of tables and integrity. Any Node (Groups and Tables included)
> > can receive a limited amount of metadata in PyTables, by using the
> attached
> > attributeset. In SQL, metadata is limited to some keywords, to be used
> upon
> > table creation, that establish constraints on the columns that have a
> > functional significance. A prime example of this is identifying foreign
> > keys. SQL allows to use this information structurally at the time of
> joins,
> > whereas in PyTables one is free to implement this or any other
> navigational
> > scheme in a customized way using the attributes of the table.
> >
> > When designing such a scheme it has to be remembered that PyTables tables
> > have always an implicit column containing the row numbers, and this is
> > likely to be used as a key.
> >
> >
> > NOTE: I intentionally excluded here implementation issues whenever they
> are
> > not related to structural ones, e.g. SQL tables do not play well with
> Numpy
> > containers and are thus ill-suited for big data with Python. Another
> example
> > would be all the features related to transactions/concurrency and
> > authorization, which are orthogonal to the data model.
> >
> >
> > -á.
> >
> >
> ------------------------------------------------------------------------------
> > Live Security Virtual Conference
> > Exclusive live event will cover all the ways today's security and
> > threat landscape has changed and how IT managers can respond. Discussions
> > will include endpoint security, mobile security and the latest in malware
> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> > _______________________________________________
> > Pytables-users mailing list
> > Pytables-users@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users