Good points. Just some additional comments: I do think that scientific/hierarchical file formats like HDF5 and RDBMS system have their specific use cases and I don't think it makes sense to replace one with the other. I do also think that you shouldn't try to apply RDBMS principles to HDF5 like formats and also vice versa. Working with NoSQL DBs (key/value store) or GraphDB is different than working with a RDBMS. The same applies to HDF5 and RDBMS.
PyTables introduces some RBDMS like concepts (tables) but in the end it is based on HDF5 and to get the best performance you have to have some knowledge about the underlying HDF5 file structure and its concepts. The way you should store data in HDF5 really depends on how you will access it (if you want to get the best performance). Sometimes this means that you have to store the data redundantly in two different ways if you two orthogonal ways to access it. However nobody said you can't combine these different storage systems. For example you can use a RDBMS system to store meta-information and make use of relationships, constraints, foreign keys, etc but store the "raw data" (that is not suited for an RDBMS system) in HDF5 or PyTables respectively and just relate them by using unique identifier. Some databases like PostgreSQL even support retrieving data from non SQL sources (flat files, XML, etc) (http://en.wikipedia.org/wiki/PostgreSQL#Foreign_Data_Wrappers) which might be of an interest. cheers Ümit P.S.: AFAIK Postgresql supports schemas which might be comparable to groups in HDF5. On Wed, Apr 25, 2012 at 11:41 PM, Alvaro Tejero Cantero <alv...@minin.es> wrote: > Hello list, > > The relational model has a strong foundation and I have spent a few hours > thinking about what in PyTables is structurally different from it. Here are > my thoughts. I would be delighted if you could add/comment/correct on these > ideas. This could eventually help people with a relational/SQL background > who want to know how to best use the idioms of PyTables for their data > modeling > > --- > > I make a distinction between relational and SQL (see CJ Date’s "SQL and > relational theory" for more on that). > > From a purely structural point of view, the following differences are > apparent: > > relations vs. sequences. Relations are sets (i.e. not ordered) of tuples > (again, not ordered). > > rows: In PyTables, every container has an implicit row number, meaning there > is always a candidate key and order matters. Although strictly an > implementation-level concern, row numbers in PyTables are not stored but > computed, thanks to the in-disk adjacency of the records. This is important > for large datasets, where adding storage of row numbers means roughly a > doubling of diskspace. > columns: In PyTables columns are ordered. That is not the case in a purely > relational system but it is the case in SQL. > > Flat tablespace vs. hierarchical tablespace. SQL tables live in a global > namespace. PyTables objects can be put inside Groups. Each approach can be > mapped onto the other by name mangling. Groups in PyTables are like tables > of tables -- for each node in a group there is a full table (or another > group...). This introduces a possible ambiguity in data modeling: > > Consider a table of car parts, one column is Part ID and the other is Model > ID, indicating in what car models a particular part is built in. In PyTables > you can construct the same table /or/ create a /models group and create one > table per model consisting of a single column of Part IDs e.g. /model/sedan, > /model/cabrio... etc. The same is possible in a relational setting (dividing > the tables according to one attribute, and naming them according to the > attribute value, e.g. model_sedan, model_cabrio...). The defining difference > is that the interface to manipulate that list is the same (it is a table) > whereas in PyTables one listing is a Table object and the other is a list of > Nodes, and the API for both is a bit different. > > Attributes of tables and integrity. Any Node (Groups and Tables included) > can receive a limited amount of metadata in PyTables, by using the attached > attributeset. In SQL, metadata is limited to some keywords, to be used upon > table creation, that establish constraints on the columns that have a > functional significance. A prime example of this is identifying foreign > keys. SQL allows to use this information structurally at the time of joins, > whereas in PyTables one is free to implement this or any other navigational > scheme in a customized way using the attributes of the table. > > When designing such a scheme it has to be remembered that PyTables tables > have always an implicit column containing the row numbers, and this is > likely to be used as a key. > > > NOTE: I intentionally excluded here implementation issues whenever they are > not related to structural ones, e.g. SQL tables do not play well with Numpy > containers and are thus ill-suited for big data with Python. Another example > would be all the features related to transactions/concurrency and > authorization, which are orthogonal to the data model. > > > -á. > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pytables-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/pytables-users > ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users