I'm continuing this thread on the dev list. -á.
On Fri, Apr 13, 2012 at 21:17, Anthony Scopatz <scop...@gmail.com> wrote: > > On Fri, Apr 13, 2012 at 12:30 PM, Alvaro Tejero Cantero <alv...@minin.es> > wrote: >> >> Hi Anthony, >> >> >> >> >> How does hierarchical help here? do you create a 'singer_name'/song >> >> table? or a 'genre name'/song ?. Most of the time the physical layout >> >> in the form of a hierarchy is just an annoyance. >> > >> > I have to say that I disagree. The hierarchical features make it so >> > that >> > the data maps well to both Python objects and file systems. I feel that >> > both of these are more natural to work with than having to construct a >> > query of joins, groupbys, etc which reconstructs my data. So while this >> > is just my opinion, I feel that hierarchies are much more natural to >> > work >> > with. >> > >> >> I try to see the difference; I think I could be helped by an example. >> SQL gives you views to hide the data layout. You will just have 'a >> table' that you can read by rows. Nothing else. What happens in the >> case outlined before if I want to group songs by genre? if I encode >> the relation in the hierarchy, I will have to write the code that >> generates the view, but since there are many ways to create the >> hierarchy, the code will be specific to one data layout... I'd love to >> be wrong here. > > > Yes, this is exactly correct. You do have to write the code that generates > the view based on how your file is organized. However, encoding this > information explicitly in the hierarchy can make look up faster than if you > have to search and join several large tables. > > Because PyTables makes writing this easy in most cases, I don't mind. > >> >> >> > Also, my sense is that there would be a fair bit of overhead in this >> >> > interface >> >> > layer, which might not get you the speed boost you desire. I could >> >> > be >> >> > wrong >> >> > about this though. >> >> >> >> I think you're right in the wrapping of the results via the Python >> >> interface to SQLite. I suspect you're not about the queries executed >> >> in the virtual table, because that is left for you to implement and >> >> thus you could turn the query terms (that are handed over to you) into >> >> in-kernel expressions if you so wish (http://www.sqlite.org/vtab.html) >> > >> > >> > This was informed by my experience with SQLAlchemy which in some >> > situations added an excessively long computation times. With the >> > PyTables >> > infrastructure, we would at least have the option of writing the >> > performance >> > critical parts in C or Cython... >> >> You are talking here about writing specific /queries/ in Cython? just >> for clarification. > > > I am talking about writing the interface code between SQLite and PyTables > in C or (more likely) Cython. The queries themselves still are written in > SQL. > >> >> >> >> > If I saw a proof-of-concept implementation, I may grok better the >> >> > purpose. >> >> > Do you have any code to share? >> >> >> >> No, but I have an example ER diagram which is only part of what I >> >> need. You are welcome to have a look at it[2] >> >> > Sorry the text in this image is too small for me to read. >> >> I uploaded a larger version on the same location. > > > Thanks! > >> >> >> > Writing data-specific relational layers for your applications on top of >> > HDF5 with PyTables is not hard (IMHO). Add in the features of NumPy >> > to perform in-memory manipulations and you have pretty much everything >> > that you need. I think this is why we don't have formal implementation >> > of the SQ Language in PyTables. >> >> Yes, fair enough. There is just no canonical way of doing it. >> >> What do you think of storing in the .attrs something like "This column >> of this table matches (in the sense of foreign key) that column in >> that table" ? >> >> Or would you store these relations in a global repository of sorts - a >> specific table? > > > Whenever I have wanted to mimic relation behavior in HDF5 I have used > the second method where you store a table(s) of relations somewhere in your > file and make sure that your 'data' tables have appropriate 'primary key' > columns. > > Attributes are an interesting idea but I would advise against it since space > is > limited [9] and access is slow [10]. > >> >> >> > I guess what i don't understand still is why - if you wanted to do this >> > - >> > would >> > you use the SQLite vtabs? This seems to have the worst of the SQL world >> > in terms of vendor lock in, compatibility with other SQL >> > implementations, >> > etc. >> >> That is true. I did a quick search and couldn't find if/how e.g >> Postgres has such an extensibility mechanism. >> >> At the same time, there's some commonality between SQLite and >> PyTables: single-file, no concurrency approach. If you need a >> full-fledged RDBMS with authorization etc. you are in another league >> and some abstractions may be difficult to map to PyTables. > > > This is a good point that I hadn't considered. > >> >> >> And RDBMS have received, recently, features that are of great interest >> for scientific users - for example, indexes optimized for spatial or >> interval queries[7][8]. >> >> > Instead, why not just write a SQLAlchemy dialect [6] that is backed by >> > PyTables? >> >> I considered this. I don't know how difficult it is. Do you think that >> this would be the way to go for implementing a thin relational layer >> on top of PyTables? >> >> As I have no practical experience with SQLAlchemy, I cannot foresee >> e.g. those performance drops that you were pinpointing above. > > > I am under the impression that a SQLite vtabs implementation would be > faster, but less general, than SQLAlchemy. But this is why I was asking > the question about "Why vtabs?" in the first place. I guess it comes down > to whoever implements it ;) > >> >> >> > Yes, this isn't 'self-contained' in that we know have a dependency on >> > SQLAlchemy. >> > However, if done right this would be an *optional* dependency. Are >> > there >> > reasons >> > to not do this that I am missing? I think that including something like >> > this as a >> > subpackage in PyTables is more reasonable than interfacing with SQLite >> > in specific. >> >> More reasonable in a general sense, I don't know. The mirror statement >> would be to say that adding support for Numpy containers to the Python >> database adapter would be a reasonable thing to do. >> >> >> > Thanks for fielding my questions here. >> >> A pleasure. I am trying to wrap my head around all the possibilities >> here. I think a documented PyTables use-case for a moderately complex >> scientific database could do a lot for its story. > > > I agree. I think that having something like what you propose available > would > be really interesting. It would be great to be able to say "And we support > SQL > queries if you need them!" > > However, I am concerned about how this would affect the existing PyTables > code base in terms of maintenance, compatibility with existing objects, > build system dependencies, etc. > > A lot of this decisions get made by the person who actually writes it. Thus > I > was asking if you had any code available. If I saw a partial > implementation > I could review it. > > So to answer your initial question, we would be interested in looking at a > SQL > interface layer for HDF5 using PyTables. At that point we could discuss > what it would take to integrate it back in upstream. However, since I am > not > personally all that interested in SQL, I probably wouldn't be the one to > write > this subpackage ;). > > If you are interested in implementing it in one of the two main ways we > discussed > but don't know which to pursue we can try to work that out here or on the > dev list. > If you really want to try vtabs or SQLAlchemy, I encourage you to try and > let us > know how it goes and if you have questions or need help! > > Be Well > Anthony > > [9] http://www.hdfgroup.org/HDF5/doc1.6/UG/13_Attributes.html#SpecIssues > [10] http://www.hdfgroup.org/HDF5/doc/UG/UG_frame13Attributes.html > >> >> >> Cheers, >> >> Álvaro. >> >> [7] http://www.sqlite.org/rtree.html >> [8] http://www.postgresql.org/docs/8.1/static/gist.html >> >> > Be Well >> > Anthony >> > >> > >> > [6] http://docs.sqlalchemy.org/en/latest/#dialect-documentation >> > >> >> >> >> >> >> Cheers, >> >> >> >> Álvaro. >> >> -- >> >> [1] http://www.rmunn.com/sqlalchemy-tutorial/tutorial.html >> >> [2] http://dl.dropbox.com/u/2467197/ER-simple.png (yellow tables link >> >> to HDF5 data, or other tables with the real measurements, white tables >> >> are computed). >> >> [3] http://www.scidb.org/ >> >> [4] See p.26-29 and 32 >> >> >> >> >> >> http://www.itea-wsmr.org/ITEA%20Papers%20%20Presentations/2006%20ITEA%20Papers%20and%20Presentations/folk_HDF5_databases_pres.pdf >> >> [5] >> >> >> >> https://github.com/numpy/numpy/blob/master/numpy/lib/recfunctions.py#L826 >> >> >> >> >> >> > Be Well >> >> > Anthony >> >> > >> >> > On Thu, Apr 12, 2012 at 11:03 AM, Alvaro Tejero Cantero >> >> > <alv...@minin.es> >> >> > wrote: >> >> >> >> >> >> Hi, >> >> >> >> >> >> The topic of introducing some kind of relational management in >> >> >> PyTables comes up with certain frequency. >> >> >> >> >> >> Would it be possible to combine the virtues of RDBMS and hdf5's >> >> >> speed >> >> >> via a mechanism such as SQLite Virtual Tables? >> >> >> >> >> >> http://www.sqlite.org/vtab.html >> >> >> >> >> >> I wonder if the required x* functions could be written for PyTables, >> >> >> or if it being in Python is an obstacle to this kind of interfacing >> >> >> with SQLite. >> >> >> >> >> >> Something like that would be a truly powerful solution in use cases >> >> >> that don't require concurrency. >> >> >> >> >> >> Cheers, >> >> >> >> >> >> -á. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> >> For Developers, A Lot Can Happen In A Second. >> >> >> Boundary is the first to Know...and Tell You. >> >> >> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! >> >> >> http://p.sf.net/sfu/Boundary-d2dvs2 >> >> >> _______________________________________________ >> >> >> Pytables-users mailing list >> >> >> Pytables-users@lists.sourceforge.net >> >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > ------------------------------------------------------------------------------ >> >> > For Developers, A Lot Can Happen In A Second. >> >> > Boundary is the first to Know...and Tell You. >> >> > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! >> >> > http://p.sf.net/sfu/Boundary-d2dvs2 >> >> > _______________________________________________ >> >> > Pytables-users mailing list >> >> > Pytables-users@lists.sourceforge.net >> >> > https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> For Developers, A Lot Can Happen In A Second. >> >> Boundary is the first to Know...and Tell You. >> >> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! >> >> http://p.sf.net/sfu/Boundary-d2dvs2 >> >> _______________________________________________ >> >> Pytables-users mailing list >> >> Pytables-users@lists.sourceforge.net >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> > >> > >> > >> > >> > ------------------------------------------------------------------------------ >> > For Developers, A Lot Can Happen In A Second. >> > Boundary is the first to Know...and Tell You. >> > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! >> > http://p.sf.net/sfu/Boundary-d2dvs2 >> > _______________________________________________ >> > Pytables-users mailing list >> > Pytables-users@lists.sourceforge.net >> > https://lists.sourceforge.net/lists/listinfo/pytables-users >> > >> >> >> ------------------------------------------------------------------------------ >> For Developers, A Lot Can Happen In A Second. >> Boundary is the first to Know...and Tell You. >> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! >> http://p.sf.net/sfu/Boundary-d2dvs2 >> _______________________________________________ >> Pytables-users mailing list >> Pytables-users@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > ------------------------------------------------------------------------------ > For Developers, A Lot Can Happen In A Second. > Boundary is the first to Know...and Tell You. > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > http://p.sf.net/sfu/Boundary-d2dvs2 > _______________________________________________ > Pytables-users mailing list > Pytables-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/pytables-users > ------------------------------------------------------------------------------ For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users