I'm continuing this thread on the dev list.

-á.



On Fri, Apr 13, 2012 at 21:17, Anthony Scopatz <scop...@gmail.com> wrote:
>
> On Fri, Apr 13, 2012 at 12:30 PM, Alvaro Tejero Cantero <alv...@minin.es>
> wrote:
>>
>> Hi Anthony,
>>
>>
>>
>> >> How does hierarchical help here? do you create a 'singer_name'/song
>> >> table? or a 'genre name'/song ?. Most of the time the physical layout
>> >> in the form of a hierarchy is just an annoyance.
>> >
>> > I have to say that I disagree.  The hierarchical features make it so
>> > that
>> > the data maps well to both Python objects and file systems.  I feel that
>> > both of these are more natural to work with than having to construct a
>> > query of joins, groupbys, etc which reconstructs my data. So while this
>> > is just my opinion, I feel that hierarchies are much more natural to
>> > work
>> > with.
>> >
>>
>> I try to see the difference; I think I could be helped by an example.
>> SQL gives you views to hide the data layout. You will just have 'a
>> table' that you can read by rows. Nothing else. What happens in the
>> case outlined before if I want to group songs by genre? if I encode
>> the relation in the hierarchy, I will have to write the code that
>> generates the view, but since there are many ways to create the
>> hierarchy, the code will be specific to one data layout... I'd love to
>> be wrong here.
>
>
> Yes, this is exactly correct.  You do have to write the code that generates
> the view based on how your file is organized.  However, encoding this
> information explicitly in the hierarchy can make look up faster than if you
> have to  search and join several large tables.
>
> Because PyTables makes writing this easy in most cases, I don't mind.
>
>>
>> >> > Also, my sense is that there would be a fair bit of overhead in this
>> >> > interface
>> >> > layer, which might not get you the speed boost you desire.  I could
>> >> > be
>> >> > wrong
>> >> > about this though.
>> >>
>> >> I think you're right in the wrapping of the results via the Python
>> >> interface to SQLite. I suspect you're not about the queries executed
>> >> in the virtual table, because that is left for you to implement and
>> >> thus you could turn the query terms (that are handed over to you) into
>> >> in-kernel expressions if you so wish (http://www.sqlite.org/vtab.html)
>> >
>> >
>> > This was informed by my experience with SQLAlchemy which in some
>> > situations added an excessively long computation times.  With the
>> > PyTables
>> > infrastructure, we would at least have the option of writing the
>> > performance
>> > critical parts in C or Cython...
>>
>> You are talking here about writing specific /queries/ in Cython? just
>> for clarification.
>
>
> I am talking about writing the interface code between SQLite and PyTables
> in C or (more likely) Cython.  The queries themselves still are written in
> SQL.
>
>>
>>
>> >> > If I saw a proof-of-concept implementation, I may grok better the
>> >> > purpose.
>> >> > Do you have any code to share?
>> >>
>> >> No, but I have an example ER diagram which is only part of what I
>> >> need.  You are welcome to have a look at it[2]
>>
>> > Sorry the text in this image is too small for me to read.
>>
>> I uploaded a larger version on the same location.
>
>
> Thanks!
>
>>
>>
>> > Writing data-specific relational layers for your applications on top of
>> > HDF5 with PyTables is not hard (IMHO).  Add in the features of NumPy
>> > to perform in-memory manipulations and you have pretty much everything
>> > that you need.  I think this is why we don't have formal implementation
>> > of the SQ Language in PyTables.
>>
>> Yes, fair enough. There is just no canonical way of doing it.
>>
>> What do you think of storing in the .attrs something like "This column
>> of this table matches (in the sense of foreign key) that column in
>> that table" ?
>>
>> Or would you store these relations in a global repository of sorts - a
>> specific table?
>
>
> Whenever I have wanted to mimic relation behavior in HDF5 I have used
> the second method where you store a table(s) of relations somewhere in your
> file and make sure that your 'data' tables have appropriate 'primary key'
> columns.
>
> Attributes are an interesting idea but I would advise against it since space
> is
> limited [9] and access is slow [10].
>
>>
>>
>> > I guess what i don't understand still is why - if you wanted to do this
>> > -
>> > would
>> > you use the SQLite vtabs?  This seems to have the worst of the SQL world
>> > in terms of vendor lock in, compatibility with other SQL
>> > implementations,
>> > etc.
>>
>> That is true. I did a quick search and couldn't find if/how e.g
>> Postgres has such an extensibility mechanism.
>>
>> At the same time, there's some commonality between SQLite and
>> PyTables: single-file, no concurrency approach. If you need a
>> full-fledged RDBMS with authorization etc. you are in another league
>> and some abstractions may be difficult to map to PyTables.
>
>
> This is a good point that I hadn't considered.
>
>>
>>
>> And RDBMS have received, recently, features that are of great interest
>> for scientific users - for example, indexes optimized for spatial or
>> interval queries[7][8].
>>
>> > Instead, why not just write a SQLAlchemy dialect [6] that is backed by
>> > PyTables?
>>
>> I considered this. I don't know how difficult it is. Do you think that
>> this would be the way to go for implementing a thin relational layer
>> on top of PyTables?
>>
>> As I have no practical experience with SQLAlchemy, I cannot foresee
>> e.g. those performance drops that you were pinpointing above.
>
>
> I am under the impression that a SQLite vtabs implementation would be
> faster, but less general, than SQLAlchemy.  But this is why I was asking
> the question about "Why vtabs?" in the first place.  I guess it comes down
> to whoever implements it ;)
>
>>
>>
>> > Yes, this isn't 'self-contained' in that we know have a dependency on
>> > SQLAlchemy.
>> > However, if done right this would be an *optional* dependency.  Are
>> > there
>> > reasons
>> > to not do this that I am missing?  I think that including something like
>> > this as a
>> > subpackage in PyTables is more reasonable than interfacing with SQLite
>> > in specific.
>>
>> More reasonable in a general sense, I don't know. The mirror statement
>> would be to say that adding support for Numpy containers to the Python
>> database adapter would be a reasonable thing to do.
>>
>>
>> > Thanks for fielding my questions here.
>>
>> A pleasure. I am trying to wrap my head around all the possibilities
>> here. I think a documented PyTables use-case for a moderately complex
>> scientific database could do a lot for its story.
>
>
> I agree.  I think that having something like what you propose available
> would
> be really interesting.  It would be great to be able to say "And we support
> SQL
> queries if you need them!"
>
> However, I am concerned about how this would affect the existing PyTables
> code base in terms of maintenance, compatibility with existing objects,
> build system dependencies, etc.
>
> A lot of this decisions get made by the person who actually writes it.  Thus
> I
> was asking if you had any code available.  If I saw a partial
> implementation
> I could review it.
>
> So to answer your initial question, we would be interested in looking at a
> SQL
> interface layer for HDF5 using PyTables.  At that point we could discuss
> what it would take to integrate it back in upstream.  However, since I am
> not
> personally all that interested in SQL, I probably wouldn't be the one to
> write
> this subpackage ;).
>
> If you are interested in implementing it in one of the two main ways we
> discussed
> but don't know which to pursue we can try to work that out here or on the
> dev list.
> If you really want to try vtabs or SQLAlchemy, I encourage you to try and
> let us
> know how it goes and if you have questions or need help!
>
> Be Well
> Anthony
>
> [9] http://www.hdfgroup.org/HDF5/doc1.6/UG/13_Attributes.html#SpecIssues
> [10] http://www.hdfgroup.org/HDF5/doc/UG/UG_frame13Attributes.html
>
>>
>>
>> Cheers,
>>
>> Álvaro.
>>
>> [7] http://www.sqlite.org/rtree.html
>> [8] http://www.postgresql.org/docs/8.1/static/gist.html
>>
>> > Be Well
>> > Anthony
>> >
>> >
>> > [6] http://docs.sqlalchemy.org/en/latest/#dialect-documentation
>> >
>> >>
>> >>
>> >> Cheers,
>> >>
>> >> Álvaro.
>> >> --
>> >> [1] http://www.rmunn.com/sqlalchemy-tutorial/tutorial.html
>> >> [2] http://dl.dropbox.com/u/2467197/ER-simple.png (yellow tables link
>> >> to HDF5 data, or other tables with the real measurements, white tables
>> >> are computed).
>> >> [3] http://www.scidb.org/
>> >> [4] See p.26-29 and 32
>> >>
>> >>
>> >> http://www.itea-wsmr.org/ITEA%20Papers%20%20Presentations/2006%20ITEA%20Papers%20and%20Presentations/folk_HDF5_databases_pres.pdf
>> >> [5]
>> >>
>> >> https://github.com/numpy/numpy/blob/master/numpy/lib/recfunctions.py#L826
>> >>
>> >>
>> >> > Be Well
>> >> > Anthony
>> >> >
>> >> > On Thu, Apr 12, 2012 at 11:03 AM, Alvaro Tejero Cantero
>> >> > <alv...@minin.es>
>> >> > wrote:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> The topic of introducing some kind of relational management in
>> >> >> PyTables comes up with certain frequency.
>> >> >>
>> >> >> Would it be possible to combine the virtues of RDBMS and hdf5's
>> >> >> speed
>> >> >> via a mechanism such as SQLite Virtual Tables?
>> >> >>
>> >> >> http://www.sqlite.org/vtab.html
>> >> >>
>> >> >> I wonder if the required x* functions could be written for PyTables,
>> >> >> or if it being in Python is an obstacle to this kind of interfacing
>> >> >> with SQLite.
>> >> >>
>> >> >> Something like that would be a truly powerful solution in use cases
>> >> >> that don't require concurrency.
>> >> >>
>> >> >> Cheers,
>> >> >>
>> >> >> -á.
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> ------------------------------------------------------------------------------
>> >> >> For Developers, A Lot Can Happen In A Second.
>> >> >> Boundary is the first to Know...and Tell You.
>> >> >> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
>> >> >> http://p.sf.net/sfu/Boundary-d2dvs2
>> >> >> _______________________________________________
>> >> >> Pytables-users mailing list
>> >> >> Pytables-users@lists.sourceforge.net
>> >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > ------------------------------------------------------------------------------
>> >> > For Developers, A Lot Can Happen In A Second.
>> >> > Boundary is the first to Know...and Tell You.
>> >> > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
>> >> > http://p.sf.net/sfu/Boundary-d2dvs2
>> >> > _______________________________________________
>> >> > Pytables-users mailing list
>> >> > Pytables-users@lists.sourceforge.net
>> >> > https://lists.sourceforge.net/lists/listinfo/pytables-users
>> >> >
>> >>
>> >>
>> >>
>> >> ------------------------------------------------------------------------------
>> >> For Developers, A Lot Can Happen In A Second.
>> >> Boundary is the first to Know...and Tell You.
>> >> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
>> >> http://p.sf.net/sfu/Boundary-d2dvs2
>> >> _______________________________________________
>> >> Pytables-users mailing list
>> >> Pytables-users@lists.sourceforge.net
>> >> https://lists.sourceforge.net/lists/listinfo/pytables-users
>> >
>> >
>> >
>> >
>> > ------------------------------------------------------------------------------
>> > For Developers, A Lot Can Happen In A Second.
>> > Boundary is the first to Know...and Tell You.
>> > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
>> > http://p.sf.net/sfu/Boundary-d2dvs2
>> > _______________________________________________
>> > Pytables-users mailing list
>> > Pytables-users@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/pytables-users
>> >
>>
>>
>> ------------------------------------------------------------------------------
>> For Developers, A Lot Can Happen In A Second.
>> Boundary is the first to Know...and Tell You.
>> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
>> http://p.sf.net/sfu/Boundary-d2dvs2
>> _______________________________________________
>> Pytables-users mailing list
>> Pytables-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
>
> ------------------------------------------------------------------------------
> For Developers, A Lot Can Happen In A Second.
> Boundary is the first to Know...and Tell You.
> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
> http://p.sf.net/sfu/Boundary-d2dvs2
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to