Re: Database/DBD Bridging?

2011-12-13 Thread Reinier Post
Op di 13 december 2011 om 15:35:14 (+0100) schreef rehs...@googlemail.com (Jens 
Rehsack):
 2011/10/4 Reinier Post r...@win.tue.nl:
[...]
  For instance, I have a simple script that needs to count the number of
  distinct values in columns.  This seems like a pretty elementary thing
  to do, but after a lot of tweaking and tinkering I've found that
 
   - SELECT COUNT(DISTINCT) ... is invalid in Jet (Access MDB)
   - SELECT COUNT(*) FROM (SELECT DISTINCT ...)) is invalid elsewhere
   - SELECT DISTINCT ... is accepted but just does SELECT in DBD::CSV
 
 How old are you're DBD::CSV / SQL::Statement versions?
 Of course, SELECT DISTINCT works well in DBD::CSV.

Indeed, SELECT DISTINCT appears to work just fine now,
and I can't recall the conditions under which it appeared to fail,
so feel free to attribute it to faulty testing on my part.

 /Jens

-- 
Reinier


Re: Database/DBD Bridging?

2011-11-23 Thread Jens Rehsack
2011/10/4 Reinier Post r...@win.tue.nl:
 On Fri, Sep 23, 2011 at 11:19:32PM -0400, Brendan Byrd wrote:

 [...], the Perl community in general needs this.  Developers need to just
 grab some modules via CPAN, describe the relational models (if they don't
 already exist), and have the flow of data *just work!* [...]

 I would love to have some idea of how to write multi-DBMS SQL
 applications with DBI (or with anything else).

 The main problem I see is the differences in SQL between DB backends.
 They are large, and SQL::Statement just adds another dialect.

 For a start it would be nice to have a matrix up somewhere listing
 DBMSes against SQL constructs, so I can look up, for instance, against
 which types of DBD backends which SQL syntax will actually work
 (i.e. not only be accepted but also take effect).
 You can't do much that will work everywhere.

 For instance, I have a simple script that needs to count the number of
 distinct values in columns.  This seems like a pretty elementary thing
 to do, but after a lot of tweaking and tinkering I've found that

  - SELECT COUNT(DISTINCT) ... is invalid in Jet (Access MDB)
  - SELECT COUNT(*) FROM (SELECT DISTINCT ...)) is invalid elsewhere
  - SELECT DISTINCT ... is accepted but just does SELECT in DBD::CSV

How old are you're DBD::CSV / SQL::Statement versions?
Of course, SELECT DISTINCT works well in DBD::CSV.

/Jens

 The safe thing is to just do a SELECT and deduplicate in Perl,
 but it would be nice to do more within the backends.

 Once this is in place it would be nice to start to 'plug the holes'
 by defining a common SQL and translations into backend-specific SQL.

 This can be used to define a DBD::Any that can be used to talk to many
 different backends.  (To some extent this can be approximated by a
 SQL processor that supports external table data, such as Access or Pg.)

 Of course there are many other differences to resolve, e.g. in treatment
 of character sets.  These issues can be dealt with in a similar way.

 The nice thing about such a matrix is that it will already be useful
 on its own, even when it's unfinished.

 --
 Reinier Post
 TU Eindhoven



Re: Database/DBD Bridging?

2011-10-04 Thread Reinier Post
On Fri, Sep 23, 2011 at 11:19:32PM -0400, Brendan Byrd wrote:

 [...], the Perl community in general needs this.  Developers need to just
 grab some modules via CPAN, describe the relational models (if they don't
 already exist), and have the flow of data *just work!* [...]

I would love to have some idea of how to write multi-DBMS SQL
applications with DBI (or with anything else).

The main problem I see is the differences in SQL between DB backends.
They are large, and SQL::Statement just adds another dialect.

For a start it would be nice to have a matrix up somewhere listing
DBMSes against SQL constructs, so I can look up, for instance, against
which types of DBD backends which SQL syntax will actually work
(i.e. not only be accepted but also take effect).
You can't do much that will work everywhere.

For instance, I have a simple script that needs to count the number of
distinct values in columns.  This seems like a pretty elementary thing
to do, but after a lot of tweaking and tinkering I've found that

  - SELECT COUNT(DISTINCT) ... is invalid in Jet (Access MDB)
  - SELECT COUNT(*) FROM (SELECT DISTINCT ...)) is invalid elsewhere
  - SELECT DISTINCT ... is accepted but just does SELECT in DBD::CSV

The safe thing is to just do a SELECT and deduplicate in Perl,
but it would be nice to do more within the backends.

Once this is in place it would be nice to start to 'plug the holes'
by defining a common SQL and translations into backend-specific SQL.

This can be used to define a DBD::Any that can be used to talk to many
different backends.  (To some extent this can be approximated by a
SQL processor that supports external table data, such as Access or Pg.)

Of course there are many other differences to resolve, e.g. in treatment
of character sets.  These issues can be dealt with in a similar way.

The nice thing about such a matrix is that it will already be useful
on its own, even when it's unfinished.

-- 
Reinier Post
TU Eindhoven


Re: Database/DBD Bridging?

2011-10-04 Thread Tim Bunce
On Tue, Oct 04, 2011 at 02:20:10PM +0200, Reinier Post wrote:
 On Fri, Sep 23, 2011 at 11:19:32PM -0400, Brendan Byrd wrote:
 
 For a start it would be nice to have a matrix up somewhere listing
 DBMSes against SQL constructs, so I can look up, for instance, against
 which types of DBD backends which SQL syntax will actually work

A quick google turned up 
http://www.grantondata.com/community/dbrosettastone.html
There may well be others/better/newer.

 For instance, I have a simple script that needs to count the number of
 distinct values in columns.  This seems like a pretty elementary thing
 to do, but after a lot of tweaking and tinkering I've found that
 
   - SELECT COUNT(DISTINCT) ... is invalid in Jet (Access MDB)
   - SELECT COUNT(*) FROM (SELECT DISTINCT ...)) is invalid elsewhere
   - SELECT DISTINCT ... is accepted but just does SELECT in DBD::CSV

A good ORM, like DBIx::Class, will handle most differences in SQL dialects
for you, and give you a way to include fragments of literal SQL if you
need to.

Tim.


Re: Database/DBD Bridging?

2011-09-26 Thread Brendan Byrd
On Sat, Sep 24, 2011 at 12:30 AM, Darren Duncan dar...@darrenduncan.netwrote:

 Brendan Byrd wrote:

  Yeah, that sounds right.  So would this eventually become its own DBD
 module?


 Yes and no.  It would not natively be a DBD module, but a separate module
 can exist that is a DBD module which wraps it.  Kind of like how you have
 both SQLite and DBD::SQLite, say.


So, it's more or less its own meta-database that talks to other database
systems.  Wouldn't this (or at least parts of it) essentially be a
next-generation SQL::Statement engine?  After all, they have a pretty good
framework, but it doesn't support everything.  Modules like DBD::CSV or
DBD::File would benefit from having a fully 2003 compatible SQL.

I'm still having trouble understanding the process, though.  I mean, I
looked through the three module categories.  There actually isn't much code
there.  But, you have a* **LOT* of documentation.  I mean, you have two
separate module sections just for documentation, totaling 32,876 words!  The
Muldis::D::Basics POD is 3,742 words alone.  Hell, *Old Man and the Sea* is
26,560 words.  You've literally already written a book of documentation.  I
dare say that if you spend that time on the actual code, it would be done by
now.

How would this work?  Let's say I have the two databases in the primary
email I sent.  How would that work in Perl?  Do I set up Muldis first with
the two databases, and then log in via DBI to the DBD, which DBIC would
interface against?  Does the DBD basically talk to Muldis to do the dirty
work, or is there some work that the DBD would need to take care of (like
actually talking to the databases)?


  Does it use DBI methods to figure out the specs of the system?

 For example, you were saying less capable things like DBD::CSV.  Is that
 determined by querying get_info for the ODBC/ANSI capability data?


 It would use whatever means make sense, which might be starting with the
 DBI methods for some basic functionality and then doing SELECT from the
 INFORMATION_SCHEMA to provide enhanced functionality.


Yeah, I guess DBI doesn't have the entire schema in its methods, but neither
do a lot of DBMS.

-- 
Brendan Byrd/SineSwiper sineswi...@gmail.com
Computer tech, PERL wizard, and all-around Internet guru


Re: Database/DBD Bridging?

2011-09-24 Thread Brendan Byrd
The problem with PostgreSQL's SQL/MED is that it's not Perl, and it won't
work for some of the more abstract objects available as DBD.  I would like
to tie this DBD::FederatedDB into DBIC, so that it can search and insert
everything on-the-fly.  Shoving everything into RAM isn't right, either,
since DBD::AnyData can already do that.  The whole point of having the
databases process the rows one at a time is so that it can handle 10 million
row tables without a full wasteful dump.  It looks like Set::Relation can
work out great for sucking in table_info/row_info data, and can be used as
the temp cache as fractured rows come in.

I would be highly interested in developing this with you.  I'm spread pretty
thin with several other Perl modules, so I otherwise wouldn't tackle it
right now.  But, if you already have something started, we can try to finish
it, and that's much better than starting from scratch alone.

Do you have a repository for this new module yet?  What are you calling it?
I take it the module is building off of SQL::Statement?

On Fri, Sep 23, 2011 at 12:27 AM, Darren Duncan dar...@darrenduncan.netwrote:

 Brendan,

 Taking into account David's response ...

 I have several answers for you:

 1.  The functionality you are talking about is often referred to as a
 federated database, where you have one database engine that the client talks
 to which turns around and farms some of the work off to various other
 database engines, coordinating between them and doing some extra work
 itself.  What you initially proposed is essentially using Perl as the
 client-most database engine, and what David proposed is using PostgreSQL as
 the client-most engine instead.

 2.  If PostgreSQL 9.1's SQL/MED abilities will do what you need, then use
 it.

 3.  If immense scalability isn't needed and you want to do this in Perl
 then I recommend you look at my Set::Relation CPAN module, which will handle
 a lot of the difficult work for you.  Specifically, you would still manage
 for yourself the parceling out of queries to the various back-end DBMSs like
 Oracle or Excel or whatever, but Set::Relation will then take care of all
 the drudgery of taking the various rowsets returned from those and combine
 them into the various end-result queries you actually wanted to do.  Each
 Set::Relation object contains a rowset and you can use its several dozen
 methods to do relational joins or aggregates or antijoins or various other
 things, all the functionality of SQL, but in Perl.  Its main limiting factor
 is that it is entirely RAM-based, though this also makes it simpler.  So you
 can do this right now.

 4.  I am presently implementing a relational DBMS in Perl which provides
 all the functionality you described and more, including query language
 support that lets you write code like you demonstrated.  Strictly speaking
 the initial version is fully self-contained for simplicity, but a
 subsequent/spinoff version would add the ability to farm out to other
 database engines as per SQL/MED, and this *is* designed to scale.  It even
 uses the paradigm you mention, where each underlying engine is essentially a
 namespace in which tables live, and you can join between them as if they
 were all local; or to be more accurate, each database *connection* has its
 own namespace and the underlying engine is just a quality of that
 connection, like with how DBI lets you have multiple connections with the
 same driver.

 So if this sounds like something you want to help create, please talk with
 me.

 -- Darren Duncan

 Brendan Byrd wrote:

 Okay, this is a big blue sky idea, but like all things open-source, it
 comes
 out of a need.  I'm trying to merge together Excel (or CSV), Oracle,
 Fusion
 Tables, JSON, and SNMP for various data points and outputs.  DBIC seems to
 work great for a large database with a bunch of tables, but what about a
 bunch of databases?  I've searched and searched, and nobody seemed to have
 designed a DBD for multiple DBDs.  There's DBD::Multi and Multiplex, but
 that's merely for replication.  This would require reparsing of SQL
 statements.

 So, let's call this module idea DBD::IntegrateDB or MultiDB.  It would be
 a
 module built from SQL::Statement (using the typical Embed instructions),
 so
 it would use that module's SQL Engine for parsing and processing SQL.

 We'll use a simple example of two databases: one Oracle, and one MySQL.
 This module loads both of them in with a DBI-connect string.  Then the
 dev
 runs a prepare on the following SQL:

 SELECT
   book, title, b.person, age, dob
 FROM
   ora.books b
   INNER JOIN mysql.people p ON (
  b.person_id = p.person_id
   )

 So, ora.books is on the Oracle DB, and mysql.people is on the MySQL
 DB.
 The parser for this MultiDB would:

   1. Use SQL::Parser to break down the SQL statement.
   2. Figure out who owns what, in terms of tables and columns.  (Complain

   about ambiguous columns if it has to.)
   3. Use table_info calls to the 

Re: Database/DBD Bridging?

2011-09-24 Thread Brendan Byrd
On Fri, Sep 23, 2011 at 5:01 PM, Darren Duncan dar...@darrenduncan.netwrote:


  The problem with PostgreSQL's SQL/MED is that it's not Perl, and it won't
 work for some of the more abstract objects available as DBD.


 You may want to look into PL/Perl then, using Perl inside Postgres, to
 bring together some of these things, if it will work for you.


Yeah, but that's mostly a hack.  It's not really portable and frankly, I
personally don't need this module.  Granted, my conversations started out
that way, but I acknowledge that the module doesn't exist and I can do
things just fine merging them together via Perl as I have been.

However, the Perl community in general needs this.  Developers need to just
grab some modules via CPAN, describe the relational models (if they don't
already exist), and have the flow of data *just work!*  We need a new way of
thinking how this data interacts.  It is ALL related, else you wouldn't be
using it.  Why do I have to parse through some JSON object and dump them
into variables just to put them into some SQL database if a) the JSON object
is already a bunch of variables, and b) they are both universally well-known
standards.  Me gluing the two together for my one program for this one
instance is basically just hard coding because no other method exists.
JSON, Oracle, CSV, Google, SNMP, Excel, Access, MySQL, XML, HTML, CGI (yes,
I went there)... Make it all available on one interface!

DBIC needs this.  The examples they provide with aren't real world.  Hey,
just grab this SQLite database, create your own tables anyway you like, and
map those into DBIC.  Because Yo Dawg, I heard you like relationships, so we
put relationships in your relationships, so that you can waste time while
you waste time.  (Yes, I went there, too...)  After all, if you are
designing the database, you can make it as simple or as complex as you want
it, and if it's complex, just make views or materialized views to simplify
things.  Granted, I'm belittling DBIC ability to relate things in a single
database (and 4NF in general), but *real problems* with data come from
multiple sources.  Sources that you can't admin.  Sources with some really
funky languages.



   I would like to tie this DBD::FederatedDB into DBIC, so that it can
 search and insert everything on-the-fly.  Shoving everything into RAM isn't
 right, either, since DBD::AnyData can already do that.  The whole point of
 having the databases process the rows one at a time is so that it can handle
 10 million row tables without a full wasteful dump.


 Another thing to ask is whether what you're doing here is a batch process
 where some performance matters are less of an issue, or whether it is more
 on demand or more performance sensitive.


Both?  I can have both, can't I?

But, seriously, the primary goal is the ability to support mostly
everything.  (Really, the goal is to get something into DBD that can go into
DBIC as a single data source.)  It would start out as just something that
would join two tables together and expand from there.

However, speed would be a good secondary goal.  I think that sort of thing
is achievable not from XS code, but from intelligent decisions about how to
pull the data.  Weighing the cardinality of the keys in the table with how
it links together.  If one table is 20 rows and another is 500K rows, then
just pull the entire 20 row table into memory and dump it into a temporary
table on the other system, so that you can do an actual JOIN over there.
Those kind of decisions, which is much of the logic of a RDBMS itself.
Also, the module has a natural speed advantage because of parallelization.
Most of these different systems are going to be on different servers, so at
least the data flow is going to be faster than one server pulling both
tables.  (In general.  Obviously, RDBMS have their own internal operations
to make things as fast as possible.)


   It looks

 like Set::Relation can work out great for sucking in table_info/row_info
 data, and can be used as the temp cache as fractured rows come in.


 Perhaps, although Set::Relation is more about making database operations
 like join etc available in Perl, so you'll want to be using such various
 tools to take advantage of it.  But then no one besides myself has used it
 yet that I know of, and others often think of tool uses beyond the creator.


Well, that's pretty much what I would be using it for: making database
operations available by defining the relational model in Set::Relation, and
have it handle exactly how the rows are going to interact.


 If you mean the more robust/scalable solution, then that has 2 main parts,
 which is a standard query language specification, Muldis D, plus multiple
 implementations.  It corresponds to but is distinct from the ecosystem of
 there being an ISO SQL standard and its implementations in various DBMSs.

 The query language, Muldis D, is not SQL but it is relevant here because it
 is designed to correspond to 

Re: Database/DBD Bridging?

2011-09-23 Thread Darren Duncan
I only got a copy of this message directly and not also via the list as 
expected, since you addressed it to the list, but anyway ...


Brendan Byrd wrote on 2011 Sep 22 at 6:25am PST/UTC-8:
The problem with PostgreSQL's SQL/MED is that it's not Perl, and it 
won't work for some of the more abstract objects available as DBD.


You may want to look into PL/Perl then, using Perl inside Postgres, to bring 
together some of these things, if it will work for you.


 I 
would like to tie this DBD::FederatedDB into DBIC, so that it can search 
and insert everything on-the-fly.  Shoving everything into RAM isn't 
right, either, since DBD::AnyData can already do that.  The whole point 
of having the databases process the rows one at a time is so that it can 
handle 10 million row tables without a full wasteful dump.


Another thing to ask is whether what you're doing here is a batch process where 
some performance matters are less of an issue, or whether it is more on demand 
or more performance sensitive.


  It looks
like Set::Relation can work out great for sucking in table_info/row_info 
data, and can be used as the temp cache as fractured rows come in.


Perhaps, although Set::Relation is more about making database operations like 
join etc available in Perl, so you'll want to be using such various tools to 
take advantage of it.  But then no one besides myself has used it yet that I 
know of, and others often think of tool uses beyond the creator.


I would be highly interested in developing this with you.  I'm spread 
pretty thin with several other Perl modules, so I otherwise wouldn't 
tackle it right now.  But, if you already have something started, we can 
try to finish it, and that's much better than starting from scratch alone.


Do you have a repository for this new module yet?  What are you calling 
it?  I take it the module is building off of SQL::Statement?

snip

If you mean the more robust/scalable solution, then that has 2 main parts, which 
is a standard query language specification, Muldis D, plus multiple 
implementations.  It corresponds to but is distinct from the ecosystem of there 
being an ISO SQL standard and its implementations in various DBMSs.


The query language, Muldis D, is not SQL but it is relevant here because it is 
designed to correspond to SQL and to be an intermediary form for 
generating/parsing SQL or translating between SQL dialects, or between SQL and 
other languages like Perl.  (This means all SQL, including stored procedures.)


This essentially is exactly what you want to do, have a common query syntax 
where behind the scenes some is turned into SQL that is pushed to back-end 
DBMSs, and some of which is turned into Perl to do local processing.  The great 
thing is as a user you don't have to know where it executes, but just that the 
implementation will pick the best way to handle particular code.  I think of an 
analogy like LLVM that can compile selectively to a CPU or a GPU. 
Automatically, more capable DBMSs like Postgres get more work pushed to them to 
do natively, and less capable things like DBD::CSV or whatever have less pushed 
to them and more done in Perl.


The language spec is in github at https://github.com/muldis/Muldis-D and it is 
also published on CPAN in the pure-pod distribution Muldis-D, but the CPAN copy 
has fallen behind at the moment.


The implementations I haven't started yet, or I did but canceled those efforts 
so to do it differently, so you can't run anything yet.  But I know in my head 
exactly how I intend to do it.


I intend to make a few more large updates to the Muldis D spec before starting 
in earnest on the implementation, so to make that simpler and easier to do (it 
is substantially complete other than some large refinements); some clues to this 
direction are in the file TODO_DRAFT in github.


For timetable, if I could focus on this project I could have something usable in 
a few months; however, I also have a separate paying job that I'm currently 
focusing on which doesn't leave much time for the new project, though I hope to 
get more time to work on it maybe in mid-late October.


If you are still interested in working on this, or you just want to follow it, 
please join the (low traffic) discussion list muldis-db-us...@mm.darrenduncan.net .


FYI, this project is quite serious, not pie in the sky, and it has interest from 
some significant people in the industry, such as C.J. Date (well known for An 
Introduction to Database Systems that sold over 800K copies), and one of his 
latest co-authored books in 2010 explicitly covers part of my project with a 
chapter.


-- Darren Duncan


Re: Database/DBD Bridging?

2011-09-23 Thread Darren Duncan

Brendan Byrd wrote:

On Fri, Sep 23, 2011 at 5:01 PM, Darren Duncan dar...@darrenduncan.net wrote:
This essentially is exactly what you want to do, have a common query
syntax where behind the scenes some is turned into SQL that is
pushed to back-end DBMSs, and some of which is turned into Perl to
do local processing.  The great thing is as a user you don't have to
know where it executes, but just that the implementation will pick
the best way to handle particular code.  I think of an analogy like
LLVM that can compile selectively to a CPU or a GPU. Automatically,
more capable DBMSs like Postgres get more work pushed to them to do
natively, and less capable things like DBD::CSV or whatever have
less pushed to them and more done in Perl.
 
Yeah, that sounds right.  So would this eventually become its own DBD 
module?


Yes and no.  It would not natively be a DBD module, but a separate module can 
exist that is a DBD module which wraps it.  Kind of like how you have both 
SQLite and DBD::SQLite, say.


  Does it use DBI methods to figure out the specs of the system?
For example, you were saying less capable things like DBD::CSV.  Is 
that determined by querying get_info for the ODBC/ANSI capability data?


It would use whatever means make sense, which might be starting with the DBI 
methods for some basic functionality and then doing SELECT from the 
INFORMATION_SCHEMA to provide enhanced functionality.


Of course.  Something like this is huge, but it's also hugely important 
to make sure it gets into the hands of the Perl community.


Absolutely.

-- Darren Duncan


Database/DBD Bridging?

2011-09-22 Thread Brendan Byrd
Okay, this is a big blue sky idea, but like all things open-source, it comes
out of a need.  I'm trying to merge together Excel (or CSV), Oracle, Fusion
Tables, JSON, and SNMP for various data points and outputs.  DBIC seems to
work great for a large database with a bunch of tables, but what about a
bunch of databases?  I've searched and searched, and nobody seemed to have
designed a DBD for multiple DBDs.  There's DBD::Multi and Multiplex, but
that's merely for replication.  This would require reparsing of SQL
statements.

So, let's call this module idea DBD::IntegrateDB or MultiDB.  It would be a
module built from SQL::Statement (using the typical Embed instructions), so
it would use that module's SQL Engine for parsing and processing SQL.

We'll use a simple example of two databases: one Oracle, and one MySQL.
This module loads both of them in with a DBI-connect string.  Then the dev
runs a prepare on the following SQL:

SELECT
   book, title, b.person, age, dob
FROM
   ora.books b
   INNER JOIN mysql.people p ON (
  b.person_id = p.person_id
   )

So, ora.books is on the Oracle DB, and mysql.people is on the MySQL DB.
The parser for this MultiDB would:

   1. Use SQL::Parser to break down the SQL statement.
   2. Figure out who owns what, in terms of tables and columns.  (Complain
   about ambiguous columns if it has to.)
   3. Use table_info calls to the separate DBI interfaces, including number
   of rows, cardinality (if available), etc.
   4. Store the joining information.
   5. Prepare two *separate* SQL statements for each DB.  It would no longer
   be JOIN queries, but standard queries for the tables (including person_id,
   which wasn't included in the original SELECT statement).

Then when the statement is executed:

   1. The two SQL statements are executed for each DB.
   2. The fetch_row sub would process each row one at a time for each DB.
   3. If two IDs match, send a row back.  Otherwise, cache the data and wait
   for something to match.
   4. Repeat until the rows are exhausted on one or both sides.  (One side
   for INNER, both sides for OUTER.)

Does anything like that exists?  I'm not saying it's an easy operation, but
if something like that can just start off with a simple JOINs at first, it
would be a miracle module.  Imagine linking with more abstract DBI modules:
Oracle to CSV to MySQL to Teradata to Sys to Sponge.  Tell me you're not
excited at the prospect of eventually creating free open-source competitive
frameworks to multi-million dollar Business Intelligence software.  Heck,
DBIC is getting there, but the data mining and integration piece is missing.

-- 
Brendan Byrd/SineSwiper sineswi...@gmail.com
Computer tech, PERL wizard, and all-around Internet guru


Re: Database/DBD Bridging?

2011-09-22 Thread David E . Wheeler
On Sep 21, 2011, at 7:53 PM, Brendan Byrd wrote:

 Okay, this is a big blue sky idea, but like all things open-source, it comes
 out of a need.  I'm trying to merge together Excel (or CSV), Oracle, Fusion
 Tables, JSON, and SNMP for various data points and outputs.  DBIC seems to
 work great for a large database with a bunch of tables, but what about a
 bunch of databases?  I've searched and searched, and nobody seemed to have
 designed a DBD for multiple DBDs.  There's DBD::Multi and Multiplex, but
 that's merely for replication.  This would require reparsing of SQL
 statements.

I think it'd be simpler to use something like SQL/MED. In PostgreSQL 9.1, for 
example, you can install foreign data wrappers so that you can create tables 
inside PostgreSQL that actually point to CSV, MySQL, Oracle, Redis, or any 
number of other sources. It's read-only right now, though that's often all 
people need. Some example FDWs:

 http://www.postgresql.org/docs/current/static/file-fdw.html
 http://pgxn.org/tag/fdw

ISTR that SQL Server has had similar capabilities for a long time, and I know 
Access has. Hell, ODBC might get you a long way toward it.

In short, I think that a multi-backend DBD is probably the wrong level at which 
to do this, waay too much work.

Best,

David



Re: Database/DBD Bridging?

2011-09-22 Thread Darren Duncan

Brendan,

Taking into account David's response ...

I have several answers for you:

1.  The functionality you are talking about is often referred to as a federated 
database, where you have one database engine that the client talks to which 
turns around and farms some of the work off to various other database engines, 
coordinating between them and doing some extra work itself.  What you initially 
proposed is essentially using Perl as the client-most database engine, and what 
David proposed is using PostgreSQL as the client-most engine instead.


2.  If PostgreSQL 9.1's SQL/MED abilities will do what you need, then use it.

3.  If immense scalability isn't needed and you want to do this in Perl then I 
recommend you look at my Set::Relation CPAN module, which will handle a lot of 
the difficult work for you.  Specifically, you would still manage for yourself 
the parceling out of queries to the various back-end DBMSs like Oracle or Excel 
or whatever, but Set::Relation will then take care of all the drudgery of taking 
the various rowsets returned from those and combine them into the various 
end-result queries you actually wanted to do.  Each Set::Relation object 
contains a rowset and you can use its several dozen methods to do relational 
joins or aggregates or antijoins or various other things, all the functionality 
of SQL, but in Perl.  Its main limiting factor is that it is entirely RAM-based, 
though this also makes it simpler.  So you can do this right now.


4.  I am presently implementing a relational DBMS in Perl which provides all the 
functionality you described and more, including query language support that lets 
you write code like you demonstrated.  Strictly speaking the initial version is 
fully self-contained for simplicity, but a subsequent/spinoff version would add 
the ability to farm out to other database engines as per SQL/MED, and this *is* 
designed to scale.  It even uses the paradigm you mention, where each underlying 
engine is essentially a namespace in which tables live, and you can join between 
them as if they were all local; or to be more accurate, each database 
*connection* has its own namespace and the underlying engine is just a quality 
of that connection, like with how DBI lets you have multiple connections with 
the same driver.


So if this sounds like something you want to help create, please talk with me.

-- Darren Duncan

Brendan Byrd wrote:

Okay, this is a big blue sky idea, but like all things open-source, it comes
out of a need.  I'm trying to merge together Excel (or CSV), Oracle, Fusion
Tables, JSON, and SNMP for various data points and outputs.  DBIC seems to
work great for a large database with a bunch of tables, but what about a
bunch of databases?  I've searched and searched, and nobody seemed to have
designed a DBD for multiple DBDs.  There's DBD::Multi and Multiplex, but
that's merely for replication.  This would require reparsing of SQL
statements.

So, let's call this module idea DBD::IntegrateDB or MultiDB.  It would be a
module built from SQL::Statement (using the typical Embed instructions), so
it would use that module's SQL Engine for parsing and processing SQL.

We'll use a simple example of two databases: one Oracle, and one MySQL.
This module loads both of them in with a DBI-connect string.  Then the dev
runs a prepare on the following SQL:

SELECT
   book, title, b.person, age, dob
FROM
   ora.books b
   INNER JOIN mysql.people p ON (
  b.person_id = p.person_id
   )

So, ora.books is on the Oracle DB, and mysql.people is on the MySQL DB.
The parser for this MultiDB would:

   1. Use SQL::Parser to break down the SQL statement.
   2. Figure out who owns what, in terms of tables and columns.  (Complain
   about ambiguous columns if it has to.)
   3. Use table_info calls to the separate DBI interfaces, including number
   of rows, cardinality (if available), etc.
   4. Store the joining information.
   5. Prepare two *separate* SQL statements for each DB.  It would no longer
   be JOIN queries, but standard queries for the tables (including person_id,
   which wasn't included in the original SELECT statement).

Then when the statement is executed:

   1. The two SQL statements are executed for each DB.
   2. The fetch_row sub would process each row one at a time for each DB.
   3. If two IDs match, send a row back.  Otherwise, cache the data and wait
   for something to match.
   4. Repeat until the rows are exhausted on one or both sides.  (One side
   for INNER, both sides for OUTER.)

Does anything like that exists?  I'm not saying it's an easy operation, but
if something like that can just start off with a simple JOINs at first, it
would be a miracle module.  Imagine linking with more abstract DBI modules:
Oracle to CSV to MySQL to Teradata to Sys to Sponge.  Tell me you're not
excited at the prospect of eventually creating free open-source competitive
frameworks to multi-million dollar Business Intelligence software.  Heck,