Re: Add Unicode Support to the DBI

2011-09-22 Thread Martin J. Evans

On 21/09/11 21:52, Greg Sabino Mullane wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160


...

And maybe that's the default. But I should be able to tell it to be pedantic 
when the
data is known to be bad (see, for example data from an SQL_ASCII-encoded 
PostgreSQL database).

...

DBD::Pg's approach is currently broken. Greg is working on fixing it, but for 
compatibility
reasons the fix is non-trivial (an the API might be, too). In a perfect world 
DBD::Pg would
just always do the right thing, as the database tells it what encodings to use 
when you
connect (and *all* data is encoded as such, not just certain data types). But 
the world is
not perfect, there's a lot of legacy stuff.

Greg, care to add any other details?


My thinking on this has changed a bit. See the DBD::Pg in git head for a 
sample, but basically,
DBD::Pg is going to:

* Flip the flag on if the client_encoding is UTF-8 (and server_encoding is not 
SQL_ASCII)
* Flip if off if not

The single switch will be pg_unicode_flag, which will basiccaly override the 
automatic
choice above, just in case you really want your SQL_ASCII byte soup marked as 
utf8 for
some reason, or (more likely), you want your data unmarked as utf8 despite 
being so.

This does rely on PostgreSQL doing the right thing when it comes to 
encoding/decoding/storing
all the encodings, but I'm pretty sure it's doing well in that regard.

...

Since nobody has actally defined a specific interface yet, let me throw out a
straw man. It may look familiar :)

===
* $h-{unicode_flag}

If this is set on, data returned from the database is assumed to be UTF-8, and
the utf8 flag will be set. DBDs will decode the data as needed.


There is more than one way to encode unicode - not everyone uses UTF-8; 
although some encodings don't support all of unicode.

unicode is not encoded as UTF-8 in ODBC using the wide APIs.

Using the wide ODBC APIs returns data in UCS2 encoding and DBD::ODBC decodes 
it. Using the ANSI APIs data is returned as octets and is whatever it is - it 
may be ASCII, it may be UTF-8 encoded (only in 2 cases I know and I believe 
they are flawed anyway) it may be something else in which case the application 
needs to know what it is. In the case of octets which are UTF-8 encoded 
DBD::ODBC has no idea that is the case unless you tell it and it will then set 
the UTF-8 flag (but see later).


If this is set off, the utf8 flag will never be set, and no decoding will be 
done
on data coming back from the database.

If this is not set (undefined), the underlying DBD is responsible for doing the
correct thing. In other words, the behaviour is undefined.
===

I don't think this will fit into DBD::Pgs current implementation perfectly, as
we wouldn't want people to simply leave $h-{unicode_flag} on, as that would
force SQL_ASCII text to have utf8 flipped on. Perhaps we simply never, ever
allow that.


I'm not that familiar with Postgres (I've used a few times and not to any great 
degree) and I used MySQL for a while years ago. I occasionally use SQLite. I do 
use DBD::Oracle and DBD::ODBC all the time. I'm still struggling to see the 
problem that needs fixing. Is it just that some people would like a DBI flag 
which tells the DBD:

1) decode any data coming back from the database strictly such that if it is 
invalid you die
2) decode any data coming back from the database loosely (e.g., utf-8 vs UTF-8)
3) don't decode the data from the database at all
4) don't decode the data, the DBD knows it is say UTF-8 encoded and simply sets 
the UTF-8 flag (which from what I read is horribly flawed but seems to work for 
me).

and the reverse.

DBD::Oracle does 1 some of the time and it does 4 the rest of the time e.g. 
error messages are fully decoded from UTF-8 IF Oracle is sending UTF-8 and it 
does 4 on most of the column data IF Oracle is sending UTF-8.

DBD::ODBC does nothing via the ANSI APIs unless the odbc_utf8 flag is turned on 
in which case it does 4 (and it only does this because there is apparently a 
version of the Postgres ODBC driver out there somewhere that returns UTF-8 
encoded data, I've never seen it, I just accepted the patch).

DBD::ODBC does 1 if using the wide APIs and it has little choice since no one 
would want to accept UCS2 and have to decode it all the time.

My point being, doesn't the DBD know how the data is encoded when it gets it 
from the database? and it would hopefully know what the database needs when 
sending data. Perhaps in some conditions the DBD does not know this and needs 
to be told (I could imagine SQLite reading/writing straight to files for 
instance might want to know to open the file with UTF-8 layer).

So is the problem that sometimes a DBD does not know what to encode data being 
sent to the database or how/whether to decode data coming back from the 
database? and if that is the case do we need some settings in DBI to tell a DBD?

Martin
--
Martin J. Evans
Easysoft Limited
http://www.easysoft.com


Re: Add Unicode Support to the DBI

2011-09-22 Thread Martin J. Evans

David, I forgot to answer your post first and ended up putting most of my 
comments in a reply to Greg's posting - sorry, it was a long night last night. 
Some further comments below:

On 21/09/11 19:44, David E. Wheeler wrote:

On Sep 10, 2011, at 3:08 AM, Martin J. Evans wrote:


I'm not sure any change is required to DBI to support unicode. As
far as I'm aware unicode already works with DBI if the DBDs do the
right thing.


Right, but the problem is that, IME, none of them do the right
thing. As I said, I've submitted encoding-related bug reports for
every DBD I've used in production code. And they all have different
interfaces for tweaking things.


So it is these different interfaces for tweaking things you'd like to 
standardise I presume.


If you stick to the rule that all data Perl receives must be
decoded and all data Perl exports must be encoded it works
(ignoring any issues in Perl itself).


Er, was there supposed to be a , then … statement there?


I bow to Tom's experience but I'm still not sure how that applies
to DBI so long as the interface between the database and Perl
always encodes and decodes then the issues Tom describes are all
Perl ones - no?


The trouble is that:

1. They don't always encode or decode 2. When they do, the tend to
get subtle bits wrong 3. And they all have different interfaces and
philosophies for doing so


Surely Oracle should return the data encoded as you asked for it
and if it did not Oracle is broken. I'd still like to see this case
and then we can see if Oracle is broken and if there is a fix for
it.


Oh I don't doubt that Oracle is broken.


In some places DBD::Oracle does sv_utf8_decode(scalar) or
SvUTF8_on(scalar) (depending on your Perl) and in some places it
just does SvUTF8_on(scalar). I believe the latter is much quicker
as the data is not checked. Many people (myself included) are
particularly interested in DBD::Oracle being fast and if all the
occurrences were changed to decode I'd patch that out in my copy as
I know the data I receive is UTF-8 encoded.


IME It needs an assume Oracle is broken knob. That is, I should
have the option to enface encoding and decoding, rather than just
flipping SvUTF8. And I think that such an interface should be
standardized in the DBI along with detailed information for driver
authors how how to get it right.


ok, I get that.


See above. I'd like the chance to go with speed and take the
consequences rather than go with slower but know incorrect UTF-8 is
spotted.


And maybe that's the default. But I should be able to tell it to be
pedantic when the data is known to be bad (see, for example data from
an SQL_ASCII-encoded PostgreSQL database).


I thought UTF-8 when used in Perl used the strict definition and
utf-8 used Perl's looser definition - see
http://search.cpan.org/~dankogai/Encode-2.44/Encode.pm#UTF-8_vs._utf8_vs._UTF8





That's right. So if I want to ensure that I'm getting strict encoding
in my database, It needs to encode and decode, not simply flip
SvUTF8.


Don't DBDs do this now? I know the encoding of the data I receive
in DBD::ODBC and decode it when I get it and encode it when I send
it and I believe that is what DBD::Oracle does as well. There is
one exception in ODBC for drivers which don't truly abide by ODBC
spec and send 8 bit data back UTF-8 encoded (see later).


There is no single API for configuring this in the DBI, and I argue
there should be.


I've spent a lot of effort getting unicode working in DBD::ODBC
(for UNIX and with patches from Alexander Foken for Windows) which
is implemented in an awkward fashion in ODBC. I'd like to hear from
DBD authors what support they already have and how it is
implemented so we can see what ground is already covered and where
the problems were.


DBD::Pg's approach is currently broken. Greg is working on fixing it,
but for compatibility reasons the fix is non-trivial (an the API
might be, too). In a perfect world DBD::Pg would just always do the
right thing, as the database tells it what encodings to use when you
connect (and *all* data is encoded as such, not just certain data
types). But the world is not perfect, there's a lot of legacy stuff.

Greg, care to add any other details?


as I remain unconvinced a problem exists other than incorrectly
coded DBDs. I'm happy to collate that information. As a start I'll
describe the DBD::ODBC:

1. ODBC has 2 sets of APIs, SQLxxxA (each chr is 8 bits) and
SQLxxxW (each chr is 16 bits and UCS-2). This is how Microsoft did
it and yes I know that does not support all of unicode but code
pages get involved too.

2. You select which API you are using with a macro when you compile
your application so you cannot change your mind. You can in theory
call SQLxxxA or SQLxxxW functions directly but if you use SQLxxx
you get the A or W depending on what the macro is set to. Problem:
DBD::ODBC has to built one way or the other.

3. When using the SQLxxxA functions you can still bind
columns/parameters as wide 

Database/DBD Bridging?

2011-09-22 Thread Brendan Byrd
Okay, this is a big blue sky idea, but like all things open-source, it comes
out of a need.  I'm trying to merge together Excel (or CSV), Oracle, Fusion
Tables, JSON, and SNMP for various data points and outputs.  DBIC seems to
work great for a large database with a bunch of tables, but what about a
bunch of databases?  I've searched and searched, and nobody seemed to have
designed a DBD for multiple DBDs.  There's DBD::Multi and Multiplex, but
that's merely for replication.  This would require reparsing of SQL
statements.

So, let's call this module idea DBD::IntegrateDB or MultiDB.  It would be a
module built from SQL::Statement (using the typical Embed instructions), so
it would use that module's SQL Engine for parsing and processing SQL.

We'll use a simple example of two databases: one Oracle, and one MySQL.
This module loads both of them in with a DBI-connect string.  Then the dev
runs a prepare on the following SQL:

SELECT
   book, title, b.person, age, dob
FROM
   ora.books b
   INNER JOIN mysql.people p ON (
  b.person_id = p.person_id
   )

So, ora.books is on the Oracle DB, and mysql.people is on the MySQL DB.
The parser for this MultiDB would:

   1. Use SQL::Parser to break down the SQL statement.
   2. Figure out who owns what, in terms of tables and columns.  (Complain
   about ambiguous columns if it has to.)
   3. Use table_info calls to the separate DBI interfaces, including number
   of rows, cardinality (if available), etc.
   4. Store the joining information.
   5. Prepare two *separate* SQL statements for each DB.  It would no longer
   be JOIN queries, but standard queries for the tables (including person_id,
   which wasn't included in the original SELECT statement).

Then when the statement is executed:

   1. The two SQL statements are executed for each DB.
   2. The fetch_row sub would process each row one at a time for each DB.
   3. If two IDs match, send a row back.  Otherwise, cache the data and wait
   for something to match.
   4. Repeat until the rows are exhausted on one or both sides.  (One side
   for INNER, both sides for OUTER.)

Does anything like that exists?  I'm not saying it's an easy operation, but
if something like that can just start off with a simple JOINs at first, it
would be a miracle module.  Imagine linking with more abstract DBI modules:
Oracle to CSV to MySQL to Teradata to Sys to Sponge.  Tell me you're not
excited at the prospect of eventually creating free open-source competitive
frameworks to multi-million dollar Business Intelligence software.  Heck,
DBIC is getting there, but the data mining and integration piece is missing.

-- 
Brendan Byrd/SineSwiper sineswi...@gmail.com
Computer tech, PERL wizard, and all-around Internet guru


Re: Database/DBD Bridging?

2011-09-22 Thread David E . Wheeler
On Sep 21, 2011, at 7:53 PM, Brendan Byrd wrote:

 Okay, this is a big blue sky idea, but like all things open-source, it comes
 out of a need.  I'm trying to merge together Excel (or CSV), Oracle, Fusion
 Tables, JSON, and SNMP for various data points and outputs.  DBIC seems to
 work great for a large database with a bunch of tables, but what about a
 bunch of databases?  I've searched and searched, and nobody seemed to have
 designed a DBD for multiple DBDs.  There's DBD::Multi and Multiplex, but
 that's merely for replication.  This would require reparsing of SQL
 statements.

I think it'd be simpler to use something like SQL/MED. In PostgreSQL 9.1, for 
example, you can install foreign data wrappers so that you can create tables 
inside PostgreSQL that actually point to CSV, MySQL, Oracle, Redis, or any 
number of other sources. It's read-only right now, though that's often all 
people need. Some example FDWs:

 http://www.postgresql.org/docs/current/static/file-fdw.html
 http://pgxn.org/tag/fdw

ISTR that SQL Server has had similar capabilities for a long time, and I know 
Access has. Hell, ODBC might get you a long way toward it.

In short, I think that a multi-backend DBD is probably the wrong level at which 
to do this, waay too much work.

Best,

David



Re: Add Unicode Support to the DBI

2011-09-22 Thread Martin J. Evans

On 22/09/2011 17:36, David E. Wheeler wrote:

On Sep 22, 2011, at 2:26 AM, Martin J. Evans wrote:


There is more than one way to encode unicode - not everyone uses UTF-8; 
although some encodings don't support all of unicode.

Yeah, maybe should be utf8_flag instead.

see below.

unicode is not encoded as UTF-8 in ODBC using the wide APIs.

Using the wide ODBC APIs returns data in UCS2 encoding and DBD::ODBC decodes 
it. Using the ANSI APIs data is returned as octets and is whatever it is - it 
may be ASCII, it may be UTF-8 encoded (only in 2 cases I know and I believe 
they are flawed anyway) it may be something else in which case the application 
needs to know what it is. In the case of octets which are UTF-8 encoded 
DBD::ODBC has no idea that is the case unless you tell it and it will then set 
the UTF-8 flag (but see later).

Right. There needs to be a way to tell the DBI what encoding the server sends 
and expects to be sent. If it's not UTF-8, then the utf8_flag option is kind of 
useless.
I think this was my point above, i.e., why utf8? databases accept and 
supply a number of encodings so why have a flag called utf8? are we 
going to have ucs2, utf16, utf32 flags as well. Surely, it makes more 
sense to have a flag where you can set the encoding in the same form 
Encode uses.

I'm not that familiar with Postgres (I've used a few times and not to any great 
degree) and I used MySQL for a while years ago. I occasionally use SQLite. I do 
use DBD::Oracle and DBD::ODBC all the time. I'm still struggling to see the 
problem that needs fixing. Is it just that some people would like a DBI flag 
which tells the DBD:

1) decode any data coming back from the database strictly such that if it is 
invalid you die
2) decode any data coming back from the database loosely (e.g., utf-8 vs UTF-8)
3) don't decode the data from the database at all
4) don't decode the data, the DBD knows it is say UTF-8 encoded and simply sets 
the UTF-8 flag (which from what I read is horribly flawed but seems to work for 
me).

and the reverse.

Yes, with one API for all drivers, if possible, and guidelines for how it 
should work (when to encode and decode, what to encode and decode, when to just 
flip the utf8 flag on and off, etc.).

ok

DBD::Oracle does 1 some of the time and it does 4 the rest of the time e.g. 
error messages are fully decoded from UTF-8 IF Oracle is sending UTF-8 and it 
does 4 on most of the column data IF Oracle is sending UTF-8.

Yeah, but to enable it *you set a bloody environment variable*. WHAT?
Unless I'm mistaken as to what you refer to I believe that is a feature 
of the Oracle client libraries and not one of DBD::Oracle so there is 
little we can do about that.



My point being, doesn't the DBD know how the data is encoded when it gets it 
from the database? and it would hopefully know what the database needs when 
sending data. Perhaps in some conditions the DBD does not know this and needs 
to be told (I could imagine SQLite reading/writing straight to files for 
instance might want to know to open the file with UTF-8 layer).

Or to turn it off, so you can just pass the encoded UTF-8 through to the file 
without the decode/encode round-trip.


So is the problem that sometimes a DBD does not know what to encode data being 
sent to the database or how/whether to decode data coming back from the 
database? and if that is the case do we need some settings in DBI to tell a DBD?

That's an issue, yes, but the main issue is that all the drivers do it 
differently, sometimes with different semantics, and lack all the functionality 
one might want (e.g., your examples 1-4).

Best,

David



So to try and move forward, we'd we talking about a flag or flags which say:

1 encode the data sent to the database like this (which could be nothing)
2 decode the data retrieved from the database like this (which could 
be nothing but if not nothing it could be using strict or loose for the 
UTF-8 and utf-8 case)
3 don't decode but use SvUTF8_on (a specific case since Perl uses that 
internally and a number of database return UTF-8)

  one that seems to work but I worry about.
4 do what the DBD thinks is best - whatever the behaviour is now?

and what about when it conflicts with your locale/LANG?

and what about PERL_UNICODE flags, do they come into this?

and what about when the DBD knows you are wrong because the database 
says it is returning data in encoding X but you ask for Y.


and for DBD::ODBC built for unicode API am I expected to try and decode 
UCS2 as x just because the flag tells me to and I know it will not work? 
Seems like it only applies to the ANSI API in DBD::ODBC where the data 
could be UTF-8 encoded in a few (possibly broken see 
http://www.martin-evans.me.uk/node/20#unicode) cases.


I still think it would help to name some specific cases per DBD of flags 
in use and why they exist:


DBD::ODBC has a odbc_utf8_on flag to say that data returned by the 
database when using the ANSI APIs is 

Re: Add Unicode Support to the DBI

2011-09-22 Thread David E. Wheeler
On Sep 22, 2011, at 11:14 AM, Martin J. Evans wrote:

 Right. There needs to be a way to tell the DBI what encoding the server 
 sends and expects to be sent. If it's not UTF-8, then the utf8_flag option 
 is kind of useless.
 I think this was my point above, i.e., why utf8? databases accept and supply 
 a number of encodings so why have a flag called utf8? are we going to have 
 ucs2, utf16, utf32 flags as well. Surely, it makes more sense to have a flag 
 where you can set the encoding in the same form Encode uses.

Yes, I agreed with you. :-)

 Unless I'm mistaken as to what you refer to I believe that is a feature of 
 the Oracle client libraries and not one of DBD::Oracle so there is little we 
 can do about that.

Sure you can. I set something via the DBI interface and the DBD sets the 
environment variable for the Oracle client libraries.

 So to try and move forward, we'd we talking about a flag or flags which say:
 
 1 encode the data sent to the database like this (which could be nothing)
 2 decode the data retrieved from the database like this (which could be 
 nothing but if not nothing it could be using strict or loose for the UTF-8 
 and utf-8 case)
 3 don't decode but use SvUTF8_on (a specific case since Perl uses that 
 internally and a number of database return UTF-8)
  one that seems to work but I worry about.
 4 do what the DBD thinks is best - whatever the behaviour is now?

Yes.

 and what about when it conflicts with your locale/LANG?

So what?

 and what about PERL_UNICODE flags, do they come into this?

What are those?

 and what about when the DBD knows you are wrong because the database says it 
 is returning data in encoding X but you ask for Y.

Throw an exception or a warning.

 and for DBD::ODBC built for unicode API am I expected to try and decode UCS2 
 as x just because the flag tells me to and I know it will not work? Seems 
 like it only applies to the ANSI API in DBD::ODBC where the data could be 
 UTF-8 encoded in a few (possibly broken see 
 http://www.martin-evans.me.uk/node/20#unicode) cases.

If the user does something that makes no sense, tell them it makes no sense. 
Die if necessary.

 I still think it would help to name some specific cases per DBD of flags in 
 use and why they exist:
 
 DBD::ODBC has a odbc_utf8_on flag to say that data returned by the database 
 when using the ANSI APIs is UTF-8 encoded and currently it calls SvUTF8_on on 
 that data (I've never used or verified it works myself but the person 
 supplying the patch said it had a purpose with a particular Postgres based 
 database).

That's what the new DBD::Pg flag that Greg's working on does, too.

 Beyond that DBD::ODBC has no other flags as it knows in the unicode/wide APIs 
 the data is UCS2 encoded and it checks it is valid when decoding it. 
 Similarly when sending data to the database in the wide APIs it takes the 
 Perl scalar and encodes it in UCS2.

Yeah, ideally, by default, if the DBD knows the encoding used by the database, 
it should just DTRT. There are backward compatibility issues with that for 
DBD::Pg, though. So there probably should be a knob to say don't do any 
encoding or decoding at all, because a lot of older apps likely expect that.

 DBD::Oracle to my knowledge has no special flags; it just attempts to do the 
 right thing but it favours speed so most data that is supposed to be UTF-8 
 encoded has SvUTF8_on set but in one case (error messages) it properly and 
 strictly decodes the message so long as your Perl is recent enough else it 
 uses SvUTF8_on.
 
 So, what are the other flags in use and what purpose do they fulfill.

I think we could really just start with one flag, encoding. By default the 
DBD should just try to do the right thing. If encoding is set to :raw then 
it should do no encoding or decoding. If it's set to :utf8 it should just 
turn the flag on or off. If it's set to an actual encoding it should encode and 
decode. I think that would be a good start.

Best,

David






Re: Add Unicode Support to the DBI

2011-09-22 Thread David E. Wheeler
On Sep 22, 2011, at 11:57 AM, Martin J. Evans wrote:

 ok except what the oracle client libraries accept does not match with Encode 
 accepted strings so someone would have to come up with some sort of mapping 
 between the two.

Yes. That's one of the consequences of providing a single interface to multiple 
databases.

 and what about when it conflicts with your locale/LANG?
 So what?
 I'm not so sure this is a So what as Perl itself uses locale settings in 
 some cases - just thought it needed mentioning for consideration.

I'm not really concerned about locales at this point. I tend to leave 
collation, for example, up to the database. Right now I'm strictly concerned 
about encoding.

 and what about PERL_UNICODE flags, do they come into this?
 What are those?
 See http://perldoc.perl.org/perlrun.html
 
 In particular UTF-8 is the default PerlIO layer for input streams of which 
 reading data from a database could be considered one?

That'd be cool, but it's not currently implemented that way, obviously. DBI and 
PerlIO are completely independent AFAIK, and the DBI doesn't look like a file 
handle.

 ok, I'm thinking through the ramifications of this.
 
 To add to the list I see DBD::SQLite has |sqlite_unicode |strings coming 
 from the database and passed to the collation function will be properly 
 tagged with the utf8 flag; but this only works if the |sqlite_unicode| 
 attribute is set before the first call to a perl collation sequence and The 
 current FTS3 implementation in SQLite is far from complete with respect to 
 utf8 handling : in particular, variable-length characters are not treated 
 correctly by the builtin functions |offsets()| and |snippet()|.
 
 and DBD::CSV has
 
 f_encoding = utf8,
 
 DBD::mysql has mysql_enable_utf8 which apparently This attribute determines 
 whether DBD::mysql should assume strings stored in the database are utf8. 
 This feature defaults to off.
 
 I could not find any special flags for DBD::DB2.
 
 DBD::Sybase has syb_enable_utf8 If this attribute is set then DBD::Sybase 
 will convert UNIVARCHAR, UNICHAR, and UNITEXT data to Perl's internal utf-8 
 encoding when they are retrieved. Updating a unicode column will cause Sybase 
 to convert any incoming data from utf-8 to its internal utf-16 encoding.


Yeah, so I think that can be generalized.

Best,

David




Re: Database/DBD Bridging?

2011-09-22 Thread Darren Duncan

Brendan,

Taking into account David's response ...

I have several answers for you:

1.  The functionality you are talking about is often referred to as a federated 
database, where you have one database engine that the client talks to which 
turns around and farms some of the work off to various other database engines, 
coordinating between them and doing some extra work itself.  What you initially 
proposed is essentially using Perl as the client-most database engine, and what 
David proposed is using PostgreSQL as the client-most engine instead.


2.  If PostgreSQL 9.1's SQL/MED abilities will do what you need, then use it.

3.  If immense scalability isn't needed and you want to do this in Perl then I 
recommend you look at my Set::Relation CPAN module, which will handle a lot of 
the difficult work for you.  Specifically, you would still manage for yourself 
the parceling out of queries to the various back-end DBMSs like Oracle or Excel 
or whatever, but Set::Relation will then take care of all the drudgery of taking 
the various rowsets returned from those and combine them into the various 
end-result queries you actually wanted to do.  Each Set::Relation object 
contains a rowset and you can use its several dozen methods to do relational 
joins or aggregates or antijoins or various other things, all the functionality 
of SQL, but in Perl.  Its main limiting factor is that it is entirely RAM-based, 
though this also makes it simpler.  So you can do this right now.


4.  I am presently implementing a relational DBMS in Perl which provides all the 
functionality you described and more, including query language support that lets 
you write code like you demonstrated.  Strictly speaking the initial version is 
fully self-contained for simplicity, but a subsequent/spinoff version would add 
the ability to farm out to other database engines as per SQL/MED, and this *is* 
designed to scale.  It even uses the paradigm you mention, where each underlying 
engine is essentially a namespace in which tables live, and you can join between 
them as if they were all local; or to be more accurate, each database 
*connection* has its own namespace and the underlying engine is just a quality 
of that connection, like with how DBI lets you have multiple connections with 
the same driver.


So if this sounds like something you want to help create, please talk with me.

-- Darren Duncan

Brendan Byrd wrote:

Okay, this is a big blue sky idea, but like all things open-source, it comes
out of a need.  I'm trying to merge together Excel (or CSV), Oracle, Fusion
Tables, JSON, and SNMP for various data points and outputs.  DBIC seems to
work great for a large database with a bunch of tables, but what about a
bunch of databases?  I've searched and searched, and nobody seemed to have
designed a DBD for multiple DBDs.  There's DBD::Multi and Multiplex, but
that's merely for replication.  This would require reparsing of SQL
statements.

So, let's call this module idea DBD::IntegrateDB or MultiDB.  It would be a
module built from SQL::Statement (using the typical Embed instructions), so
it would use that module's SQL Engine for parsing and processing SQL.

We'll use a simple example of two databases: one Oracle, and one MySQL.
This module loads both of them in with a DBI-connect string.  Then the dev
runs a prepare on the following SQL:

SELECT
   book, title, b.person, age, dob
FROM
   ora.books b
   INNER JOIN mysql.people p ON (
  b.person_id = p.person_id
   )

So, ora.books is on the Oracle DB, and mysql.people is on the MySQL DB.
The parser for this MultiDB would:

   1. Use SQL::Parser to break down the SQL statement.
   2. Figure out who owns what, in terms of tables and columns.  (Complain
   about ambiguous columns if it has to.)
   3. Use table_info calls to the separate DBI interfaces, including number
   of rows, cardinality (if available), etc.
   4. Store the joining information.
   5. Prepare two *separate* SQL statements for each DB.  It would no longer
   be JOIN queries, but standard queries for the tables (including person_id,
   which wasn't included in the original SELECT statement).

Then when the statement is executed:

   1. The two SQL statements are executed for each DB.
   2. The fetch_row sub would process each row one at a time for each DB.
   3. If two IDs match, send a row back.  Otherwise, cache the data and wait
   for something to match.
   4. Repeat until the rows are exhausted on one or both sides.  (One side
   for INNER, both sides for OUTER.)

Does anything like that exists?  I'm not saying it's an easy operation, but
if something like that can just start off with a simple JOINs at first, it
would be a miracle module.  Imagine linking with more abstract DBI modules:
Oracle to CSV to MySQL to Teradata to Sys to Sponge.  Tell me you're not
excited at the prospect of eventually creating free open-source competitive
frameworks to multi-million dollar Business Intelligence software.  Heck,