The problem with PostgreSQL's SQL/MED is that it's not Perl, and it won't
work for some of the more abstract objects available as DBD.  I would like
to tie this DBD::FederatedDB into DBIC, so that it can search and insert
everything on-the-fly.  Shoving everything into RAM isn't right, either,
since DBD::AnyData can already do that.  The whole point of having the
databases process the rows one at a time is so that it can handle 10 million
row tables without a full wasteful dump.  It looks like Set::Relation can
work out great for sucking in table_info/row_info data, and can be used as
the temp cache as fractured rows come in.

I would be highly interested in developing this with you.  I'm spread pretty
thin with several other Perl modules, so I otherwise wouldn't tackle it
right now.  But, if you already have something started, we can try to finish
it, and that's much better than starting from scratch alone.

Do you have a repository for this new module yet?  What are you calling it?
I take it the module is building off of SQL::Statement?

On Fri, Sep 23, 2011 at 12:27 AM, Darren Duncan <dar...@darrenduncan.net>wrote:

> Brendan,
>
> Taking into account David's response ...
>
> I have several answers for you:
>
> 1.  The functionality you are talking about is often referred to as a
> federated database, where you have one database engine that the client talks
> to which turns around and farms some of the work off to various other
> database engines, coordinating between them and doing some extra work
> itself.  What you initially proposed is essentially using Perl as the
> client-most database engine, and what David proposed is using PostgreSQL as
> the client-most engine instead.
>
> 2.  If PostgreSQL 9.1's SQL/MED abilities will do what you need, then use
> it.
>
> 3.  If immense scalability isn't needed and you want to do this in Perl
> then I recommend you look at my Set::Relation CPAN module, which will handle
> a lot of the difficult work for you.  Specifically, you would still manage
> for yourself the parceling out of queries to the various back-end DBMSs like
> Oracle or Excel or whatever, but Set::Relation will then take care of all
> the drudgery of taking the various rowsets returned from those and combine
> them into the various end-result queries you actually wanted to do.  Each
> Set::Relation object contains a rowset and you can use its several dozen
> methods to do relational joins or aggregates or antijoins or various other
> things, all the functionality of SQL, but in Perl.  Its main limiting factor
> is that it is entirely RAM-based, though this also makes it simpler.  So you
> can do this right now.
>
> 4.  I am presently implementing a relational DBMS in Perl which provides
> all the functionality you described and more, including query language
> support that lets you write code like you demonstrated.  Strictly speaking
> the initial version is fully self-contained for simplicity, but a
> subsequent/spinoff version would add the ability to farm out to other
> database engines as per SQL/MED, and this *is* designed to scale.  It even
> uses the paradigm you mention, where each underlying engine is essentially a
> namespace in which tables live, and you can join between them as if they
> were all local; or to be more accurate, each database *connection* has its
> own namespace and the underlying engine is just a quality of that
> connection, like with how DBI lets you have multiple connections with the
> same driver.
>
> So if this sounds like something you want to help create, please talk with
> me.
>
> -- Darren Duncan
>
> Brendan Byrd wrote:
>
>> Okay, this is a big blue sky idea, but like all things open-source, it
>> comes
>> out of a need.  I'm trying to merge together Excel (or CSV), Oracle,
>> Fusion
>> Tables, JSON, and SNMP for various data points and outputs.  DBIC seems to
>> work great for a large database with a bunch of tables, but what about a
>> bunch of databases?  I've searched and searched, and nobody seemed to have
>> designed a DBD for multiple DBDs.  There's DBD::Multi and Multiplex, but
>> that's merely for replication.  This would require reparsing of SQL
>> statements.
>>
>> So, let's call this module idea DBD::IntegrateDB or MultiDB.  It would be
>> a
>> module built from SQL::Statement (using the typical Embed instructions),
>> so
>> it would use that module's SQL Engine for parsing and processing SQL.
>>
>> We'll use a simple example of two databases: one Oracle, and one MySQL.
>> This module loads both of them in with a DBI->connect string.  Then the
>> dev
>> runs a prepare on the following SQL:
>>
>> SELECT
>>   book, title, b.person, age, dob
>> FROM
>>   ora.books b
>>   INNER JOIN mysql.people p ON (
>>      b.person_id = p.person_id
>>   )
>>
>> So, "ora.books" is on the Oracle DB, and "mysql.people" is on the MySQL
>> DB.
>> The parser for this MultiDB would:
>>
>>   1. Use SQL::Parser to break down the SQL statement.
>>   2. Figure out who owns what, in terms of tables and columns.  (Complain
>>
>>   about ambiguous columns if it has to.)
>>   3. Use table_info calls to the separate DBI interfaces, including number
>>
>>   of rows, cardinality (if available), etc.
>>   4. Store the joining information.
>>   5. Prepare two *separate* SQL statements for each DB.  It would no
>> longer
>>
>>   be JOIN queries, but standard queries for the tables (including
>> person_id,
>>   which wasn't included in the original SELECT statement).
>>
>> Then when the statement is executed:
>>
>>   1. The two SQL statements are executed for each DB.
>>   2. The fetch_row sub would process each row one at a time for each DB.
>>   3. If two IDs match, send a row back.  Otherwise, cache the data and
>> wait
>>   for something to match.
>>   4. Repeat until the rows are exhausted on one or both sides.  (One side
>>
>>   for INNER, both sides for OUTER.)
>>
>> Does anything like that exists?  I'm not saying it's an easy operation,
>> but
>> if something like that can just start off with a simple JOINs at first, it
>> would be a miracle module.  Imagine linking with more abstract DBI
>> modules:
>> Oracle to CSV to MySQL to Teradata to Sys to Sponge.  Tell me you're not
>> excited at the prospect of eventually creating free open-source
>> competitive
>> frameworks to multi-million dollar "Business Intelligence" software.
>>  Heck,
>> DBIC is getting there, but the data mining and integration piece is
>> missing.
>>
>>
>


-- 
Brendan Byrd/SineSwiper <sineswi...@gmail.com>
Computer tech, PERL wizard, and all-around Internet guru

Reply via email to