RFC: [b]lob_read in DBD::ODBC and other DBD differences

Martin J. Evans Tue, 20 Jul 2010 13:46:58 -0700

Hi,

I have the requirement to read lobs via DBD::ODBC in chunks but there is
no current way of doing this. I see DBI defines blob_read something like
this:


blob_read => { U =>[4,5,'$field, $offset, $len [, \\$buf [,$bufoffset]]'] }

which I'm guessing means read the lob at column $field from offset
$offset for $len bytes into $buf and if $bufoffset is specified read it
into $buf at this point. I'm presuming the $bufoffset is there to
support reading the lob in chunks and appending to $buf?

However I cannot find another DBD which actually uses it, neither can I
find any documentation on it e.g., DBD::Oracle has ora_lob_read and
DBD::Pg has pg_lo_read which do not use blob_read. I also have a minor
issue with it being called blob_read and not lob_read as my requirement
is actually to read large objects which are strings not binary objects.

Some background:

There are a couple of peculiarities in ODBC:

1. DBD::ODBC currently binds columns in ODBC using SQLBindCol. DBD::ODBC
allocates enough space for each column (or LongReadLen), binds the
column as the defaulted or requested type then calls SQLFetch - all the
columns for a row are retrieved in that one call in this case. DBD::ODBC
does not currently use SQLGetData which allows the caller to retrieve
column data in chunks. However, the SQLGetData interface does not
specifically tell you that you have retrieved all of the column, it
works by filling your buffer each time and telling you how many bytes it
could have returned (if you had passed a bigger buffer) returning
SQL_SUCCESS_WITH_INFO and on incomplete call to SQLGetData and it
returns SQL_SUCCESS and the bytes returned will be equal or less than
your buffer length. Fortunately, it does allow another call to
SQLGetData after all of the column is retrieved which returns
SQL_NO_DATA so you can detect the end other than testing the returned
bytes compared with the requested bytes.

2. you can only read lobs via SQLGetData from the start to the end (or
part way through) and you cannot jump in at an offset - of course, a DBD
could throw away the unwanted bits.

3. some databases do not allow the reading of columns via SQLGetData out
of order when other columns are bound (e.g., MS SQL Server) i.e., you
can bind columns 1-5 then use SQLGetData for column 6 but you cannot
bind columns 1,2,4,5 and use SQLGetData to read column 3 - as that would
be out of order.

4. you cannot part read column N with SQLGetData then switch to read
part of column N+1 - you have to read all/part of column N then do
column N+1 (and so on) and cannot got back to column N.

In addition, the $bufoffset argument to DBI's blob_read causes me a
problem since I don't know if we are using character lengths or byte
lengths. Some explanation is required. ODBC does not really do UTF8
encoding but Perl does. ODBC has SQLWxxx functions which take and return
Wide characters (UTF16 encoded or UCS2 encoded). When DBD::ODBC is built
with unicode support (mostly on Windows but an increasing number of
people on UNIX are using it too) it calls the SQLWxxx functions or binds
columns as SQL_WCHAR and converts the returned data to UTF8 encoding in
the bound Perl scalars (which can be bigger than the supplied buffer
length as UTF-8 encoding can be as much as 6 * chrs).

Obviously if you call blob_read with a length of N and if it means N
bytes then I have a problem since there is no way of knowing what the
returned wide chrs will translate into upto 6 * bytes UTF8 encoded.
Calling blob_read with a buffer length of 1000 could require
1000/2*6=3000 bytes to store the UTF-8 encoded result.

I appreciate that the $bufoffset in blob_read allows someone to retrieve
lobs in chunks and have them appended to the current scalar but that can
also be achieved by concatenation.

Also, blob_read does not allow a type to be specified. This is perhaps
not important if blob_read is JUST for BLOBs and not CLOBS but my case
is mostly clobs. In ODBC, if you request a blob as a SQL_CHAR you get
the blob data returned as hex encoded bytes so chr(32) becomes 0x20 (4
bytes as a string).

As a result of the above, I don't see any way of supporting the existing
blob_read in DBI and as it is undocumented anyway and no DBDs seem to
use it I've put a test implementation together as:

odbc_lob_read(sth, colno, buf, length, \%attrs)

where \%attrs can contain TYPE => SQL_type.

By default the column is retrieved as it is defined in the db (if the
column is a binary type), as SQL_WCHAR (wide chrs translated into UTF-8
encoding in Perl) if a unicode build and SQL_CHAR otherwise. You can
override the type in %attrs. If you build with unicode support then
length means size of buffer in bytes and this is the buffer supplied to
SQLGetData so it means at most length/2 chrs and if length/2 chrs * UTF8
encoding is greater than length then the buf scalar is grown i.e., in
the unicode build length means the length in chrs and it might return
more bytes than length specified. It currently does not support
retrieving a chunk of a lob which is added/concatenated to an existing
scalar - part of the reason for this is that there is nothing to stop
someone calling lob_read with different types as you retrieve chunks of
the column and column data cannot be both UTF-8 encoded and not UTF-8
encoded - it can actually but it does not make sense to me for it to be.

I appreciate this RFC is quite long and I'm guessing few people will get
to the end. Increasingly, I'm working with several DBDs and the
differences between them (beyond SQL differences) is proving VERY
frustrating. It would be nice to try and sort out lob reading and whilst
I'm on a roll, SQLMoreResults/more_results, last_insert_id, lob_write,
binding with a specific type etc  but I guess since some of these have
been covered before and we didn't get too far in the past that perhaps
there is not a lot of interest in it. I'd like to fit DBD::ODBC with
DBI's blob_read but I don't think it works for DBD::ODBC as it stands
and I note other DBDs have provided their own lob reading methods
substantially different so I'm guessing they found that too.

With Perl 6 perhaps on the horizon and Tim's current interest in JDBC I
think we are well advised to sort some of this out in Perl 5 first. It
makes little difference to me right now as I'll implement what I need
now and move on but I'm prepared to work to a DBI interface for the
common goal. Some of this is not going to go away - you only have to
watch perl monks to see the frustration with last_insert_id, lob
reading, binding columns with types etc across DBDs to know this is
proving troublesome for the average user.

Martin
--
Martin J. Evans
Easysoft Limited
http://www.easysoft.com

RFC: [b]lob_read in DBD::ODBC and other DBD differences

Reply via email to