Re: Question about TableFunctions in Derby

Rick Hillegas Fri, 21 Aug 2009 12:07:38 -0700

Hi Chris,

Thanks for the quick response. Some comments inline...


Chris Goodacre wrote:

Ok  - I think I get the basics of your approach:   rather than alter the 
signature of the table function (which is arbitrary and bound as part of the 
create function statement), you are (optionally) asking the developer to 
implement a hook that derby calls prior to actually requesting data (the 
initScan method).    So far, so good.  I like the approach.

I have a couple of areas I want to make sure I understand:

(a) The FormatableBitSet would be a column-ordered based set of flags 
indicating whether the column was included in the select statement, correct?   
The column index would presumably be based on the order of occurrence of the 
column in the returns table () subclause of the create function statement.   
This introduces a dependency into my code, but I don't have enough experience 
to know whether I just got lucky not needing the dependency yet.  Maybe other 
common implementations of the table function/result set already need to know 
what order the columns were defined in the table function.

Would it be better if referencedColumns were a String[]? That's finetoo. There's only one subtlety with that approach: what you would getwould be an array of SQL identifiers. That means that the strings wouldbe uppercased unless you bothered to double-quote the column names atCREATE FUNCTION time.


Would some other type be better for referencedColumns?

(b) I get steps 1 & 2, but in step 3, are you saying that would be an 
alteration of the current create function statement (like either of the examples 
below), or a separate step entirely, or neither?

I was just being pedantic. The CREATE FUNCTION statement would notchange. It would look exactly like it does today.


Hope we're converging on something useful.

Thanks,
-Rick

e.g1: create function legacy_realty_data()returns table (postal_code varchar(10),price numeric(10,2))language javaparameter style DERBY_SCANNABLE_JDBC_RESULT_SET // flags implementation of this DerbyScan interface

                no sql
                external name 'com.cg.example.LegacyHousingDataFunction.read'

e.g2: create function legacy_realty_data()returns table (postal_code varchar(10),price numeric(10,2))language javaparameter style DERBY_JDBC_RESULT_SET

                no sql

external name 'com.cg.example.LegacyHousingDataFunction.read' returns DerbyScan

Do I understand what you're proposing?

-chris

----- Original Message ----
From: Rick Hillegas <[email protected]>
To: Derby Discussion <[email protected]>
Sent: Friday, August 21, 2009 10:12:42 AM
Subject: Re: Question about TableFunctions in Derby

Hi Chris,

Thanks for following up on this. How does something like the following sound? 
I'm going to phrase this in terms of existing classes. I suspect that we'd want 
to use some simpler classes if we implemented this--for instance, I'm not happy 
with ScanQualifier: as your example points out, ScanQualifier pulls in the 
Derby type system, which is a lot of machinery you don't need to worry about. 
But here's a sketch:

1) Derby would expose some interface which your ResultSet would implement:

public interface DerbyScan
{
   /**
     * Setup that is called before Derby calls next() to get the first row
     *
     * @param referencedColumns Columns which Derby may retrieve. These are 
column positions as declared at CREATE FUNCTION time.
     * @param restriction Array of simple comparisons to constant values. Each 
comparison applies to a single column.
     */
   public void initScan( FormatableBitSet referencedColumns, ScanQualifier[] 
restriction ) throws SQLException;
}

2) You would code something like this:

public class MyResultSet implements ResultSet, DerbyScan { ... }

3) And your CREATE FUNCTION statement would bind a function to a method like 
the following, which you would code also:

public static MyResultSet legacyRealtyData() throws SQLException { ... }

Is this headed in the right direction?

Thanks,
-Rick

Chris Goodacre wrote:

Rick,
Sorry it's taken me so long to reply on this.  I just today got back to this in 
earnest.   I'll try to walk through an example, imagining that I have an array 
of ScanQualifiers that gets passed to my table function's method, just to make 
sure I understand this.

public static ResultSet read(ScanQualifier[] qualifiers) {
   // ... impl
}

So, if I were to go back to my original example:

select house_number, street, city from table (legacy_realty_data()) where price 
< 500000

a) I think that an array with only a single ScanQualifier object would be 
passed to my read(...) method.
b) I can see where the operator for the ScanQualifier object would be some 
negative number
c) The column id would reference the column # (basically) of the price column 
from the table definition of my CREATE FUNCTION statement.
d) The result of getOrderable() on the scanqualifier object would return me a 
DataValueDescriptor.  e) I could interrogate the DataValueDescriptor to get the 
value (500000) in a type/manner that I could use to pass on to my legacy system

I could use this information to restrict the number of rows that come back.   
That's good.

It would still be nice if I could restrict the number of columns I'm requesting 
up front.   It's expensive to go back and forth to this system, so I would 
rather make one read (all relevant rows, all relevant columns) and take the 
chance that the user only uses some of the rows from the result set.

Would it be possible to use a ScanQualifier (or something like it) to inform the table procedure methods which specific (non-calculated) columns are in the query?-chris

----- Original Message ----
From: Rick Hillegas <[email protected]>
To: Derby Discussion <[email protected]>
Sent: Monday, July 20, 2009 3:08:31 PM
Subject: Re: Question about TableFunctions in Derby

Hi Chris,

Reducing the number of column probes may be possible without any changes to 
Derby: When your ResultSet is asked to get a column, it can remember that 
request. On later rows, your ResultSet can ask the external data source for all 
of the column positions it has remembered so far. In the query you gave, this 
would play out like this:

1) On the first row, your ResultSet would make 3 calls to the external data 
source, one for each column. But the ResultSet would remember which columns 
were requested.

2) For each of the remaining N-1 rows, your ResultSet would call the external 
data source only once, asking the external data source for all three columns in 
a single batch. That batch could then be cached and the individual columns 
could be returned to Derby when Derby called the getXXX() methods.

Positioning and restricting the rows themselves (the WHERE clause fragments) is 
tricker. It probably requires help from Derby, as you suggest. We could design 
some interface by which Derby would pass the ResultSet a list of 
org.apache.derby.iapi.sql.execute.ScanQualifier. Your ResultSet could then 
forward those directives to the external data source.

What do you think?
-Rick

Chris Goodacre wrote:

Rick, thanks for your suggestions.   Perhaps I am being obtuse, but when you 
say ...

"Since you have only asked for 3 columns, that's all that Derby will
request from the ResultSet instantiated by your table function. That
is, Derby is only going to call ResultSet.getXXX() on the house_number,
street, and city columns. That should behave efficiently provided that
your ResultSet is smart enough to only fault-in columns for which a
getXXX() is called."

Does that mean that I make a separate request to the legacy system each time 
getXXX() is called - i.e. lazily initialize each column in the result set?    I 
think this has to be the only way to do it, since I don't know which columns 
will be requested at the time the read() method of my tablefunction is invoked. 
 Making (in this case) 3 calls to the legacy system to get 1 column for N rows 
is certainly better than making 1 call to the legacy system to get 1000 columns 
for N rows and then throwing away 997*N values/cells, but still not quite as 
nice as I'd like.  If I were making a wish - I'd wish for some sort of parsed 
representation of the query get passed to the read method (or to some other 
method - similar to, or even as part of, the query optimization interface).    
Ideally, this structured representation would have the list of columns 
belonging to the table function from the select list, and the where clause 
components specific to the table

 function only (i.e. mytablefunction.price > 50000 but NOT mytablefunction.price 
< myrealtable.value).

In the absence of this, when the VTIResultSet class passes the ActivationHolder 
to the derby class which invokes the read() method reflectively, why can't that 
class pass the activation context (it knows it is dealing with a derby table 
function, it knows the class name, it has access to the result set descriptor, 
if not the where clause) pass this information along to the user's table 
function class?   I would happily implement an interface in this class (not 
sure why read() has to be static) to get this information prior to resultset 
instantiation.

-chris



----- Original Message ----
From: Rick Hillegas <[email protected]>
To: Derby Discussion <[email protected]>
Sent: Monday, July 20, 2009 10:55:33 AM
Subject: Re: Question about TableFunctions in Derby

Hi Chris,

Some comments inline...

Chris Goodacre wrote:

I've read the Derby developer's guide and Rick Hillegas's informative white
paper
(http://developers.sun.com/javadb/reference/whitepapers/sampleTableFunctions/doc/TableFunctionsWhitePaper.html)
on Table Functions, but am still struggling with the following issue:

I am trying to create an RDB abstraction for a large CICS/VSAM-based legacy
system and blend it with our newer, RDB-based tier. This seems like a good
application of TableFunctions. The VSAM data is made available to me via an
IP-based proprietary messaging interface. There are lots of different files
here, but due to some historical forces, most of the data I'm interested in
resides in 4 VSAM files.

Unfortunately, each of those VSAM files has over a 1000 fields in it.

Now eventually, it might be possible to fully model a single VSAM file into
(for the sake of argument) 50 tables; each table/row representing a small slice
of a single VSAM record.

In the meantime, for both this proof-of-concept and as a migration path to our
existing clients, I'd like to represent each VSAM file as a table (subject to
the 1024 column SQL limitation per table). This will be a highly-denormalized
and decidedly non-relational view of the data, but it will be easy to
demonstrate and immediately recognizable to our customers.

However, I can't seem to get around the problem of data granularity. For
example, if my customer executes:

select house_number, street, city from table (legacy_realty_data()) where price
< 500000

Since you have only asked for 3 columns, that's all that Derby will request 
from the ResultSet instantiated by your table function. That is, Derby is only 
going to call ResultSet.getXXX() on the house_number, street, and city columns. 
That should behave efficiently provided that your ResultSet is smart enough to 
only fault-in columns for which a getXXX() is called.

The WHERE clause is a little trickier. You are right, Derby will read all rows 
from the ResultSet and throw away the rows which don't satisfy the WHERE 
clause. What you want to do is push the qualification through the table 
function to the external data source. I don't see any way to do this other than 
adding some more arguments to your table function. For instance, if you could 
push the qualification through to the external data source, then you could get 
efficient behavior from something like the following:

select house_number, street, city
from table( legacy_realty_data( 500000 ) ) s;

Hope this helps,
-Rick

I don't appear to have any visibility to the actual query inside my 
legacy_realty_data TableFunction, so I have to go get all 1000 fields for however 
many listings are present where price< 500000 even though only three columns 
will be requested.  Am I missing something?  Aside from having the user repeat the 
columns as parameters to the table function (which looks awkward to say the 
least), I can't see a way around this based on my limited knowledge of Derby.

Is there a way to only retrieve the columns that the user is querying for?

Looking forward to your help/advice.

-chris

Re: Question about TableFunctions in Derby

Reply via email to