Re: Question about TableFunctions in Derby

Chris Goodacre Fri, 21 Aug 2009 14:18:43 -0700

Rick,

I think a String[] would be just fine.   While I am sure that a String[] is 
probably slower than a BitSet, I did a quick test, looking at different 
comparisons, and matching 205 strings out of 1,646,085 (even using 
equalsIgnoreCase) only takes about 39 ms on my machine.


I think we've completed the first example.  I have one other that I'd like to 
look at:

select h.price from TABLE ( legacy_realty_data() ) h, chicago_zipcodes z
where h.zipcode = z.zipcode


In this case, I would get a single String in referencedColumns array.  What (if 
anything) would be in the ScanQualifier[]?

chris



----- Original Message ----
From: Rick Hillegas <[email protected]>
To: Derby Discussion <[email protected]>
Sent: Friday, August 21, 2009 2:07:01 PM
Subject: Re: Question about TableFunctions in Derby

Hi Chris,

Thanks for the quick response. Some comments inline...

Chris Goodacre wrote:
> Ok  - I think I get the basics of your approach:   rather than alter the 
> signature of the table function (which is arbitrary and bound as part of the 
> create function statement), you are (optionally) asking the developer to 
> implement a hook that derby calls prior to actually requesting data (the 
> initScan method).    So far, so good.  I like the approach.
> 
> I have a couple of areas I want to make sure I understand:
> 
> (a) The FormatableBitSet would be a column-ordered based set of flags 
> indicating whether the column was included in the select statement, correct?  
>  The column index would presumably be based on the order of occurrence of the 
> column in the returns table () subclause of the create function statement.   
> This introduces a dependency into my code, but I don't have enough experience 
> to know whether I just got lucky not needing the dependency yet.  Maybe other 
> common implementations of the table function/result set already need to know 
> what order the columns were defined in the table function.
>  
Would it be better if referencedColumns were a String[]? That's fine too. 
There's only one subtlety with that approach: what you would get would be an 
array of SQL identifiers. That means that the strings would be uppercased 
unless you bothered to double-quote the column names at CREATE FUNCTION time.

Would some other type be better for referencedColumns?
> (b) I get steps 1 & 2, but in step 3, are you saying that would be an 
> alteration of the current create function statement (like either of the 
> examples below), or a separate step entirely, or neither?
>  
I was just being pedantic. The CREATE FUNCTION statement would not change. It 
would look exactly like it does today.

Hope we're converging on something useful.

Thanks,
-Rick
>      e.g1:   create function legacy_realty_data()                 returns 
> table (                      postal_code varchar(10),                      
> price numeric(10,2)                 )                 language java           
>       parameter style DERBY_SCANNABLE_JDBC_RESULT_SET // flags implementation 
> of this DerbyScan interface
>                 no sql
>                 external name 'com.cg.example.LegacyHousingDataFunction.read'
> 
>     e.g2: create function legacy_realty_data()                 returns table 
> (                      postal_code varchar(10),                      price 
> numeric(10,2)                 )                 language java                 
> parameter style DERBY_JDBC_RESULT_SET
>                 no sql
>                 external name 'com.cg.example.LegacyHousingDataFunction.read' 
> returns DerbyScan 
> 
> Do I understand what you're proposing?  
> -chris
> 
> ----- Original Message ----
> From: Rick Hillegas <[email protected]>
> To: Derby Discussion <[email protected]>
> Sent: Friday, August 21, 2009 10:12:42 AM
> Subject: Re: Question about TableFunctions in Derby
> 
> Hi Chris,
> 
> Thanks for following up on this. How does something like the following sound? 
> I'm going to phrase this in terms of existing classes. I suspect that we'd 
> want to use some simpler classes if we implemented this--for instance, I'm 
> not happy with ScanQualifier: as your example points out, ScanQualifier pulls 
> in the Derby type system, which is a lot of machinery you don't need to worry 
> about. But here's a sketch:
> 
> 1) Derby would expose some interface which your ResultSet would implement:
> 
> public interface DerbyScan
> {
>    /**
>      * Setup that is called before Derby calls next() to get the first row
>      *
>      * @param referencedColumns Columns which Derby may retrieve. These are 
> column positions as declared at CREATE FUNCTION time.
>      * @param restriction Array of simple comparisons to constant values. 
> Each comparison applies to a single column.
>      */
>    public void initScan( FormatableBitSet referencedColumns, ScanQualifier[] 
> restriction ) throws SQLException;
> }
> 
> 2) You would code something like this:
> 
> public class MyResultSet implements ResultSet, DerbyScan { ... }
> 
> 3) And your CREATE FUNCTION statement would bind a function to a method like 
> the following, which you would code also:
> 
> public static MyResultSet legacyRealtyData() throws SQLException { ... }
> 
> 
> Is this headed in the right direction?
> 
> Thanks,
> -Rick
> 
> Chris Goodacre wrote:
>  
>> Rick,
>> Sorry it's taken me so long to reply on this.  I just today got back to this 
>> in earnest.   I'll try to walk through an example, imagining that I have an 
>> array of ScanQualifiers that gets passed to my table function's method, just 
>> to make sure I understand this.
>> 
>> public static ResultSet read(ScanQualifier[] qualifiers) {
>>    // ... impl
>> }
>> 
>> So, if I were to go back to my original example:
>> 
>> select house_number, street, city from table (legacy_realty_data()) where 
>> price < 500000
>> 
>> a) I think that an array with only a single ScanQualifier object would be 
>> passed to my read(...) method.
>> b) I can see where the operator for the ScanQualifier object would be some 
>> negative number
>> c) The column id would reference the column # (basically) of the price 
>> column from the table definition of my CREATE FUNCTION statement.
>> d) The result of getOrderable() on the scanqualifier object would return me 
>> a DataValueDescriptor.  e) I could interrogate the DataValueDescriptor to 
>> get the value (500000) in a type/manner that I could use to pass on to my 
>> legacy system
>> 
>> I could use this information to restrict the number of rows that come back.  
>>  That's good.
>> 
>> It would still be nice if I could restrict the number of columns I'm 
>> requesting up front.   It's expensive to go back and forth to this system, 
>> so I would rather make one read (all relevant rows, all relevant columns) 
>> and take the chance that the user only uses some of the rows from the result 
>> set.
>> 
>> Would it be possible to use a ScanQualifier (or something like it) to inform 
>> the table procedure methods which specific (non-calculated) columns are in 
>> the query?  -chris
>> 
>> 
>> 
>> ----- Original Message ----
>> From: Rick Hillegas <[email protected]>
>> To: Derby Discussion <[email protected]>
>> Sent: Monday, July 20, 2009 3:08:31 PM
>> Subject: Re: Question about TableFunctions in Derby
>> 
>> Hi Chris,
>> 
>> Reducing the number of column probes may be possible without any changes to 
>> Derby: When your ResultSet is asked to get a column, it can remember that 
>> request. On later rows, your ResultSet can ask the external data source for 
>> all of the column positions it has remembered so far. In the query you gave, 
>> this would play out like this:
>> 
>> 1) On the first row, your ResultSet would make 3 calls to the external data 
>> source, one for each column. But the ResultSet would remember which columns 
>> were requested.
>> 
>> 2) For each of the remaining N-1 rows, your ResultSet would call the 
>> external data source only once, asking the external data source for all 
>> three columns in a single batch. That batch could then be cached and the 
>> individual columns could be returned to Derby when Derby called the getXXX() 
>> methods.
>> 
>> Positioning and restricting the rows themselves (the WHERE clause fragments) 
>> is tricker. It probably requires help from Derby, as you suggest. We could 
>> design some interface by which Derby would pass the ResultSet a list of 
>> org.apache.derby.iapi.sql.execute.ScanQualifier. Your ResultSet could then 
>> forward those directives to the external data source.
>> 
>> What do you think?
>> -Rick
>> 
>> 
>> Chris Goodacre wrote:
>>      
>>> Rick, thanks for your suggestions.   Perhaps I am being obtuse, but when 
>>> you say ...
>>> 
>>> "Since you have only asked for 3 columns, that's all that Derby will
>>> request from the ResultSet instantiated by your table function. That
>>> is, Derby is only going to call ResultSet.getXXX() on the house_number,
>>> street, and city columns. That should behave efficiently provided that
>>> your ResultSet is smart enough to only fault-in columns for which a
>>> getXXX() is called."
>>> 
>>> Does that mean that I make a separate request to the legacy system each 
>>> time getXXX() is called - i.e. lazily initialize each column in the result 
>>> set?    I think this has to be the only way to do it, since I don't know 
>>> which columns will be requested at the time the read() method of my 
>>> tablefunction is invoked.  Making (in this case) 3 calls to the legacy 
>>> system to get 1 column for N rows is certainly better than making 1 call to 
>>> the legacy system to get 1000 columns for N rows and then throwing away 
>>> 997*N values/cells, but still not quite as nice as I'd like.  If I were 
>>> making a wish - I'd wish for some sort of parsed representation of the 
>>> query get passed to the read method (or to some other method - similar to, 
>>> or even as part of, the query optimization interface).    Ideally, this 
>>> structured representation would have the list of columns belonging to the 
>>> table function from the select list, and the where clause components 
>>> specific to the table
>>>      
>  function only (i.e. mytablefunction.price > 50000 but NOT 
> mytablefunction.price < myrealtable.value).
>  
>>> In the absence of this, when the VTIResultSet class passes the 
>>> ActivationHolder to the derby class which invokes the read() method 
>>> reflectively, why can't that class pass the activation context (it knows it 
>>> is dealing with a derby table function, it knows the class name, it has 
>>> access to the result set descriptor, if not the where clause) pass this 
>>> information along to the user's table function class?   I would happily 
>>> implement an interface in this class (not sure why read() has to be static) 
>>> to get this information prior to resultset instantiation.
>>> 
>>> -chris
>>> 
>>> 
>>> 
>>> ----- Original Message ----
>>> From: Rick Hillegas <[email protected]>
>>> To: Derby Discussion <[email protected]>
>>> Sent: Monday, July 20, 2009 10:55:33 AM
>>> Subject: Re: Question about TableFunctions in Derby
>>> 
>>> Hi Chris,
>>> 
>>> Some comments inline...
>>> 
>>> Chris Goodacre wrote:
>>>            
>>>> I've read the Derby developer's guide and Rick Hillegas's informative 
>>>> white paper 
>>>> (http://developers.sun.com/javadb/reference/whitepapers/sampleTableFunctions/doc/TableFunctionsWhitePaper.html)
>>>>  on Table Functions, but am still struggling with the following issue:
>>>> 
>>>> I am trying to create an RDB abstraction for a large CICS/VSAM-based 
>>>> legacy system and blend it with our newer, RDB-based tier.  This seems 
>>>> like a good application of TableFunctions.  The VSAM data is made 
>>>> available to me via an IP-based proprietary messaging interface.  There 
>>>> are lots of different files here, but due to some historical forces, most 
>>>> of the data I'm interested in resides in 4 VSAM files.
>>>> 
>>>> Unfortunately, each of those VSAM files has over a 1000 fields in it.
>>>> 
>>>> Now eventually, it might be possible to fully model a single VSAM file 
>>>> into (for the sake of argument) 50 tables; each table/row representing a 
>>>> small slice of a single VSAM record.
>>>> 
>>>> In the meantime, for both this proof-of-concept and as a migration path to 
>>>> our existing clients, I'd like to represent each VSAM file as a table 
>>>> (subject to the 1024 column SQL limitation per table).  This will be a 
>>>> highly-denormalized and decidedly non-relational view of the data, but it 
>>>> will be easy to demonstrate and immediately recognizable to our customers.
>>>> 
>>>> However, I can't seem to get around the problem of data granularity.  For 
>>>> example, if my customer executes:
>>>> 
>>>> select house_number, street, city from table (legacy_realty_data()) where 
>>>> price < 500000
>>>>                    
>>> Since you have only asked for 3 columns, that's all that Derby will request 
>>> from the ResultSet instantiated by your table function. That is, Derby is 
>>> only going to call ResultSet.getXXX() on the house_number, street, and city 
>>> columns. That should behave efficiently provided that your ResultSet is 
>>> smart enough to only fault-in columns for which a getXXX() is called.
>>> 
>>> The WHERE clause is a little trickier. You are right, Derby will read all 
>>> rows from the ResultSet and throw away the rows which don't satisfy the 
>>> WHERE clause. What you want to do is push the qualification through the 
>>> table function to the external data source. I don't see any way to do this 
>>> other than adding some more arguments to your table function. For instance, 
>>> if you could push the qualification through to the external data source, 
>>> then you could get efficient behavior from something like the following:
>>> 
>>> select house_number, street, city
>>> from table( legacy_realty_data( 500000 ) ) s;
>>> 
>>> Hope this helps,
>>> -Rick
>>> 
>>>            
>>>> I don't appear to have any visibility to the actual query inside my 
>>>> legacy_realty_data TableFunction, so I have to go get all 1000 fields for 
>>>> however many listings are present where price< 500000 even though only 
>>>> three columns will be requested.  Am I missing something?  Aside from 
>>>> having the user repeat the columns as parameters to the table function 
>>>> (which looks awkward to say the least), I can't see a way around this 
>>>> based on my limited knowledge of Derby.
>>>> 
>>>> Is there a way to only retrieve the columns that the user is querying for?
>>>> 
>>>> Looking forward to your help/advice.
>>>> 
>>>> -chris
>>>> 
>>>>

Re: Question about TableFunctions in Derby

Reply via email to