Rick, your response is exactly as I would have expected.  I assumed the 
filtering of data based on the join would have to occur "above" the individual 
queries (though admittedly, I'm no expert in RDBMS theory).  

So how do we / I proceed at this point?   I'm new to this project/list.  Shall 
I enter this as an enhancement request in Jira?   Do you want/need any help 
with the actual enhancement?

-chris   



----- Original Message ----
From: Rick Hillegas <[email protected]>
To: Derby Discussion <[email protected]>
Sent: Monday, August 24, 2009 10:29:47 AM
Subject: Re: Question about TableFunctions in Derby

Hi Chris,

A response inline...

Chris Goodacre wrote:
> Rick,
>
> I think a String[] would be just fine.   While I am sure that a String[] is 
> probably slower than a BitSet, I did a quick test, looking at different 
> comparisons, and matching 205 strings out of 1,646,085 (even using 
> equalsIgnoreCase) only takes about 39 ms on my machine.
>
> I think we've completed the first example.  I have one other that I'd like to 
> look at:
>
> select h.price from TABLE ( legacy_realty_data() ) h, chicago_zipcodes z
> where h.zipcode = z.zipcode
>
>
> In this case, I would get a single String in referencedColumns array.  What 
> (if anything) would be in the ScanQualifier[]?
>  
I think that the ScanQualifier[] argument would be null in the first 
increment of this work. The optimizer has a strong bias to put the table 
function in the outermost slot of the join order. In order to limit the 
table function scan with the "h.zipcode = z.zipcode" clause, the 
optimizer would need some better heuristics for figuring out when to put 
the table function in an inner slot. That could be a follow-on effort.

Hope this seems sensible and useful to you,
-Rick
> chris
>
>
>
> ----- Original Message ----
> From: Rick Hillegas <[email protected]>
> To: Derby Discussion <[email protected]>
> Sent: Friday, August 21, 2009 2:07:01 PM
> Subject: Re: Question about TableFunctions in Derby
>
> Hi Chris,
>
> Thanks for the quick response. Some comments inline...
>
> Chris Goodacre wrote:
>  
>> Ok  - I think I get the basics of your approach:   rather than alter the 
>> signature of the table function (which is arbitrary and bound as part of the 
>> create function statement), you are (optionally) asking the developer to 
>> implement a hook that derby calls prior to actually requesting data (the 
>> initScan method).    So far, so good.  I like the approach.
>>
>> I have a couple of areas I want to make sure I understand:
>>
>> (a) The FormatableBitSet would be a column-ordered based set of flags 
>> indicating whether the column was included in the select statement, correct? 
>>   The column index would presumably be based on the order of occurrence of 
>> the column in the returns table () subclause of the create function 
>> statement.   This introduces a dependency into my code, but I don't have 
>> enough experience to know whether I just got lucky not needing the 
>> dependency yet.  Maybe other common implementations of the table 
>> function/result set already need to know what order the columns were defined 
>> in the table function.
>>  
>>    
> Would it be better if referencedColumns were a String[]? That's fine too. 
> There's only one subtlety with that approach: what you would get would be an 
> array of SQL identifiers. That means that the strings would be uppercased 
> unless you bothered to double-quote the column names at CREATE FUNCTION time.
>
> Would some other type be better for referencedColumns?
>  
>> (b) I get steps 1 & 2, but in step 3, are you saying that would be an 
>> alteration of the current create function statement (like either of the 
>> examples below), or a separate step entirely, or neither?
>>  
>>    
> I was just being pedantic. The CREATE FUNCTION statement would not change. It 
> would look exactly like it does today.
>
> Hope we're converging on something useful.
>
> Thanks,
> -Rick
>  
>>      e.g1:   create function legacy_realty_data()                 returns 
>> table (                      postal_code varchar(10),                      
>> price numeric(10,2)                 )                 language java          
>>        parameter style DERBY_SCANNABLE_JDBC_RESULT_SET // flags 
>> implementation of this DerbyScan interface
>>                 no sql
>>                 external name 'com.cg.example.LegacyHousingDataFunction.read'
>>
>>     e.g2: create function legacy_realty_data()                 returns table 
>> (                      postal_code varchar(10),                      price 
>> numeric(10,2)                 )                 language java                
>>  parameter style DERBY_JDBC_RESULT_SET
>>                 no sql
>>                 external name 
>> 'com.cg.example.LegacyHousingDataFunction.read' returns DerbyScan 
>>
>> Do I understand what you're proposing?  
>> -chris
>>
>> ----- Original Message ----
>> From: Rick Hillegas <[email protected]>
>> To: Derby Discussion <[email protected]>
>> Sent: Friday, August 21, 2009 10:12:42 AM
>> Subject: Re: Question about TableFunctions in Derby
>>
>> Hi Chris,
>>
>> Thanks for following up on this. How does something like the following 
>> sound? I'm going to phrase this in terms of existing classes. I suspect that 
>> we'd want to use some simpler classes if we implemented this--for instance, 
>> I'm not happy with ScanQualifier: as your example points out, ScanQualifier 
>> pulls in the Derby type system, which is a lot of machinery you don't need 
>> to worry about. But here's a sketch:
>>
>> 1) Derby would expose some interface which your ResultSet would implement:
>>
>> public interface DerbyScan
>> {
>>    /**
>>      * Setup that is called before Derby calls next() to get the first row
>>      *
>>      * @param referencedColumns Columns which Derby may retrieve. These are 
>> column positions as declared at CREATE FUNCTION time.
>>      * @param restriction Array of simple comparisons to constant values. 
>> Each comparison applies to a single column.
>>      */
>>    public void initScan( FormatableBitSet referencedColumns, ScanQualifier[] 
>> restriction ) throws SQLException;
>> }
>>
>> 2) You would code something like this:
>>
>> public class MyResultSet implements ResultSet, DerbyScan { ... }
>>
>> 3) And your CREATE FUNCTION statement would bind a function to a method like 
>> the following, which you would code also:
>>
>> public static MyResultSet legacyRealtyData() throws SQLException { ... }
>>
>>
>> Is this headed in the right direction?
>>
>> Thanks,
>> -Rick
>>
>> Chris Goodacre wrote:
>>  
>>    
>>> Rick,
>>> Sorry it's taken me so long to reply on this.  I just today got back to 
>>> this in earnest.   I'll try to walk through an example, imagining that I 
>>> have an array of ScanQualifiers that gets passed to my table function's 
>>> method, just to make sure I understand this.
>>>
>>> public static ResultSet read(ScanQualifier[] qualifiers) {
>>>    // ... impl
>>> }
>>>
>>> So, if I were to go back to my original example:
>>>
>>> select house_number, street, city from table (legacy_realty_data()) where 
>>> price < 500000
>>>
>>> a) I think that an array with only a single ScanQualifier object would be 
>>> passed to my read(...) method.
>>> b) I can see where the operator for the ScanQualifier object would be some 
>>> negative number
>>> c) The column id would reference the column # (basically) of the price 
>>> column from the table definition of my CREATE FUNCTION statement.
>>> d) The result of getOrderable() on the scanqualifier object would return me 
>>> a DataValueDescriptor.  e) I could interrogate the DataValueDescriptor to 
>>> get the value (500000) in a type/manner that I could use to pass on to my 
>>> legacy system
>>>
>>> I could use this information to restrict the number of rows that come back. 
>>>   That's good.
>>>
>>> It would still be nice if I could restrict the number of columns I'm 
>>> requesting up front.   It's expensive to go back and forth to this system, 
>>> so I would rather make one read (all relevant rows, all relevant columns) 
>>> and take the chance that the user only uses some of the rows from the 
>>> result set.
>>>
>>> Would it be possible to use a ScanQualifier (or something like it) to 
>>> inform the table procedure methods which specific (non-calculated) columns 
>>> are in the query?  -chris
>>>
>>>
>>>
>>> ----- Original Message ----
>>> From: Rick Hillegas <[email protected]>
>>> To: Derby Discussion <[email protected]>
>>> Sent: Monday, July 20, 2009 3:08:31 PM
>>> Subject: Re: Question about TableFunctions in Derby
>>>
>>> Hi Chris,
>>>
>>> Reducing the number of column probes may be possible without any changes to 
>>> Derby: When your ResultSet is asked to get a column, it can remember that 
>>> request. On later rows, your ResultSet can ask the external data source for 
>>> all of the column positions it has remembered so far. In the query you 
>>> gave, this would play out like this:
>>>
>>> 1) On the first row, your ResultSet would make 3 calls to the external data 
>>> source, one for each column. But the ResultSet would remember which columns 
>>> were requested.
>>>
>>> 2) For each of the remaining N-1 rows, your ResultSet would call the 
>>> external data source only once, asking the external data source for all 
>>> three columns in a single batch. That batch could then be cached and the 
>>> individual columns could be returned to Derby when Derby called the 
>>> getXXX() methods.
>>>
>>> Positioning and restricting the rows themselves (the WHERE clause 
>>> fragments) is tricker. It probably requires help from Derby, as you 
>>> suggest. We could design some interface by which Derby would pass the 
>>> ResultSet a list of org.apache.derby.iapi.sql.execute.ScanQualifier. Your 
>>> ResultSet could then forward those directives to the external data source.
>>>
>>> What do you think?
>>> -Rick
>>>
>>>
>>> Chris Goodacre wrote:
>>>      
>>>      
>>>> Rick, thanks for your suggestions.   Perhaps I am being obtuse, but when 
>>>> you say ...
>>>>
>>>> "Since you have only asked for 3 columns, that's all that Derby will
>>>> request from the ResultSet instantiated by your table function. That
>>>> is, Derby is only going to call ResultSet.getXXX() on the house_number,
>>>> street, and city columns. That should behave efficiently provided that
>>>> your ResultSet is smart enough to only fault-in columns for which a
>>>> getXXX() is called."
>>>>
>>>> Does that mean that I make a separate request to the legacy system each 
>>>> time getXXX() is called - i.e. lazily initialize each column in the result 
>>>> set?    I think this has to be the only way to do it, since I don't know 
>>>> which columns will be requested at the time the read() method of my 
>>>> tablefunction is invoked.  Making (in this case) 3 calls to the legacy 
>>>> system to get 1 column for N rows is certainly better than making 1 call 
>>>> to the legacy system to get 1000 columns for N rows and then throwing away 
>>>> 997*N values/cells, but still not quite as nice as I'd like.  If I were 
>>>> making a wish - I'd wish for some sort of parsed representation of the 
>>>> query get passed to the read method (or to some other method - similar to, 
>>>> or even as part of, the query optimization interface).    Ideally, this 
>>>> structured representation would have the list of columns belonging to the 
>>>> table function from the select list, and the where clause components 
>>>> specific to the table
>>>>      
>>>>        
>>  function only (i.e. mytablefunction.price > 50000 but NOT 
>> mytablefunction.price < myrealtable.value).
>>  
>>    
>>>> In the absence of this, when the VTIResultSet class passes the 
>>>> ActivationHolder to the derby class which invokes the read() method 
>>>> reflectively, why can't that class pass the activation context (it knows 
>>>> it is dealing with a derby table function, it knows the class name, it has 
>>>> access to the result set descriptor, if not the where clause) pass this 
>>>> information along to the user's table function class?   I would happily 
>>>> implement an interface in this class (not sure why read() has to be 
>>>> static) to get this information prior to resultset instantiation.
>>>>
>>>> -chris
>>>>
>>>>
>>>>
>>>> ----- Original Message ----
>>>> From: Rick Hillegas <[email protected]>
>>>> To: Derby Discussion <[email protected]>
>>>> Sent: Monday, July 20, 2009 10:55:33 AM
>>>> Subject: Re: Question about TableFunctions in Derby
>>>>
>>>> Hi Chris,
>>>>
>>>> Some comments inline...
>>>>
>>>> Chris Goodacre wrote:
>>>>            
>>>>        
>>>>> I've read the Derby developer's guide and Rick Hillegas's informative 
>>>>> white paper 
>>>>> (http://developers.sun.com/javadb/reference/whitepapers/sampleTableFunctions/doc/TableFunctionsWhitePaper.html)
>>>>>  on Table Functions, but am still struggling with the following issue:
>>>>>
>>>>> I am trying to create an RDB abstraction for a large CICS/VSAM-based 
>>>>> legacy system and blend it with our newer, RDB-based tier.  This seems 
>>>>> like a good application of TableFunctions.  The VSAM data is made 
>>>>> available to me via an IP-based proprietary messaging interface.  There 
>>>>> are lots of different files here, but due to some historical forces, most 
>>>>> of the data I'm interested in resides in 4 VSAM files.
>>>>>
>>>>> Unfortunately, each of those VSAM files has over a 1000 fields in it.
>>>>>
>>>>> Now eventually, it might be possible to fully model a single VSAM file 
>>>>> into (for the sake of argument) 50 tables; each table/row representing a 
>>>>> small slice of a single VSAM record.
>>>>>
>>>>> In the meantime, for both this proof-of-concept and as a migration path 
>>>>> to our existing clients, I'd like to represent each VSAM file as a table 
>>>>> (subject to the 1024 column SQL limitation per table).  This will be a 
>>>>> highly-denormalized and decidedly non-relational view of the data, but it 
>>>>> will be easy to demonstrate and immediately recognizable to our customers.
>>>>>
>>>>> However, I can't seem to get around the problem of data granularity.  For 
>>>>> example, if my customer executes:
>>>>>
>>>>> select house_number, street, city from table (legacy_realty_data()) where 
>>>>> price < 500000
>>>>>                    
>>>>>          
>>>> Since you have only asked for 3 columns, that's all that Derby will 
>>>> request from the ResultSet instantiated by your table function. That is, 
>>>> Derby is only going to call ResultSet.getXXX() on the house_number, 
>>>> street, and city columns. That should behave efficiently provided that 
>>>> your ResultSet is smart enough to only fault-in columns for which a 
>>>> getXXX() is called.
>>>>
>>>> The WHERE clause is a little trickier. You are right, Derby will read all 
>>>> rows from the ResultSet and throw away the rows which don't satisfy the 
>>>> WHERE clause. What you want to do is push the qualification through the 
>>>> table function to the external data source. I don't see any way to do this 
>>>> other than adding some more arguments to your table function. For 
>>>> instance, if you could push the qualification through to the external data 
>>>> source, then you could get efficient behavior from something like the 
>>>> following:
>>>>
>>>> select house_number, street, city
>>>> from table( legacy_realty_data( 500000 ) ) s;
>>>>
>>>> Hope this helps,
>>>> -Rick
>>>>
>>>>            
>>>>        
>>>>> I don't appear to have any visibility to the actual query inside my 
>>>>> legacy_realty_data TableFunction, so I have to go get all 1000 fields for 
>>>>> however many listings are present where price< 500000 even though only 
>>>>> three columns will be requested.  Am I missing something?  Aside from 
>>>>> having the user repeat the columns as parameters to the table function 
>>>>> (which looks awkward to say the least), I can't see a way around this 
>>>>> based on my limited knowledge of Derby.
>>>>>
>>>>> Is there a way to only retrieve the columns that the user is querying for?
>>>>>
>>>>> Looking forward to your help/advice.
>>>>>
>>>>> -chris
>>>>>
>>>>>                    
>>>>>          

Reply via email to