Rick, your response is exactly as I would have expected. I assumed the filtering of data based on the join would have to occur "above" the individual queries (though admittedly, I'm no expert in RDBMS theory).
So how do we / I proceed at this point? I'm new to this project/list. Shall I enter this as an enhancement request in Jira? Do you want/need any help with the actual enhancement? -chris ----- Original Message ---- From: Rick Hillegas <[email protected]> To: Derby Discussion <[email protected]> Sent: Monday, August 24, 2009 10:29:47 AM Subject: Re: Question about TableFunctions in Derby Hi Chris, A response inline... Chris Goodacre wrote: > Rick, > > I think a String[] would be just fine. While I am sure that a String[] is > probably slower than a BitSet, I did a quick test, looking at different > comparisons, and matching 205 strings out of 1,646,085 (even using > equalsIgnoreCase) only takes about 39 ms on my machine. > > I think we've completed the first example. I have one other that I'd like to > look at: > > select h.price from TABLE ( legacy_realty_data() ) h, chicago_zipcodes z > where h.zipcode = z.zipcode > > > In this case, I would get a single String in referencedColumns array. What > (if anything) would be in the ScanQualifier[]? > I think that the ScanQualifier[] argument would be null in the first increment of this work. The optimizer has a strong bias to put the table function in the outermost slot of the join order. In order to limit the table function scan with the "h.zipcode = z.zipcode" clause, the optimizer would need some better heuristics for figuring out when to put the table function in an inner slot. That could be a follow-on effort. Hope this seems sensible and useful to you, -Rick > chris > > > > ----- Original Message ---- > From: Rick Hillegas <[email protected]> > To: Derby Discussion <[email protected]> > Sent: Friday, August 21, 2009 2:07:01 PM > Subject: Re: Question about TableFunctions in Derby > > Hi Chris, > > Thanks for the quick response. Some comments inline... > > Chris Goodacre wrote: > >> Ok - I think I get the basics of your approach: rather than alter the >> signature of the table function (which is arbitrary and bound as part of the >> create function statement), you are (optionally) asking the developer to >> implement a hook that derby calls prior to actually requesting data (the >> initScan method). So far, so good. I like the approach. >> >> I have a couple of areas I want to make sure I understand: >> >> (a) The FormatableBitSet would be a column-ordered based set of flags >> indicating whether the column was included in the select statement, correct? >> The column index would presumably be based on the order of occurrence of >> the column in the returns table () subclause of the create function >> statement. This introduces a dependency into my code, but I don't have >> enough experience to know whether I just got lucky not needing the >> dependency yet. Maybe other common implementations of the table >> function/result set already need to know what order the columns were defined >> in the table function. >> >> > Would it be better if referencedColumns were a String[]? That's fine too. > There's only one subtlety with that approach: what you would get would be an > array of SQL identifiers. That means that the strings would be uppercased > unless you bothered to double-quote the column names at CREATE FUNCTION time. > > Would some other type be better for referencedColumns? > >> (b) I get steps 1 & 2, but in step 3, are you saying that would be an >> alteration of the current create function statement (like either of the >> examples below), or a separate step entirely, or neither? >> >> > I was just being pedantic. The CREATE FUNCTION statement would not change. It > would look exactly like it does today. > > Hope we're converging on something useful. > > Thanks, > -Rick > >> e.g1: create function legacy_realty_data() returns >> table ( postal_code varchar(10), >> price numeric(10,2) ) language java >> parameter style DERBY_SCANNABLE_JDBC_RESULT_SET // flags >> implementation of this DerbyScan interface >> no sql >> external name 'com.cg.example.LegacyHousingDataFunction.read' >> >> e.g2: create function legacy_realty_data() returns table >> ( postal_code varchar(10), price >> numeric(10,2) ) language java >> parameter style DERBY_JDBC_RESULT_SET >> no sql >> external name >> 'com.cg.example.LegacyHousingDataFunction.read' returns DerbyScan >> >> Do I understand what you're proposing? >> -chris >> >> ----- Original Message ---- >> From: Rick Hillegas <[email protected]> >> To: Derby Discussion <[email protected]> >> Sent: Friday, August 21, 2009 10:12:42 AM >> Subject: Re: Question about TableFunctions in Derby >> >> Hi Chris, >> >> Thanks for following up on this. How does something like the following >> sound? I'm going to phrase this in terms of existing classes. I suspect that >> we'd want to use some simpler classes if we implemented this--for instance, >> I'm not happy with ScanQualifier: as your example points out, ScanQualifier >> pulls in the Derby type system, which is a lot of machinery you don't need >> to worry about. But here's a sketch: >> >> 1) Derby would expose some interface which your ResultSet would implement: >> >> public interface DerbyScan >> { >> /** >> * Setup that is called before Derby calls next() to get the first row >> * >> * @param referencedColumns Columns which Derby may retrieve. These are >> column positions as declared at CREATE FUNCTION time. >> * @param restriction Array of simple comparisons to constant values. >> Each comparison applies to a single column. >> */ >> public void initScan( FormatableBitSet referencedColumns, ScanQualifier[] >> restriction ) throws SQLException; >> } >> >> 2) You would code something like this: >> >> public class MyResultSet implements ResultSet, DerbyScan { ... } >> >> 3) And your CREATE FUNCTION statement would bind a function to a method like >> the following, which you would code also: >> >> public static MyResultSet legacyRealtyData() throws SQLException { ... } >> >> >> Is this headed in the right direction? >> >> Thanks, >> -Rick >> >> Chris Goodacre wrote: >> >> >>> Rick, >>> Sorry it's taken me so long to reply on this. I just today got back to >>> this in earnest. I'll try to walk through an example, imagining that I >>> have an array of ScanQualifiers that gets passed to my table function's >>> method, just to make sure I understand this. >>> >>> public static ResultSet read(ScanQualifier[] qualifiers) { >>> // ... impl >>> } >>> >>> So, if I were to go back to my original example: >>> >>> select house_number, street, city from table (legacy_realty_data()) where >>> price < 500000 >>> >>> a) I think that an array with only a single ScanQualifier object would be >>> passed to my read(...) method. >>> b) I can see where the operator for the ScanQualifier object would be some >>> negative number >>> c) The column id would reference the column # (basically) of the price >>> column from the table definition of my CREATE FUNCTION statement. >>> d) The result of getOrderable() on the scanqualifier object would return me >>> a DataValueDescriptor. e) I could interrogate the DataValueDescriptor to >>> get the value (500000) in a type/manner that I could use to pass on to my >>> legacy system >>> >>> I could use this information to restrict the number of rows that come back. >>> That's good. >>> >>> It would still be nice if I could restrict the number of columns I'm >>> requesting up front. It's expensive to go back and forth to this system, >>> so I would rather make one read (all relevant rows, all relevant columns) >>> and take the chance that the user only uses some of the rows from the >>> result set. >>> >>> Would it be possible to use a ScanQualifier (or something like it) to >>> inform the table procedure methods which specific (non-calculated) columns >>> are in the query? -chris >>> >>> >>> >>> ----- Original Message ---- >>> From: Rick Hillegas <[email protected]> >>> To: Derby Discussion <[email protected]> >>> Sent: Monday, July 20, 2009 3:08:31 PM >>> Subject: Re: Question about TableFunctions in Derby >>> >>> Hi Chris, >>> >>> Reducing the number of column probes may be possible without any changes to >>> Derby: When your ResultSet is asked to get a column, it can remember that >>> request. On later rows, your ResultSet can ask the external data source for >>> all of the column positions it has remembered so far. In the query you >>> gave, this would play out like this: >>> >>> 1) On the first row, your ResultSet would make 3 calls to the external data >>> source, one for each column. But the ResultSet would remember which columns >>> were requested. >>> >>> 2) For each of the remaining N-1 rows, your ResultSet would call the >>> external data source only once, asking the external data source for all >>> three columns in a single batch. That batch could then be cached and the >>> individual columns could be returned to Derby when Derby called the >>> getXXX() methods. >>> >>> Positioning and restricting the rows themselves (the WHERE clause >>> fragments) is tricker. It probably requires help from Derby, as you >>> suggest. We could design some interface by which Derby would pass the >>> ResultSet a list of org.apache.derby.iapi.sql.execute.ScanQualifier. Your >>> ResultSet could then forward those directives to the external data source. >>> >>> What do you think? >>> -Rick >>> >>> >>> Chris Goodacre wrote: >>> >>> >>>> Rick, thanks for your suggestions. Perhaps I am being obtuse, but when >>>> you say ... >>>> >>>> "Since you have only asked for 3 columns, that's all that Derby will >>>> request from the ResultSet instantiated by your table function. That >>>> is, Derby is only going to call ResultSet.getXXX() on the house_number, >>>> street, and city columns. That should behave efficiently provided that >>>> your ResultSet is smart enough to only fault-in columns for which a >>>> getXXX() is called." >>>> >>>> Does that mean that I make a separate request to the legacy system each >>>> time getXXX() is called - i.e. lazily initialize each column in the result >>>> set? I think this has to be the only way to do it, since I don't know >>>> which columns will be requested at the time the read() method of my >>>> tablefunction is invoked. Making (in this case) 3 calls to the legacy >>>> system to get 1 column for N rows is certainly better than making 1 call >>>> to the legacy system to get 1000 columns for N rows and then throwing away >>>> 997*N values/cells, but still not quite as nice as I'd like. If I were >>>> making a wish - I'd wish for some sort of parsed representation of the >>>> query get passed to the read method (or to some other method - similar to, >>>> or even as part of, the query optimization interface). Ideally, this >>>> structured representation would have the list of columns belonging to the >>>> table function from the select list, and the where clause components >>>> specific to the table >>>> >>>> >> function only (i.e. mytablefunction.price > 50000 but NOT >> mytablefunction.price < myrealtable.value). >> >> >>>> In the absence of this, when the VTIResultSet class passes the >>>> ActivationHolder to the derby class which invokes the read() method >>>> reflectively, why can't that class pass the activation context (it knows >>>> it is dealing with a derby table function, it knows the class name, it has >>>> access to the result set descriptor, if not the where clause) pass this >>>> information along to the user's table function class? I would happily >>>> implement an interface in this class (not sure why read() has to be >>>> static) to get this information prior to resultset instantiation. >>>> >>>> -chris >>>> >>>> >>>> >>>> ----- Original Message ---- >>>> From: Rick Hillegas <[email protected]> >>>> To: Derby Discussion <[email protected]> >>>> Sent: Monday, July 20, 2009 10:55:33 AM >>>> Subject: Re: Question about TableFunctions in Derby >>>> >>>> Hi Chris, >>>> >>>> Some comments inline... >>>> >>>> Chris Goodacre wrote: >>>> >>>> >>>>> I've read the Derby developer's guide and Rick Hillegas's informative >>>>> white paper >>>>> (http://developers.sun.com/javadb/reference/whitepapers/sampleTableFunctions/doc/TableFunctionsWhitePaper.html) >>>>> on Table Functions, but am still struggling with the following issue: >>>>> >>>>> I am trying to create an RDB abstraction for a large CICS/VSAM-based >>>>> legacy system and blend it with our newer, RDB-based tier. This seems >>>>> like a good application of TableFunctions. The VSAM data is made >>>>> available to me via an IP-based proprietary messaging interface. There >>>>> are lots of different files here, but due to some historical forces, most >>>>> of the data I'm interested in resides in 4 VSAM files. >>>>> >>>>> Unfortunately, each of those VSAM files has over a 1000 fields in it. >>>>> >>>>> Now eventually, it might be possible to fully model a single VSAM file >>>>> into (for the sake of argument) 50 tables; each table/row representing a >>>>> small slice of a single VSAM record. >>>>> >>>>> In the meantime, for both this proof-of-concept and as a migration path >>>>> to our existing clients, I'd like to represent each VSAM file as a table >>>>> (subject to the 1024 column SQL limitation per table). This will be a >>>>> highly-denormalized and decidedly non-relational view of the data, but it >>>>> will be easy to demonstrate and immediately recognizable to our customers. >>>>> >>>>> However, I can't seem to get around the problem of data granularity. For >>>>> example, if my customer executes: >>>>> >>>>> select house_number, street, city from table (legacy_realty_data()) where >>>>> price < 500000 >>>>> >>>>> >>>> Since you have only asked for 3 columns, that's all that Derby will >>>> request from the ResultSet instantiated by your table function. That is, >>>> Derby is only going to call ResultSet.getXXX() on the house_number, >>>> street, and city columns. That should behave efficiently provided that >>>> your ResultSet is smart enough to only fault-in columns for which a >>>> getXXX() is called. >>>> >>>> The WHERE clause is a little trickier. You are right, Derby will read all >>>> rows from the ResultSet and throw away the rows which don't satisfy the >>>> WHERE clause. What you want to do is push the qualification through the >>>> table function to the external data source. I don't see any way to do this >>>> other than adding some more arguments to your table function. For >>>> instance, if you could push the qualification through to the external data >>>> source, then you could get efficient behavior from something like the >>>> following: >>>> >>>> select house_number, street, city >>>> from table( legacy_realty_data( 500000 ) ) s; >>>> >>>> Hope this helps, >>>> -Rick >>>> >>>> >>>> >>>>> I don't appear to have any visibility to the actual query inside my >>>>> legacy_realty_data TableFunction, so I have to go get all 1000 fields for >>>>> however many listings are present where price< 500000 even though only >>>>> three columns will be requested. Am I missing something? Aside from >>>>> having the user repeat the columns as parameters to the table function >>>>> (which looks awkward to say the least), I can't see a way around this >>>>> based on my limited knowledge of Derby. >>>>> >>>>> Is there a way to only retrieve the columns that the user is querying for? >>>>> >>>>> Looking forward to your help/advice. >>>>> >>>>> -chris >>>>> >>>>> >>>>>
