Rick, I think a String[] would be just fine. While I am sure that a String[] is probably slower than a BitSet, I did a quick test, looking at different comparisons, and matching 205 strings out of 1,646,085 (even using equalsIgnoreCase) only takes about 39 ms on my machine.
I think we've completed the first example. I have one other that I'd like to look at: select h.price from TABLE ( legacy_realty_data() ) h, chicago_zipcodes z where h.zipcode = z.zipcode In this case, I would get a single String in referencedColumns array. What (if anything) would be in the ScanQualifier[]? chris ----- Original Message ---- From: Rick Hillegas <[email protected]> To: Derby Discussion <[email protected]> Sent: Friday, August 21, 2009 2:07:01 PM Subject: Re: Question about TableFunctions in Derby Hi Chris, Thanks for the quick response. Some comments inline... Chris Goodacre wrote: > Ok - I think I get the basics of your approach: rather than alter the > signature of the table function (which is arbitrary and bound as part of the > create function statement), you are (optionally) asking the developer to > implement a hook that derby calls prior to actually requesting data (the > initScan method). So far, so good. I like the approach. > > I have a couple of areas I want to make sure I understand: > > (a) The FormatableBitSet would be a column-ordered based set of flags > indicating whether the column was included in the select statement, correct? > The column index would presumably be based on the order of occurrence of the > column in the returns table () subclause of the create function statement. > This introduces a dependency into my code, but I don't have enough experience > to know whether I just got lucky not needing the dependency yet. Maybe other > common implementations of the table function/result set already need to know > what order the columns were defined in the table function. > Would it be better if referencedColumns were a String[]? That's fine too. There's only one subtlety with that approach: what you would get would be an array of SQL identifiers. That means that the strings would be uppercased unless you bothered to double-quote the column names at CREATE FUNCTION time. Would some other type be better for referencedColumns? > (b) I get steps 1 & 2, but in step 3, are you saying that would be an > alteration of the current create function statement (like either of the > examples below), or a separate step entirely, or neither? > I was just being pedantic. The CREATE FUNCTION statement would not change. It would look exactly like it does today. Hope we're converging on something useful. Thanks, -Rick > e.g1: create function legacy_realty_data() returns > table ( postal_code varchar(10), > price numeric(10,2) ) language java > parameter style DERBY_SCANNABLE_JDBC_RESULT_SET // flags implementation > of this DerbyScan interface > no sql > external name 'com.cg.example.LegacyHousingDataFunction.read' > > e.g2: create function legacy_realty_data() returns table > ( postal_code varchar(10), price > numeric(10,2) ) language java > parameter style DERBY_JDBC_RESULT_SET > no sql > external name 'com.cg.example.LegacyHousingDataFunction.read' > returns DerbyScan > > Do I understand what you're proposing? > -chris > > ----- Original Message ---- > From: Rick Hillegas <[email protected]> > To: Derby Discussion <[email protected]> > Sent: Friday, August 21, 2009 10:12:42 AM > Subject: Re: Question about TableFunctions in Derby > > Hi Chris, > > Thanks for following up on this. How does something like the following sound? > I'm going to phrase this in terms of existing classes. I suspect that we'd > want to use some simpler classes if we implemented this--for instance, I'm > not happy with ScanQualifier: as your example points out, ScanQualifier pulls > in the Derby type system, which is a lot of machinery you don't need to worry > about. But here's a sketch: > > 1) Derby would expose some interface which your ResultSet would implement: > > public interface DerbyScan > { > /** > * Setup that is called before Derby calls next() to get the first row > * > * @param referencedColumns Columns which Derby may retrieve. These are > column positions as declared at CREATE FUNCTION time. > * @param restriction Array of simple comparisons to constant values. > Each comparison applies to a single column. > */ > public void initScan( FormatableBitSet referencedColumns, ScanQualifier[] > restriction ) throws SQLException; > } > > 2) You would code something like this: > > public class MyResultSet implements ResultSet, DerbyScan { ... } > > 3) And your CREATE FUNCTION statement would bind a function to a method like > the following, which you would code also: > > public static MyResultSet legacyRealtyData() throws SQLException { ... } > > > Is this headed in the right direction? > > Thanks, > -Rick > > Chris Goodacre wrote: > >> Rick, >> Sorry it's taken me so long to reply on this. I just today got back to this >> in earnest. I'll try to walk through an example, imagining that I have an >> array of ScanQualifiers that gets passed to my table function's method, just >> to make sure I understand this. >> >> public static ResultSet read(ScanQualifier[] qualifiers) { >> // ... impl >> } >> >> So, if I were to go back to my original example: >> >> select house_number, street, city from table (legacy_realty_data()) where >> price < 500000 >> >> a) I think that an array with only a single ScanQualifier object would be >> passed to my read(...) method. >> b) I can see where the operator for the ScanQualifier object would be some >> negative number >> c) The column id would reference the column # (basically) of the price >> column from the table definition of my CREATE FUNCTION statement. >> d) The result of getOrderable() on the scanqualifier object would return me >> a DataValueDescriptor. e) I could interrogate the DataValueDescriptor to >> get the value (500000) in a type/manner that I could use to pass on to my >> legacy system >> >> I could use this information to restrict the number of rows that come back. >> That's good. >> >> It would still be nice if I could restrict the number of columns I'm >> requesting up front. It's expensive to go back and forth to this system, >> so I would rather make one read (all relevant rows, all relevant columns) >> and take the chance that the user only uses some of the rows from the result >> set. >> >> Would it be possible to use a ScanQualifier (or something like it) to inform >> the table procedure methods which specific (non-calculated) columns are in >> the query? -chris >> >> >> >> ----- Original Message ---- >> From: Rick Hillegas <[email protected]> >> To: Derby Discussion <[email protected]> >> Sent: Monday, July 20, 2009 3:08:31 PM >> Subject: Re: Question about TableFunctions in Derby >> >> Hi Chris, >> >> Reducing the number of column probes may be possible without any changes to >> Derby: When your ResultSet is asked to get a column, it can remember that >> request. On later rows, your ResultSet can ask the external data source for >> all of the column positions it has remembered so far. In the query you gave, >> this would play out like this: >> >> 1) On the first row, your ResultSet would make 3 calls to the external data >> source, one for each column. But the ResultSet would remember which columns >> were requested. >> >> 2) For each of the remaining N-1 rows, your ResultSet would call the >> external data source only once, asking the external data source for all >> three columns in a single batch. That batch could then be cached and the >> individual columns could be returned to Derby when Derby called the getXXX() >> methods. >> >> Positioning and restricting the rows themselves (the WHERE clause fragments) >> is tricker. It probably requires help from Derby, as you suggest. We could >> design some interface by which Derby would pass the ResultSet a list of >> org.apache.derby.iapi.sql.execute.ScanQualifier. Your ResultSet could then >> forward those directives to the external data source. >> >> What do you think? >> -Rick >> >> >> Chris Goodacre wrote: >> >>> Rick, thanks for your suggestions. Perhaps I am being obtuse, but when >>> you say ... >>> >>> "Since you have only asked for 3 columns, that's all that Derby will >>> request from the ResultSet instantiated by your table function. That >>> is, Derby is only going to call ResultSet.getXXX() on the house_number, >>> street, and city columns. That should behave efficiently provided that >>> your ResultSet is smart enough to only fault-in columns for which a >>> getXXX() is called." >>> >>> Does that mean that I make a separate request to the legacy system each >>> time getXXX() is called - i.e. lazily initialize each column in the result >>> set? I think this has to be the only way to do it, since I don't know >>> which columns will be requested at the time the read() method of my >>> tablefunction is invoked. Making (in this case) 3 calls to the legacy >>> system to get 1 column for N rows is certainly better than making 1 call to >>> the legacy system to get 1000 columns for N rows and then throwing away >>> 997*N values/cells, but still not quite as nice as I'd like. If I were >>> making a wish - I'd wish for some sort of parsed representation of the >>> query get passed to the read method (or to some other method - similar to, >>> or even as part of, the query optimization interface). Ideally, this >>> structured representation would have the list of columns belonging to the >>> table function from the select list, and the where clause components >>> specific to the table >>> > function only (i.e. mytablefunction.price > 50000 but NOT > mytablefunction.price < myrealtable.value). > >>> In the absence of this, when the VTIResultSet class passes the >>> ActivationHolder to the derby class which invokes the read() method >>> reflectively, why can't that class pass the activation context (it knows it >>> is dealing with a derby table function, it knows the class name, it has >>> access to the result set descriptor, if not the where clause) pass this >>> information along to the user's table function class? I would happily >>> implement an interface in this class (not sure why read() has to be static) >>> to get this information prior to resultset instantiation. >>> >>> -chris >>> >>> >>> >>> ----- Original Message ---- >>> From: Rick Hillegas <[email protected]> >>> To: Derby Discussion <[email protected]> >>> Sent: Monday, July 20, 2009 10:55:33 AM >>> Subject: Re: Question about TableFunctions in Derby >>> >>> Hi Chris, >>> >>> Some comments inline... >>> >>> Chris Goodacre wrote: >>> >>>> I've read the Derby developer's guide and Rick Hillegas's informative >>>> white paper >>>> (http://developers.sun.com/javadb/reference/whitepapers/sampleTableFunctions/doc/TableFunctionsWhitePaper.html) >>>> on Table Functions, but am still struggling with the following issue: >>>> >>>> I am trying to create an RDB abstraction for a large CICS/VSAM-based >>>> legacy system and blend it with our newer, RDB-based tier. This seems >>>> like a good application of TableFunctions. The VSAM data is made >>>> available to me via an IP-based proprietary messaging interface. There >>>> are lots of different files here, but due to some historical forces, most >>>> of the data I'm interested in resides in 4 VSAM files. >>>> >>>> Unfortunately, each of those VSAM files has over a 1000 fields in it. >>>> >>>> Now eventually, it might be possible to fully model a single VSAM file >>>> into (for the sake of argument) 50 tables; each table/row representing a >>>> small slice of a single VSAM record. >>>> >>>> In the meantime, for both this proof-of-concept and as a migration path to >>>> our existing clients, I'd like to represent each VSAM file as a table >>>> (subject to the 1024 column SQL limitation per table). This will be a >>>> highly-denormalized and decidedly non-relational view of the data, but it >>>> will be easy to demonstrate and immediately recognizable to our customers. >>>> >>>> However, I can't seem to get around the problem of data granularity. For >>>> example, if my customer executes: >>>> >>>> select house_number, street, city from table (legacy_realty_data()) where >>>> price < 500000 >>>> >>> Since you have only asked for 3 columns, that's all that Derby will request >>> from the ResultSet instantiated by your table function. That is, Derby is >>> only going to call ResultSet.getXXX() on the house_number, street, and city >>> columns. That should behave efficiently provided that your ResultSet is >>> smart enough to only fault-in columns for which a getXXX() is called. >>> >>> The WHERE clause is a little trickier. You are right, Derby will read all >>> rows from the ResultSet and throw away the rows which don't satisfy the >>> WHERE clause. What you want to do is push the qualification through the >>> table function to the external data source. I don't see any way to do this >>> other than adding some more arguments to your table function. For instance, >>> if you could push the qualification through to the external data source, >>> then you could get efficient behavior from something like the following: >>> >>> select house_number, street, city >>> from table( legacy_realty_data( 500000 ) ) s; >>> >>> Hope this helps, >>> -Rick >>> >>> >>>> I don't appear to have any visibility to the actual query inside my >>>> legacy_realty_data TableFunction, so I have to go get all 1000 fields for >>>> however many listings are present where price< 500000 even though only >>>> three columns will be requested. Am I missing something? Aside from >>>> having the user repeat the columns as parameters to the table function >>>> (which looks awkward to say the least), I can't see a way around this >>>> based on my limited knowledge of Derby. >>>> >>>> Is there a way to only retrieve the columns that the user is querying for? >>>> >>>> Looking forward to your help/advice. >>>> >>>> -chris >>>> >>>>
