Excellent. Looks like we are on the same page now.

Re: HashSet (HashMap?) vs. []. The map is definitely performing better when you are looking up a value by its key. This may be the case when we are assembling column descriptors inside the builder. This operation is done 1 time (ok maybe it is done N times, where N is the number of columns).

However processing the ResultSet is a different story. There's no lookup by key. For each ResultSet row we need to apply ALL column descriptors one by one to get the values out. So with a HashSet/ HashMap we'd have this:

   Iterator<ColumnDescriptor> it = map.values().iterator();
   while(it.hasNext()) {
       ColumnDescriptor column = it.next();
       ....
   }

With a ColumnDescriptor[] we have this:

  for(int i = 0; i < length; i++) {
       column = columns[i];
  }

Both loops are done M times, where M is the number of rows in the ResultSet. In the worst case scenario, M is much larger than N. In the first case, we call three extra methods (iterator, hasNext, next) and create at least one extra object (Iterator). So the secon case is marginally faster. Now if you multiply that nanosecond or whatever difference by a few millions, it can become more significant.

So essentially when talking about this refactoring we need to separate the first step of preparing the columns, and the second step of using them.

Andrus


On Oct 12, 2009, at 12:38 PM, Evgeny Ryabitskiy wrote:

It doesn't matter how this represented *inside* the builder class, as
builder is used only once per query. On the other hand, coming out of the builder it must be optimized, as access to the column descriptors array is performed N*M times during each result set processing, where N is the width of the result set, and M is its length. I.e. it can be a very large number (up to tens or hundreds of millions calls). Every small optimization matters
here.

So.. I was talking exactly about optimization... HashedSet array can
be faster cause we perform several scans over whole array of
ColumnDescriptors. And safety cause we don't get duplicates for
Columns. And.. I didn't get you position about this idea

This is something I don't know. We need to check about a dozen of drivers from different vendors that we support to verify that. This is just a getter
in the interface. Implementors could've made it anything.


I have looked through JTDS drivers (not a dozen but a least one).
ResultSet has all information about columns (just  private final
ColInfo[] columns).
When getMetaData performed - constructs new Object that has reference
to array of columns from ResultSet .
Looks like there is no problem with JTDS.


The problem that if we don't set ResultSetMetadata like in current
(trunk) version, without ResultSetMetadata  we don't know all
columns..

Not true. We don't know all the columns for SQLTemplate queries. For all other types of queries we DO know all the columns, as Cayenne generates SQL from scratch for those queries. I think this one place is where we have the
biggest mismatch in our views of the implementation.

ah... now I see. You are right that was a mismatch in our views. I
will work on it in the evening.

Another thing to check here is actually reading column data from returned ResultSetMetadata, as lazy
resolving of it can be  postponed a step further.

Again in JTDS it's just a array of ColInfo (like our
ColumnDescriptor), it's passed to RowSet through constructor from
protocol implementation.


Evgeny Ryabitskiy.


Reply via email to