Re: [jira] Updated: (CAY-1282) Use #result as optional directive for only few columns (not all)

Andrus Adamchik Mon, 12 Oct 2009 03:59:17 -0700

Excellent. Looks like we are on the same page now.

Re: HashSet (HashMap?) vs. []. The map is definitely performing betterwhen you are looking up a value by its key. This may be the case whenwe are assembling column descriptors inside the builder. Thisoperation is done 1 time (ok maybe it is done N times, where N is thenumber of columns).

However processing the ResultSet is a different story. There's nolookup by key. For each ResultSet row we need to apply ALL columndescriptors one by one to get the values out. So with a HashSet/HashMap we'd have this:


   Iterator<ColumnDescriptor> it = map.values().iterator();
   while(it.hasNext()) {
       ColumnDescriptor column = it.next();
       ....
   }

With a ColumnDescriptor[] we have this:

  for(int i = 0; i < length; i++) {
       column = columns[i];
  }

Both loops are done M times, where M is the number of rows in theResultSet. In the worst case scenario, M is much larger than N. In thefirst case, we call three extra methods (iterator, hasNext, next) andcreate at least one extra object (Iterator). So the secon case ismarginally faster. Now if you multiply that nanosecond or whateverdifference by a few millions, it can become more significant.

So essentially when talking about this refactoring we need to separatethe first step of preparing the columns, and the second step of usingthem.


Andrus


On Oct 12, 2009, at 12:38 PM, Evgeny Ryabitskiy wrote:

It doesn't matter how this represented *inside* the builder class, as
builder is used only once per query. On the other hand, coming outof thebuilder it must be optimized, as access to the column descriptorsarray isperformed N*M times during each result set processing, where N isthe widthof the result set, and M is its length. I.e. it can be a very largenumber(up to tens or hundreds of millions calls). Every smalloptimization matters
here.
So.. I was talking exactly about optimization... HashedSet array can
be faster cause we perform several scans over whole array of
ColumnDescriptors. And safety cause we don't get duplicates for
Columns. And.. I didn't get you position about this idea
This is something I don't know. We need to check about a dozen ofdriversfrom different vendors that we support to verify that. This is justa getter
in the interface. Implementors could've made it anything.
I have looked through JTDS drivers (not a dozen but a least one).
ResultSet has all information about columns (just  private final
ColInfo[] columns).
When getMetaData performed - constructs new Object that has reference
to array of columns from ResultSet .
Looks like there is no problem with JTDS.
The problem that if we don't set ResultSetMetadata like in current
(trunk) version, without ResultSetMetadata  we don't know all
columns..
Not true. We don't know all the columns for SQLTemplate queries.For allother types of queries we DO know all the columns, as Cayennegenerates SQLfrom scratch for those queries. I think this one place is where wehave the
biggest mismatch in our views of the implementation.
ah... now I see. You are right that was a mismatch in our views. I
will work on it in the evening.
Another thing to check here is actually reading column data fromreturned ResultSetMetadata, as lazy
resolving of it can be  postponed a step further.
Again in JTDS it's just a array of ColInfo (like our
ColumnDescriptor), it's passed to RowSet through constructor from
protocol implementation.


Evgeny Ryabitskiy.

Re: [jira] Updated: (CAY-1282) Use #result as optional directive for only few columns (not all)

Reply via email to