Re: Having a code ref executed when result set is all fetched

Elizabeth Mattijsen Tue, 27 Mar 2012 07:17:44 -0700

How time flies when you're having some days off...

On Mar 15, 2012, at 1:35 PM, Tim Bunce wrote:
> On Wed, Mar 14, 2012 at 01:35:21PM +0100, Elizabeth Mattijsen wrote:
>> Maybe some background first:
>> 
>> I'm working on code that does queries on large tables.  To be able to 
>> support the growth of these tables, we need to shard the tables to multiple 
>> database servers.  One way to shard this evenly, would be to shard per 
>> second of data: even seconds go to one database, odd seconds go to another.
>> 
>> We have a lot of code that creates minutely / hourly / daily aggregations 
>> (in Perl, the database just cannot do the kind of aggregations that we need) 
>> by querying these tables.
>> 
>> When moving to sharded tables, I would like to keep the changes to the code 
>> as minimal as possible.
>> 
>> By monkeypatching DBI::st, I've been able to reduce the code change to only 
>> having to add another loop construct to the code, so something like:
>> 
>> =============================
>> my %visitors;
>> my $values;
>> $visitors{ $values->[0] }= undef
>>  while $values= $sth->fetchrow_arrayref;
>> printf "Found %d unique visitors\n", scalar keys %visitors;
>> =============================
> 
> BTW, using bind_col() would be faster:
> 
>    $sth->bind_col(1, \my $visitor_id);
>    $visitors{ $visitor_id } = undef
>        while $sth->fetchrow_arrayref;


Thanks for the tip.  The code really was just a simplification of a more 
generic meme, in which multiple columns are being fetched.  The point was that 
the aggregation needs to be done in Perl.


> though it should be faster still to do a SELECT DISTINCT on the server
> for that particular case.

Even in this particular case, one cannot be sure.  As long as any temporary 
result remains in RAM in the server, probably.  As soon as any temporary result 
is stored on disk, the Perl aggregation approach wins hands down.  I've seen 
that happen on N-million row records with the number of unique occurrences 
reaching the number of records.  Fast network + RAM beats (rotating) disk 
almost any time.


>> Becomes:
>> =============================
>> my %visitors;
>> my $values;
>> while ( $sth= $sth->next ) {
>>    $visitors{ $values->[0] }= undef
>>      while $values= $sth->fetchrow_arrayref;
>> }
>> printf "Found %d unique visitors\n", scalar keys %visitors;
>> =============================
> 
> There's actually an (almost completely undocumented) more_results()
> method that's meant for that kind of thing. I was rather surprised
> to find it wasn't documented as I'm pretty sure we thrashed out the
> semantics some time ago.
> 
> [later] Wow, "some time ago" was 2004 :)
> See 
> http://markmail.org/thread/i7drouwtpdiybjzn#query:+page:1+mid:p4dy4uv3t2kpku56+state:results
> The "example implementation" in there assumed $sth->prepare works.
> Given current realities the more_results method would be like:
>    $sth = $sth->{Database}->prepare($sql);
>    $sth->execute;
>    return $sth; # caller must use this returned handle

The "next" method looks a lot like the "more_results" method, but with one very 
important difference.  The "more_results" method depends on database driver 
support for multiple result sets.  In my case, each statement could be executed 
using a *different* database handle, potentially on a different database 
instance, potentially with a different database driver.

In other words: "more_results" is too limited in its scope to be useful for 
what I'm trying to accomplish.


>> Then I found out about "swap_inner_handles" and realised that *if* I
>> could catch the depletion of the result set, I could do the $sth->next
>> logic using swap_inner_handles.  And make it completely invisible to
>> the outside world.
> 
> swap_inner_handle (aka "brain surgery for handles") is kind'a scary.
> But yes, it's worth exploring.
> 
>    my $new_sth = $old_sth->{Database}->prepare($sql);
>    $new_sth->swap_inner_handle($old_sth);
>    $old_sth->execute;
>    return $old_sth; # call can ignore this
> 
> 
>> However, I've not been able to find a way to catch that state change.
>> 
>> I have tried:
>> 
>>  Stealing $sth->finish (on the assumption that that would be called when the 
>> result set depletes)
> 
> Compiled drivers would tend to call the drivers own internal function
> for that.

That's what I saw, indeed.  Glad to see that confirmed.


>>  Stealing DBI::st::STORE (on the assumption that the tie interface would be 
>> used to reset the Active attribute)
> Again, compiled drivers tend to take shortcuts (in this case via the
> DBIc_ACTIVE_off macro defined in DBIXS.h).

Ok, if I interpret this correctly, then a change in that macro that would allow 
support for a code reference to be called, would be the only change needed.  
And a recompile of any database driver, of course.   But that would be 
transparent for anybody installing from source, right?


>> Neither of these work, unfortunately.
>> 
>> 
>> So I wonder, would it make sense to add a statement handle attribute
>> like "last_record_seen" which would contain a code ref?  And have that
>> code execute with the statement handle as its only parameter whenever
>> the result set of a statement handle gets depleted, but before
>> fetchrow_arrayref returns undef?  If that would exist, then it would
>> simply be a matter of putting the swap_inner_handle logic in that code.
>> 
>> something like:
>> 
>> $sth->{last_record_seen}= sub {
>>    my ($old)= @_;
>>    $old->swap_inner_handle($new);  # wherever $new comes from, not important 
>> here
>> }
>> 
>> Then it would be possible to completely hide the sharding of the
>> databases from our code.  Which would make life a lot easier on a lot
>> of people now and in the future.
> 
> It's not quite that simple if you're trying to avoid code changes in the
> application - i.e., if a while(fetch) loop should transparently return
> rows from multiple statements. The problem is the control flow.
> 
> At what point in the dispatcher does last_record_seen get called
> and how does it, or the dispatcher, then arrange to (re)call the
> fetch method to get the next row?
> 
> Hooking into the Active flag being turned off wouldn't be safe
> since the driver is not expecting to be reentered at that point.

I guess my use of swap_inner_handle would circumvent that.  Nothing changes for 
the old inner statement handle.  And the new inner statement handle would 
simply become active.  The driver would not need to know about this, as this is 
already at statement handle level.  Or am I missing something?


> The existing Callback mechanism 
> http://search.cpan.org/~timb/DBI/DBI.pm#Callbacks
> only works on the pre-call side of the dispatcher, currently.
> Extending it to allow callbacks on the post-call (return) side is
> certainly a possibility that Ive outlined previously. To avoid being
> too expensive for your needs the mechanism would have to support an
> optional "only call this callback if the return value is false" flag.
> 
> So perhaps we'd end up with something like this
> 
>    $sth->{Callbacks}{fetchrow_arrayref} = [
>        undef,  # pre-call hook
>        undef,  # post-call hook for true returns
>        sub {   # post-call hook for false returns
>            ...
>        }
>    ];

Thinking about this some more, the way forward for this seems to be adding 
support for another special callback key, e.g. 'Active.off' (by adapting the 
DBIc_ACTIVE_off macro).  Its parameters would be the statement handle on which 
the last record was just fetched (causing the Active attribute to be set to 
"off"), and the method name that was used to exhaust the resultset.


>> Actually, this callback would need to return whether or not to call the 
>> original fetch code again.  Something like:
>> 
>> $sth->{last_record_seen}= sub {
>>   my ($old)= @_;
>>   if ($new) {   # wherever $new comes from, not important here
>>       $old->swap_inner_handle($new);
>>       return 1;  # please try again
>>   }
>>   return 0; # we're really done, thank you
>> }
> 
> I don't know how easy or safe it would be to introduce a loop into the
> dispatcher to allow that. Maybe not too hard. It may be a better API to
> treat the return value of the post-call callback as the value to be
> returned to the application, so the callback would end with
> 
>    return $sth->fetchrow_arrayref;

I guess that would need to be:

  $sth->{Callbacks}->{Active.off}= sub {
      my ( $old, $method )= @_;
      if ($new) {  # wherever $new comes from
          $old->swap_inner_handle($new);
          return $old->$method;
      }
      return;
  };

Or is there in the end just one method that can exhaust a resultset?



Liz

Re: Having a code ref executed when result set is all fetched

Reply via email to