Tom Lane [email protected] <mailto:[email protected]> wrote:

> Well, yeah, that's the problem. You can certainly maintain your own 
> persistent data structure somewhere, but then it's entirely on your head to 
> manage it and avoid memory leakage/bloating as you process more and more 
> data. The mechanisms I pointed you at provide a structure that makes sure 
> space gets reclaimed when there's no longer a reference to it, but if you go 
> the roll-your-own route then it's a lot messier.

Of course, I'm sole owner and admin of every instance of PostgreSQL involved, 
but a great many things have always been and will remain on my head if I "gots 
it wrongly", so yes, rolling my own would be an option of last resort. I was 
able to "see" how the aggregate memory mechanism worked but confess that I must 
still connect the dots in the expanded datum case. I know you're seeing 
something there I don't yet recognise, so I'll look again.  

> A mechanism that might work well enough is a transaction-lifespan hash table. 
> You could look at, for example, uncommitted_enum_types in pg_enum.c for 
> sample code.

And at that, naturally.

Trying hard to not get too far ahead of myself here, perhaps I should try 
running into the foreseen performance issue before addressing it. How about I 
make a trivial implementation (skipping over the complex optimised processing, 
compression and advanced logic ensuring canonical compressed values) of the 
aggregate and user defined type the aggregate would calculate, and put that up 
for a review first.

If all goes to script, the result ought to be that the aggregate would nicely 
reuse the decoded version of the aggregate in memory and forget it existed when 
the final function has been run. But then each user defined type function 
getting called would decode the stored byte array, do its bit on the data, and 
encode to a byte array again.

It won't be optimal, but it will be simple, and go a long way towards ensuring 
I got the whole type and aggregate ecosystem set up as it should be. It will 
also settle any doubts as to whether membership tests would use "IN" or "ANY" 
semantics.

Step next would be to use say the transaction memory context to retain the 
decoded version in memory between calls to different (non-aggregate) UDT 
functions (made in the same transaction). Beyond finding the appropriate 
context to use, the challenge, I understand, would be to identify the 
particular instance of the UDT in memory, which is where you're suggesting 
using a hash table. With fewer outstanding vagaries and uncertainties, the way 
forward might be quite obvious by then, but from where I stand right now  I can 
only think of worse ways than hashing, so it's most likely be hashing. 

That said, the two most important reality checks we'd have to consider at that 
point would be:

a) whether there are ever enough, as in more than three, of these values 
present in the same memory context to warrant the hash calculation overhead 
since a full comparison will be required anyway, and

b) whether the special characteristic of my UDT values where identical values 
have, by definition, identical structure in memory offers enough opportunity to 
use a simple reference count to decide if a value can/should be changed 
in-place, if the changed value should be written to its own fresh slot or just 
need an increased reference count on memory that already represent that value. 

Once we cracked the intra-function nut, we could look for a way to share memory 
between the aggregate- and other UDT-functions.

Either way, I've got a boat load more reading and writing to do. Thank you ever 
so much for your time and attention. It is a great kindness, well appreciated.

Regards,
Marthin Laubscher
 




Reply via email to