Re: [PyGreSQL] sql array notation parsing bug in pygresql-5+

raf Wed, 14 Sep 2016 19:35:31 -0700

Christoph Zwerschke wrote:

> Am 13.09.2016 um 08:35 schrieb raf:
> >I suspect that if people aren't complaining about it, that might
> >be because array columns probably aren't a very popular datatype
> >in sql databases. The only reason I'm using them is because it
> >was the easiest and quickest way to migrate data from a database
> >designed probably in the 1980s. But perhaps they're more popular
> >than I imagine. I really don't know.
> 
> The question for me is not how frequently arrays are used, but how
> frequently arrays with non-default start indices are used. Note that the
> ARRAY constructor in Postgres doesn't even have a way to set a different
> start index (as far as I know).


Yes, they're usually created by something like:

  update some_table set array_column[index_other_than_1] = value...

I've never really thought of them as having a non-default start
index. I've always just thought of the '[#:#]={}' notation as a
Postgres-specific "compression" format which needed to be
"decompressed" when fetched but that seems not to be the case
(and it doesn't explain negative start indexes).

> >Optional is completely fine. If the C version can be made to
> >optionally insert the None values, that would be awesome but I
> >was surprised that the speed difference between that and my
> >parser in Python didn't make a huge difference to the overall
> >speed of the tests. But I wasn't testing array parsing so maybe
> >that's not too surprising.
> >
> >Actually, I just wrote a test to select an array of 26 ints
> >(with a default start index) 10000 times and the pygresql-5
> >version with the C array parser took 9s and the pygreesql-4
> >version with my Python array parser took 4s. That can't be
> >right. There must be some other differences that explain it.
> >The parsing of the array is probably only a small part of the
> >overall database query.
> 
> If there's really a performance degradation I'd like to fix that. Maybe you
> can create a small reproducible test code.
> 
> Note that you can test the array parser in isolation:
> http://www.pygresql.org/contents/pg/module.html#cast-array-record-fast-parsers-for-arrays-and-records
> 
> It also parses multidimensional arrays and it should be pretty fast.

I suspect that the speed difference is not because of the array
parser. There must some other reason so I've attached the test
program I used rather than one that just tests the parser in
isolation. The array parser in the attachment is slow and only
handles 1-dimensional arrays. I think it's safe to say that the
C parser would be much faster.

> Maybe the method should create a subclass of list that corresponds to a
> one-dimensonal Postgres array. It should behave like a normal list, but have
> settable (and changeable) start and end indices (if you change one, you
> change also the other) and return None for out of range indices. That way
> you could even properly convert an array like '[-2:-2]={1}'.
> Multidimensional arrays could be created by nesting these.
> 
> I'm thinking of something along these lines:
> 
> class PgList(list):
> 
>     def __init__(self, value, lower=1):
>         list.__init__(self, value)
>         self._lower = lower
> 
>     @property
>     def lower(self):
>         return self._lower if self else None
> 
>     @property
>     def upper(self):
>         return self._lower + len(self) - 1 if self else None
> 
>     def __getitem__(self, key):
>         if self and self._lower <= key < self._lower + len(self):
>             return list.__getitem__(self, key - self._lower)
> 
> We could also support mutability and slicing, but that will quickly become
> complicated.

I don't want things to get too complicated.

> Would it help you if we return these kind of lists?

I'm not sure. I'd be happy for it to return None for indexes
between 0 and the "real" start but I wouldn't want it to return
None for indexes past the end even though that would mimic
Postgres behaviour. I'd rather it behaved like a Python list
with enough None values inserted at the beginning to make the
indexes match (although being off-by-one of course). In other
words, I'd want len(a) in Python to return the same value as
array_upper(a, 1) in Postgres. But it sounds like that's too
tacky which is fair enough. Just because it's what I want
doesn't mean that's what anyone else would want.

If inserting None values into an ordinary Python list is not an
option, my next thought was maybe the client can request
non-optional behaviour somehow that means that, when fetching
arrays, if the start index is 1, a Python list is returned (as
is the case now) but if the start index is not 1, then a 2-tuple
is returned instead containing the Postgres start index as one
item and the list that would normally be returned as the other
item.

That way, the information about the start index isn't discarded
but PyGreSQL doesn't need to do very much with it. Then, when
fetching a Postgres array, I can detect the arrival of a tuple
instead of a list and replace it with the list with inserted
None values that I want but other clients may want to do
something else with it that doesn't waste memory.

As long as this behaviour had to be explicitly requested, it
wouldn't affect anyone who currently relies on the existing
behaviour.

You could also accept such tuples when sending an array to
Postgres to create arrays with non-standard start indexes.
That may not be necessary but it would make sense.

> -- Chris

cheers,
raf

_______________________________________________
PyGreSQL mailing list
PyGreSQL@vex.net
https://mail.vex.net/mailman/listinfo.cgi/pygresql

Re: [PyGreSQL] sql array notation parsing bug in pygresql-5+

Reply via email to