Re: [PyGreSQL] sql array notation parsing bug in pygresql-5+

raf Mon, 12 Sep 2016 23:36:14 -0700

Christoph Zwerschke wrote:

> Am 12.09.2016 um 08:22 schrieb raf:
> >>The "[2:2]=" prefix only says that Postgres should use the start and end
> >>index 2 for that index, but that's only for accessing the array internally
> >>in Postgres and does not translate to Python, since Python lists always have
> >>the start index 0.
> >
> >That's a surpringingly unhelpful point of view. If a Postgres
> >array has a value in item n then the python equivalent of that
> >array must have the same value in item n-1. To think otherwise
> >is madness. It's like saying that the position in an array is
> >meaningless. It really isn't.
> 
> Let's take a step back here.
> 
> The problem is that the Postgres array type doesn't directly translate to
> the Python list type, since Postgres arrays carry additional information,
> namely a start index different from the default (i.e. 1) and show a
> different behavior by e.g. returning null for an out of bound index while
> Python lists will raise an IndexError.
> 
> There are different ways to deal with that:
> 
> 1) Simply ignore the start index and return only the array items.
> 
> This is how it's currently done and what I think is the simplest and most
> straight-forward solution. This is also how psycopg2 and pg8000 do it, and I
> don't see many people complaining about this. It's not perfect, but it's not
> "madness" either (please avoid such words in Internet discussions, as it's
> very easy to take them personally and be offended, and unnecessary flamewars
> will arise).


I'm sorry for being an idiot. I should know better. I really
didn't mean to offend. It was meant to be more of a misquote as
in "That way lies madness". i.e. not an accusation, but a
warning. :-) Please accept my apologies.

I suspect that if people aren't complaining about it, that might
be because array columns probably aren't a very popular datatype
in sql databases. The only reason I'm using them is because it
was the easiest and quickest way to migrate data from a database
designed probably in the 1980s. But perhaps they're more popular
than I imagine. I really don't know.

Option 1 may be simple and straight-forward but the loss of
crucial position information means it's not a solution (for my
needs).

> 2) Compensate the start index by adding so many "None"s at the front of the
> list.
> 
> This is your suggestion. I agree, that's an interesting idea. But I also see
> some problems:
> 
> a) PyGres would become incompatible with other drivers (see above)

I can understand wanting compatibility. But I hope the other
drivers have ways to override their interpretation of array
semantics.

> b) If you go that route you could arguably also insert a None item in front
> of arrays with the default start index of 1. With other words, it's not
> immediately clear whether the default offset of 1 should be compensated with
> None items as well, or whether only non-standard offsets should be
> compensated. It's also not immediately clear whether None values at the
> start of a Python list should be automatically translated into a start index
> when storing a Python list in Postgres, or whether they should be kept as
> null values.

No, I think that it is clear that the first/lowest/smallest
index into a Postgres array is 1 and the first/lowest/smallest
index into a Python list is 0 (and yes, I know you can have
'[-2:-2]={1}' but that way lies...you know).

> c) If you retrieve an array with a large start index, the list in Python
> takes much memory (contrary to Postgres). If you re-insert that list into
> Postgres again, the database will grow larger as well. This might
> particularly become a problem with multi-dimensional arrays. With other
> words, round-trips to the database might not only change, but even inflate
> data which would not happen with 1).

That's true but necessary to preserve the positions of array
elements. To me, what I think of as the semantic correctness the
array datatype is far more important than space optimization.
For my use, there would never be more than about 50 nulls at the
start of an array. If it were a billion, that would be a
problem, but that's not likely to happen, certainly not in my
databases.

> d) While method 1) destroys info about the start index, method 2) destroys
> info about the length of the array. Depending on the type of data, this info
> might be just as relevant as the info about the start index.

Ooh, I hadn't thought of that. But really, I think the position
of each item in an array is what matters. Postgres arrays aren't
just a bag of values. They are arrays whose items are accessed
via an index. If I store an array in one column and I store an
index into that array in another column, then I fetch both of
those columns into Python, I should be able to use the fetched
index - 1 to access the same item in the fetched array in Python
as I would if I used the index to access the item in Postgres.
Those array semantics seem far more likely to me to be more
important than the positional distance between the explicitly
set item with the lowest index and the explicitly set item with
the highest index.

But of course, the length may be extremely important to someone
else and I just can't think of a situation where that's the case.
That could just be a limitation of my imagination. :-)

> >Look, even postgres agrees with me:
> >
> >  select cast('[2:2]={1}' as integer[]) = cast('{1}' as integer[])
> >   ?column?
> >  ----------
> >   f
> >  (1 row)
> >
> >See, they're are not the same thing. :-)
> 
> Yes, but
> 
> select cast('[2:2]={1}' as integer[])
>   = cast('{NULL,1}' as integer[])
> 
> returns false as well :-) They are also not the same thing, but would be
> returned as the same thing by your method.

Touché :-)

But I would see them as the same thing in all important ways.

> Another argument against your interpretation of [2:2]={1} as [None, 1]: You
> would expect that [1] concatenated with [None, 1] = [1, None, 1], however,
> Postgres concatenates them to [1, 1], please try it:
> 
> select cast('[1:1]={1}' as integer[])
>    || cast('[2:2]={1}' as integer[])

Wow! I never would that thought that would happen!

> >You say above that the index notation doesn't translate to Python.
> >I'm ssaying that it absolutelu has to translate to Python. If not,
> >it's indistinguishable from data corruption (just a lot more consistent).
> 
> I would not say that. You just need to be aware of the fact that the start
> index gets lost, and compensate it manually in Python, e.g. by querying
> array_lower(x) in addition to x if you know that your array x can have
> different start indices.

I really don't want to have to change nearly every single piece
of sql just to keep existing behaviour. Being stuck with an old
version (or monkey patching the new version) is the only
practical solution.

What you are suggesting is similar to what I've been doing for
years (except that no SQL needed to change): parsing '[2:2]={1}'
and using the starting index to set up the array the way I need
it. But that was only possible because the start index wasn't
lost. It was right there. Now it's lost because pygresql-5
discards it.

> >I can't magically change the selection sql to identify these arrays
> >and insert nulls into them on the way out. Or at least I can't anymore.
> >With pygresql-4 I would receive the string version and parse it into
> >a correct python list. With pygresql-5 I no longer have that option.
> 
> You're right. While you can disable the conversion in pg, there is currently
> no official way to do it in pgdb. This is an oversight.
> 
> However, you can use the following monkeypatch:
> 
> pgdb.Typecasts.create_array_cast = lambda self, basecast: str
> 
> So, for the time being, you can use that or compensate manually in Python by
> querying array_lower().

Thanks! That's perfect! Almost. I mean, I suppose it's shame
that a monkey patch is required and a documented API for
selecting that behaviour will be better but all I really care
about is getting my arrays back. With this, all my tests pass
again.

> >Even if you completely disagree with me, please accept that this
> >is incredibly important to me. My database is filled with arrays
> >and many are like these. If you don't add some way for me to
> >restore the old behaviour or fix the new behaviour, I will not
> >be able to upgrade to pygresql-5 ever.
> 
> I do not completely disagree, but I hope you understand my objections.
> I wouldn't want to make your method the default for the reasons listed
> above, but maybe we can make it optional (it's a bit difficult because the
> array syntax parse is written in C for performance reasons), and I will
> certainly add an official way to completely switch off array conversion in
> pgdb, or use a custom array parser.

Optional is completely fine. If the C version can be made to
optionally insert the None values, that would be awesome but I
was surprised that the speed difference between that and my
parser in Python didn't make a huge difference to the overall
speed of the tests. But I wasn't testing array parsing so maybe
that's not too surprising.

Actually, I just wrote a test to select an array of 26 ints
(with a default start index) 10000 times and the pygresql-5
version with the C array parser took 9s and the pygreesql-4
version with my Python array parser took 4s. That can't be
right. There must be some other differences that explain it.
The parsing of the array is probably only a small part of the
overall database query.

> I will also mention this issue with the start index in the docs.
> 
> -- Chris

Many thanks for your help (and for tolerating my rudeness).

cheers,
raf

_______________________________________________
PyGreSQL mailing list
[email protected]
https://mail.vex.net/mailman/listinfo.cgi/pygresql

Re: [PyGreSQL] sql array notation parsing bug in pygresql-5+

Reply via email to