Re: [PyGreSQL] sql array notation parsing bug in pygresql-5+

Christoph Zwerschke Mon, 12 Sep 2016 09:18:19 -0700

Am 12.09.2016 um 08:22 schrieb raf:

The "[2:2]=" prefix only says that Postgres should use the start and end
index 2 for that index, but that's only for accessing the array internally
in Postgres and does not translate to Python, since Python lists always have
the start index 0.


That's a surpringingly unhelpful point of view. If a Postgres
array has a value in item n then the python equivalent of that
array must have the same value in item n-1. To think otherwise
is madness. It's like saying that the position in an array is
meaningless. It really isn't.


Let's take a step back here.

The problem is that the Postgres array type doesn't directly translateto the Python list type, since Postgres arrays carry additionalinformation, namely a start index different from the default (i.e. 1)and show a different behavior by e.g. returning null for an out of boundindex while Python lists will raise an IndexError.


There are different ways to deal with that:

1) Simply ignore the start index and return only the array items.

This is how it's currently done and what I think is the simplest andmost straight-forward solution. This is also how psycopg2 and pg8000 doit, and I don't see many people complaining about this. It's notperfect, but it's not "madness" either (please avoid such words inInternet discussions, as it's very easy to take them personally and beoffended, and unnecessary flamewars will arise).

2) Compensate the start index by adding so many "None"s at the front ofthe list.

This is your suggestion. I agree, that's an interesting idea. But I alsosee some problems:


a) PyGres would become incompatible with other drivers (see above)

b) If you go that route you could arguably also insert a None item infront of arrays with the default start index of 1. With other words,it's not immediately clear whether the default offset of 1 should becompensated with None items as well, or whether only non-standardoffsets should be compensated. It's also not immediately clear whetherNone values at the start of a Python list should be automaticallytranslated into a start index when storing a Python list in Postgres, orwhether they should be kept as null values.

c) If you retrieve an array with a large start index, the list in Pythontakes much memory (contrary to Postgres). If you re-insert that listinto Postgres again, the database will grow larger as well. This mightparticularly become a problem with multi-dimensional arrays. With otherwords, round-trips to the database might not only change, but eveninflate data which would not happen with 1).

d) While method 1) destroys info about the start index, method 2)destroys info about the length of the array. Depending on the type ofdata, this info might be just as relevant as the info about the start index.

Look, even postgres agrees with me:

  select cast('[2:2]={1}' as integer[]) = cast('{1}' as integer[])
   ?column?
  ----------
   f
  (1 row)

See, they're are not the same thing. :-)


Yes, but

select cast('[2:2]={1}' as integer[])
  = cast('{NULL,1}' as integer[])

returns false as well :-) They are also not the same thing, but would bereturned as the same thing by your method.

Another argument against your interpretation of [2:2]={1} as [None, 1]:You would expect that [1] concatenated with [None, 1] = [1, None, 1],however, Postgres concatenates them to [1, 1], please try it:


select cast('[1:1]={1}' as integer[])
   || cast('[2:2]={1}' as integer[])

You say above that the index notation doesn't translate to Python.
I'm ssaying that it absolutelu has to translate to Python. If not,
it's indistinguishable from data corruption (just a lot more consistent).

I would not say that. You just need to be aware of the fact that thestart index gets lost, and compensate it manually in Python, e.g. byquerying array_lower(x) in addition to x if you know that your array xcan have different start indices.

I can't magically change the selection sql to identify these arrays
and insert nulls into them on the way out. Or at least I can't anymore.
With pygresql-4 I would receive the string version and parse it into
a correct python list. With pygresql-5 I no longer have that option.

You're right. While you can disable the conversion in pg, there iscurrently no official way to do it in pgdb. This is an oversight.


However, you can use the following monkeypatch:

pgdb.Typecasts.create_array_cast = lambda self, basecast: str

So, for the time being, you can use that or compensate manually inPython by querying array_lower().

Even if you completely disagree with me, please accept that this
is incredibly important to me. My database is filled with arrays
and many are like these. If you don't add some way for me to
restore the old behaviour or fix the new behaviour, I will not
be able to upgrade to pygresql-5 ever.


I do not completely disagree, but I hope you understand my objections.

I wouldn't want to make your method the default for the reasons listedabove, but maybe we can make it optional (it's a bit difficult becausethe array syntax parse is written in C for performance reasons), and Iwill certainly add an official way to completely switch off arrayconversion in pgdb, or use a custom array parser.


I will also mention this issue with the start index in the docs.

-- Chris
_______________________________________________
PyGreSQL mailing list
[email protected]
https://mail.vex.net/mailman/listinfo.cgi/pygresql

Re: [PyGreSQL] sql array notation parsing bug in pygresql-5+

Reply via email to