Re: [PyGreSQL] sql array notation parsing bug in pygresql-5+

raf Mon, 19 Sep 2016 17:30:46 -0700

Christoph Zwerschke wrote:

> Am 15.09.2016 um 04:34 schrieb raf:
> >> The question for me is not how frequently arrays are used, but how
> >> frequently arrays with non-default start indices are used. Note that the
> >> ARRAY constructor in Postgres doesn't even have a way to set a different
> >> start index (as far as I know).
> >
> > Yes, they're usually created by something like:
> >
> >   update some_table set array_column[index_other_than_1] = value...
> 
> Right, if you use an index < 1 or update an empty array, the start index
> would be changed. I still don't think this is something that is frequently
> done, but yes, it's possible.
> 
> > I've never really thought of them as having a non-default start
> > index. I've always just thought of the '[#:#]={}' notation as a
> > Postgres-specific "compression" format which needed to be
> > "decompressed" when fetched but that seems not to be the case
> > (and it doesn't explain negative start indexes).
> 
> Not really compression, since all empty values are returned as NULL.
> Actually the end index is redundant in this notation.
> 
> > I suspect that the speed difference is not because of the array
> > parser. There must some other reason so I've attached the test
> > program I used rather than one that just tests the parser in
> > isolation. The array parser in the attachment is slow and only
> > handles 1-dimensional arrays. I think it's safe to say that the
> > C parser would be much faster.
> 
> I think so. But it may point to a performance degradation elsewhere, so I'd
> like to check this. Did you forget to attach the test?


Oops. Sorry about that. Here it is.

> > I'm not sure. I'd be happy for it to return None for indexes
> > between 0 and the "real" start but I wouldn't want it to return
> > None for indexes past the end even though that would mimic
> > Postgres behaviour. I'd rather it behaved like a Python list
> > with enough None values inserted at the beginning to make the
> > indexes match (although being off-by-one of course). In other
> > words, I'd want len(a) in Python to return the same value as
> > array_upper(a, 1) in Postgres. But it sounds like that's too
> > tacky which is fair enough. Just because it's what I want
> > doesn't mean that's what anyone else would want.
> 
> Exactly. As you see, there are many different ways to implement this and if
> we do it one way, there is always somebody who will complain.
> Therefore it is best to implement only the straightforward, obvious way, but
> allow for customization.
> 
> Again, the problem here is that Postgres arrays are different beasts than
> lists in Python, much more similar to Arrays in JavaScript.
> 
> > If inserting None values into an ordinary Python list is not an
> > option, my next thought was maybe the client can request
> > non-optional behaviour somehow that means that, when fetching
> > arrays, if the start index is 1, a Python list is returned (as
> > is the case now) but if the start index is not 1, then a 2-tuple
> > is returned instead containing the Postgres start index as one
> > item and the list that would normally be returned as the other
> > item.
> 
> I was also thinking along these lines, but the return value should always be
> of the same type, otherwise code will always have to handle both cases,
> making it more complicated, or risk raising errors. Also keep in mind you
> can also have multidimensional arrays with more than one start index.
> 
> I'd rather return a list subtype with the start index as an additional
> attribute. However, the question is then how that list should behave when
> retrieving items. Since this is not obvious, we should make it customizable.
> 
> So the idea is that we provide a function for changing the base Python class
> used for PG arrays, which should be a subclass of list. The array parser
> would then only set an additional "lower" attribute in instances of that
> class, and it's up to the class implementation how this is handled when
> items of the array are returned.
> 
> Would that be reasonable solution?

That sounds fine as long as the client doesn't have to provide the list
subclass (unless they really want to).

Then, when fetching arrays, I can test for the existence of the "lower"
attribute and convert it to what the rest of my code is expecting.

> -- Christoph

cheers,
raf

#!/usr/bin/env python

# Select a 26-element integer array from Postgres 10000 times.
# If we have PyGreSQL-4, it is returned as a string and we parse it in Python.
# If we have PyGreSQL-5, it is returned as an array having been parsed in C.
# On my laptop, with PyGreSQL-4.2.2 it takes 4s and with PyGreSQL-5.0.1 it takes 9s.
# That's probably not due to the array parsing.

import pgdb
import re

def test():
	connection = pgdb.connect(host='XXX', database='XXX', user='XXX', password='XXX')
	cursor = connection.cursor()
	done = 0
	sql = "select cast('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26}' as integer[]) from pay_group where code = '61*ADV'"
	for i in range(10000):
		cursor.execute(sql)
		rows = cursor.fetchall()
		a = rows[0][0]
		if isinstance(rows[0][0], basestring):
			aa = get_int_array(a)
			if not done:
				done = 1
				print('PyGreSQL-4 Python array parser')
		else:
			aa = a
			if not done:
				done = 1
				print('PyGreSQL-5 C array parser')
	print('%r %s' % (aa, type(aa)))

def r(pattern, opts=''):
	'''
	r(pattern, opts='') -> compiled regular expression object

	Like re.compile() but with more compact options. opts is a string containing
	any of the characters 'ilmsxu' each of which corresponds to re.I, re.L etc.
	If the pattern is a unicode object, the option re.U is automatically included.
	usage: pattern = r('([a-e])\d', 'i').
	'''
	# Not a string? Must already be compiled
	if not isinstance(pattern, basestring):
		return pattern
	flags, odict = 0, { 'i': re.I, 'l': re.L, 'm': re.M, 's': re.S, 'x': re.X, 'u': re.U }
	for o in opts:
		flags |= odict[o]
	if isinstance(pattern, unicode):
		flags |= re.U
	recomp = re.compile(pattern, flags)
	return recomp

def m(pattern, text, opts='', pos=0, endpos=None):
	'''
	m(pattern, text, opts='', pos=0, endpos=None) -> re.MatchObject

	Like re.search() but with more compact options. See r().

	usage: m(pattern, text)
	'''
	if endpos is None:
		endpos = len(text)
	return r(pattern, opts).search(text, pos, endpos)

def get_int_array(val):
	if val is None:
		return None
	return [get_int(v) for v in _get_array(val)]

def get_int(val):
	'''Interpret val as a Postgres int value (really a string) and
	return the int equivalent. Called by subclasses in get_data().'''
	# If coming from postgres via python (pgdb module)
	if val is None:
		return None
	if isinstance(val, (int, long)):
		return val
	if isinstance(val, basestring):
		try:
			return int(val)
		except ValueError:
			pass
	raise Exception('Unexpected int syntax: %r' % val)

def _get_array(val):
	'''Parses the string val containing a Postges SQL array literal.
	Returns an array of strings which can be further parsed by other
	methods. If val is already an array, it is returned unaltered.
	Any elements that are the unquoted string "NULL" will be replaced
	by None. Only 1-dimensional arrays are supported.'''
	# If coming from python
	if isinstance(val, list):
		return val
	# If coming from postgres via python (pgdb module 2008)
	if val is None:
		return None
	if val == '{}':
		return []
	# Convert '[17:35]={...}' into complete '{...}' before parsing
	match = m('^\[(\d+):(\d+)\]={', val) if val[0] == '[' else None
	if match is not None:
		itext, ftext = match.group(1), match.group(2)
		i = int(itext)
		val = '{' + ','.join(['null'] * (i - 1)) + ',' + val[len(itext) + len(ftext) + 5:]
	if val[0] == '{' and val[-1] == '}':
		if set(val[1:-1]).isdisjoint(set('{}"\'')):
			# 1D, no quotes, easy
			return [None if v == 'NULL' or v == 'null' else v for v in val[1:-1].split(',')]
		# 1D still, but handle quotes
		start_state, out_state, inq_state, in_state, end_state = 'start', 'out', 'inq', 'in', 'end'
		i, end, state, vals, v, q = 0, len(val), start_state, [], '', ''
		while i < end:
			c = val[i]
			# print('i %d c %s %s q %s v %s vals %s' % (i, c, state, q, v, repr(vals)))
			if state is start_state:
				# Before {
				if c == '{':
					# After {
					state, vals = out_state, []
				else:
					raise Exception('Unexpected array syntax: %s (pos %d)' % (val, i))
			elif state is out_state:
				# Outside elements
				if c == '"':
					# Start quoted string element
					state, q, v = inq_state, c, ''
				elif c == ',':
					# Append previous element
					vals.append(v); v = ''
				elif c == '}':
					# After }
					state = end_state; vals.append(v); v = ''
				else:
					# Start unquoted element
					state, v = in_state, c
			elif state is inq_state:
				# Inside quoted element
				if c == '\\':
					# Backslash-quoted character
					if i == end - 1:
						# No next character!
						raise Exception('Unexpected array syntax: %s (pos %d)' % (val, i))
					v += val[i + 1]; i += 1
				elif c == q:
					# End of quoted string literal
					state = out_state; q = ''
				else:
					# Other character in quoted string literal
					v += c
			elif state is in_state:
				# Inside unquoted element
				if c == ',':
					# End of element
					state = out_state; vals.append(v); v = ''
				elif c == '}':
					# End of last element
					state = end_state; vals.append(v); v = ''
				else:
					# Other character in unquoted element
					v += c
			elif state is end_state:
				# Trailing characters after }
				raise Exception('Unexpected array syntax: %s (pos %d)' % (val, i))
			i += 1
		if state is not end_state:
			raise Exception('Unexpected array syntax: %s (end)' % val)
		return [None if v == 'NULL' or v == 'null' else v for v in vals]
	raise Exception('Unexpected array syntax: %s' % val)

if __name__ == '__main__':
	test()

_______________________________________________
PyGreSQL mailing list
PyGreSQL@vex.net
https://mail.vex.net/mailman/listinfo.cgi/pygresql

Re: [PyGreSQL] sql array notation parsing bug in pygresql-5+

Reply via email to