Talin wrote: > Talin <talin <at> acm.org> writes: > > >>I decided to take some of the ideas discussed in the string formatting >>thread, add a few touches of my own, and write up a PEP. >> >>http://viridia.org/python/doc/PEP_AdvancedStringFormatting.txt >> >>(I've also submitted the PEP via the normal channels.) > > > No responses? I'm surprised...
You should have copied the PEP into the email... it was a whole click away, thus easier to ignore ;) The scope of this PEP will be restricted to proposals of built-in string formatting operations (in other words, methods of the built-in string type.) This does not obviate the need for more sophisticated string-manipulation modules in the standard library such as string.Template. In any case, string.Template will not be discussed here, except to say that the this proposal will most likely have some overlapping functionality with that module. s/module/class/ The '%' operator is primarily limited by the fact that it is a binary operator, and therefore can take at most two arguments. One of those arguments is already dedicated to the format string, leaving all other variables to be squeezed into the remaining argument. The current practice is to use either a dictionary or a list as the second argument, but as many people have commented [1], this lacks flexibility. The "all or nothing" approach (meaning that one must choose between only positional arguments, or only named arguments) is felt to be overly constraining. A dictionary, *tuple*, or a single object. That a tuple is special is sometimes confusing (in most other places lists can be substituted for tuples), and that the single object can be anything but a dictionary or tuple can also be confusing. I've seen nervous people avoid the single object form entirely, often relying on the syntactically unappealing single-item tuple ('' % (x,)). Brace characters ('curly braces') are used to indicate a replacement field within the string: "My name is {0}".format( 'Fred' ) While I've argued in an earlier thread that $var is more conventional, honestly I don't care (except that %(var)s is not very nice). A couple other people also preferred $var, but I don't know if they have particularly strong opinions either. The result of this is the string: "My name is Fred" The element within the braces is called a 'field name' can either be a number, in which case it indicates a positional argument, or a name, in which case it indicates a keyword argument. Braces can be escaped using a backslash: "My name is {0} :-\{\}".format( 'Fred' ) Which would produce: "My name is Fred :-{}" Does } have to be escaped? Or just optionally escaped? I assume this is not a change to string literals, so we're relying on '\{' producing the same thing as '\\{' (which of course it does). Each field can also specify an optional set of 'conversion specifiers'. Conversion specifiers follow the field name, with a colon (':') character separating the two: "My name is {0:8}".format( 'Fred' ) The meaning and syntax of the conversion specifiers depends on the type of object that is being formatted, however many of the built-in types will recognize a standard set of conversion specifiers. The conversion specifier consists of a sequence of zero or more characters, each of which can consist of any printable character except for a non-escaped '}'. The format() method does not attempt to intepret the conversion specifiers in any way; it merely passes all of the characters between the first colon ':' and the matching right brace ('}') to the various underlying formatters (described later.) Thus you can't nest formatters, e.g., {0:pad(23):xmlquote}, unless the underlying object understands that. Which is probably unlikely. Potentially : could be special, but \: would be pass the ':' to the underlying formatter. Then {x:pad(23):xmlquote} would mean format(format(x, 'pad(23)'), 'xmlquote') Also, I note that {} doesn't naturally nest in this specification, you have to quote those as well. E.g.: {0:\{a:b\}}. But I don't really see why you'd be inclined to use {} in a formatter anyway ([] and () seem much more likely). Also, some parsing will be required in these formatters, e.g., pad(23) is not parsed in any way and so it's up to the formatter to handle that (and may use different rules than normal Python syntax). When using the 'fformat' variant, it is possible to omit the field name entirely, and simply include the conversion specifiers: "My name is {:pad(23)}" This syntax is used to send special instructions to the custom formatter object (such as instructing it to insert padding characters up to a given column.) The interpretation of this 'empty' field is entirely up to the custom formatter; no standard interpretation will be defined in this PEP. If a custom formatter is not being used, then it is an error to omit the field name. This sounds similar to (?i) in a regex. I can't think of a good use-case, though, since most commands would be localized to a specific formatter or to the formatting object constructor. {:pad(23)} seems like a bad example. {:upper}? Also, it applies globally (or does it?); that is, the formatter can't detect what markers come after the command, and which come before. So {:upper} seems like a bad example. Standard Conversion Specifiers: For most built-in types, the conversion specifiers will be the same or similar to the existing conversion specifiers used with the '%' operator. Thus, instead of '%02.2x", you will say '{0:2.2x}'. There are a few differences however: - The trailing letter is optional - you don't need to say '2.2d', you can instead just say '2.2'. If the letter is omitted, then the value will be converted into its 'natural' form (that is, the form that it take if str() or unicode() were called on it) subject to the field length and precision specifiers (if supplied.) - Variable field width specifiers use a nested version of the {} syntax, allowing the width specifier to be either a positional or keyword argument: "{0:{1}.{2}d}".format( a, b, c ) (Note: It might be easier to parse if these used a different type of delimiter, such as parens - avoiding the need to create a regex that handles the recursive case.) Ah... that's an interesting way to use nested {}. I like that. A class that wishes to implement a custom interpretation of its conversion specifiers can implement a __format__ method: class AST: def __format__( self, specifiers ): ... The 'specifiers' argument will be either a string object or a unicode object, depending on the type of the original format string. The __format__ method should test the type of the specifiers parameter to determine whether to return a string or unicode object. It is the responsibility of the __format__ method to return an object of the proper type. If nested/piped formatting was allowed (like {0:trun(23):xmlquote}) then it would be good if it could return any object, and str/unicode was called on that object ultimately. I don't know if it would be considered an abuse of formatting, but maybe a_dict.__format__('x') could return a_dict['x']. Probably not a good idea. The string.format() will format each field using the following steps: 1) First, see if there is a custom formatter. If one has been supplied, see if it wishes to override the normal formatting for this field. If so, then use the formatter's format() function to convert the field data. 2) Otherwise, see if the value to be formatted has a __format__ method. If it does, then call it. 3) Otherwise, check the internal formatter within string.format that contains knowledge of certain builtin types. If it is a language change, could all those types have __format__ methods added? Is there any way for the object to accept or decline to do formatting? 4) Otherwise, call str() or unicode() as appropriate. Is there a global repr() formatter, like %r? Potentially {0:repr} could be implemented the same way by convention, including in object.__format__? Custom Formatters: If the fformat function is used, a custom formatter object must be supplied. The only requirement is that it have a format() method with the following signature: def format( self, value, specifier, builder ) This function will be called once for each interpolated value. The parameter values will be: 'value' - the value that to be formatted. 'specifier' - a string or unicode object containing the conversion specifiers from the template string. 'builder' - contains the partially constructed string, in whatever form is most efficient - most likely the builder value will be a mutable array or buffer which can be efficiently appended to, and which will eventually be converted into an immutable string. What's the use case for this argument? The formatter should examine the type of the object and the specifier string, and decide whether or not it wants to handle this field. If it decides not to, then it should return False to indicate that the default formatting for that field should be used; Otherwise, it should call builder.append() (or whatever is the appropriate method) to concatenate the converted value to the end of the string, and return True. Well, I guess this is the use case, but it feels a bit funny to me. A concrete use case would be appreciated. Optional Feature: locals() support This feature is ancilliary to the main proposal. Often when debugging, it may be convenient to simply use locals() as a dictionary argument: print "Error in file {file}, line {line}".format( **locals() ) This particular use case could be even more useful if it were possible to specify attributes directly in the format string: print "Error in file {parser.file}, line {parser.line}" \ .format( **locals() ) It is probably not desirable to support execution of arbitrary expressions within string fields - history has shown far too many security holes that leveraged the ability of scripting languages to do this. A fairly high degree of convenience for relatively small risk can be obtained by supporting the getattr (.) and getitem ([]) operators. While it is certainly possible that these operators can be overloaded in a way that a maliciously written string could exploit their behavior in nasty ways, it is fairly rare that those operators do anything more than retargeting to another container. On other other hand, the ability of a string to execute function calls would be quite dangerous by comparison. It could be a keyword option to enable this. Though all the keywords are kind of taken. This itself wouldn't be an issue if ** wasn't going to be used so often. And/or the custom formatter could do the lookup, and so a formatter may or may not do getattr's. One other thing that could be done to make the debugging case more convenient would be to allow the locals() dict to be omitted entirely. Thus, a format function with no arguments would instead use the current scope as a dictionary argument: print "Error in file {p.file}, line {p.line}".format() An alternative would be to dedicate a special method name, other than 'format' - say, 'interpolate' or 'lformat' - for this behavior. It breaks some conventions to have a method that looks into the parent frame; but the use cases are very strong for this. Also, if attribute access was a keyword argument potentially that could be turned on by default when using the form that pulled from locals(). Unlike a string prefix, you can't tell that the template string itself was directly in the source code, so this could encourage some potential security holes (though it's not necessarily insecure). This would require some stack-frame hacking in order that format be able to get access to the scope of the calling function. Other, more radical proposals include backquoting (`), or a new string prefix character (let's say 'f' for 'format'): print f"Error in file {p.file}, line {p.line}" This prefix character could of course be combined with any of the other existing prefix characters (r, u, etc.) This does address the security issue. The 'f' reads better than the '$' prefix previous suggested, IMHO. Syntax highlighting can also be applied this way. (This also has the benefit of allowing Python programmers to quip that they can use "print f debugging", just like C programmers.) Alternate Syntax Naturally, one of the most contentious issues is the syntax of the format strings, and in particular the markup conventions used to indicate fields. Rather than attempting to exhaustively list all of the various proposals, I will cover the ones that are most widely used already. - Shell variable syntax: $name and $(name) (or in some variants, ${name}). This is probably the oldest convention out there, and is used by Perl and many others. When used without the braces, the length of the variable is determined by lexically scanning until an invalid character is found. This scheme is generally used in cases where interpolation is implicit - that is, in environments where any string can contain interpolation variables, and no special subsitution function need be invoked. In such cases, it is important to prevent the interpolation behavior from occuring accidentally, so the '$' (which is otherwise a relatively uncommonly-used character) is used to signal when the behavior should occur. It is my (Talin's) opinion, however, that in cases where the formatting is explicitly invoked, that less care needs to be taken to prevent accidental interpolation, in which case a lighter and less unwieldy syntax can be used. I don't think accidental problems with $ are that big a deal. They don't occur that often, and it's pretty obvious to the eye when they exist. "$lengthin" is pretty clearly not right compared to "${length}in". However, nervous shell programmers often use ${} everywhere, regardless of need, so this is likely to introduce style differences between programmers (some will always use ${}, some will remove {}'s whenever possible). However, it can be reasonable argued that {} is just as readable and easy to work with as $, and it avoids the need to do any changes as you reformat the string (possibly introducing or removing ambiguity), or add formatting. - Printf and its cousins ('%'), including variations that add a field index, so that fields can be interpolated out of order. - Other bracket-only variations. Various MUDs have used brackets (e.g. [name]) to do string interpolation. The Microsoft .Net libraries uses braces {}, and a syntax which is very similar to the one in this proposal, although the syntax for conversion specifiers is quite different. [2] Many languages use {}, including PHP and Ruby, and even $ uses it on some level. The details differ, but {} exists nearly everywhere in some fashion. - Backquoting. This method has the benefit of minimal syntactical clutter, however it lacks many of the benefits of a function call syntax (such as complex expression arguments, custom formatters, etc.) It doesn't have any natural nesting, nor any way to immediately see the difference between opening and closing an expression. It also implies a relation to shell ``, which evaluates the contents. I don't see any benefit to backquotes. Personally I'm very uncomfortable with using str.format(**args) for all named substitution. It removes the possibility of non-enumerable dictionary-like objects, and requires a dictionary copy whenever an actual dictionary is used. In the case of positional arguments it is currently an error if you don't use all your positional arguments with %. Would it be an error in this case? Should the custom formatter get any opportunity to finalize the formatted string (e.g., "here's the finished string, give me what you want to return")? -- Ian Bicking / [EMAIL PROTECTED] / http://blog.ianbicking.org _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com