On Thu, Mar 26, 2020 at 6:34 AM Peter J. Holzer <hjp-pyt...@hjp.at> wrote: > > On 2020-03-23 06:00:41 +1100, Chris Angelico wrote: > > Second point, and related to the above. The regex that defines break > > points, as found in the source code, is: > > > > wordsep_re = re.compile(r''' > > ( # any whitespace > > %(ws)s+ > > | # em-dash between words > > (?<=%(wp)s) -{2,} (?=\w) > > | # word, possibly hyphenated > > %(nws)s+? (?: > > # hyphenated word > > -(?: (?<=%(lt)s{2}-) | (?<=%(lt)s-%(lt)s-)) > > (?= %(lt)s -? %(lt)s) > > | # end of word > > (?=%(ws)s|\Z) > > | # em-dash > > (?<=%(wp)s) (?=-{2,}\w) > > ) > > )''' % {'wp': word_punct, 'lt': letter, > > 'ws': whitespace, 'nws': nowhitespace}, > > > > It's built primarily out of small matches with long assertions, eg > > "match a hyphen, as long as it's preceded by two letters or a letter > > and a hyphen". > > Do you need that fancy logic? Could you only break on white-space > instead? It won't wrap "tetrabromo-phenolsulfonephthalein" in that case > but since you mentioned its for a twitter client, most users probably > won't mind (and those who do mind will probably insist that the > algorithm should be able to split it into tetrabromo-phenolsulfone- > phthalein, if that's where the line end is, as it was here purely by > lucky accident). A regexp for whitespace is pretty simple. >
If I *just* want to break on whitespace, I can do that (set both flags to False, off it goes). And in fact, that's what I've done so far, and it's working reasonably well. But I was hoping to be more flexible than that. ChrisA -- https://mail.python.org/mailman/listinfo/python-list