[issue24601] bytes and unicode splitlines() methods differ on what is a line break

Gregory P. Smith Thu, 09 Jul 2015 19:18:51 -0700

New submission from Gregory P. Smith:

for bytes, \v (0x0b) is not considered a line break.  for unicode, it is.


this traces back to the Objects/stringlib/ code where unicode defers to the 
decision made by Objects/unicodeobject.c's ascii_linebreak table which contains 
7 line breaks in the 0..127 character range:

static unsigned char ascii_linebreak[] = {
    0, 0, 0, 0, 0, 0, 0, 0,
/*         0x000A, * LINE FEED */
/*         0x000B, * LINE TABULATION */
/*         0x000C, * FORM FEED */
/*         0x000D, * CARRIAGE RETURN */
    0, 0, 1, 1, 1, 1, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0,
/*         0x001C, * FILE SEPARATOR */
/*         0x001D, * GROUP SEPARATOR */
/*         0x001E, * RECORD SEPARATOR */
    0, 0, 0, 0, 1, 1, 1, 0,


Whereas Objects/stringlib/stringdefs.h used by only considers \r and \n.

I think these should be consistent.  But making this change likely breaks 
existing code in weird ways.

This does come up when porting from 2 to 3 as a str '' type with one of those 
other characters in it was not broken by splitlines in 2.x but is broken by 
splitlines in 3.x.

----------
messages: 246538
nosy: gregory.p.smith
priority: normal
severity: normal
status: open
title: bytes and unicode splitlines() methods differ on what is a line break

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue24601>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue24601] bytes and unicode splitlines() methods differ on what is a line break

Reply via email to