discussion: can't iterate over lines in `UploadedFile` with `\r\n` or `\r` line endings

Tai Lee Thu, 28 Aug 2008 20:25:43 -0700

http://code.djangoproject.com/ticket/8149


As mentioned in the ticket, `UploadedFile.__iter__` iterates over a
`StringIO` object to yield each line of the uploaded file (including
line endings). Unfortunately the current version of `StringIO` only
treats `\n` as a line ending, but iterating over a file object will
yield lines regardless of the line ending type. I believe that future
versions of `StringIO` work the same way as file objects do now.

The problem with the current implementation is that we can't know what
line ending uploaded text files will use, so anybody trying to iterate
across lines of an uploaded file might get multiple lines or even the
entire chunk or file in a single iteration.

In order for users to reliably iterate through each line in an
uploaded text file, they will need to write their own iterator to
account for each possibility.

A few possible workarounds for users that I can think of are:

1) Save the file to a temporary location on disk and open it as a file
object, then iterate through that. This can be cumbersome because by
default files under 2.5 MB are stored in memory, while larger files
are already stored in a temporary location on disk.

2) Write a new iterator which includes the chunk/buffer logic of
`UploadedFile.__iter__` but treats `\r\n` and `\r` as line endings as
well as `\n`.

3) Load the entire file into memory and split it (if you don't need to
retain the line endings).

A few possible solutions on the django side could be:

1) Subclass `StringIO` and override `readline` to work with other line
endings. This could be useful in other areas of Django, and could be
considered similar to making the decimal module available to Python
2.3, by making future functionality of `StringIO` available now.

2) Rewrite `UploadedFile.__iter__` to not use `StringIO`. Some
alternatives might be to parse the string in a similar way to
`StringIO.readline`, or to use `re.findall` (with a gross pattern like
the one found in the patch attached to the ticket), or to use
`re.split` with a slightly less offensive pattern such as
`re.split(r'(\r\n|\r|\n)', ...)` which would yield lines and line
endings alternatively.

Personally I think that it's not a rare edge case that users will want
to accept text file uploads from unknown sources and that they should
be able to iterate over each line of uploaded text files without re-
writing that functionality in their own code.


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

discussion: can't iterate over lines in `UploadedFile` with `\r\n` or `\r` line endings

Reply via email to