#36897: Optimize repercent_broken_unicode() performance
-------------------------------------+-------------------------------------
Reporter: Tarek Nakkouch | Type:
| Cleanup/optimization
Status: new | Component: Utilities
Version: 6.0 | Severity: Normal
Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
The `repercent_broken_unicode()` function in `django/utils/encoding.py`
has performance issues when processing URLs with many consecutive invalid
UTF-8 bytes. The bottleneck is due to raising an exception for each
invalid byte and creating intermediate bytes objects through
concatenation.
{{{#!python
changed_parts = []
while True:
try:
path.decode()
except UnicodeDecodeError as e:
repercent = quote(path[e.start : e.end],
safe=b"/#%[]=:;$&()+,!?*@'~")
# creates new bytes object
changed_parts.append(path[: e.start] + repercent.encode())
path = path[e.end :]
else:
return b"".join(changed_parts) + path
}}}
== Suggested optimization ==
The simplest solution is to append byte parts separately to the list
instead of concatenating them with the `+` operator, avoiding creation of
intermediate bytes objects. This provides ~40% improvement while keeping
the same exception-based approach:
{{{#!python
changed_parts = []
while True:
try:
path.decode()
except UnicodeDecodeError as e:
repercent = quote(path[e.start : e.end],
safe=b"/#%[]=:;$&()+,!?*@'~")
changed_parts.append(path[: e.start])
changed_parts.append(repercent.encode())
path = path[e.end :]
else:
changed_parts.append(path)
return b"".join(changed_parts)
}}}
Alternatively, a manual UTF-8 validation approach could eliminate
exception overhead entirely by scanning byte-by-byte and checking UTF-8
patterns to identify invalid sequences without raising exceptions. This
would reduce processing time by ~80% though the implementation is more
complex.
--
Ticket URL: <https://code.djangoproject.com/ticket/36897>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
--
You received this message because you are subscribed to the Google Groups
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/django-updates/0107019c149a2169-2af1ced5-50af-4d78-aa9c-28140fef6aea-000000%40eu-central-1.amazonses.com.