#36897: Optimize repercent_broken_unicode() performance
-------------------------------------+-------------------------------------
     Reporter:  Tarek Nakkouch       |                     Type:
                                     |  Cleanup/optimization
       Status:  new                  |                Component:  Utilities
      Version:  6.0                  |                 Severity:  Normal
     Keywords:                       |             Triage Stage:
                                     |  Unreviewed
    Has patch:  0                    |      Needs documentation:  0
  Needs tests:  0                    |  Patch needs improvement:  0
Easy pickings:  0                    |                    UI/UX:  0
-------------------------------------+-------------------------------------
 The `repercent_broken_unicode()` function in `django/utils/encoding.py`
 has performance issues when processing URLs with many consecutive invalid
 UTF-8 bytes. The bottleneck is due to raising an exception for each
 invalid byte and creating intermediate bytes objects through
 concatenation.

 {{{#!python
 changed_parts = []
 while True:
     try:
         path.decode()
     except UnicodeDecodeError as e:
         repercent = quote(path[e.start : e.end],
 safe=b"/#%[]=:;$&()+,!?*@'~")
         # creates new bytes object
         changed_parts.append(path[: e.start] + repercent.encode())
         path = path[e.end :]
     else:
         return b"".join(changed_parts) + path
 }}}

 == Suggested optimization ==

 The simplest solution is to append byte parts separately to the list
 instead of concatenating them with the `+` operator, avoiding creation of
 intermediate bytes objects. This provides ~40% improvement while keeping
 the same exception-based approach:

 {{{#!python
 changed_parts = []
 while True:
     try:
         path.decode()
     except UnicodeDecodeError as e:
         repercent = quote(path[e.start : e.end],
 safe=b"/#%[]=:;$&()+,!?*@'~")
         changed_parts.append(path[: e.start])
         changed_parts.append(repercent.encode())
         path = path[e.end :]
     else:
         changed_parts.append(path)
         return b"".join(changed_parts)
 }}}

 Alternatively, a manual UTF-8 validation approach could eliminate
 exception overhead entirely by scanning byte-by-byte and checking UTF-8
 patterns to identify invalid sequences without raising exceptions. This
 would reduce processing time by ~80% though the implementation is more
 complex.
-- 
Ticket URL: <https://code.djangoproject.com/ticket/36897>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/django-updates/0107019c149a2169-2af1ced5-50af-4d78-aa9c-28140fef6aea-000000%40eu-central-1.amazonses.com.

Reply via email to