[gpodder-devel] util.remove_html_tags optimizations

nikosapi Wed, 19 Mar 2008 21:16:32 -0700

Hello again!

While hunting a bug down today I found some code that was slowing down 
gPodder's loading time :D


The original bug was that in the episode descriptions I was still seeing stuff 
like &#8217;. This is because the python codepoint2name dict doesn't include 
all of the possible unicode characters. So I replaced the old code with a 
regex that converts the codepoint numbers directly to unicode characters. In 
a quick benchmark I calculated that using the old code took 3.28 sec worth of 
load time whereas the new code uses < 0.1 sec of load time :)

Here are some examples of feeds which include those weird codepoints:
- http://feeds.feedburner.com/doctorow_podcast
- http://feeds.feedburner.com/nlo

Now what's really cool is that when I launch gPodder, it's ready to go in less 
than 2 seconds! (on an intel E6300)

Let me know what you guys think,

nick

--- gpodder-r615/src/gpodder/util.py	2008-03-19 23:15:38.000000000 -0400
+++ gpodder-r615-dev/src/gpodder/util.py	2008-03-19 23:28:32.000000000 -0400
@@ -309,14 +309,13 @@
     # strips html from a string (fix for <description> tags containing html)
     rexp = re.compile( "<[^>]*>")
     stripstr = rexp.sub( '', html)
-    # replaces numeric entities with entity names
-    dict = htmlentitydefs.codepoint2name
-    for key in dict.keys():
-        stripstr = stripstr.replace( '&#'+str(key)+';', '&'+unicode( dict[key], 'iso-8859-1')+';')
+    # replace unicode entities with the characters they represent
+    unicode_ent_re = re.compile( '&#(\d{2,4});' )
+    stripstr = unicode_ent_re.sub( lambda x: unichr(int(x.group(1))), stripstr )
     # strips html entities
     dict = htmlentitydefs.entitydefs
-    for key in dict.keys():
-        stripstr = stripstr.replace( '&'+unicode(key,'iso-8859-1')+';', unicode(dict[key], 'iso-8859-1'))
+    html_ent_re = re.compile( '&(.{2,8});' )
+    stripstr = html_ent_re.sub( lambda x: unicode(dict.get(x.group(1),''), 'iso-8859-1'), stripstr )
     return stripstr

_______________________________________________
gpodder-devel mailing list
[email protected]
https://lists.berlios.de/mailman/listinfo/gpodder-devel

[gpodder-devel] util.remove_html_tags optimizations

Reply via email to