#13543: Enhancements to the Sitemaps Framework so it works better for large 
sites
--------------------------+-------------------------------------------------
 Reporter:  mlissner      |       Owner:  nobody    
   Status:  new           |   Milestone:            
Component:  Contrib apps  |     Version:  SVN       
 Keywords:                |       Stage:  Unreviewed
Has_patch:  0             |  
--------------------------+-------------------------------------------------
 I've been using the sitemaps framework on my site for a little while now,
 and it occurs to me that, while the current implementation is good, maybe
 there are ways it could be improved. Forgive me if any of this has been
 mentioned elsewhere, but I did some looking, and didn't find it, so
 hopefully not, and hopefully this is the right place for such a
 discussion.

 The problems I see with the sitemaps are two:
 1. Generating a sitemap takes a LOT of IO, DB and CPU
 2. There is no way to to trigger an update when certain pages change.
 3. There is no good caching mechanism for them.

 I'll explain. The first problem for sitemaps is that they are generally
 large pages with LOTs of calls to the DB. The max for Google is 50k
 pages/sitemap, which means at least 50K things pulled from the DB. If you
 have custom date-modified and custom priority fields for each of these,
 that's 150K records from the DB. Bam, one of your DB threads is tied up.
 If you have an indexed sitemap set up, it's even possible for a crawler to
 request a bunch of your sitemaps simultaneously, in which case, bam, all
 your DB threads are tied up.

 The second problem is for sites that create content that doesn't change
 very often. As an owner of such a site, my sitemaps don't ever change
 aside from the one that is the last page in the index, which is where the
 new content is listed. All the old sitemaps almost never change. The
 result of this is that they almost never need to be regenerated.
 Occasionally they do, but very rarely. This is the case for lots of types
 of websites:
  - blogs
  - news sites (ahem, lawrence.com)
  - e-commerce, where new products come, but old ones are largely static
  - reference sites (like mine)

 The third problem I mentioned above is that there is no good caching
 mechanism for the sitemaps. They can be cached by one of the caching
 mechanisms listed in the caching documentation, but since they can be
 quite big, and since they often don't change, using a third of your RAM-
 based cache for your sitemaps is not a great option. Since there's no way
 to choose a different cache backend for a different page on the site, this
 becomes challenging.

 So....I'm not ripe with ideas on how to change these things, but I thought
 I would mention them formally here, and see what discussion ensued. My
 only ideas both center on a caching mechanism of some kind:
  - assuming that the database or the filesystem is the best type of cache
 for these (since they're large and mostly static), a method to easily set
 the cache backend for sitemaps would be incredibly useful.
  - having a system of triggers for sitemap regeneration would also be
 amazing, so that rather than actually generating the sitemap whenever it
 is pinged, instead it could be generated only when there is a change. (I
 suppose this could be configured in the views that create new content, but
 that seems a little hackish.)

 I'm curious what people's thoughts on this are, and perhaps what their
 solutions are. The current sitemap framework is great in its simplicity,
 but for real sitemaps on big sites, I don't think it works all that well,
 but maybe I'm missing something.

-- 
Ticket URL: <http://code.djangoproject.com/ticket/13543>
Django <http://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/django-updates?hl=en.

Reply via email to