Hey Jeff, I haven't tried to solve this problem in practice, so this amounts to thinking out loud; maybe it is helpful, maybe not. ;)
So each chunk is limited to 50k urls and 10mb uncompressed. In other words, you've got to constraints to consider. In very simplistic testing, I can compress about 50k urls that are right around 10mb of generated sitemap data down to around 4mb. I'd say it depends pretty heavily on your url schema how many you'll be able to safely fit in one sitemap. As I recall, there is no ordering to your sitemap, right? I've got no clue how this might impact SEO type stuff. I'm assuming this is an issue because you've got well over 50k urls; presumably a full "sitemap index" file is sufficient to list 2,500,000,000 urls (ie 50k * 50k). If you want to compute this thing on-demand, maybe you could use the datastore stats to get an idea of how many entities you have (presumably this is the driver), then figure out how many sitemaps you'll need. Then, here's the iffy part, perhaps you could use something similar to how mapreduce works to shard the key-space into roughly equal shards. Basically you'd start fetching ordering by __scatter__, but you could need a *lot* of keys to make this work right. And I mean a lot, probably several hundred k worth of keys. Then split those up such that you have roughly 50k entities per shard. You would need to compute the numbers based on the total count. This is just sounding messy, hard, and possibly unreliable. Precomputing is sounding better to me. ;) If you know something about the keyspace, you could probably do better. How do you get the stuff you want to include in the sitemap? Is it fairly static, or always coming in, or is it periodically added in very large batches? If it is pretty static, you can just regenerate it periodically. Getting the total count is easy enough, then you'll just need to allocate urls to map files. You could do this linearly, or break it up in some way. Possibly in a way similar to map-reduce. Just split the key-space into shards, then within each set of shard start allocating to sitemaps. If you're really wanting to maximize urls / sitemap, you could probably come up with a way to combine the left-overs from each shard into one (or more) full sitemaps. If you're periodically getting the data in large batches, do it in chunks. Keep track of you're last partially-full sitemap and start adding to it until it is full. Once it is full write it off to the blobstore and open a new sitemap to start filling. Keep the current active sitemap in the datastore so you can add new urls to it, serve the others from the blobstore. Just some random thoughts... Robert On Thu, Feb 23, 2012 at 08:30, Jeff Schnitzer <[email protected]> wrote: > The sitemaps protocol, like most things on the internet, seems to be > slightly retarded. > > Ok, the rule for sitemaps is "no more than 50k items". What do you do when > you have more? You need to break it down into 50k chunks and reference them > from a sitemap index... but how do you get 50k chunks? > > The obvious answer is: Take the max id, divide by 50k, and that's the > number of sitemaps in your index. The first sitemap is 0-49,999, the second > is 50,000-99,9999, etc. That works as long as your entities have simple > numeric ids. What happens when they don't, either because they have > ancestors or because they have string keys? > > This would be a lot easier if the last line in a sitemap could be a pointer > to another sitemap, allowing them to chain... but no, the internet was > designed with the idea that your website is a bunch of files ok a disk. > Cursors shmursors. > > What do you do? Precalculate your maps into blobs? > > Jeff > > -- > You received this message because you are subscribed to the Google Groups > "Google App Engine" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/google-appengine?hl=en. -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
