Hey Jeff,
  I haven't tried to solve this problem in practice, so this amounts
to thinking out loud; maybe it is helpful, maybe not. ;)

  So each chunk is limited to 50k urls and 10mb uncompressed.  In
other words, you've got to constraints to consider.  In very
simplistic testing, I can compress about 50k urls that are right
around 10mb of generated sitemap data down to around 4mb.  I'd say it
depends pretty heavily on your url schema how many you'll be able to
safely fit in one sitemap.  As  I recall, there is no ordering to your
sitemap, right?  I've got no clue how this might impact SEO type
stuff.  I'm assuming this is an issue because you've got well over 50k
urls; presumably a full "sitemap index" file is sufficient to list
2,500,000,000 urls (ie 50k * 50k).

  If you want to compute this thing on-demand, maybe you could use the
datastore stats to get an idea of how many entities you have
(presumably this is the driver), then figure out how many sitemaps
you'll need.  Then, here's the iffy part, perhaps you could use
something similar to how mapreduce works to shard the key-space into
roughly equal shards.  Basically you'd start fetching ordering by
__scatter__, but you could need a *lot* of keys to make this work
right.  And I mean a lot, probably several hundred k worth of keys.
Then split those up such that you have roughly 50k entities per shard.
 You would need to compute the numbers based on the total count.  This
is just sounding messy, hard, and possibly unreliable.  Precomputing
is sounding better to me.  ;)  If you know something about the
keyspace, you could probably do better.

  How do you get the stuff you want to include in the sitemap?  Is it
fairly static, or always coming in, or is it periodically added in
very large batches?

  If it is pretty static, you can just regenerate it periodically.
Getting the total count is easy enough, then you'll just need to
allocate urls to map files.  You could do this linearly, or break it
up in some way.  Possibly in a way similar to map-reduce.  Just split
the key-space into shards, then within each set of shard start
allocating to sitemaps.  If you're really wanting to maximize urls /
sitemap, you could probably come up with a way to combine the
left-overs from each shard into one (or more) full sitemaps.

  If you're periodically getting the data in large batches, do it in
chunks.  Keep track of you're last partially-full sitemap and start
adding to it until it is full.  Once it is full write it off to the
blobstore and open a new sitemap to start filling.  Keep the current
active sitemap in the datastore so you can add new urls to it, serve
the others from the blobstore.


Just some random thoughts...



Robert



On Thu, Feb 23, 2012 at 08:30, Jeff Schnitzer <[email protected]> wrote:
> The sitemaps protocol, like most things on the internet, seems to be
> slightly retarded.
>
> Ok, the rule for sitemaps is "no more than 50k items".  What do you do when
> you have more?  You need to break it down into 50k chunks and reference them
> from a sitemap index... but how do you get 50k chunks?
>
> The obvious answer is:  Take the max id, divide by 50k, and that's the
> number of sitemaps in your index.  The first sitemap is 0-49,999, the second
> is 50,000-99,9999, etc.  That works as long as your entities have simple
> numeric ids.  What happens when they don't, either because they have
> ancestors or because they have string keys?
>
> This would be a lot easier if the last line in a sitemap could be a pointer
> to another sitemap, allowing them to chain... but no, the internet was
> designed with the idea that your website is a bunch of files ok a disk.
>  Cursors shmursors.
>
> What do you do?  Precalculate your maps into blobs?
>
> Jeff
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to