I have a large number of pages that are stored in my database on my
app, that are only accessible via the search engine under normal
circumstances. This is a perfect ordinary use case for making a
sitemap, which is what I've done.

I have a few million of these pages, so I dynamically generate a
sitemap index that in turn points to individual sitemaps. This was
easy, because I could just pull out the id of the last page, and
divide that by 50,000 (the max allowed by the sitemap protocol) to
determine the number of sitemaps I would link to.

Then, because each page has a url slug in the database that has to be
read, each sitemap would be generated by calling find_each with the id
limited to be within the range of the 50,000 starting from 50,000
multiplied by the sitemap number.

While a simple idea in theory, this produces an enormous amount of
database churn in practice. The search engines would sometimes spike
the database for up to 6 seconds with a single query, even with the
find_each.

I have since reduced the 50,000 down to 10,000, with the result that I
seem to have a constant database churn from the search engines at
about 500ms to 1s.

This doesn't really strike me as a very good situation. My first
instinct to handle this is that I need to somehow cache the sitemaps,
maybe pre-generate them and update them once a week or something like
that, to minimize the strain on the database. However, this seems
fairly difficult on Heroku, since we don't have any local storage on
the webservers that can hold the generated (somewhat large) sitemap
files.

I suppose I could rig something up to upload the weekly sitemap cache
to Amazon S3 or something like that, but I've never heard of anyone
storing their sitemap off site before. Will this present any type of
an issue to the search engines, having to step out of a domain to get
the domain's sitemap? Overall, that seems like a somewhat awkward
arrangement as well.

What do people think? How would you handle this?

-- 
You received this message because you are subscribed to the Google Groups 
"Heroku" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/heroku?hl=en.

Reply via email to