Funny you should ask. We had a similar problem at Veoh. I think that this kind of problem is relatively common.
Taking video viewing as the poster child, the meta-data about videos comes in a couple of flavors: The title and description and publisher info A pointer to the actual video bits The view counts Rating data The history of who viewed the video (for recommendation systems and such) Stats about how people play the video (just 4 seconds? all the way through? From the primary interface? From embedded references?) We can categorize this data on a couple of different axes. One is update rate: Some of this data is very rarely updated (the video pointer and publisher info). Some is updated more commonly, but still pretty rarely (title and description) Some is updated fairly often (ratings) And some is updated ALL the time (view counts especially, but view history and view stats as well) Another categorization is based on how you plan to search the data: Title and description and length and publisher and date published (users searching using the search box and advanced search) Play history and ratings (recommendation systems doing off-line analysis) Not usually searched (encoding, number of audio tracks, size in bytes and so on) Usually, high volume sites have to store data differently depending on size, change-rate and purpose. Then you abstract different search and storage decisions with an access layer. For what you are doing, you should put into lucene only those things which are low change rate and which must be searched. You should put high mutation rate data into something like memcache with some persistent back-store. Very large data items such as the video itself should be in an entirely different kind of store (at Veoh we used a very heavily hacked version of danga's mogile). Your two phase update trick will work reasonably well in the short-term, but if your traffic is growing quickly it won't last very long because the full update will be so nasty. On Wed, Apr 1, 2009 at 1:06 AM, sunnyfr <[email protected]> wrote: > > Yep but we won't change the system now :( > Or maybe I can have two kinds of schema ? > One which is the new video during the day so just new datas and the other > one by night which update all caracteristic of videos ? full update > nightly > and light new update during the day ? > what do you think ?? > Because the other caracteristics are not that important but used for > filters, most view, comment ... > >
