My apologies. I made a confusing thinko in the announcement about the ArchCom office hour in 2.5 hours. I wrote: "schema change for page content language" (last week's RFC discussion)
...when I meant: "image and oldimage tables" (what we'd like to revisit this week) Corrected version below (highlighted with "***") Rob On Wed, Aug 31, 2016 at 12:20 AM, Rob Lanphier <ro...@wikimedia.org> wrote: > Hi everyone, > > For this week's office hour, we'd like to discuss [T589: ***RfC: image and > oldimage tables***][1] Timo brought this up on the list a > couple of weeks ago, and didn't get much of a response. (But thank you > Jaime for responding!) We discussed this one fairly recently ([July > 13][2]). This area of our system is still in need of simplification > and optimization. > > The discussion is scheduled for 2016-08-31 UTC: > Time: Wednesday 21 UTC (2pm PDT, 23 CEST) > Place: #wikimedia-office > Phab event: [E266][3] > [ArchCom/Status][4] > > Rob > > [1]: <https://phabricator.wikimedia.org/T589> > [3]: <https://phabricator.wikimedia.org/E228> July 13 meeting > [3]: <https://phabricator.wikimedia.org/E266> Upcoming meeting > [4]: <https://www.mediawiki.org/wiki/Architecture_committee/Status> > ---------- Forwarded message ---------- > From: Krinkle <krinklem...@gmail.com> > Date: Wed, Aug 10, 2016 at 1:54 PM > Subject: [Wikitech-l] Schema migration for 'image' and 'oldimage' tables > To: Wikimedia developers <wikitech-l@lists.wikimedia.org> > > > TL;DR: Participate on T589 and help decide what the upcoming schema change > should entail, and how we'll migrate existing data. > > Hey all, > > Couple weeks ago we dedicated an IRC office hour to > https://phabricator.wikimedia.org/T589 (RFC: image and oldimage tables). > > Updated draft at: > https://www.mediawiki.org/wiki/Requests_for_comment/image_and_oldimage_tables > > We clarified scope and purpose of this particular RFC. Other issues are > still important but considered orthogonal, and to be dealt with parallelly > (or at a later time). > > Revised problem statement: > > 1. File revisions should have unique identifiers (better than "current file > title + upload timestamp". (Subject to race conditions, hard to > index/query, etc.) > 2. Uploading new file revisions must not involve rows moving across tables, > or rows being replaced. > > Participants agreed with the revised problem statement, it makes sense not > to merely add primary keys to the existing tables ("Proposal 1" on the RFC > draft), as that wouldn't adequately solve the Problem 2. > > The second proposal was to separate information about image revision from > the image entity itself. Similar to the page/revisions tables. This was > generally accepted as a good idea, but details are still to be determined. > > The general idea is that all revision-specific information (except for a > pointer to the current revision) would no longer live in the 'image' table. > Instead, information about all (for both current and past revisions) would > live in the same table (instead of being moved around from one table to > another when it's no longer the current one). > > Details at: > https://www.mediawiki.org/wiki/Requests_for_comment/image_and_oldimage_tables#2._Separate_entity_and_versioning > > Some open questions I'd like to see discussed on Phabricator (or here on > wikitech): > > 1. Which fields do we keep in the 'image' table (img_id, img_name, > img_latest, anything else?). > > All fields currently being queried from both tables, will probably only > stay in the image revision table. But there are a few fields that we > intentionally only want to query about current versions. For example > 'img_sha1'. For duplicate-detection, we need to only consider the latest > revisions. Doing this by keeping img_sha1 means uploading a new revision > will involve updating two fields instead of one (img_latest and img_sha1). > This isn't unprecedented as we do this for page as well > (WikiPage::updateRevisionOn; page_latest, page_touched, page_is_redirect, > page_len). > > Are there other fields we need to keep besides img_sha1? Or should we can > solve the img_sha1 use case in a different manner? > > 2. img_metadata > > This field is a blob of serialised PHP (typically representing the Exif > data of an image). > > Tim (correct me if I got it wrong) mentioned we could potentially make > migration easier by changing img_metadata to be stored in a separate table > and change the img_metadata field (in the image revision table) to instead > be a pointer to a primary key. > > This could potentially be done separately later, but if it helps migration, > we should consider doing it now. > > How will this interact with file deletion? Will it be difficult to garbage > collect this? Do we need to? (We don't seem to do it for the 'text' table / > external store; is it worth moving this an external store?) > > 3. Migration > > If we rename both tables (image/oldimage -> file/filerevision), we'd have > the ability to run migration in the background without interfering with the > live site, and without requiring a long read-only period and/or duplicate > and additional code complexity to be developed. > > Is there a way we can do the migration without creating two new tables? > Using the oldimage table as import destination for current rows isn't > straight forward as existing scan queries would need to skip the current > rows somehow while in the midst of this migration. Seems possible, but is > it worth the complexity? (We'd need extra code that knows about that > migration field, and how long do we keep that code? Also complicates > migration for third-parties using update.php). > > Is creating the new tables separately viable for the scale of Wikimedia > Commons? (and dropping the old ones once finished). Is this a concern from > a DBA perspective with regards to storage space? (We'd temporarily need > about twice the space for these tables). So far I understood that it > wouldn't be a problem per se, but that there are also other options we can > explore for Wikimedia. For example we could use a separate set of slaves > and alter those while depooled (essentially using entirely separate set of > db slaves instead of a separate table within each slave). > > Do we create the new table(s) separately and switch over once it's caught > up? This would require doing multiple passes as we depool slaves one by one > (we've done that before at Wikimedia). Switch-over could be done by > migrating before the software upgrade, with a very short read-only period > after the last pass is finished. It wouldn't require maintaining multiple > code paths, which is attractive. > > Other ideas? > > -- Timo > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l