Re: [Wikitech-l] 2016W35 ArchCom-RFC meeting: Revisiting image/oldimage (T589)

Rob Lanphier Wed, 31 Aug 2016 11:32:11 -0700

My apologies.  I made a confusing thinko in the announcement about the
ArchCom office hour in 2.5 hours.  I wrote:
"schema change for page content language" (last week's RFC discussion)


...when I meant:
"image and oldimage tables" (what we'd like to revisit this week)

Corrected version below (highlighted with "***")

Rob

On Wed, Aug 31, 2016 at 12:20 AM, Rob Lanphier <ro...@wikimedia.org> wrote:
> Hi everyone,
>
> For this week's office hour, we'd like to discuss [T589: ***RfC: image and 
> oldimage tables***][1]  Timo brought this up on the list a
> couple of weeks ago, and didn't get much of a response. (But thank you
> Jaime for responding!)  We discussed this one fairly recently ([July
> 13][2]).  This area of our system is still in need of simplification
> and optimization.
>
> The discussion is scheduled for 2016-08-31 UTC:
> Time: Wednesday 21 UTC (2pm PDT, 23 CEST)
> Place: #wikimedia-office
> Phab event: [E266][3]
> [ArchCom/Status][4]
>
> Rob
>
> [1]: <https://phabricator.wikimedia.org/T589>
> [3]: <https://phabricator.wikimedia.org/E228> July 13 meeting
> [3]: <https://phabricator.wikimedia.org/E266> Upcoming meeting
> [4]: <https://www.mediawiki.org/wiki/Architecture_committee/Status>
> ---------- Forwarded message ----------
> From: Krinkle <krinklem...@gmail.com>
> Date: Wed, Aug 10, 2016 at 1:54 PM
> Subject: [Wikitech-l] Schema migration for 'image' and 'oldimage' tables
> To: Wikimedia developers <wikitech-l@lists.wikimedia.org>
>
>
> TL;DR: Participate on T589 and help decide what the upcoming schema change
> should entail, and how we'll migrate existing data.
>
> Hey all,
>
> Couple weeks ago we dedicated an IRC office hour to
> https://phabricator.wikimedia.org/T589 (RFC: image and oldimage tables).
>
> Updated draft at:
> https://www.mediawiki.org/wiki/Requests_for_comment/image_and_oldimage_tables
>
> We clarified scope and purpose of this particular RFC. Other issues are
> still important but considered orthogonal, and to be dealt with parallelly
> (or at a later time).
>
> Revised problem statement:
>
> 1. File revisions should have unique identifiers (better than "current file
> title + upload timestamp". (Subject to race conditions, hard to
> index/query, etc.)
> 2. Uploading new file revisions must not involve rows moving across tables,
> or rows being replaced.
>
> Participants agreed with the revised problem statement, it makes sense not
> to merely add primary keys to the existing tables ("Proposal 1" on the RFC
> draft), as that wouldn't adequately solve the Problem 2.
>
> The second proposal was to separate information about image revision from
> the image entity itself. Similar to the page/revisions tables. This was
> generally accepted as a good idea, but details are still to be determined.
>
> The general idea is that all revision-specific information (except for a
> pointer to the current revision) would no longer live in the 'image' table.
> Instead, information about all (for both current and past revisions) would
> live in the same table (instead of being moved around from one table to
> another when it's no longer the current one).
>
> Details at:
> https://www.mediawiki.org/wiki/Requests_for_comment/image_and_oldimage_tables#2._Separate_entity_and_versioning
>
> Some open questions I'd like to see discussed on Phabricator (or here on
> wikitech):
>
> 1. Which fields do we keep in the 'image' table (img_id, img_name,
> img_latest, anything else?).
>
> All fields currently being queried from both tables, will probably only
> stay in the image revision table. But there are a few fields that we
> intentionally only want to query about current versions. For example
> 'img_sha1'. For duplicate-detection, we need to only consider the latest
> revisions. Doing this by keeping img_sha1 means uploading a new revision
> will involve updating two fields instead of one (img_latest and img_sha1).
> This isn't unprecedented as we do this for page as well
> (WikiPage::updateRevisionOn; page_latest, page_touched, page_is_redirect,
> page_len).
>
> Are there other fields we need to keep besides img_sha1? Or should we can
> solve the img_sha1 use case in a different manner?
>
> 2. img_metadata
>
> This field is a blob of serialised PHP (typically representing the Exif
> data of an image).
>
> Tim (correct me if I got it wrong) mentioned we could potentially make
> migration easier by changing img_metadata to be stored in a separate table
> and change the img_metadata field (in the image revision table) to instead
> be a pointer to a primary key.
>
> This could potentially be done separately later, but if it helps migration,
> we should consider doing it now.
>
> How will this interact with file deletion? Will it be difficult to garbage
> collect this? Do we need to? (We don't seem to do it for the 'text' table /
> external store; is it worth moving this an external store?)
>
> 3. Migration
>
> If we rename both tables (image/oldimage -> file/filerevision), we'd have
> the ability to run migration in the background without interfering with the
> live site, and without requiring a long read-only period and/or duplicate
> and additional code complexity to be developed.
>
> Is there a way we can do the migration without creating two new tables?
> Using the oldimage table as import destination for current rows isn't
> straight forward as existing scan queries would need to skip the current
> rows somehow while in the midst of this migration. Seems possible, but is
> it worth the complexity? (We'd need extra code that knows about that
> migration field, and how long do we keep that code? Also complicates
> migration for third-parties using update.php).
>
> Is creating the new tables separately viable for the scale of Wikimedia
> Commons? (and dropping the old ones once finished). Is this a concern from
> a DBA perspective with regards to storage space? (We'd temporarily need
> about twice the space for these tables). So far I understood that it
> wouldn't be a problem per se, but that there are also other options we can
> explore for Wikimedia. For example we could use a separate set of slaves
> and alter those while depooled (essentially using entirely separate set of
> db slaves instead of a separate table within each slave).
>
> Do we create the new table(s) separately and switch over once it's caught
> up? This would require doing multiple passes as we depool slaves one by one
> (we've done that before at Wikimedia). Switch-over could be done by
> migrating before the software upgrade, with a very short read-only period
> after the last pass is finished. It wouldn't require maintaining multiple
> code paths, which is attractive.
>
> Other ideas?
>
> -- Timo
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] 2016W35 ArchCom-RFC meeting: Revisiting image/oldimage (T589)

Reply via email to