Hi all, For those who don't know me, I'm one of the GSOC students this year. My mentor is ^demon, and my project is to enhance support for metadata in uploaded files. Similar to the recent thread on interwiki transclusions, I'd thought I'd ask for comments about what I propose to do.
Currently metadata is stored in img_metadata field of the image table as a serialized php array. Well this works fine for the primary use case - listing the metadata in a little box on the image description page, its not very flexible. Its impossible to do queries like get a list of images with some specific metadata property equal to some specific value, or get a list of images ordered by what software edited them. So as part of my project I would like to move the metadata to its own table. However I think the structure of the table will need to be a little more complicated then just <page id>, <name>, <value> triples, since ideally it would be able to store XMP metadata, which can contain nested structures. XMP metadata is pretty much the most complex metadata format currently popular (for metadata stored inside images anyways), and can store pretty much all other types of metadata. Its also the only format that can store multi-lingual content, which is a definite plus as those commons folks love their languages. Thus I think it would be wise to make the table store information in a manner that is rather close to the XMP data model. So basically my proposed metadata table looks like: *meta_id - primary key, auto-incrementing integer *meta_page - foreign key for page_id - what image is this for *meta_type - type of entry - simple value or some sort of compound structure. XMP supports ordered/unordered lists, associative array type structures, alternate array's (things like arrays listing the value of the property in different languages). *meta_schema - xmp uses different namespaces to prevent name collisions. exif properties have their own namespace, IPTC properties have their own namespace, etc *meta_name - The name of the property *meta_value - the value of the property (or null for some compound things, see below) *meta_ref - a reference to a meta_id of a different row for nested structures, or null if not applicable (or 0 perhaps) *meta_qualifies - boolean to denote if this property is a qualifier (in XMP there are normal properties and qualifiers) (see http://www.mediawiki.org/wiki/User:Bawolff/metadata_table for a longer explanation of the table structure) Now, before everyone says eww nested structures in a db are inefficient and what not, I don't think its that bad (however I'm new to the whole scalability thing, so hopefully someone more knowledgeable than me will confirm or deny that). The XMP specification specifically says that there is no artificial limit on nesting depth, however in general practise its not nested very deeply. Furthermore in most cases the tree structure can be safely ignored. Consider: *Use-case 1 (primary usecase), displaying a metadata info box on an image page. Most of the time that'd be translating specific name and values into html table cells. The tree structure is totally unnecessary. for example the exif property DateTimeOriginal can only appear once per image (also it can only appear at the root of the tree structure but thats beside the point). There is no need to reconstruct the tree, just look through all the props for the one you need. If the tree structure is important it can be reconstructed on the php side, and would typically be only the part of the tree that is relevant, not the entire nested structure. *Use-case 2 (secondary usecase). Get list of images ordered by some property starting at foo. or get list of images where property bar = baz. In this case its a simple select. It does not matter where in the tree structure the property is. Thus, all the nestedness of XMP is preserved (So we could re-output it into xmp form if we so desired), and there is no evil joining the metadata table with itself over and over again (or at all), which from what i understand, self-joining to reconstruct nested structures is what makes them inefficient in databases. I also think this schema would be future proof because it can store pretty much all metadata we can think of. We can also extend it with custom properties we make up that are guaranteed to not conflict with anything (The X in xmp is for extensible). As a side-note, based on my rather informal survey of commons (aka the couple people who happened to be on #wikimedia-commons at that moment) another use-case people think would be cool and useful is metadata intersections, and metadata-category intersections. I'm not planning to do this as part of my project, as I believe that would have performance issues. However doing a metadata table like this does leave the possibility open for people to do such intersection things on the toolserver or in a DPL-like extension. I'd love to get some feedback on this. Is this a reasonable approach for me to take on this. Thanks for reading. -- -bawolff _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l