Hi Nick, You're begining a very interesting topic about foundation of our metadata concept :) I agree with you that metadata is not the best place to store thumbnail result. Until now, our metadata is simple map with key:values. This structure is not really flexiable in some cases. For exemple, we would store author's information, each author has a first name and a last name. Ideally, we could have some like struct: Person: FirstName LastName
An other example is for our futur thumbnail. If we can have a metadata 'thumbnail' with hierarchical structure like: Thumbnail: Dimension Width Length MimeType Extension Pages Description That needs a huge refactoring about our core model. An other solution is we can keep thumbnail result is a list List<byte[]> insteads of a single value. An element is the thumbnail of a page. If the list has only 1 element, mean there's only thumbnail of the first page. Hong-Thai -----Message d'origine----- De : Nick Burch [mailto:apa...@gagravarr.org] Envoyé : jeudi 9 janvier 2014 12:11 À : dev@tika.apache.org Objet : RE: Extract thumbnail from openxml office files On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote: > By searching on issues, I found the issue already created: > https://issues.apache.org/jira/browse/TIKA-90 I'm not sure if the metadata is the right place to return this. Some formats offer a small thumbnail, others can offer a small thumbnail for every page, and at least one can include a full-size image of the first page. Would we not be better off exposing these embedded renderings via the existing embedded resources handling, with some sort of handy way to identify what something is (eg this is a full-size PNG of page 1, this is a jpg thumbnail of page 3)? Nick