[ https://issues.apache.org/jira/browse/TIKA-930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284925#comment-13284925 ]
Jörg Ehrlich commented on TIKA-930: ----------------------------------- Some answers to Ray's comments: Creator: The DublinCore creator is usually considered the creator of the intellectual property, not the creator of the file. That is what the "creator tool" property is for. So we should stick with the "creator" property and don't use "author" or any other additional key. Rating: I think we should better not use anything more generic here. The generic approaches taken in the past are the reason why we have this huge mess of incompatible applications today. There is a strong reason why the Metadata Working Group has introduced this definition as it is. A lot of important applications understand and use this definition today. And didn't we say we wanted to use only something which is clearly defined? Geographic: Have you found any files or file types which are actually using the W3C approach to store geolocation data? All I have seen until today are using Exif to store it :) > Consolidation of Some Tika Core Properties > ------------------------------------------ > > Key: TIKA-930 > URL: https://issues.apache.org/jira/browse/TIKA-930 > Project: Tika > Issue Type: Improvement > Components: metadata > Affects Versions: 1.2 > Reporter: Ray Gauss II > > There are a few properties in TikaCoreProperties which overlap and I think we > should minimize ambiguity by consolidating them into a single composite > property with the clearest name, the most general specification referenced as > its primary property, and the others and deprecated strings as its > secondaries. > Here's the proposed pseudo-code for the changes: > Remove TikaCoreProperties.SUBJECT > TikaCoreProperties.KEYWORDS <- DublinCore.SUBJECT, { Office.KEYWORDS, > MSOffice.KEYWORDS, Metadata.SUBJECT } > Remove TikaCoreProperties.DATE > TikaCoreProperties.CREATION_DATE <- DublinCore.DATE, { Office.CREATION_DATE, > MSOffice.CREATION_DATE, Metadata.DATE } > Remove TikaCoreProperties.MODIFIED > TikaCoreProperties.SAVE_DATE <- DublinCore.MODIFIED, { Office.SAVE_DATE, > MSOffice.LAST_SAVED, Metadata.MODIFIED, "Last-Modified" } > and an example of the Java changes: > {code:title=TikaCoreProperties.java *Before*} > /** > * @see DublinCore#SUBJECT > */ > public static final Property SUBJECT = > Property.composite(DublinCore.SUBJECT, > new Property[] { Property.internalText(Metadata.SUBJECT) }); > > /** > * @see Office#KEYWORDS > */ > public static final Property KEYWORDS = > Property.composite(Office.KEYWORDS, > new Property[] { Property.internalTextBag(MSOffice.KEYWORDS) }); > {code} > would become > {code:title= TikaCoreProperties.java *After*} > /** > * @see DublinCore#SUBJECT > * @see Office#KEYWORDS > */ > public static final Property KEYWORDS = > Property.composite(DublinCore.SUBJECT, > new Property[] { > Office.KEYWORDS, > Property.internalTextBag(MSOffice.KEYWORDS), > Property.internalText(Metadata.SUBJECT) > }); > {code} > Since this would require a bit of refactoring for parsers that use the > properties being removed I thought it best to get some feedback before > working up a full patch. > Does this seem like a reasonable approach? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira