[ https://issues.apache.org/jira/browse/TIKA-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ray Gauss II resolved TIKA-930. ------------------------------- Resolution: Fixed Fix Version/s: 1.2 Fixed in r1356560. This ended up being a fairly large commit. Feel free to revert or re-open this issue if I've messed something up. I've included the commit message here as it describes the majority of the changes: - Added the Dublin Core Terms namespace and prefix - Changed DublinCore.CREATOR to multi-valued property - Consolidated TikaCoreProperties.AUTHOR to TikaCoreProperties.CREATOR - Removed TikaCoreProperties.LAST_AUTHOR and added TikaCoreProperties.MODIFIER - Added DublinCore.CREATED - Consolidated TikaCoreProperties.DATE and TikaCoreProperties.CREATION_DATE to TikaCoreProperties.CREATED - Consolidated TikaCoreProperties.SAVE_DATE to TikaCoreProperties.MODIFIED - Updated DublinCore.MODIFIED to correct terms namespace - Added OpenOfficeXMLCore.SUBJECT - Consolidated TikaCoreProperties.SUBJECT to TikaCoreProperties.KEYWORDS - Added several temporary transition properties to TikaCoreProperties to ease migrating previous use of 'subject' to more specific properties and maintain backwards compatibility * For most mail-related parsers/handlers, transition subject to dc:title * For most office-related parsers/handlers, transition subject to OO cp:subject - Added TikaCoreProperties.CREATOR_TOOL - Added TikaCoreProperties.METADATA_DATE - Added TikaCoreProperties.RATING - Changed XMP to use common namespace delimiter - Added Open Office word processing namespace and prefix to OfficeOpenXMLExtended - Added OfficeOpenXMLExtended.COMMENTS - Added TikaCoreProperties.COMMENTS which is a composite of OfficeOpenXMLExtended.COMMENTS, ClimateForecast.COMMENT and MSOffice.COMMENTS - Deprecated MSOffice.Comments - Changed OpenDocumentMetaParser to accommodate TikaCoreProperties since the XML it processes treats dc:date and dc:subject differently than DcXMLParser - Change nextMetadata in TextExtractor to a Property rather than String key - Changed DcXmlParser to use namespace already defined in DublinCore - Updated parsers to reflect TikaCoreProperties changes - Updated tika-xmp to reflect TikaCoreProperties changes - Registered dcterms namespace in XMPMetadataTest - Updated tests to reflect new changes and added some tests for backwards compatibility > Consolidation of Some Tika Core Properties > ------------------------------------------ > > Key: TIKA-930 > URL: https://issues.apache.org/jira/browse/TIKA-930 > Project: Tika > Issue Type: Improvement > Components: metadata > Affects Versions: 1.2 > Reporter: Ray Gauss II > Fix For: 1.2 > > > There are a few properties in TikaCoreProperties which overlap and I think we > should minimize ambiguity by consolidating them into a single composite > property with the clearest name, the most general specification referenced as > its primary property, and the others and deprecated strings as its > secondaries. > Here's the proposed pseudo-code for the changes: > Remove TikaCoreProperties.SUBJECT > TikaCoreProperties.KEYWORDS <- DublinCore.SUBJECT, { Office.KEYWORDS, > MSOffice.KEYWORDS, Metadata.SUBJECT } > Remove TikaCoreProperties.DATE > TikaCoreProperties.CREATION_DATE <- DublinCore.DATE, { Office.CREATION_DATE, > MSOffice.CREATION_DATE, Metadata.DATE } > Remove TikaCoreProperties.MODIFIED > TikaCoreProperties.SAVE_DATE <- DublinCore.MODIFIED, { Office.SAVE_DATE, > MSOffice.LAST_SAVED, Metadata.MODIFIED, "Last-Modified" } > and an example of the Java changes: > {code:title=TikaCoreProperties.java *Before*} > /** > * @see DublinCore#SUBJECT > */ > public static final Property SUBJECT = > Property.composite(DublinCore.SUBJECT, > new Property[] { Property.internalText(Metadata.SUBJECT) }); > > /** > * @see Office#KEYWORDS > */ > public static final Property KEYWORDS = > Property.composite(Office.KEYWORDS, > new Property[] { Property.internalTextBag(MSOffice.KEYWORDS) }); > {code} > would become > {code:title= TikaCoreProperties.java *After*} > /** > * @see DublinCore#SUBJECT > * @see Office#KEYWORDS > */ > public static final Property KEYWORDS = > Property.composite(DublinCore.SUBJECT, > new Property[] { > Office.KEYWORDS, > Property.internalTextBag(MSOffice.KEYWORDS), > Property.internalText(Metadata.SUBJECT) > }); > {code} > Since this would require a bit of refactoring for parsers that use the > properties being removed I thought it best to get some feedback before > working up a full patch. > Does this seem like a reasonable approach? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira