[ 
https://issues.apache.org/jira/browse/TIKA-930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284794#comment-13284794
 ] 

Ray Gauss II commented on TIKA-930:
-----------------------------------

I'm not sure what our policy is on using standards that aren't yet ratified, 
but I'm in favor of using the most generic standard out there.

For all of these, the composite's secondary properties array is really just a 
means of providing backwards compatibility.  Individual parsers can set 
multiple metadata properties with the same value if they desire.

Individual property comments:

Creator/Author:
DublinCore seems to be a bit vague here, but I believe most users treat 
DublinCore.CREATOR as the creator of the file.  Author is the creator of the 
intellectual property that the file represents.  IPTC.CREATOR, which references 
DublinCore.CREATOR, does further define it as the IP creator.  I think 
something that describes the IP creator should stay in TikaCoreProperties, 
distinct from the file creator, but naming it AUTHOR isn't as general as 
something like INTELLECTUAL_PROPERTY_CREATOR would be.

I'm not sure INITIAL_AUTHOR and LAST_AUTHOR need to be included in 
TikaCoreProperties though.  Those seem like something individual parsers should 
set.

Creation date:
If we go with the newer DC namespace then DublinCore.CREATED should be the 
primary for TikaCoreProperties.CREATION_DATE.  Individual parsers can also set 
XMP.CREATE_DATE if they want and it doesn't need to be included here.

Modification date:
I was just trying to consolidate naming convention, if everyone thinks 
'modified' is more standard vocabulary that's fine, but then 
TikaCoreProperties.CREATION_DATE should be TikaCoreProperties.CREATED. 
Individual parsers can also set XMP.MODIFY_DATE if they want and it doesn't 
need to be included here.

Creator tool:
Sounds reasonable, and likely a common need.

Rating:
Sounds like a common need, though the externalReal and -1 or [0..5] definition 
of ratings in XMP.RATING may not be generic enough for inclusion here.  I'd be 
interested to hear others' thoughts on this.

Metadata date:
Sounds reasonable, and likely a common need.

Geo coordinates:
W3C are the most generic and make sense for all file types.  Individual parsers 
can also set EXIF properties.  Renaming Geographic to W3CGeographic may make 
sense.

Copyright:
Agreed.
                
> Consolidation of Some Tika Core Properties
> ------------------------------------------
>
>                 Key: TIKA-930
>                 URL: https://issues.apache.org/jira/browse/TIKA-930
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata
>    Affects Versions: 1.2
>            Reporter: Ray Gauss II
>
> There are a few properties in TikaCoreProperties which overlap and I think we 
> should minimize ambiguity by consolidating them into a single composite 
> property with the clearest name, the most general specification referenced as 
> its primary property, and the others and deprecated strings as its 
> secondaries.
> Here's the proposed pseudo-code for the changes:
> Remove TikaCoreProperties.SUBJECT
> TikaCoreProperties.KEYWORDS <- DublinCore.SUBJECT, { Office.KEYWORDS, 
> MSOffice.KEYWORDS, Metadata.SUBJECT }
> Remove TikaCoreProperties.DATE
> TikaCoreProperties.CREATION_DATE <- DublinCore.DATE, { Office.CREATION_DATE, 
> MSOffice.CREATION_DATE, Metadata.DATE }
> Remove TikaCoreProperties.MODIFIED
> TikaCoreProperties.SAVE_DATE <- DublinCore.MODIFIED, { Office.SAVE_DATE, 
> MSOffice.LAST_SAVED, Metadata.MODIFIED, "Last-Modified" }
> and an example of the Java changes:
> {code:title=TikaCoreProperties.java *Before*}
>     /**
>      * @see DublinCore#SUBJECT
>      */
>     public static final Property SUBJECT = 
> Property.composite(DublinCore.SUBJECT, 
>             new Property[] { Property.internalText(Metadata.SUBJECT) });
>       
>     /**
>      * @see Office#KEYWORDS
>      */
>     public static final Property KEYWORDS = 
> Property.composite(Office.KEYWORDS,
>             new Property[] { Property.internalTextBag(MSOffice.KEYWORDS) });
> {code}
> would become
> {code:title= TikaCoreProperties.java *After*}
>     /**
>      * @see DublinCore#SUBJECT
>      * @see Office#KEYWORDS
>      */
>     public static final Property KEYWORDS = 
> Property.composite(DublinCore.SUBJECT,
>             new Property[] { 
>                   Office.KEYWORDS, 
>                   Property.internalTextBag(MSOffice.KEYWORDS),
>                   Property.internalText(Metadata.SUBJECT)
>               });
> {code}
> Since this would require a bit of refactoring for parsers that use the 
> properties being removed I thought it best to get some feedback before 
> working up a full patch.
> Does this seem like a reasonable approach?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to