Le 12/10/15 14:22, Nick Burch a écrit :
> Currently, it's very easy for a new user of Tika to get the metadata
> they want out, they can just fetch a simple string value to get
> started with. You can, when you learn more, start getting more richly
> typed values out, but the quickstart is simple. Some libraries make it
> so that you have to learn the full rich metadata structure right from
> the get-go, which causes problems for new users. Whatever we do to
> help the power users, we need to not ruin it for the beginners!
What would be the approach for more richly typed values? Would they be
an extension of the current model, or a second model existing in
parallel with the first one?
> For the discussion on "what should a richer Tika metadata system be
> based on", I think TIKA-1607 is where that is taking place, plus some
> related threads on-list.
Thanks for the link. TIKA-1607 seems to be about associating arbitrary
java.lang.Object to property keys. But isn't a little bit opaque? I
mean, if a user get an instance of a class that he doesn't know, how to
extract information from it?
> In the short term, if there are some key parts of that standard for
> geospacial metadata that we don't currently handle, and could do
> easily with the current setup, then we should raise a JIRA + get a
> sample file + add the support
Regarding ISO 19115 support, what seems the main question to me is how
to handle a tree structure? The current Tika metadata structure seems to
be like a Map<String,String[]> (please correct me if I'm wrong), while
ISO 19115 is more like a Map<String,Node> where each Node can contains
children nodes, thus forming a tree. The following example in Tika:
Creator…………………… Jon Smith
Publisher……………… A company
Title………………………… Anything
would be in the ISO 19115 model (note how the creator and publisher are
grouped under the same "responsible party" node):
Citation
├─Title………………………………………………… Anything
└─Cited responsible party
[1]
├─Role…………………………………………… Author
└─Individual
└─Name…………………………………… Jon Smith
[2]
├─Role…………………………………………… Publisher
└─Organisation
└─Name…………………………………… A company
The tree structure allows to put other information, like email address
and phone numbers, without confusion about whether the address applies
to the creator or to the publisher. Of course a flat structure could
prefix property names (e.g. "creator_address", "publisher_address",
etc.), but this would result in a lot of keys. For example ISO 19115
defines 20 standard roles (resourceProvider, custodian, owner, user,
distributor, originator, pointOfContact, principalInvestigator,
processor, publisher, author, sponsor, coAuthor, collaborator, editor,
mediator, rightsHolder, contributor, funder, stakeholder) and each of
them can be associated to about 30 properties under the "Cited
responsible party" node (name, positionName, phone, city,
administrativeArea, postalCode, country, hoursOfService,
contactInstruction, onlineResource, etc.). Does Tika would like to
handle such amount of data, and if yes is a flat structure really
appropriate?
Martin