[
https://issues.apache.org/jira/browse/TIKA-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15887004#comment-15887004
]
Tim Allison commented on TIKA-2274:
-----------------------------------
I'm only seeing one title extracted if I use your example with trunk.
{noformat}
@Test
public void testMultipleTitles() throws Exception {
String[] titles =
getXML("testHTML_multipleTitles.html").metadata.getValues(TikaCoreProperties.TITLE);
assertEquals(1, titles.length);
}
{noformat}
As you point out,and if I remember correctly, dc:title must be single valued
(aside from the multiple languages, but that's another issue).
I'm not against namespacing <meta name="title"> so that we capture the various
titles as long as we leave dc:title as it is. We did something similar with
PDFs to capture differences btwn the XMP and the "regular" metadata.
What's your recommendation for a namespace?
> <title> and <meta name="title"> metadata collision
> --------------------------------------------------
>
> Key: TIKA-2274
> URL: https://issues.apache.org/jira/browse/TIKA-2274
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.14
> Reporter: Matthew Caruana Galizia
> Priority: Minor
> Labels: html
>
> In several different corpuses I've found HTML files which look like the
> following:
> {code}
> <html>
> <head>
> <title>Some title</title>
> <meta name="title" content="some other title">
> </head>
> ...
> </html>
> {code}
> This causes the "title" property in the metadata to have two values set, when
> one would expect that this field is not multivalued.
> Perhaps some fields from <meta> tags, like this one, should be namespaced.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)