Skip to site navigation (Press enter)

[jira] [Commented] (TIKA-2274) and <meta name="title"> metadata collision</span></a></span> </h1> <p class="darkgray font13"> <span class="sender pipe"><a href="/search?l=dev@tika.apache.org&q=from:%22Tim+Allison+%5C%28JIRA%5C%29%22" rel="nofollow"><span itemprop="author" itemscope itemtype="http://schema.org/Person"><span itemprop="name">Tim Allison (JIRA)</span></span></a></span> <span class="date"><a href="/search?l=dev@tika.apache.org&q=date:20170227" rel="nofollow">Mon, 27 Feb 2017 17:27:02 -0800</a></span> </p> </div> <div itemprop="articleBody" class="msgBody">  <pre> [ <a rel="nofollow" href="https://issues.apache.org/jira/browse/TIKA-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15887004#comment-15887004">https://issues.apache.org/jira/browse/TIKA-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15887004#comment-15887004</a> ] </pre><pre> Tim Allison commented on TIKA-2274: ----------------------------------- I'm only seeing one title extracted if I use your example with trunk. {noformat} @Test public void testMultipleTitles() throws Exception { String[] titles = getXML("testHTML_multipleTitles.html").metadata.getValues(TikaCoreProperties.TITLE); assertEquals(1, titles.length); } {noformat} As you point out,and if I remember correctly, dc:title must be single valued (aside from the multiple languages, but that's another issue). I'm not against namespacing <meta name="title"> so that we capture the various titles as long as we leave dc:title as it is. We did something similar with PDFs to capture differences btwn the XMP and the "regular" metadata. What's your recommendation for a namespace? > <title> and <meta name="title"> metadata collision > -------------------------------------------------- > > Key: TIKA-2274 > URL: <a rel="nofollow" href="https://issues.apache.org/jira/browse/TIKA-2274">https://issues.apache.org/jira/browse/TIKA-2274</a> > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.14 > Reporter: Matthew Caruana Galizia > Priority: Minor > Labels: html > > In several different corpuses I've found HTML files which look like the > following: > {code} > <html> > <head> > <title>Some title</title> > <meta name="title" content="some other title"> > </head> > ... > </html> > {code} > This causes the "title" property in the metadata to have two values set, when > one would expect that this field is not multivalued. > Perhaps some fields from <meta> tags, like this one, should be namespaced. -- This message was sent by Atlassian JIRA (v6.3.15#6346) </pre> </div> <div class="msgButtons margintopdouble"> <ul class="overflow"> <li class="msgButtonItems"><a class="button buttonleft " accesskey="p" href="msg20836.html">Previous message</a></li> <li class="msgButtonItems textaligncenter"><a class="button" accesskey="c" href="index.html#20837">View by thread</a></li> <li class="msgButtonItems textaligncenter"><a class="button" accesskey="i" href="maillist.html#20837">View by date</a></li> <li class="msgButtonItems textalignright"><a class="button buttonright " accesskey="n" href="msg20840.html">Next message</a></li> </ul> </div> <a name="tslice"></a> <div class="tSliceList margintopdouble"> <ul class="icons monospace"> <li class="icons-email tSliceCur"><span class="subject">[jira] [Commented] (TIKA-2274) <title>...</span> <span class="sender italic">Tim Allison (JIRA)</span></li> <li><ul> <li class="icons-email"><span class="subject"><a href="msg20840.html">[jira] [Commented] (TIKA-2274) <tit...</a></span> <span class="sender italic">Matthew Caruana Galizia (JIRA)</span></li> </ul> </ul> </div> <div class="overflow msgActions margintopdouble"> <div class="msgReply" > <h2> Reply via email to </h2> <form method="POST" action="/mailto.php"> <input type="hidden" name="subject" value="[jira] [Commented] (TIKA-2274) <title> and <meta name="title"> metadata collision"> <input type="hidden" name="msgid" value="JIRA.13045590.1487848987000.12802.1488245205497@Atlassian.JIRA"> <input type="hidden" name="relpath" value="dev@tika.apache.org/msg20837.html"> <input type="submit" value=" Tim Allison (JIRA) "> </form> </div> </div> </div> <div class="aside" role="complementary"> <div class="logo"> <a href="/"><img src="/logo.png" width=247 height=88 alt="The Mail Archive"></a> </div> <form class="overflow" action="/search" method="get"> <input type="hidden" name="l" value="dev@tika.apache.org"> <label class="hidden" for="q">Search the site</label> <input class="submittext" type="text" id="q" name="q" placeholder="Search dev"> <input class="submitbutton" name="submit" type="image" src="/submit.png" alt="Submit"> </form> <div class="nav margintop" id="nav" role="navigation"> <ul class="icons font16"> <li class="icons-home"><a href="/">The Mail Archive home</a></li> <li class="icons-list"><a href="/dev@tika.apache.org/">dev - all messages</a></li> <li class="icons-about"><a href="/dev@tika.apache.org/info.html">dev - about the list</a></li> <li class="icons-expand"><a href="/search?l=dev@tika.apache.org&q=subject:%22%5C%5Bjira%5C%5D+%5C%5BCommented%5C%5D+%5C%28TIKA%5C-2274%5C%29+%3Ctitle%3E+and+%3Cmeta+name%3D%5C%22title%5C%22%3E+metadata+collision%22&o=newest&f=1" title="e" id="e">Expand</a></li> <li class="icons-prev"><a href="msg20836.html" title="p">Previous message</a></li> <li class="icons-next"><a href="msg20840.html" title="n">Next message</a></li> </ul> </div> <div class="listlogo margintopdouble"> </div> <div class="margintopdouble"> </div> </div> </div> <div class="footer" role="contentinfo"> <ul> <li><a href="/">The Mail Archive home</a></li> <li><a href="/faq.html#newlist">Add your mailing list</a></li> <li><a href="/faq.html">FAQ</a></li> <li><a href="/faq.html#support">Support</a></li> <li><a href="/faq.html#privacy">Privacy</a></li> <li class="darkgray">JIRA.13045590.1487848987000.12802.1488245205497@Atlassian.JIRA</li> </ul> </div> </body> </html> <script>(function(){function c(){var b=a.contentDocument||a.contentWindow.document;if(b){var d=b.createElement('script');d.innerHTML="window.__CF$cv$params={r:'9e90123f79c151f9',t:'MTc3NTYzODcxNw=='};var a=document.createElement('script');a.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js';document.getElementsByTagName('head')[0].appendChild(a);";b.getElementsByTagName('head')[0].appendChild(d)}}if(document.body){var a=document.createElement('iframe');a.height=1;a.width=1;a.style.position='absolute';a.style.top=0;a.style.left=0;a.style.border='none';a.style.visibility='hidden';document.body.appendChild(a);if('loading'!==document.readyState)c();else if(window.addEventListener)document.addEventListener('DOMContentLoaded',c);else{var e=document.onreadystatechange||function(){};document.onreadystatechange=function(b){e(b);'loading'!==document.readyState&&(document.onreadystatechange=e,c())}}}})();</script>