[
https://issues.apache.org/jira/browse/TIKA-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683158#comment-17683158
]
Tim Allison commented on TIKA-3963:
-----------------------------------
The documentation should convey that we are dropping "author" in favor of
"dc:creator" when the parser does the mapping to both. However, the HTMLParser
appears not have done that mapping...which is a bit of a surprise for me...
This is the output from 1.28.5:
{noformat}
[{"Content-Encoding":"UTF-8","Content-Length":"436","Content-Type":"text/html;
charset\u003dUTF-8","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.html.HtmlParser"],"X-TIKA:content":"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAll
meta information goes in the head
section...\n\n\n","X-TIKA:content_handler":"ToTextContentHandler","X-TIKA:embedded_depth":"0","X-TIKA:parse_time_millis":"149","author":"John
Doe","description":"Free Web
tutorials","keywords":"HTML,CSS,XML,JavaScript","resourceName":"author.html","title":"OldMetaTitle","viewport":"width\u003ddevice-width,
initial-scale\u003d1.0"}]
{noformat}
> HTML author and title aren't mapped to their dc:x counterparts
> --------------------------------------------------------------
>
> Key: TIKA-3963
> URL: https://issues.apache.org/jira/browse/TIKA-3963
> Project: Tika
> Issue Type: Bug
> Components: metadata
> Affects Versions: 2.6.0
> Environment: Tika server on Windows
> Curl client on WSL Ubuntu instance
>
> Reporter: Josh Burchard
> Priority: Major
> Attachments: author.html
>
>
> The 2.x migration doc
> ([here|https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0])
> mentions that author and title are generally, and automatically, mapped to
> their dc:x equivalents when returned by Tika 2.x. That doesn't seem to be
> happening for HTML files. Can this be fixed?
> [^author.html]
> {{$ curl -X PUT --upload-file /mnt/c/tmp/author.html --header
> "Content-Disposition: attachment; filename=\"author.html\"" --header
> "Accept:Application/json" http://localhost:9998/rmeta/text | python -m
> json.tool}}
> {{ % Total % Received % Xferd Average Speed Time Time Time
> Current}}
> {{ Dload Upload Total Spent Left
> Speed}}
> {{100 1152 100 716 100 436 685 417 0:00:01 0:00:01 --:--:--
> 1102}}
> {{[}}
> {{ {}}
> {{ "Content-Encoding": "UTF-8",}}
> {{ "Content-Length": "436",}}
> {{ "Content-Type": "text/html; charset=UTF-8",}}
> {{ "X-TIKA:Parsed-By": [}}
> {{ "org.apache.tika.parser.DefaultParser",}}
> {{ "org.apache.tika.parser.html.HtmlParser"}}
> {{ ],}}
> {{ "X-TIKA:Parsed-By-Full-Set": [}}
> {{ "org.apache.tika.parser.DefaultParser",}}
> {{ "org.apache.tika.parser.html.HtmlParser"}}
> {{ ],}}
> {{ "X-TIKA:content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAll meta
> information goes in the head section...\n\n\n",}}
> {{ "X-TIKA:content_handler": "ToTextContentHandler",}}
> {{ "X-TIKA:embedded_depth": "0",}}
> {{ "X-TIKA:parse_time_millis": "886",}}
> {{{color:#FF0000} "author": "John Doe",{color}}}
> {{ "description": "Free Web tutorials",}}
> {{ "keywords": "HTML,CSS,XML,JavaScript",}}
> {{ "resourceName": "author.html",}}
> {{{color:#FF0000} "title": "OldMetaTitle",{color}}}
> {{ "viewport": "width=device-width, initial-scale=1.0"}}
> {{ }}}
> {{]}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)