Josh Burchard created TIKA-3963:
-----------------------------------
Summary: HTML author and title aren't mapped to their dc:x
counterparts
Key: TIKA-3963
URL: https://issues.apache.org/jira/browse/TIKA-3963
Project: Tika
Issue Type: Bug
Components: metadata
Affects Versions: 2.6.0
Environment: Tika server on Windows
Curl client on WSL Ubuntu instance
Reporter: Josh Burchard
Attachments: author.html
The 2.x migration doc
([here|https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0])
mentions that author and title are generally, and automatically, mapped to
their dc:x equivalents when returned by Tika 2.x. That doesn't seem to be
happening for HTML files. Can this be fixed?
[^author.html]
{{$ curl -X PUT --upload-file /mnt/c/tmp/author.html --header
"Content-Disposition: attachment; filename=\"author.html\"" --header
"Accept:Application/json" http://localhost:9998/rmeta/text | python -m
json.tool}}
{{ % Total % Received % Xferd Average Speed Time Time Time
Current}}
{{ Dload Upload Total Spent Left
Speed}}
{{100 1152 100 716 100 436 685 417 0:00:01 0:00:01 --:--:--
1102}}
{{[}}
{{ {}}
{{ "Content-Encoding": "UTF-8",}}
{{ "Content-Length": "436",}}
{{ "Content-Type": "text/html; charset=UTF-8",}}
{{ "X-TIKA:Parsed-By": [}}
{{ "org.apache.tika.parser.DefaultParser",}}
{{ "org.apache.tika.parser.html.HtmlParser"}}
{{ ],}}
{{ "X-TIKA:Parsed-By-Full-Set": [}}
{{ "org.apache.tika.parser.DefaultParser",}}
{{ "org.apache.tika.parser.html.HtmlParser"}}
{{ ],}}
{{ "X-TIKA:content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAll meta
information goes in the head section...\n\n\n",}}
{{ "X-TIKA:content_handler": "ToTextContentHandler",}}
{{ "X-TIKA:embedded_depth": "0",}}
{{ "X-TIKA:parse_time_millis": "886",}}
{{{color:#FF0000} "author": "John Doe",{color}}}
{{ "description": "Free Web tutorials",}}
{{ "keywords": "HTML,CSS,XML,JavaScript",}}
{{ "resourceName": "author.html",}}
{{{color:#FF0000} "title": "OldMetaTitle",{color}}}
{{ "viewport": "width=device-width, initial-scale=1.0"}}
{{ }}}
{{]}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)