[ 
https://issues.apache.org/jira/browse/TIKA-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Burchard updated TIKA-3963:
--------------------------------
    Description: 
The 2.x migration doc 
([here|https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0])
 mentions that author is generally, and automatically, mapped to it's 
dc:creator equivalent when returned by Tika 2.x.  That doesn't seem to be 
happening for HTML files. Can this be fixed?

[^author.html]

{{$ curl -X PUT --upload-file /mnt/c/tmp/author.html --header 
"Content-Disposition: attachment; filename=\"author.html\"" --header 
"Accept:Application/json" [http://localhost:9998/rmeta/text] | python -m 
json.tool}}
{{  % Total    % Received % Xferd  Average Speed   Time    Time     Time  
Current}}
{{                                 Dload  Upload   Total   Spent    Left  
Speed}}
{{100  1152  100   716  100   436    685    417  0:00:01  0:00:01 
-{-}:{-}{-}:{-}-  1102}}
{{[}}
{{    {}}
{{        "Content-Encoding": "UTF-8",}}
{{        "Content-Length": "436",}}
{{        "Content-Type": "text/html; charset=UTF-8",}}
{{        "X-TIKA:Parsed-By": [}}
{{            "org.apache.tika.parser.DefaultParser",}}
{{            "org.apache.tika.parser.html.HtmlParser"}}
{{        ],}}
{{        "X-TIKA:Parsed-By-Full-Set": [}}
{{            "org.apache.tika.parser.DefaultParser",}}
{{            "org.apache.tika.parser.html.HtmlParser"}}
{{        ],}}
{{        "X-TIKA:content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAll meta 
information goes in the head section...\n\n\n",}}
{{        "X-TIKA:content_handler": "ToTextContentHandler",}}
{{        "X-TIKA:embedded_depth": "0",}}
{{        "X-TIKA:parse_time_millis": "886",}}
{{{color:#ff0000}        "author": "John Doe",{color}}}
{{        "description": "Free Web tutorials",}}
{{        "keywords": "HTML,CSS,XML,JavaScript",}}
{{        "resourceName": "author.html",}}
{color:#172b4d}{{        "title": "OldMetaTitle",}}{color}
{{        "viewport": "width=device-width, initial-scale=1.0"}}
{\{    }}}
{{]}}

  was:
The 2.x migration doc 
([here|https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0])
 mentions that author and title are generally, and automatically, mapped to 
their dc:x equivalents when returned by Tika 2.x.  That doesn't seem to be 
happening for HTML files. Can this be fixed?

[^author.html]

{{$ curl -X PUT --upload-file /mnt/c/tmp/author.html --header 
"Content-Disposition: attachment; filename=\"author.html\"" --header 
"Accept:Application/json" http://localhost:9998/rmeta/text | python -m 
json.tool}}
{{  % Total    % Received % Xferd  Average Speed   Time    Time     Time  
Current}}
{{                                 Dload  Upload   Total   Spent    Left  
Speed}}
{{100  1152  100   716  100   436    685    417  0:00:01  0:00:01 --:--:--  
1102}}
{{[}}
{{    {}}
{{        "Content-Encoding": "UTF-8",}}
{{        "Content-Length": "436",}}
{{        "Content-Type": "text/html; charset=UTF-8",}}
{{        "X-TIKA:Parsed-By": [}}
{{            "org.apache.tika.parser.DefaultParser",}}
{{            "org.apache.tika.parser.html.HtmlParser"}}
{{        ],}}
{{        "X-TIKA:Parsed-By-Full-Set": [}}
{{            "org.apache.tika.parser.DefaultParser",}}
{{            "org.apache.tika.parser.html.HtmlParser"}}
{{        ],}}
{{        "X-TIKA:content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAll meta 
information goes in the head section...\n\n\n",}}
{{        "X-TIKA:content_handler": "ToTextContentHandler",}}
{{        "X-TIKA:embedded_depth": "0",}}
{{        "X-TIKA:parse_time_millis": "886",}}
{{{color:#FF0000}        "author": "John Doe",{color}}}
{{        "description": "Free Web tutorials",}}
{{        "keywords": "HTML,CSS,XML,JavaScript",}}
{{        "resourceName": "author.html",}}
{{{color:#FF0000}        "title": "OldMetaTitle",{color}}}
{{        "viewport": "width=device-width, initial-scale=1.0"}}
{{    }}}
{{]}}


> HTML author isn't mapped to its dc:creator counterpart
> ------------------------------------------------------
>
>                 Key: TIKA-3963
>                 URL: https://issues.apache.org/jira/browse/TIKA-3963
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata
>    Affects Versions: 2.6.0
>         Environment: Tika server on Windows
> Curl client on WSL Ubuntu instance
>  
>            Reporter: Josh Burchard
>            Priority: Major
>         Attachments: author.html
>
>
> The 2.x migration doc 
> ([here|https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0])
>  mentions that author is generally, and automatically, mapped to it's 
> dc:creator equivalent when returned by Tika 2.x.  That doesn't seem to be 
> happening for HTML files. Can this be fixed?
> [^author.html]
> {{$ curl -X PUT --upload-file /mnt/c/tmp/author.html --header 
> "Content-Disposition: attachment; filename=\"author.html\"" --header 
> "Accept:Application/json" [http://localhost:9998/rmeta/text] | python -m 
> json.tool}}
> {{  % Total    % Received % Xferd  Average Speed   Time    Time     Time  
> Current}}
> {{                                 Dload  Upload   Total   Spent    Left  
> Speed}}
> {{100  1152  100   716  100   436    685    417  0:00:01  0:00:01 
> -{-}:{-}{-}:{-}-  1102}}
> {{[}}
> {{    {}}
> {{        "Content-Encoding": "UTF-8",}}
> {{        "Content-Length": "436",}}
> {{        "Content-Type": "text/html; charset=UTF-8",}}
> {{        "X-TIKA:Parsed-By": [}}
> {{            "org.apache.tika.parser.DefaultParser",}}
> {{            "org.apache.tika.parser.html.HtmlParser"}}
> {{        ],}}
> {{        "X-TIKA:Parsed-By-Full-Set": [}}
> {{            "org.apache.tika.parser.DefaultParser",}}
> {{            "org.apache.tika.parser.html.HtmlParser"}}
> {{        ],}}
> {{        "X-TIKA:content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAll meta 
> information goes in the head section...\n\n\n",}}
> {{        "X-TIKA:content_handler": "ToTextContentHandler",}}
> {{        "X-TIKA:embedded_depth": "0",}}
> {{        "X-TIKA:parse_time_millis": "886",}}
> {{{color:#ff0000}        "author": "John Doe",{color}}}
> {{        "description": "Free Web tutorials",}}
> {{        "keywords": "HTML,CSS,XML,JavaScript",}}
> {{        "resourceName": "author.html",}}
> {color:#172b4d}{{        "title": "OldMetaTitle",}}{color}
> {{        "viewport": "width=device-width, initial-scale=1.0"}}
> {\{    }}}
> {{]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to