[ 
https://issues.apache.org/jira/browse/TIKA-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950183#comment-17950183
 ] 

Tilman Hausherr edited comment on TIKA-4411 at 5/8/25 5:50 AM:
---------------------------------------------------------------

There's still some problem with getting the correct file in a ZIP archive, e.g. 
in 
commoncrawl3/Z6/Z6KHRTPXFKHFABJIIWQUICBXYGORLJ3Z
it claims it is text/html but it's a zip file (probably a firefox/thunderbird 
plugin or dictionary)

The actual file is 
{{chrome\firedictionary.jar\content\firedictionary\view\WordHistoryAndExcerpt.html}}

I looked at the recursive json and 3.1.0 has
{code:java}
  "resourceName" : "content/firedictionary/view/WordHistoryAndExcerpt.html",
  "X-TIKA:content" : "\r\n  \r\n    \r\n      \t \r\n      \t\r\n        \n\r\n 
     \r\n      \t \r\n    \n\r\n    \r\n    \r\n      \t \r\n      \t\r\n       
  Word history and excerpts\r\n      \r\n      \t \r\n      \t \r\n      \t 
\r\n    \n\r\n\r\n    \r\n      \t \r\n      \r\n      \t\r\n        \n\r\n     
 \r\n      \r\n      \t \r\n      \r\n      \t\r\n        \n\n\n\r\n        
\n\n\n\r\n        \n\n\n\r\n        \n\n\n\r\n      \r\n      \r\n      \t \r\n 
   \n\r\n  \n\r\n"
{code}
3.2.0 has just
{code:java}
  "resourceName" : "content/firedictionary/view/WordHistoryAndExcerpt.html"
{code}


was (Author: tilman):
There's still some problem with getting the correct file in a ZIP archive, e.g. 
in 
commoncrawl3/Z6/Z6KHRTPXFKHFABJIIWQUICBXYGORLJ3Z
it claims it is text/html but it's a zip file (probably a firefox/thunderbird 
plugin or dictionary)

The actual file is 
{{chrome\firedictionary.jar\content\firedictionary\view\WordHistoryAndExcerpt.html}}

I looked at the recursive json and 3.1.0 has
{code}
  "resourceName" : "content/firedictionary/view/WordHistoryAndExcerpt.html",
  "X-TIKA:content" : "\r\n  \r\n    \r\n      \t \r\n      \t\r\n        \n\r\n 
     \r\n      \t \r\n    \n\r\n    \r\n    \r\n      \t \r\n      \t\r\n       
  Word history and excerpts\r\n      \r\n      \t \r\n      \t \r\n      \t 
\r\n    \n\r\n\r\n    \r\n      \t \r\n      \r\n      \t\r\n        \n\r\n     
 \r\n      \r\n      \t \r\n      \r\n      \t\r\n        \n\n\n\r\n        
\n\n\n\r\n        \n\n\n\r\n        \n\n\n\r\n      \r\n      \r\n      \t \r\n 
   \n\r\n  \n\r\n"
{code}
3.2.0 (from yesterday, I may have missed the latest updates) has just
{code}
  "resourceName" : "content/firedictionary/view/WordHistoryAndExcerpt.html"
{code}

> Run the 3.2.0 release process
> -----------------------------
>
>                 Key: TIKA-4411
>                 URL: https://issues.apache.org/jira/browse/TIKA-4411
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: reports-3.2.0-pre-rc1.tgz
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to