Sebastian Nagel created TIKA-3489:
-------------------------------------
Summary: Robots.txt files frequently identified as message/rfc822
Key: TIKA-3489
URL: https://issues.apache.org/jira/browse/TIKA-3489
Project: Tika
Issue Type: Bug
Components: mime
Affects Versions: 1.27, 1.26, 1.25
Reporter: Sebastian Nagel
Attachments: robots.txt
The Tika MIME detector recognizes a robots.txt file as "message/rfc822" if the
file starts with a "User-Agent" rule and contains also a second rule not too
far away from the beginning, e.g.:
{noformat}
User-Agent: goodbot
Disallow:
User-Agent: badbot
Disallow: /
{noformat}
The change
[7769a2b|https://github.com/apache/tika/commit/7769a2b4fba2b4af7127eba0c7694f663fd97a13]
requires that two different clauses are matched. However, the two occurrences
of "User-Agent:" (initial and after a new line) are treated as different
instead of equivalent matches.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)