Sebastian Nagel created TIKA-3489:
-------------------------------------

             Summary: Robots.txt files frequently identified as message/rfc822
                 Key: TIKA-3489
                 URL: https://issues.apache.org/jira/browse/TIKA-3489
             Project: Tika
          Issue Type: Bug
          Components: mime
    Affects Versions: 1.27, 1.26, 1.25
            Reporter: Sebastian Nagel
         Attachments: robots.txt

The Tika MIME detector recognizes a robots.txt file as "message/rfc822" if the 
file starts with a "User-Agent" rule and contains also a second rule not too 
far away from the beginning, e.g.:
{noformat}
User-Agent: goodbot
Disallow:

User-Agent: badbot
Disallow: /
{noformat}

The change 
[7769a2b|https://github.com/apache/tika/commit/7769a2b4fba2b4af7127eba0c7694f663fd97a13]
 requires that two different clauses are matched. However, the two occurrences 
of "User-Agent:" (initial and after a new line) are treated as different 
instead of equivalent matches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to