[ https://issues.apache.org/jira/browse/TIKA-4431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17956103#comment-17956103 ]
Tim Allison commented on TIKA-4431: ----------------------------------- Thank you for opening this. It looks like we're truncating the potential uri anchor here: https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L545 And yet here (https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/detect/NameDetector.java#L117), because of TIKA-3783, we're careful not to remove the anchor chunk if there's a period after it. :/ > Mime Type Detection Error with File Name containing Number Sign > ---------------------------------------------------------------- > > Key: TIKA-4431 > URL: https://issues.apache.org/jira/browse/TIKA-4431 > Project: Tika > Issue Type: Bug > Components: core > Reporter: knoobie > Priority: Major > > I noticed that changing the file name to include a number sign / hashtag (#) > changes the mime type detection. > For example, "Lorem-Ipsum.csv" correctly parses to "text/csv" but once > "Lorem-Ipsum#123.csv" is given (with the same file content) the parser > detects "text/plain". > > {code:java} > import static org.assertj.core.api.Assertions.assertThat; > import java.nio.charset.StandardCharsets; > import org.apache.tika.Tika; > import org.junit.jupiter.api.Test; > public class ApacheTikaTest { > @Test > void detect_normalFileName() { > var tika = new Tika(); > var fileName = "Lorem-Ipsum.csv"; > var data = """ > Lorem;Ipsum; > 1 ;2 ; > 3 ;4 ; > """; > assertThat(tika.detect(data.getBytes(StandardCharsets.UTF_8), fileName)) > .isEqualTo("text/csv"); > } > @Test > void detect_FileNameWithHashtag() { > var tika = new Tika(); > var fileName = "Lorem-Ipsum#123.csv"; > var data = """ > Lorem;Ipsum; > 1 ;2 ; > 3 ;4 ; > """; > assertThat(tika.detect(data.getBytes(StandardCharsets.UTF_8), fileName)) > // Fails with result: 'text/plain' > .isEqualTo("text/csv"); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)