[ https://issues.apache.org/jira/browse/TIKA-4431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
knoobie updated TIKA-4431: -------------------------- Environment: (was: {code:xml} <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <artifactId>3.1.0</artifactId> </dependency> {code}) > Mime Type Detection Error with File Naming containing Number Sign > ------------------------------------------------------------------ > > Key: TIKA-4431 > URL: https://issues.apache.org/jira/browse/TIKA-4431 > Project: Tika > Issue Type: Bug > Components: core > Reporter: knoobie > Priority: Major > > I noticed that changing the file name to include a number sign / hashtag (#) > changes the mime type detection. > For example, "Lorem-Ipsum.csv" correctly parses to "text/csv" but once > "Lorem-Ipsum#123.csv" is given (with the same file content) the parser > detects "text/plain". > > {code:java} > import static org.assertj.core.api.Assertions.assertThat; > import java.nio.charset.StandardCharsets; > import org.apache.tika.Tika; > import org.junit.jupiter.api.Test; > public class ApacheTikaTest { > @Test > void detect_normalFileName() { > var tika = new Tika(); > var fileName = "Lorem-Ipsum.csv"; > var data = """ > Lorem;Ipsum; > 1 ;2 ; > 3 ;4 ; > """; > assertThat(tika.detect(data.getBytes(StandardCharsets.UTF_8), fileName)) > .isEqualTo("text/csv"); > } > @Test > void detect_FileNameWithHashtag() { > var tika = new Tika(); > var fileName = "Lorem-Ipsum#123.csv"; > var data = """ > Lorem;Ipsum; > 1 ;2 ; > 3 ;4 ; > """; > assertThat(tika.detect(data.getBytes(StandardCharsets.UTF_8), fileName)) > // Fails with result: 'text/plain' > .isEqualTo("text/csv"); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)