[ 
https://issues.apache.org/jira/browse/TIKA-4431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

knoobie updated TIKA-4431:
--------------------------
    Summary: Mime Type Detection Error with File Name containing Number Sign   
(was: Mime Type Detection Error with File Naming containing Number Sign )

> Mime Type Detection Error with File Name containing Number Sign 
> ----------------------------------------------------------------
>
>                 Key: TIKA-4431
>                 URL: https://issues.apache.org/jira/browse/TIKA-4431
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>            Reporter: knoobie
>            Priority: Major
>
> I noticed that changing the file name to include a number sign / hashtag (#) 
> changes the mime type detection.
> For example, "Lorem-Ipsum.csv" correctly parses to "text/csv" but once 
> "Lorem-Ipsum#123.csv" is given (with the same file content) the parser 
> detects "text/plain".
>  
> {code:java}
> import static org.assertj.core.api.Assertions.assertThat;
> import java.nio.charset.StandardCharsets;
> import org.apache.tika.Tika;
> import org.junit.jupiter.api.Test;
> public class ApacheTikaTest {
>   @Test
>   void detect_normalFileName() {
>     var tika = new Tika();
>     var fileName = "Lorem-Ipsum.csv";
>     var data = """
>      Lorem;Ipsum;
>       1    ;2    ;
>       3    ;4    ;
>       """;
>     assertThat(tika.detect(data.getBytes(StandardCharsets.UTF_8), fileName))
>       .isEqualTo("text/csv");
>   }
>   @Test
>   void detect_FileNameWithHashtag() {
>     var tika = new Tika();
>     var fileName = "Lorem-Ipsum#123.csv";
>     var data = """
>       Lorem;Ipsum;
>       1    ;2    ;
>       3    ;4    ;
>       """;
>     assertThat(tika.detect(data.getBytes(StandardCharsets.UTF_8), fileName))
>       // Fails with result: 'text/plain'
>       .isEqualTo("text/csv");  
>    }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to