[ 
https://issues.apache.org/jira/browse/TIKA-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16076541#comment-16076541
 ] 

Hudson commented on TIKA-2422:
------------------------------

FAILURE: Integrated in Jenkins build Tika-trunk #1317 (See 
[https://builds.apache.org/job/Tika-trunk/1317/])
TIKA-2422 -- improve detection of Graphviz *.dot format (snagel: 
[https://github.com/apache/tika/commit/8d8e818cedd6727a9ff43572a31aad83b9537350])
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
TIKA-2422 -- improve detection of Graphviz *.dot format - allow leading 
(snagel: 
[https://github.com/apache/tika/commit/da7ade6350edf603e0caef03827eddc357e636ab])
* (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
* (edit) tika-parsers/src/test/resources/test-documents/testGRAPHVIZg.dot
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* (add) tika-parsers/src/test/resources/test-documents/testGRAPHVIZdc.dot
* (edit) tika-parsers/src/test/resources/test-documents/testGRAPHVIZd.dot


> Improve detection of Graphviz *.dot format
> ------------------------------------------
>
>                 Key: TIKA-2422
>                 URL: https://issues.apache.org/jira/browse/TIKA-2422
>             Project: Tika
>          Issue Type: Improvement
>          Components: detector, mime
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.16
>
>
> Detection of Graphviz document formats could be improved by adding
> - either *.dot as glob pattern (conflicts with the more frequent MSWord 
> templates)
> - a magic pattern which catches the [.dot 
> language|http://www.graphviz.org/content/dot-language] grammar, eg. 
> {{^\s*(?:strict\s+)?(?:di)?graph\b}}
> Seen with Common Crawl data (see also discussions on 
> [user@tika|https://lists.apache.org/thread.html/1e4f4b6c249618a446f2e92f56ef90e6bfa0dfe51ce10197461df3d9@%3Cuser.tika.apache.org%3E]
>  and 
> [dev@poi|https://lists.apache.org/thread.html/7e0c25a389a03011eabce81e933f17a6093390138f4890fa77c36a59@%3Cdev.poi.apache.org%3E]):
>  web server sends "text/vnd.graphviz" (often wrong) and Tika detects 
> "application/msword" (sometimes wrong), see [WARC 
> file|https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/tika_dot_graphviz_msword.warc.gz]).
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to