[
https://issues.apache.org/jira/browse/TIKA-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16076278#comment-16076278
]
ASF GitHub Bot commented on TIKA-2422:
--------------------------------------
sebastian-nagel opened a new pull request #190: TIKA-2422 -- improve detection
of Graphviz *.dot format
URL: https://github.com/apache/tika/pull/190
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Improve detection of Graphviz *.dot format
> ------------------------------------------
>
> Key: TIKA-2422
> URL: https://issues.apache.org/jira/browse/TIKA-2422
> Project: Tika
> Issue Type: Improvement
> Components: detector, mime
> Reporter: Sebastian Nagel
> Priority: Minor
>
> Detection of Graphviz document formats could be improved by adding
> - either *.dot as glob pattern (conflicts with the more frequent MSWord
> templates)
> - a magic pattern which catches the [.dot
> language|http://www.graphviz.org/content/dot-language] grammar, eg.
> {{^\s*(?:strict\s+)?(?:di)?graph\b}}
> Seen with Common Crawl data (see also discussions on
> [user@tika|https://lists.apache.org/thread.html/1e4f4b6c249618a446f2e92f56ef90e6bfa0dfe51ce10197461df3d9@%3Cuser.tika.apache.org%3E]
> and
> [dev@poi|https://lists.apache.org/thread.html/7e0c25a389a03011eabce81e933f17a6093390138f4890fa77c36a59@%3Cdev.poi.apache.org%3E]):
> web server sends "text/vnd.graphviz" (often wrong) and Tika detects
> "application/msword" (sometimes wrong), see [WARC
> file|https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/tika_dot_graphviz_msword.warc.gz]).
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)