[
https://issues.apache.org/jira/browse/TIKA-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16076416#comment-16076416
]
ASF GitHub Bot commented on TIKA-2422:
--------------------------------------
sebastian-nagel commented on issue #190: TIKA-2422 -- improve detection of
Graphviz *.dot format
URL: https://github.com/apache/tika/pull/190#issuecomment-313378665
Ok, unit test added. After testing a couple of .dot files: allowed C++-style
comments before (di)graph magic keyword.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Improve detection of Graphviz *.dot format
> ------------------------------------------
>
> Key: TIKA-2422
> URL: https://issues.apache.org/jira/browse/TIKA-2422
> Project: Tika
> Issue Type: Improvement
> Components: detector, mime
> Reporter: Sebastian Nagel
> Priority: Minor
>
> Detection of Graphviz document formats could be improved by adding
> - either *.dot as glob pattern (conflicts with the more frequent MSWord
> templates)
> - a magic pattern which catches the [.dot
> language|http://www.graphviz.org/content/dot-language] grammar, eg.
> {{^\s*(?:strict\s+)?(?:di)?graph\b}}
> Seen with Common Crawl data (see also discussions on
> [user@tika|https://lists.apache.org/thread.html/1e4f4b6c249618a446f2e92f56ef90e6bfa0dfe51ce10197461df3d9@%3Cuser.tika.apache.org%3E]
> and
> [dev@poi|https://lists.apache.org/thread.html/7e0c25a389a03011eabce81e933f17a6093390138f4890fa77c36a59@%3Cdev.poi.apache.org%3E]):
> web server sends "text/vnd.graphviz" (often wrong) and Tika detects
> "application/msword" (sometimes wrong), see [WARC
> file|https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/tika_dot_graphviz_msword.warc.gz]).
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)