Ji-Hyun Oh created TIKA-1634:
--------------------------------
Summary: Detecting problem with Matlab source code
Key: TIKA-1634
URL: https://issues.apache.org/jira/browse/TIKA-1634
Project: Tika
Issue Type: Improvement
Components: mime
Affects Versions: 1.8
Reporter: Ji-Hyun Oh
Priority: Trivial
Both Matlab source code and Objective-C source code have the same suffix, which
is .m. Therefore, Matlab has additional match value in mime types.xml.
In tika-mimetypes.xml Matlab is defined as:
<mime-type type="text/x-matlab">
<_comment>Matlab source code</_comment>
<magic priority="50">
<match value="function [" type="string" offset="0"/>
</magic>
<!-- <glob pattern="*.m"/> - conflicts with text/x-objcsrc -->
<sub-class-of type="text/plain"/>
</mime-type>
However, Matlab codes does not always start with "function [“. Therefore, some
Matlab codes are detected as text/x-bojcsrc. Based on the source codes
collected from NOAA Paleoclimatology Software Resources, many Matlab codes have
match value like these (problematic files are attached as an example):
<mime-type type="text/x-matlab">
<_comment>Matlab source code</_comment>
<magic priority="50">
<match value="function" type="string" offset="0"/>
<match value="%" type="string" offset="0"/>
</magic>
<!-- <glob pattern="*.m"/> - conflicts with text/x-objcsrc -->
<sub-class-of type="text/plain"/>
</mime-type>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)