[ 
https://issues.apache.org/jira/browse/TIKA-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557350#comment-14557350
 ] 

Nick Burch commented on TIKA-1634:
----------------------------------

In r1681351, I've added two more matches, of lower priority as they have a 
higher false-positive chance. One covers single or no output functions, the 
other tries to spot the comments at the top of the file. Our 3 test matlab 
files (your two and my own "hello world" one) now detect correctly.

Could you try with your wider set of matlab files with these magics in, and 
close the issue if they all detect fine now?

> Detecting problem with Matlab source code
> -----------------------------------------
>
>                 Key: TIKA-1634
>                 URL: https://issues.apache.org/jira/browse/TIKA-1634
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 1.8
>            Reporter: Ji-Hyun Oh
>            Priority: Trivial
>         Attachments: BARCAST_MainCode.m, Matlab_mime-type_test.xlsx, wtsgaus.m
>
>
> Both Matlab source code and Objective-C source code have the same suffix, 
> which is .m. Therefore, Matlab has additional match value in mime types.xml. 
> In tika-mimetypes.xml Matlab is defined as:
>   <mime-type type="text/x-matlab">
>     <_comment>Matlab source code</_comment>
>     <magic priority="50">
>       <match value="function [" type="string" offset="0"/>
>     </magic>
>     <!-- <glob pattern="*.m"/> - conflicts with text/x-objcsrc -->
>     <sub-class-of type="text/plain"/>
>   </mime-type>
> However, Matlab codes does not always start with "function [“. Therefore, 
> some Matlab codes are detected as text/x-bojcsrc. Based on the source codes 
> collected from NOAA Paleoclimatology Software Resources, many Matlab codes 
> have match value like these (problematic files are attached as an example):
> <mime-type type="text/x-matlab">
>     <_comment>Matlab source code</_comment>
>     <magic priority="50">
>       <match value="function" type="string" offset="0"/>
>       <match value="%" type="string" offset="0"/>
>     </magic>
>     <!-- <glob pattern="*.m"/> - conflicts with text/x-objcsrc -->
>     <sub-class-of type="text/plain"/>
>   </mime-type>
> Conducted several detecting tests by using different Matlab packages obtained 
> from NOAA Paleoclimatology Software Resources, with/without 
> Custom-mimtypes.xml. Results are attached. As a results, total 121 Matlab 
> files are detected correctly with custom-mimetypes.xml, while  55 Matlab 
> files are detected as Matlab files without custom-mimetypes.xml (= only with 
> current match value). However, this match value for Matlab source code could 
> be only common in Paleoclimatology community. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to