[ 
https://issues.apache.org/jira/browse/TIKA-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563642#comment-14563642
 ] 

Ji-Hyun Oh commented on TIKA-1634:
----------------------------------

I tested newly updated magics with my set of matlab files. With updated magics, 
only one file failed to be detected as matlab (see the updated .xls file to see 
the result). The file started with 

%% SET the initial values for the Bayesian Anova.
load Data_INPUT

So we also added one more match value as below:     
<match value="%%" type="string" offset="0"/>

However, I am closing my issue. 




> Detecting problem with Matlab source code
> -----------------------------------------
>
>                 Key: TIKA-1634
>                 URL: https://issues.apache.org/jira/browse/TIKA-1634
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 1.8
>            Reporter: Ji-Hyun Oh
>            Priority: Trivial
>         Attachments: BARCAST_MainCode.m, wtsgaus.m
>
>
> Both Matlab source code and Objective-C source code have the same suffix, 
> which is .m. Therefore, Matlab has additional match value in mime types.xml. 
> In tika-mimetypes.xml Matlab is defined as:
>   <mime-type type="text/x-matlab">
>     <_comment>Matlab source code</_comment>
>     <magic priority="50">
>       <match value="function [" type="string" offset="0"/>
>     </magic>
>     <!-- <glob pattern="*.m"/> - conflicts with text/x-objcsrc -->
>     <sub-class-of type="text/plain"/>
>   </mime-type>
> However, Matlab codes does not always start with "function [“. Therefore, 
> some Matlab codes are detected as text/x-bojcsrc. Based on the source codes 
> collected from NOAA Paleoclimatology Software Resources, many Matlab codes 
> have match value like these (problematic files are attached as an example):
> <mime-type type="text/x-matlab">
>     <_comment>Matlab source code</_comment>
>     <magic priority="50">
>       <match value="function" type="string" offset="0"/>
>       <match value="%" type="string" offset="0"/>
>     </magic>
>     <!-- <glob pattern="*.m"/> - conflicts with text/x-objcsrc -->
>     <sub-class-of type="text/plain"/>
>   </mime-type>
> Conducted several detecting tests by using different Matlab packages obtained 
> from NOAA Paleoclimatology Software Resources, with/without 
> Custom-mimtypes.xml. Results are attached. As a results, total 103 Matlab 
> files are detected correctly with custom-mimetypes.xml, while  42 Matlab 
> files are detected as Matlab files without custom-mimetypes.xml (= only with 
> current match value). However, this match value for Matlab source code could 
> be only common in Paleoclimatology community. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to