Hi Tika folks,

I am trying to detect Matlab source code by using Tika. I have two Matlab codes 
written in two different styles. Both of them have .m suffixes.
The main problem is caused from that both Matlab source code and Objective-C 
source code have the same suffix. Therefore, Matlab has additional match value 
in mime types.xml.

In tika-mimetypes.xml Matlab is defined as:
  <mime-type type="text/x-matlab">
    <_comment>Matlab source code</_comment>
    <magic priority="50">
      <match value="function [" type="string" offset="0"/>
    </magic>
    <!-- <glob pattern="*.m"/> - conflicts with text/x-objcsrc -->
    <sub-class-of type="text/plain"/>
  </mime-type>

When I used Tika to detect their file types:


bash-3.2$ java -jar tika-app-1.7.jar -d /Users/ohjihyun/missingness_patterns.m

 text/x-matlab


bash-3.2$ java -jar tika-app-1.7.jar -d /Users/ohjihyun/BARCAST_MainCode.m

text/x-objcsrc


The first file starts like this:

function [np, kavlr, kmisr, prows, mp, iptrn] = missingness_patterns(X)

The second one starts with this:

%% CONTROL CODE FOR FULLY BAYESIAN SPATIO-TEMPORAL TEMPERATURE RECONSTRUCTION
%EVERYTHING IS MODULAR TO ALLOW FOR EASY DEBUGGING AND ADAPTATION
% _vNewModel_Oct08: change the formalism to reflect new model (Beta_1 now
% normal). Allows for multiple proxies
clear all; close all;

Therefore, I created my own custom-mimetypes.xml: (I benchmarked 
freedesktop.org.xml)

  <mime-type type="text/x-matlab">
    <_comment>Matlab source code</_comment>
    <magic priority="50">
      <match value="function [" type="string" offset="0"/>
      <match value="%" type="string" offset="0"/>
    </magic>
    <!-- <glob pattern="*.m"/> - conflicts with text/x-objcsrc -->
    <sub-class-of type="text/plain"/>
  </mime-type>

And I did :
Java -classpath 
/Users/ohjihyun/tika-1.7/tika-core/src/main/resources/org/apache/tika/mime:tika-app-1.8-SNAPSHOT.jar
 org.apache.tika.cli.TikaCLI --detect 
/Users/ohjihyun/Desktop/matlab_sample/BARCAST_MainCode.m

But still I got:
text/x-objcsrc

Am I doing something wrong with the way to use custom-mimetypes.xml?
Or it is only possible to use custom-mimetypes.xml when I have new file format? 
I attach the Matlab files. These two were obtained from
NOAA Paleoclimatology Software Resources 
(https://www.ncdc.noaa.gov/cdo/f?p=517:20:232249663152501:pg_R_58156058819335387:NO&pg_min_row=16&pg_max_rows=15&pg_rows_fetched=15.
 )

Thank you always for your help.

Cheers,
Ji-Hyun

Reply via email to