Hi Tika folks,
I am trying to detect Matlab source code by using Tika. I have two Matlab codes
written in two different styles. Both of them have .m suffixes.
The main problem is caused from that both Matlab source code and Objective-C
source code have the same suffix. Therefore, Matlab has additional match value
in mime types.xml.
In tika-mimetypes.xml Matlab is defined as:
<mime-type type="text/x-matlab">
<_comment>Matlab source code</_comment>
<magic priority="50">
<match value="function [" type="string" offset="0"/>
</magic>
<!-- <glob pattern="*.m"/> - conflicts with text/x-objcsrc -->
<sub-class-of type="text/plain"/>
</mime-type>
When I used Tika to detect their file types:
bash-3.2$ java -jar tika-app-1.7.jar -d /Users/ohjihyun/missingness_patterns.m
text/x-matlab
bash-3.2$ java -jar tika-app-1.7.jar -d /Users/ohjihyun/BARCAST_MainCode.m
text/x-objcsrc
The first file starts like this:
function [np, kavlr, kmisr, prows, mp, iptrn] = missingness_patterns(X)
The second one starts with this:
%% CONTROL CODE FOR FULLY BAYESIAN SPATIO-TEMPORAL TEMPERATURE RECONSTRUCTION
%EVERYTHING IS MODULAR TO ALLOW FOR EASY DEBUGGING AND ADAPTATION
% _vNewModel_Oct08: change the formalism to reflect new model (Beta_1 now
% normal). Allows for multiple proxies
clear all; close all;
Therefore, I created my own custom-mimetypes.xml: (I benchmarked
freedesktop.org.xml)
<mime-type type="text/x-matlab">
<_comment>Matlab source code</_comment>
<magic priority="50">
<match value="function [" type="string" offset="0"/>
<match value="%" type="string" offset="0"/>
</magic>
<!-- <glob pattern="*.m"/> - conflicts with text/x-objcsrc -->
<sub-class-of type="text/plain"/>
</mime-type>
And I did :
Java -classpath
/Users/ohjihyun/tika-1.7/tika-core/src/main/resources/org/apache/tika/mime:tika-app-1.8-SNAPSHOT.jar
org.apache.tika.cli.TikaCLI --detect
/Users/ohjihyun/Desktop/matlab_sample/BARCAST_MainCode.m
But still I got:
text/x-objcsrc
Am I doing something wrong with the way to use custom-mimetypes.xml?
Or it is only possible to use custom-mimetypes.xml when I have new file format?
I attach the Matlab files. These two were obtained from
NOAA Paleoclimatology Software Resources
(https://www.ncdc.noaa.gov/cdo/f?p=517:20:232249663152501:pg_R_58156058819335387:NO&pg_min_row=16&pg_max_rows=15&pg_rows_fetched=15.
)
Thank you always for your help.
Cheers,
Ji-Hyun