Hi Ken,
Thank you very much for your comment. 
Could you inform me what kind of previous project are you looking into? 

Ji-Hyun
________________________________________
From: Ken Krugler [[email protected]]
Sent: Wednesday, April 22, 2015 7:38 AM
To: [email protected]
Subject: RE: Detection problem: Parsing scientific source codes for 
geoscientists

I'm looking into whether detection & parsing code from a previous project could 
be open-sourced.

If that happened, we'd get support for many, many languages - though not GrADS 
or NCAR.

But the infrastructure would be there to easily add support for any missing 
languages.

-- Ken

> From: Oh, Ji-Hyun (329F-Affiliate)
> Sent: April 21, 2015 10:54:16am PDT
> To: [email protected]
> Subject: Detection problem: Parsing scientific source codes for geoscientists
>
> Hi Tika friends,
>
> I am currently engaged in a project funded by National Science Foundation. 
> Our goal is to develop a research-friendly environment where geoscientists, 
> like me, can easily find source codes they need. According to a survey, 
> scientists spend a considerable amount of their time in processing data 
> instead of doing actual science. Based on my experience as a climate 
> scientist, there exist most frequently/typically used analysis tools in 
> atmospheric science. Therefore, it could be helpful if these tools can be 
> easily shared among scientists. The thing is that the tools are written in 
> various scientific languages, so we are trying to provide the metadata of 
> source codes stored in public repositories to help scientists select source 
> code for their own usages.
>
> For the first step, I listed up the file formats that widely used in climate 
> science.
>
> FORTRAN (.f, .f90, f77)
> Python (.py)
> R (.R)
> Matlab (.m)
> GrADS (Grid Analysis and Display System)
> (.gs)
> NCL (NCAR Command Language) (.ncl)
> IDL (Interactive Data Language) (.pro)
>
> I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I 
> used Tika to obtain content type of the files (with suffix .f, f90, .m), but 
> Tika detected these files as text/plain:
>
> ohjihyun% tika -m spctime.f
>
> Content-Encoding: ISO-8859-1
> Content-Length: 16613
> Content-Type: text/plain; charset=ISO-8859-1
> X-Parsed-By: org.apache.tika.parser.DefaultParser
> X-Parsed-By: org.apache.tika.parser.txt.TXTParser
> resourceName: spctime.f
>
> ohjihyun% tika -m wavelet.m
> Content-Encoding: ISO-8859-1
> Content-Length: 5868
> Content-Type: text/plain; charset=ISO-8859-1
> X-Parsed-By: org.apache.tika.parser.DefaultParser
> X-Parsed-By: org.apache.tika.parser.txt.TXTParser
> resourceName: wavelet.m
>
> I checked Tika can give correct content type (text/x-java-source) for Java 
> file as:
> ohjihyun% tika -m UrlParser.java
> Content-Encoding: ISO-8859-1
> Content-Length: 2178
> Content-Type: text/x-java-source
> LoC: 70
> X-Parsed-By: org.apache.tika.parser.DefaultParser
> X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser
> resourceName: UrlParser.java
>
> Should I build a parser for each file format to get an exact content-type, as 
> Java has SourceCodeParser?
> Thank you in advance for your insightful comments.
>
> Ji-Hyun

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to