Thanks for sharing your work Ji-Hyun. Glad, Ken, Lewis and Nick have replied. Thanks!
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Ken Krugler <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Wednesday, April 22, 2015 at 10:38 AM To: "[email protected]" <[email protected]> Subject: RE: Detection problem: Parsing scientific source codes for geoscientists >I'm looking into whether detection & parsing code from a previous project >could be open-sourced. > >If that happened, we'd get support for many, many languages - though not >GrADS or NCAR. > >But the infrastructure would be there to easily add support for any >missing languages. > >-- Ken > >> From: Oh, Ji-Hyun (329F-Affiliate) >> Sent: April 21, 2015 10:54:16am PDT >> To: [email protected] >> Subject: Detection problem: Parsing scientific source codes for >>geoscientists >> >> Hi Tika friends, >> >> I am currently engaged in a project funded by National Science >>Foundation. Our goal is to develop a research-friendly environment where >>geoscientists, like me, can easily find source codes they need. >>According to a survey, scientists spend a considerable amount of their >>time in processing data instead of doing actual science. Based on my >>experience as a climate scientist, there exist most frequently/typically >>used analysis tools in atmospheric science. Therefore, it could be >>helpful if these tools can be easily shared among scientists. The thing >>is that the tools are written in various scientific languages, so we are >>trying to provide the metadata of source codes stored in public >>repositories to help scientists select source code for their own usages. >> >> For the first step, I listed up the file formats that widely used in >>climate science. >> >> FORTRAN (.f, .f90, f77) >> Python (.py) >> R (.R) >> Matlab (.m) >> GrADS (Grid Analysis and Display System) >> (.gs) >> NCL (NCAR Command Language) (.ncl) >> IDL (Interactive Data Language) (.pro) >> >> I checked Fortran and Matlab are included in tike-mimetypes.xml, but >>when I used Tika to obtain content type of the files (with suffix .f, >>f90, .m), but Tika detected these files as text/plain: >> >> ohjihyun% tika -m spctime.f >> >> Content-Encoding: ISO-8859-1 >> Content-Length: 16613 >> Content-Type: text/plain; charset=ISO-8859-1 >> X-Parsed-By: org.apache.tika.parser.DefaultParser >> X-Parsed-By: org.apache.tika.parser.txt.TXTParser >> resourceName: spctime.f >> >> ohjihyun% tika -m wavelet.m >> Content-Encoding: ISO-8859-1 >> Content-Length: 5868 >> Content-Type: text/plain; charset=ISO-8859-1 >> X-Parsed-By: org.apache.tika.parser.DefaultParser >> X-Parsed-By: org.apache.tika.parser.txt.TXTParser >> resourceName: wavelet.m >> >> I checked Tika can give correct content type (text/x-java-source) for >>Java file as: >> ohjihyun% tika -m UrlParser.java >> Content-Encoding: ISO-8859-1 >> Content-Length: 2178 >> Content-Type: text/x-java-source >> LoC: 70 >> X-Parsed-By: org.apache.tika.parser.DefaultParser >> X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser >> resourceName: UrlParser.java >> >> Should I build a parser for each file format to get an exact >>content-type, as Java has SourceCodeParser? >> Thank you in advance for your insightful comments. >> >> Ji-Hyun > >-------------------------- >Ken Krugler >+1 530-210-6378 >http://www.scaleunlimited.com >custom big data solutions & training >Hadoop, Cascading, Cassandra & Solr > > > > > >-------------------------- >Ken Krugler >+1 530-210-6378 >http://www.scaleunlimited.com >custom big data solutions & training >Hadoop, Cascading, Cassandra & Solr > > > > >
