Hi Trevor, 1. I assume the benchmark was using a pre-2.0 version of Tika, yes?
It would be great to try out the current support in the 2.0 branch, as a comparison with what we had previously. Also, details on the test corpus used would be useful. 2. I started using the ServiceLoader pattern to support dynamic loading of language detectors There's a bit more work to move the common support classes (LanguageWriter, etc) from the specific implementation sub-project into core Once that's done you should be able to try out directly adding your integration with Text.jl -- Ken > From: Trevor Claude Lewis > Sent: February 23, 2016 10:55:46am PST > To: [email protected] > Cc: Mattmann, Chris A (3980); Ramirez, Paul M (398M); > [email protected] > Subject: Integrating Tika with MITLL Text.jl library for language detection > > Hi all, > > I am Trevor and I am a grad student at USC currently working with Prof. > Chris Mattmann and Paul Ramirez, on integrating Tika with MIT Lincoln Lab’s > Text.jl library for language detection. > https://issues.apache.org/jira/browse/TIKA-1696 > > Since, Text.jl is written in Julia I have created a Julia HTTP Server which > accepts PUT request data and returns the language of the data as a JSON > string. > https://github.com/trevorlewis/csci572dr.git > > I have also benchmarked the results of the Julia HTTP Server to identify > language with Tika 1.11 language detector. > https://docs.google.com/spreadsheets/d/1cW6S2WpiN08pZ3UMVGMyQkO-fotUiUyGRemCrbC1miY/edit?usp=sharing > > I was also looking at the work done by Ken Krugler on Tika's 2.x branch > language detection and I was planning to fork that project and add the > Text.jl implementation. > https://issues.apache.org/jira/browse/TIKA-1723 > > I wanted to gather any input and feedback on this project. > > > Thanks, > > Trevor Lewis > [email protected] -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
