Hi Trevor,

1. I assume the benchmark was using a pre-2.0 version of Tika, yes?

It would be great to try out the current support in the 2.0 branch, as a 
comparison with what we had previously.

Also, details on the test corpus used would be useful.

2. I started using the ServiceLoader pattern to support dynamic loading of 
language detectors

There's a bit more work to move the common support classes (LanguageWriter, 
etc) from the specific implementation sub-project into core

Once that's done you should be able to try out directly adding your integration 
with Text.jl

-- Ken

> From: Trevor Claude Lewis
> Sent: February 23, 2016 10:55:46am PST
> To: [email protected]
> Cc: Mattmann, Chris A (3980); Ramirez, Paul M (398M); 
> [email protected]
> Subject: Integrating Tika with MITLL Text.jl library for language detection
> 
> Hi all,
> 
> I am Trevor and I am a grad student at USC currently working with Prof.
> Chris Mattmann and Paul Ramirez, on integrating Tika with MIT Lincoln Lab’s
> Text.jl library for language detection.
> https://issues.apache.org/jira/browse/TIKA-1696
> 
> Since, Text.jl is written in Julia I have created a Julia HTTP Server which
> accepts PUT request data and returns the language of the data as a JSON
> string.
> https://github.com/trevorlewis/csci572dr.git
> 
> I have also benchmarked the results of the Julia HTTP Server to identify
> language with Tika 1.11 language detector.
> https://docs.google.com/spreadsheets/d/1cW6S2WpiN08pZ3UMVGMyQkO-fotUiUyGRemCrbC1miY/edit?usp=sharing
> 
> I was also looking at the work done by Ken Krugler on Tika's 2.x branch
> language detection and I was planning to fork that project and add the
> Text.jl implementation.
> https://issues.apache.org/jira/browse/TIKA-1723
> 
> I wanted to gather any input and feedback on this project.
> 
> 
> Thanks,
> 
> Trevor Lewis
> [email protected]

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to