[jira] [Commented] (TIKA-1696) Language Identification with Text Processing Toolkit from MITLL

2016-03-10 Thread Paul Ramirez (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190104#comment-15190104
 ] 

Paul Ramirez commented on TIKA-1696:


Trevor has a patch to make this work with Tika 1.11. He mentioned that he 
posted the patch but I'm not seeing it here I'll hit him up as it may just be 
that he posted that in his GitHub repo.

> Language Identification with Text Processing Toolkit from MITLL
> ---
>
> Key: TIKA-1696
> URL: https://issues.apache.org/jira/browse/TIKA-1696
> Project: Tika
>  Issue Type: New Feature
>  Components: languageidentifier
>Reporter: Paul Ramirez
>Assignee: Chris A. Mattmann
> Fix For: 1.13
>
>
> The aim here is to extend the methods for language identification within 
> text. MIT Lincoln Labs has an open source library [1] written in Julia. 
> Having spoken  with the MITLL guys there is a possibility that there is a 
> scala version of this library which would make it easier to package in with 
> Tika. 
> At this point I'm not quite sure how many languages this library supports by 
> default but it can be extended when provided some training data.
> [1] https://github.com/mit-nlp/Text.jl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1696) Language Identification with Text Processing Toolkit from MITLL

2015-07-29 Thread Paul Ramirez (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646530#comment-14646530
 ] 

Paul Ramirez commented on TIKA-1696:


The algorithm that is used is described here:

https://en.wikipedia.org/wiki/Margin_Infused_Relaxed_Algorithm

 Language Identification with Text Processing Toolkit from MITLL
 ---

 Key: TIKA-1696
 URL: https://issues.apache.org/jira/browse/TIKA-1696
 Project: Tika
  Issue Type: New Feature
  Components: languageidentifier
Reporter: Paul Ramirez
 Fix For: 1.10


 The aim here is to extend the methods for language identification within 
 text. MIT Lincoln Labs has an open source library [1] written in Julia. 
 Having spoken  with the MITLL guys there is a possibility that there is a 
 scala version of this library which would make it easier to package in with 
 Tika. 
 At this point I'm not quite sure how many languages this library supports by 
 default but it can be extended when provided some training data.
 [1] https://github.com/mit-nlp/Text.jl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1696) Language Identification with Text Processing Toolkit from MITLL

2015-07-23 Thread Paul Ramirez (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639522#comment-14639522
 ] 

Paul Ramirez commented on TIKA-1696:


Ken, thanks for the fast feedback and references. I've not dug into this much 
so it may take a couple of weeks to get something up here to test. As I dig 
into this I'll update the Jira issue with more details to help drive 
discussion. Also I'll look to get the MITLL guys posting here too as they would 
be better able to describe the details. 

What wasn't clear on TIKA-369 is whether yalder was going to come back into 
Tika. Intent here is to get to a patch integrating their code so it could be 
tested in the same way that Tika's current approach was tested. Hopefully that 
patch would help answer the questions above. 

They are forwarding me some research papers so I can come up to speed on this 
too so as I gain knowledge I'll flush out here. 

Do you think this should instead happen on TIKA-369?

 Language Identification with Text Processing Toolkit from MITLL
 ---

 Key: TIKA-1696
 URL: https://issues.apache.org/jira/browse/TIKA-1696
 Project: Tika
  Issue Type: New Feature
  Components: languageidentifier
Reporter: Paul Ramirez
 Fix For: 1.10


 The aim here is to extend the methods for language identification within 
 text. MIT Lincoln Labs has an open source library [1] written in Julia. 
 Having spoken  with the MITLL guys there is a possibility that there is a 
 scala version of this library which would make it easier to package in with 
 Tika. 
 At this point I'm not quite sure how many languages this library supports by 
 default but it can be extended when provided some training data.
 [1] https://github.com/mit-nlp/Text.jl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1696) Language Identification with Text Processing Toolkit from MITLL

2015-07-23 Thread Paul Ramirez (JIRA)
Paul Ramirez created TIKA-1696:
--

 Summary: Language Identification with Text Processing Toolkit from 
MITLL
 Key: TIKA-1696
 URL: https://issues.apache.org/jira/browse/TIKA-1696
 Project: Tika
  Issue Type: New Feature
  Components: languageidentifier
Reporter: Paul Ramirez
 Fix For: 1.10


The aim here is to extend the methods for language identification within text. 
MIT Lincoln Labs has an open source library [1] written in Julia. Having spoken 
 with the MITLL guys there is a possibility that there is a scala version of 
this library which would make it easier to package in with Tika. 

At this point I'm not quite sure how many languages this library supports by 
default but it can be extended when provided some training data.

[1] https://github.com/mit-nlp/Text.jl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1688) Tika Version in Metadata

2015-07-17 Thread Paul Ramirez (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631480#comment-14631480
 ] 

Paul Ramirez commented on TIKA-1688:


Looked at later version of Tika and the format to follow would appear to be.

X-TIKA-Version: 1.9

 Tika Version in Metadata
 

 Key: TIKA-1688
 URL: https://issues.apache.org/jira/browse/TIKA-1688
 Project: Tika
  Issue Type: Improvement
Reporter: Paul Ramirez
Priority: Minor
 Fix For: 1.10


 Could this be added as X-Tika:version that way downstream there would be 
 traceability to extraction based on version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1688) Tika Version in Metadata

2015-07-16 Thread Paul Ramirez (JIRA)
Paul Ramirez created TIKA-1688:
--

 Summary: Tika Version in Metadata
 Key: TIKA-1688
 URL: https://issues.apache.org/jira/browse/TIKA-1688
 Project: Tika
  Issue Type: Improvement
Reporter: Paul Ramirez
Priority: Minor
 Fix For: 1.10


Could this be added as X-Tika:version that way downstream there would be 
traceability to extraction based on version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1688) Tika Version in Metadata

2015-07-16 Thread Paul Ramirez (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630171#comment-14630171
 ] 

Paul Ramirez commented on TIKA-1688:


Here is an example of metadata dump with Tika 1.6:

$ tika -m court-txt.tgz 
Content-Length: 171453
Content-Type: application/gzip
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.pkg.CompressorParser
resourceName: court-txt.tgz

Suggestion would be the metadata would be X-Tika-Version: 1.6 in this case.

 Tika Version in Metadata
 

 Key: TIKA-1688
 URL: https://issues.apache.org/jira/browse/TIKA-1688
 Project: Tika
  Issue Type: Improvement
Reporter: Paul Ramirez
Priority: Minor
 Fix For: 1.10


 Could this be added as X-Tika:version that way downstream there would be 
 traceability to extraction based on version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2015-01-26 Thread Paul Ramirez (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292435#comment-14292435
 ] 

Paul Ramirez commented on TIKA-1518:


Missed this over the weekend while playing with Docker but yes [~chrismattmann] 
looks to be what exactly I was thinking. +1 to leaving open until it's in 
Apache Tika codebase. Dave I will definitely use this for a project and commit 
updates to it.

 Docker with Tika Server
 ---

 Key: TIKA-1518
 URL: https://issues.apache.org/jira/browse/TIKA-1518
 Project: Tika
  Issue Type: New Feature
Reporter: Paul Ramirez
 Fix For: 1.8


 This version should be able to demonstrate as many of Apache Tika's 
 capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
 show parsers which require installation of other dependencies. In addition, 
 this should help move TIKA-1301 forward and should leverage the suggestion 
 made by [~lewismc] of a script which can pull down the latest version of 
 Apache Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2015-01-15 Thread Paul Ramirez (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279927#comment-14279927
 ] 

Paul Ramirez commented on TIKA-1518:


Thanks Konstantin for the example. If you have the time that would be awesome.

 Docker with Tika Server
 ---

 Key: TIKA-1518
 URL: https://issues.apache.org/jira/browse/TIKA-1518
 Project: Tika
  Issue Type: New Feature
Reporter: Paul Ramirez
 Fix For: 1.8


 This version should be able to demonstrate as many of Apache Tika's 
 capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
 show parsers which require installation of other dependencies. In addition, 
 this should help move TIKA-1301 forward and should leverage the suggestion 
 made by [~lewismc] of a script which can pull down the latest version of 
 Apache Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1518) Docker with Tika Server

2015-01-15 Thread Paul Ramirez (JIRA)
Paul Ramirez created TIKA-1518:
--

 Summary: Docker with Tika Server
 Key: TIKA-1518
 URL: https://issues.apache.org/jira/browse/TIKA-1518
 Project: Tika
  Issue Type: New Feature
Reporter: Paul Ramirez
 Fix For: 1.8


This version should be able to demonstrate as many of Apache Tika's 
capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to show 
parsers which require installation of other dependencies. In addition, this 
should help move TIKA-1301 forward and should leverage the suggestion made by 
[~lewismc] of a script which can pull down the latest version of Apache Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1301) Establish TikaServer on Apache hosted VM

2015-01-15 Thread Paul Ramirez (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278419#comment-14278419
 ] 

Paul Ramirez commented on TIKA-1301:


In the spirit of fun and because I'm going to do it for work I'd like to 
propose to build a Docker container for Tika with the server running. Chris 
pointed me at this issue which seems related and could be tackled jointly. 
Seems like the first step would be to create a new issue for a Docker container.

@Lewis do you have a pointer to those scripts?

 Establish TikaServer on Apache hosted VM
 

 Key: TIKA-1301
 URL: https://issues.apache.org/jira/browse/TIKA-1301
 Project: Tika
  Issue Type: Bug
  Components: server
Reporter: Lewis John McGibbney
 Fix For: 1.8


 Over in Any23, Infra recently provisioned us with a nice shiny new VM to run 
 our service on
 http://any23.org
 I would like to do the same for Tika. I have some scripts on the Any23 VM 
 which will pull stable nightly tika-server snapshots and deploy them to the 
 VM. This is really nice for both dev's and users alike.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2015-01-15 Thread Paul Ramirez (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278424#comment-14278424
 ] 

Paul Ramirez commented on TIKA-1518:


As I build a patch what component should this go into? Any suggestions on 
things that will need to be a part of this beyond the dependencies I've listed?



 Docker with Tika Server
 ---

 Key: TIKA-1518
 URL: https://issues.apache.org/jira/browse/TIKA-1518
 Project: Tika
  Issue Type: New Feature
Reporter: Paul Ramirez
 Fix For: 1.8


 This version should be able to demonstrate as many of Apache Tika's 
 capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
 show parsers which require installation of other dependencies. In addition, 
 this should help move TIKA-1301 forward and should leverage the suggestion 
 made by [~lewismc] of a script which can pull down the latest version of 
 Apache Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1319) Translation

2014-06-05 Thread Paul Ramirez (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018949#comment-14018949
 ] 

Paul Ramirez commented on TIKA-1319:


Interfaces looks good refactored. +1

 Translation
 ---

 Key: TIKA-1319
 URL: https://issues.apache.org/jira/browse/TIKA-1319
 Project: Tika
  Issue Type: New Feature
Reporter: Tyler Palsulich
Assignee: Chris A. Mattmann
Priority: Minor

 I just opened up a review on reviews.apache.org -- 
 https://reviews.apache.org/r/22219/. I copied the description below. 
 This patch adds basic language translation functionality to Tika. Translation 
 is provided by a Microsoft API, but accessed through Apache 2 licensed 
 com.memetix.microsoft-translator-java-api 
 (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants 
 to use the translation feature, they have to add a client id and client 
 secret to the 
 tika-core/src/main/resources/org/apache/tika/language/translator.properties 
 file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added 
 com.memetix as a dependency in tika-core. I put the Translator class in 
 org.apache.tika.language. There is no integration with the server or CLI, 
 yet. Further, only Strings are translated right now -- if you pass in a full 
 document with xml tags, the structure will be mangled. But, I think that 
 would be a cool feature -- translate the body, title, subtitle, etc, but not 
 the structural elements. 
 There is still more work to do, but I wanted some more eyes on this to make 
 sure I'm heading in the right direction and this is a desired feature. Let me 
 know what you think!
 There are two simple unit tests for now which translate hello to French 
 (salut). One for inputting the source and target languages, one for 
 inputing just the target language (and detecting the source language 
 automatically).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1319) Translation

2014-06-04 Thread Paul Ramirez (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017953#comment-14017953
 ] 

Paul Ramirez commented on TIKA-1319:


Looks good. Couple of questions came to mind. Would it make sense if a 
translator is passed to integrate that into the methods that extract the text 
directly instead of having an upper level client do that? Following on that to 
keep backwards compatibility would it then make sense to have some property to 
explicitly turn this behavior on and off? Are there other translator services 
that one would want to add in here down the line? If so, maybe an abstraction 
layer for the actual underlying service or code doing the translation should be 
added. That said, I don't want to invent use cases and this is a great start.

 Translation
 ---

 Key: TIKA-1319
 URL: https://issues.apache.org/jira/browse/TIKA-1319
 Project: Tika
  Issue Type: New Feature
Reporter: Tyler Palsulich
Assignee: Chris A. Mattmann
Priority: Minor

 I just opened up a review on reviews.apache.org -- 
 https://reviews.apache.org/r/22219/. I copied the description below. 
 This patch adds basic language translation functionality to Tika. Translation 
 is provided by a Microsoft API, but accessed through Apache 2 licensed 
 com.memetix.microsoft-translator-java-api 
 (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants 
 to use the translation feature, they have to add a client id and client 
 secret to the 
 tika-core/src/main/resources/org/apache/tika/language/translator.properties 
 file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added 
 com.memetix as a dependency in tika-core. I put the Translator class in 
 org.apache.tika.language. There is no integration with the server or CLI, 
 yet. Further, only Strings are translated right now -- if you pass in a full 
 document with xml tags, the structure will be mangled. But, I think that 
 would be a cool feature -- translate the body, title, subtitle, etc, but not 
 the structural elements. 
 There is still more work to do, but I wanted some more eyes on this to make 
 sure I'm heading in the right direction and this is a desired feature. Let me 
 know what you think!
 There are two simple unit tests for now which translate hello to French 
 (salut). One for inputting the source and target languages, one for 
 inputing just the target language (and detecting the source language 
 automatically).



--
This message was sent by Atlassian JIRA
(v6.2#6252)