[jira] [Commented] (TIKA-1696) Language Identification with Text Processing Toolkit from MITLL
[ https://issues.apache.org/jira/browse/TIKA-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190104#comment-15190104 ] Paul Ramirez commented on TIKA-1696: Trevor has a patch to make this work with Tika 1.11. He mentioned that he posted the patch but I'm not seeing it here I'll hit him up as it may just be that he posted that in his GitHub repo. > Language Identification with Text Processing Toolkit from MITLL > --- > > Key: TIKA-1696 > URL: https://issues.apache.org/jira/browse/TIKA-1696 > Project: Tika > Issue Type: New Feature > Components: languageidentifier >Reporter: Paul Ramirez >Assignee: Chris A. Mattmann > Fix For: 1.13 > > > The aim here is to extend the methods for language identification within > text. MIT Lincoln Labs has an open source library [1] written in Julia. > Having spoken with the MITLL guys there is a possibility that there is a > scala version of this library which would make it easier to package in with > Tika. > At this point I'm not quite sure how many languages this library supports by > default but it can be extended when provided some training data. > [1] https://github.com/mit-nlp/Text.jl -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1696) Language Identification with Text Processing Toolkit from MITLL
[ https://issues.apache.org/jira/browse/TIKA-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646530#comment-14646530 ] Paul Ramirez commented on TIKA-1696: The algorithm that is used is described here: https://en.wikipedia.org/wiki/Margin_Infused_Relaxed_Algorithm Language Identification with Text Processing Toolkit from MITLL --- Key: TIKA-1696 URL: https://issues.apache.org/jira/browse/TIKA-1696 Project: Tika Issue Type: New Feature Components: languageidentifier Reporter: Paul Ramirez Fix For: 1.10 The aim here is to extend the methods for language identification within text. MIT Lincoln Labs has an open source library [1] written in Julia. Having spoken with the MITLL guys there is a possibility that there is a scala version of this library which would make it easier to package in with Tika. At this point I'm not quite sure how many languages this library supports by default but it can be extended when provided some training data. [1] https://github.com/mit-nlp/Text.jl -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1696) Language Identification with Text Processing Toolkit from MITLL
[ https://issues.apache.org/jira/browse/TIKA-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639522#comment-14639522 ] Paul Ramirez commented on TIKA-1696: Ken, thanks for the fast feedback and references. I've not dug into this much so it may take a couple of weeks to get something up here to test. As I dig into this I'll update the Jira issue with more details to help drive discussion. Also I'll look to get the MITLL guys posting here too as they would be better able to describe the details. What wasn't clear on TIKA-369 is whether yalder was going to come back into Tika. Intent here is to get to a patch integrating their code so it could be tested in the same way that Tika's current approach was tested. Hopefully that patch would help answer the questions above. They are forwarding me some research papers so I can come up to speed on this too so as I gain knowledge I'll flush out here. Do you think this should instead happen on TIKA-369? Language Identification with Text Processing Toolkit from MITLL --- Key: TIKA-1696 URL: https://issues.apache.org/jira/browse/TIKA-1696 Project: Tika Issue Type: New Feature Components: languageidentifier Reporter: Paul Ramirez Fix For: 1.10 The aim here is to extend the methods for language identification within text. MIT Lincoln Labs has an open source library [1] written in Julia. Having spoken with the MITLL guys there is a possibility that there is a scala version of this library which would make it easier to package in with Tika. At this point I'm not quite sure how many languages this library supports by default but it can be extended when provided some training data. [1] https://github.com/mit-nlp/Text.jl -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1696) Language Identification with Text Processing Toolkit from MITLL
Paul Ramirez created TIKA-1696: -- Summary: Language Identification with Text Processing Toolkit from MITLL Key: TIKA-1696 URL: https://issues.apache.org/jira/browse/TIKA-1696 Project: Tika Issue Type: New Feature Components: languageidentifier Reporter: Paul Ramirez Fix For: 1.10 The aim here is to extend the methods for language identification within text. MIT Lincoln Labs has an open source library [1] written in Julia. Having spoken with the MITLL guys there is a possibility that there is a scala version of this library which would make it easier to package in with Tika. At this point I'm not quite sure how many languages this library supports by default but it can be extended when provided some training data. [1] https://github.com/mit-nlp/Text.jl -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1688) Tika Version in Metadata
[ https://issues.apache.org/jira/browse/TIKA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631480#comment-14631480 ] Paul Ramirez commented on TIKA-1688: Looked at later version of Tika and the format to follow would appear to be. X-TIKA-Version: 1.9 Tika Version in Metadata Key: TIKA-1688 URL: https://issues.apache.org/jira/browse/TIKA-1688 Project: Tika Issue Type: Improvement Reporter: Paul Ramirez Priority: Minor Fix For: 1.10 Could this be added as X-Tika:version that way downstream there would be traceability to extraction based on version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1688) Tika Version in Metadata
Paul Ramirez created TIKA-1688: -- Summary: Tika Version in Metadata Key: TIKA-1688 URL: https://issues.apache.org/jira/browse/TIKA-1688 Project: Tika Issue Type: Improvement Reporter: Paul Ramirez Priority: Minor Fix For: 1.10 Could this be added as X-Tika:version that way downstream there would be traceability to extraction based on version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1688) Tika Version in Metadata
[ https://issues.apache.org/jira/browse/TIKA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630171#comment-14630171 ] Paul Ramirez commented on TIKA-1688: Here is an example of metadata dump with Tika 1.6: $ tika -m court-txt.tgz Content-Length: 171453 Content-Type: application/gzip X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.pkg.CompressorParser resourceName: court-txt.tgz Suggestion would be the metadata would be X-Tika-Version: 1.6 in this case. Tika Version in Metadata Key: TIKA-1688 URL: https://issues.apache.org/jira/browse/TIKA-1688 Project: Tika Issue Type: Improvement Reporter: Paul Ramirez Priority: Minor Fix For: 1.10 Could this be added as X-Tika:version that way downstream there would be traceability to extraction based on version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292435#comment-14292435 ] Paul Ramirez commented on TIKA-1518: Missed this over the weekend while playing with Docker but yes [~chrismattmann] looks to be what exactly I was thinking. +1 to leaving open until it's in Apache Tika codebase. Dave I will definitely use this for a project and commit updates to it. Docker with Tika Server --- Key: TIKA-1518 URL: https://issues.apache.org/jira/browse/TIKA-1518 Project: Tika Issue Type: New Feature Reporter: Paul Ramirez Fix For: 1.8 This version should be able to demonstrate as many of Apache Tika's capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to show parsers which require installation of other dependencies. In addition, this should help move TIKA-1301 forward and should leverage the suggestion made by [~lewismc] of a script which can pull down the latest version of Apache Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279927#comment-14279927 ] Paul Ramirez commented on TIKA-1518: Thanks Konstantin for the example. If you have the time that would be awesome. Docker with Tika Server --- Key: TIKA-1518 URL: https://issues.apache.org/jira/browse/TIKA-1518 Project: Tika Issue Type: New Feature Reporter: Paul Ramirez Fix For: 1.8 This version should be able to demonstrate as many of Apache Tika's capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to show parsers which require installation of other dependencies. In addition, this should help move TIKA-1301 forward and should leverage the suggestion made by [~lewismc] of a script which can pull down the latest version of Apache Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1518) Docker with Tika Server
Paul Ramirez created TIKA-1518: -- Summary: Docker with Tika Server Key: TIKA-1518 URL: https://issues.apache.org/jira/browse/TIKA-1518 Project: Tika Issue Type: New Feature Reporter: Paul Ramirez Fix For: 1.8 This version should be able to demonstrate as many of Apache Tika's capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to show parsers which require installation of other dependencies. In addition, this should help move TIKA-1301 forward and should leverage the suggestion made by [~lewismc] of a script which can pull down the latest version of Apache Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1301) Establish TikaServer on Apache hosted VM
[ https://issues.apache.org/jira/browse/TIKA-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278419#comment-14278419 ] Paul Ramirez commented on TIKA-1301: In the spirit of fun and because I'm going to do it for work I'd like to propose to build a Docker container for Tika with the server running. Chris pointed me at this issue which seems related and could be tackled jointly. Seems like the first step would be to create a new issue for a Docker container. @Lewis do you have a pointer to those scripts? Establish TikaServer on Apache hosted VM Key: TIKA-1301 URL: https://issues.apache.org/jira/browse/TIKA-1301 Project: Tika Issue Type: Bug Components: server Reporter: Lewis John McGibbney Fix For: 1.8 Over in Any23, Infra recently provisioned us with a nice shiny new VM to run our service on http://any23.org I would like to do the same for Tika. I have some scripts on the Any23 VM which will pull stable nightly tika-server snapshots and deploy them to the VM. This is really nice for both dev's and users alike. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278424#comment-14278424 ] Paul Ramirez commented on TIKA-1518: As I build a patch what component should this go into? Any suggestions on things that will need to be a part of this beyond the dependencies I've listed? Docker with Tika Server --- Key: TIKA-1518 URL: https://issues.apache.org/jira/browse/TIKA-1518 Project: Tika Issue Type: New Feature Reporter: Paul Ramirez Fix For: 1.8 This version should be able to demonstrate as many of Apache Tika's capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to show parsers which require installation of other dependencies. In addition, this should help move TIKA-1301 forward and should leverage the suggestion made by [~lewismc] of a script which can pull down the latest version of Apache Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1319) Translation
[ https://issues.apache.org/jira/browse/TIKA-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018949#comment-14018949 ] Paul Ramirez commented on TIKA-1319: Interfaces looks good refactored. +1 Translation --- Key: TIKA-1319 URL: https://issues.apache.org/jira/browse/TIKA-1319 Project: Tika Issue Type: New Feature Reporter: Tyler Palsulich Assignee: Chris A. Mattmann Priority: Minor I just opened up a review on reviews.apache.org -- https://reviews.apache.org/r/22219/. I copied the description below. This patch adds basic language translation functionality to Tika. Translation is provided by a Microsoft API, but accessed through Apache 2 licensed com.memetix.microsoft-translator-java-api (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants to use the translation feature, they have to add a client id and client secret to the tika-core/src/main/resources/org/apache/tika/language/translator.properties file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added com.memetix as a dependency in tika-core. I put the Translator class in org.apache.tika.language. There is no integration with the server or CLI, yet. Further, only Strings are translated right now -- if you pass in a full document with xml tags, the structure will be mangled. But, I think that would be a cool feature -- translate the body, title, subtitle, etc, but not the structural elements. There is still more work to do, but I wanted some more eyes on this to make sure I'm heading in the right direction and this is a desired feature. Let me know what you think! There are two simple unit tests for now which translate hello to French (salut). One for inputting the source and target languages, one for inputing just the target language (and detecting the source language automatically). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1319) Translation
[ https://issues.apache.org/jira/browse/TIKA-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017953#comment-14017953 ] Paul Ramirez commented on TIKA-1319: Looks good. Couple of questions came to mind. Would it make sense if a translator is passed to integrate that into the methods that extract the text directly instead of having an upper level client do that? Following on that to keep backwards compatibility would it then make sense to have some property to explicitly turn this behavior on and off? Are there other translator services that one would want to add in here down the line? If so, maybe an abstraction layer for the actual underlying service or code doing the translation should be added. That said, I don't want to invent use cases and this is a great start. Translation --- Key: TIKA-1319 URL: https://issues.apache.org/jira/browse/TIKA-1319 Project: Tika Issue Type: New Feature Reporter: Tyler Palsulich Assignee: Chris A. Mattmann Priority: Minor I just opened up a review on reviews.apache.org -- https://reviews.apache.org/r/22219/. I copied the description below. This patch adds basic language translation functionality to Tika. Translation is provided by a Microsoft API, but accessed through Apache 2 licensed com.memetix.microsoft-translator-java-api (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants to use the translation feature, they have to add a client id and client secret to the tika-core/src/main/resources/org/apache/tika/language/translator.properties file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added com.memetix as a dependency in tika-core. I put the Translator class in org.apache.tika.language. There is no integration with the server or CLI, yet. Further, only Strings are translated right now -- if you pass in a full document with xml tags, the structure will be mangled. But, I think that would be a cool feature -- translate the body, title, subtitle, etc, but not the structural elements. There is still more work to do, but I wanted some more eyes on this to make sure I'm heading in the right direction and this is a desired feature. Let me know what you think! There are two simple unit tests for now which translate hello to French (salut). One for inputting the source and target languages, one for inputing just the target language (and detecting the source language automatically). -- This message was sent by Atlassian JIRA (v6.2#6252)