tika-trunk-jdk1.7 - Build # 733 - Failure
The Apache Jenkins build system has built tika-trunk-jdk1.7 (build #733) Status: Failure Check console output at https://builds.apache.org/job/tika-trunk-jdk1.7/733/ to view the results.
Re: tika-trunk-jdk1.7 - Build # 733 - Failure
This was due to the SVN issues that infra was dealing with last night. I’ll go ahead and spin RC #2 shortly. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Apache Jenkins Server jenk...@builds.apache.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Saturday, June 6, 2015 at 1:00 AM To: dev@tika.apache.org dev@tika.apache.org Subject: tika-trunk-jdk1.7 - Build # 733 - Failure The Apache Jenkins build system has built tika-trunk-jdk1.7 (build #733) Status: Failure Check console output at https://builds.apache.org/job/tika-trunk-jdk1.7/733/ to view the results.
Re: Configuring parsers and translators
Hi Nick, I've been mulling this over since you sent the first message. But, I'm afraid I don't have a good solution or developed ideas. I agree, it would be very nice to consolidate all configuration for all parsers in the server and app. Is it feasible to put everything into tika-config? Then Parser implementations would read the config to pull out their own configuration. Or, would it be better to keep some configuration separate? Documentation would be an issue if every parser defines its own metadata keys... But, it might be an improvement since we don't have free form properties and configuration files. Tyler On Sat, Jun 6, 2015 at 12:30 PM Nick Burch apa...@gagravarr.org wrote: Anyone have any thoughts on this? On Fri, 8 May 2015, Nick Burch wrote: Hi All This came up in TIKA-1623, but I thought it might be better brought out to the list for discussion To configure parsers on a per-document basis, such as setting PDF spacing tolerances, or telling Tesseract what language it should be OCRing for, we have the *Config objects. You create one of these, use the setters to configure it for your document, pop it onto the Parse context and it's used when processing your document To configure parsers and translators on a per-JVM basis, to apply to all documents processed, it's a bit less consistent. At least some look for a properties file with a specific name, usually in the tika namespace, and grab their settings / keys / etc out of that. At least some expect to find a *Config with their program path on it, even though that remains constant between documents. None of them support getting their settings from the Tika Config As part of our evolution of parser preferences, we're moving towards people either being able to set their preferences in code, or being able to supply a Tika Config xml which sets their parser preferences or overrides certain bits of the default. The code option works for people who want to declare certain specific things, the Tika Config one gives the same functionality but allows a consistent and clean way to set it between Tika App, Tika Server and java code. Another related example is the External Parser support. Because you can have multiple External Parser instances in your setup, one per format / program, we look for all the org/apache/tika/parser/external/tika-external-parsers.xml files on the classpath, and create parser instances based on definitions in there What do we think about setting executable paths and keys/logins for parsers like OCR, Strings, Translators etc? Always on ParseContext? Properties? Custom xml config? Tika config xml? Other? Combination? Nick
Re: Configuring parsers and translators
On Sat, 6 Jun 2015, Tyler Palsulich wrote: (Devil's advocate hat slightly on.) My one hesitation about putting it all into tika-config is that the default might get to be a monstrosity -- difficult for new users to use. Assuming you don't want any translators, and have no non-standard paths to external parsers, and are happy with default parser orderings, then your default config would be: properties/ (The plan so far remains with using the service loader to find parsers, detectors and friends, with the config just being used when you want to override parsers or parser orderings) My main worry with putting it all into config xml is that we accidently end up re-inventing spring badly... Nick
[jira] [Commented] (TIKA-1652) Tika Server should allow config file override from the command line like Tika App
[ https://issues.apache.org/jira/browse/TIKA-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575993#comment-14575993 ] Chris A. Mattmann commented on TIKA-1652: - +1, agreed. I'll wrap them both up shortly. Tika Server should allow config file override from the command line like Tika App - Key: TIKA-1652 URL: https://issues.apache.org/jira/browse/TIKA-1652 Project: Tika Issue Type: Bug Components: server Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.9 Tika-app's TikaCLI allows a command line parameter, --config, to override the Tika config at the command line. For whatever reason, Tika-server doesn't it should since it causes a different control flow for things to get created. I first saw this when testing the CTAKESParser (TIKA-1645) in Tika-server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1652) Tika Server should allow config file override from the command line like Tika App
[ https://issues.apache.org/jira/browse/TIKA-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575986#comment-14575986 ] Tyler Palsulich commented on TIKA-1652: --- I think this is a duplicate of TIKA-1426? Tika Server should allow config file override from the command line like Tika App - Key: TIKA-1652 URL: https://issues.apache.org/jira/browse/TIKA-1652 Project: Tika Issue Type: Bug Components: server Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.9 Tika-app's TikaCLI allows a command line parameter, --config, to override the Tika config at the command line. For whatever reason, Tika-server doesn't it should since it causes a different control flow for things to get created. I first saw this when testing the CTAKESParser (TIKA-1645) in Tika-server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Configuring parsers and translators
Hey Tyler, I hear you, but balance that against all the hidden things here and there, and everywhere, that I constantly keep discovering and having to pour through lines of TikaConfig - service loaders, class loaders. When things work right - no problem. When something goes wrong; HUGE waste of time. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Saturday, June 6, 2015 at 3:59 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: Configuring parsers and translators (Devil's advocate hat slightly on.) My one hesitation about putting it all into tika-config is that the default might get to be a monstrosity -- difficult for new users to use. Tyler On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: I think it would be great to have all this in the Tika Config. The one thing then is to provide an example default config and to make it *hugely* clear rather than all the levels of indirection that we currently have going on which makes it super hard when there is a config error (SPI, swallowing print messages, etc.) ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Saturday, June 6, 2015 at 3:45 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: Configuring parsers and translators Hi Nick, I've been mulling this over since you sent the first message. But, I'm afraid I don't have a good solution or developed ideas. I agree, it would be very nice to consolidate all configuration for all parsers in the server and app. Is it feasible to put everything into tika-config? Then Parser implementations would read the config to pull out their own configuration. Or, would it be better to keep some configuration separate? Documentation would be an issue if every parser defines its own metadata keys... But, it might be an improvement since we don't have free form properties and configuration files. Tyler On Sat, Jun 6, 2015 at 12:30 PM Nick Burch apa...@gagravarr.org wrote: Anyone have any thoughts on this? On Fri, 8 May 2015, Nick Burch wrote: Hi All This came up in TIKA-1623, but I thought it might be better brought out to the list for discussion To configure parsers on a per-document basis, such as setting PDF spacing tolerances, or telling Tesseract what language it should be OCRing for, we have the *Config objects. You create one of these, use the setters to configure it for your document, pop it onto the Parse context and it's used when processing your document To configure parsers and translators on a per-JVM basis, to apply to all documents processed, it's a bit less consistent. At least some look for a properties file with a specific name, usually in the tika namespace, and grab their settings / keys / etc out of that. At least some expect to find a *Config with their program path on it, even though that remains constant between documents. None of them support getting their settings from the Tika Config As part of our evolution of parser preferences, we're moving towards people either being able to set their preferences in code, or being able to supply a Tika Config xml which sets their parser preferences or overrides certain bits of the default. The code option works for people who want to declare certain specific things, the Tika Config one gives the same functionality but allows a consistent and clean way to set it between Tika App, Tika Server and java code. Another related example is the External Parser support. Because you can have multiple External Parser instances in your setup, one per
[jira] [Commented] (TIKA-1645) Extraction of biomedical information using CTAKESParser
[ https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575996#comment-14575996 ] Chris A. Mattmann commented on TIKA-1645: - I got this working with both tika-app and tika-server. See TIKA-1652 for a needed fix for this to work. I'm going to go ahead and commit this since I fully documented how to install and test and it's working good for me. It should work fine for 1.9 since it's not enabled by default and you have to do quite a bit to get it running. I'd love unit tests at some point, but not a blocker to getting this great piece of code part of 1.9. Thanks for the great work [~gostep] and [~selina]! Extraction of biomedical information using CTAKESParser --- Key: TIKA-1645 URL: https://issues.apache.org/jira/browse/TIKA-1645 Project: Tika Issue Type: Improvement Components: parser Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Labels: patch Fix For: 1.10 Attachments: CTAKESConfig.properties, TIKA-1645.patch, TIKA-1645.v02.patch, tika-config.xml As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] is a preliminary work in order to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract biomedical information from clinical text. Essentially, this work includes a wrapper for CAS serializers that aim at dumping out the identified annotations into XML-based formats. You can find in attachment a new patch that includes the CTAKESParser, a new parser that decorates the AutoDetectParser and relies on a new version of CTAKESContentHandler, based on feedback from [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser generates the same output of AutoDetectParser and, in addition, the metadata containing the identified clinical annotations detected by cTAKES. To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first to install the last stable release of cTAKES (3.2.2), following the instructions on [User Install Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide]. Then, you can launch Tika as follows: {noformat} CTAKES_HOME=/usr/local/apache-ctakes-3.2.2 java -cp tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input {noformat} In the example above, {{/path/to/CTAKESConfig}} is the parent directory of file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains the configuration properties to build the cTAKES AnalysisEngine; {{tika-config.xml}} is a custom configuration file for Tika that contains the mimetypes whose CTAKESParser will perform parsing. You can find in attachment an example of both {{CTAKESConfig.properties}} and {{tika-config.xml}} to parse ISA-Tab files using cTAKES. You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use the UMLS-based components of cTAKES. I would really appreciate your feedback. Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TIKA-1645) Extraction of biomedical information using CTAKESParser
[ https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned TIKA-1645: --- Assignee: Chris A. Mattmann (was: Giuseppe Totaro) Extraction of biomedical information using CTAKESParser --- Key: TIKA-1645 URL: https://issues.apache.org/jira/browse/TIKA-1645 Project: Tika Issue Type: Improvement Components: parser Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Labels: patch Fix For: 1.10 Attachments: CTAKESConfig.properties, TIKA-1645.patch, TIKA-1645.v02.patch, tika-config.xml As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] is a preliminary work in order to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract biomedical information from clinical text. Essentially, this work includes a wrapper for CAS serializers that aim at dumping out the identified annotations into XML-based formats. You can find in attachment a new patch that includes the CTAKESParser, a new parser that decorates the AutoDetectParser and relies on a new version of CTAKESContentHandler, based on feedback from [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser generates the same output of AutoDetectParser and, in addition, the metadata containing the identified clinical annotations detected by cTAKES. To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first to install the last stable release of cTAKES (3.2.2), following the instructions on [User Install Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide]. Then, you can launch Tika as follows: {noformat} CTAKES_HOME=/usr/local/apache-ctakes-3.2.2 java -cp tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input {noformat} In the example above, {{/path/to/CTAKESConfig}} is the parent directory of file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains the configuration properties to build the cTAKES AnalysisEngine; {{tika-config.xml}} is a custom configuration file for Tika that contains the mimetypes whose CTAKESParser will perform parsing. You can find in attachment an example of both {{CTAKESConfig.properties}} and {{tika-config.xml}} to parse ISA-Tab files using cTAKES. You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use the UMLS-based components of cTAKES. I would really appreciate your feedback. Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1645) Extraction of biomedical information using CTAKESParser
[ https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved TIKA-1645. - Resolution: Fixed Fix Version/s: (was: 1.10) 1.9 Contributed! Thanks [~gostep] and [~selina]! {noformat} bash-3.2$ svn commit -m Fix for TIKA-1645 TIKA-1642: Extraction of biomedical information using CTAKESParser contributed by Selina Chu, Giuseppe Totaro and mattmann. SendingCHANGES.txt Sendingtika-bundle/pom.xml Sendingtika-parsers/pom.xml Adding tika-parsers/src/main/java/org/apache/tika/parser/ctakes Adding tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESAnnotationProperty.java Adding tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESConfig.java Adding tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESContentHandler.java Adding tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESParser.java Adding tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESSerializer.java Adding tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESUtils.java Sending tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser Transmitting file data .. Committed revision 1683968. {noformat} Please note, improvements are welcomed. I know Giuseppe is working on an ExternalParser version of this and some other improvements. Selina is working on unit tests. Extraction of biomedical information using CTAKESParser --- Key: TIKA-1645 URL: https://issues.apache.org/jira/browse/TIKA-1645 Project: Tika Issue Type: Improvement Components: parser Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Labels: patch Fix For: 1.9 Attachments: CTAKESConfig.properties, TIKA-1645.patch, TIKA-1645.v02.patch, tika-config.xml As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] is a preliminary work in order to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract biomedical information from clinical text. Essentially, this work includes a wrapper for CAS serializers that aim at dumping out the identified annotations into XML-based formats. You can find in attachment a new patch that includes the CTAKESParser, a new parser that decorates the AutoDetectParser and relies on a new version of CTAKESContentHandler, based on feedback from [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser generates the same output of AutoDetectParser and, in addition, the metadata containing the identified clinical annotations detected by cTAKES. To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first to install the last stable release of cTAKES (3.2.2), following the instructions on [User Install Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide]. Then, you can launch Tika as follows: {noformat} CTAKES_HOME=/usr/local/apache-ctakes-3.2.2 java -cp tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input {noformat} In the example above, {{/path/to/CTAKESConfig}} is the parent directory of file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains the configuration properties to build the cTAKES AnalysisEngine; {{tika-config.xml}} is a custom configuration file for Tika that contains the mimetypes whose CTAKESParser will perform parsing. You can find in attachment an example of both {{CTAKESConfig.properties}} and {{tika-config.xml}} to parse ISA-Tab files using cTAKES. You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use the UMLS-based components of cTAKES. I would really appreciate your feedback. Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1642) Integrate cTAKES into Tika
[ https://issues.apache.org/jira/browse/TIKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved TIKA-1642. - Resolution: Fixed Fix Version/s: 1.9 Assignee: Chris A. Mattmann (was: Giuseppe Totaro) - fixed! {noformat} bash-3.2$ svn commit -m Fix for TIKA-1645 TIKA-1642: Extraction of biomedical information using CTAKESParser contributed by Selina Chu, Giuseppe Totaro and mattmann. SendingCHANGES.txt Sendingtika-bundle/pom.xml Sendingtika-parsers/pom.xml Adding tika-parsers/src/main/java/org/apache/tika/parser/ctakes Adding tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESAnnotationProperty.java Adding tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESConfig.java Adding tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESContentHandler.java Adding tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESParser.java Adding tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESSerializer.java Adding tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESUtils.java Sending tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser Transmitting file data .. Committed revision 1683968. {noformat} Integrate cTAKES into Tika -- Key: TIKA-1642 URL: https://issues.apache.org/jira/browse/TIKA-1642 Project: Tika Issue Type: Improvement Components: parser Reporter: Selina Chu Assignee: Chris A. Mattmann Fix For: 1.9 [~gostep] has written a preliminary version of [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika. The CTAKESContentHandler allows to perform the following step into Tika: * create an AnalysisEngine based on a given XML descriptor; * create a CAS (Common Analysis System) appropriate for this AnalysisEngine; * populate the CAS with the text extracted by using Tika; * perform the AnalysisEngine against the plain text added to CAS; * write out the results in the given format (XML, XCAS, XMI, etc.). It would be great improvement if we can parse the output of cTAKES and create a list of metadata which describes the terms found in the annotation index and their corresponding tokens. For instance, using the AggregatePlaintextFastUMLSProcessor analysis engine, we can utilize the UMLS database to obtain the annotations related to DiseaseDisorderMention, and I would like to be able to produce a list of words corresponding to the input text which is annotated as DiseaseDisorderMention. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1642) Integrate cTAKES into Tika
[ https://issues.apache.org/jira/browse/TIKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576051#comment-14576051 ] Hudson commented on TIKA-1642: -- ABORTED: Integrated in tika-trunk-jdk1.7 #734 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/734/]) Fix for TIKA-1645 TIKA-1642: Extraction of biomedical information using CTAKESParser contributed by Selina Chu, Giuseppe Totaro and mattmann. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1683968) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-bundle/pom.xml * /tika/trunk/tika-parsers/pom.xml * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESAnnotationProperty.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESConfig.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESContentHandler.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESParser.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESSerializer.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESUtils.java * /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser Integrate cTAKES into Tika -- Key: TIKA-1642 URL: https://issues.apache.org/jira/browse/TIKA-1642 Project: Tika Issue Type: Improvement Components: parser Reporter: Selina Chu Assignee: Chris A. Mattmann Fix For: 1.9 [~gostep] has written a preliminary version of [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika. The CTAKESContentHandler allows to perform the following step into Tika: * create an AnalysisEngine based on a given XML descriptor; * create a CAS (Common Analysis System) appropriate for this AnalysisEngine; * populate the CAS with the text extracted by using Tika; * perform the AnalysisEngine against the plain text added to CAS; * write out the results in the given format (XML, XCAS, XMI, etc.). It would be great improvement if we can parse the output of cTAKES and create a list of metadata which describes the terms found in the annotation index and their corresponding tokens. For instance, using the AggregatePlaintextFastUMLSProcessor analysis engine, we can utilize the UMLS database to obtain the annotations related to DiseaseDisorderMention, and I would like to be able to produce a list of words corresponding to the input text which is annotated as DiseaseDisorderMention. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1652) Tika Server should allow config file override from the command line like Tika App
[ https://issues.apache.org/jira/browse/TIKA-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576053#comment-14576053 ] Hudson commented on TIKA-1652: -- ABORTED: Integrated in tika-trunk-jdk1.7 #734 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/734/]) Fix for TIKA-1652, TIKA-1426: Tika Server should allow config file override from the command line like Tika App (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1683966) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java Tika Server should allow config file override from the command line like Tika App - Key: TIKA-1652 URL: https://issues.apache.org/jira/browse/TIKA-1652 Project: Tika Issue Type: Bug Components: server Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.9 Tika-app's TikaCLI allows a command line parameter, --config, to override the Tika config at the command line. For whatever reason, Tika-server doesn't it should since it causes a different control flow for things to get created. I first saw this when testing the CTAKESParser (TIKA-1645) in Tika-server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1426) Let's allow users to specify a tika config file on the commandline for tika-app and tika-server
[ https://issues.apache.org/jira/browse/TIKA-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576052#comment-14576052 ] Hudson commented on TIKA-1426: -- ABORTED: Integrated in tika-trunk-jdk1.7 #734 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/734/]) Fix for TIKA-1652, TIKA-1426: Tika Server should allow config file override from the command line like Tika App (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1683966) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java Let's allow users to specify a tika config file on the commandline for tika-app and tika-server --- Key: TIKA-1426 URL: https://issues.apache.org/jira/browse/TIKA-1426 Project: Tika Issue Type: Improvement Components: cli, server Reporter: Tim Allison Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.9 It would be handy to be able to specify a tika-config file when using tika-app and tika-server. I added this capability to tika-app as part of TIKA-1418. I should have opened a separate issue at the time (mea culpa). This present issue covers both tika-app and tika-server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1645) Extraction of biomedical information using CTAKESParser
[ https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576054#comment-14576054 ] Hudson commented on TIKA-1645: -- ABORTED: Integrated in tika-trunk-jdk1.7 #734 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/734/]) Fix for TIKA-1645 TIKA-1642: Extraction of biomedical information using CTAKESParser contributed by Selina Chu, Giuseppe Totaro and mattmann. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1683968) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-bundle/pom.xml * /tika/trunk/tika-parsers/pom.xml * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESAnnotationProperty.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESConfig.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESContentHandler.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESParser.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESSerializer.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESUtils.java * /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser Extraction of biomedical information using CTAKESParser --- Key: TIKA-1645 URL: https://issues.apache.org/jira/browse/TIKA-1645 Project: Tika Issue Type: Improvement Components: parser Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Labels: patch Fix For: 1.9 Attachments: CTAKESConfig.properties, TIKA-1645.patch, TIKA-1645.v02.patch, tika-config.xml As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] is a preliminary work in order to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract biomedical information from clinical text. Essentially, this work includes a wrapper for CAS serializers that aim at dumping out the identified annotations into XML-based formats. You can find in attachment a new patch that includes the CTAKESParser, a new parser that decorates the AutoDetectParser and relies on a new version of CTAKESContentHandler, based on feedback from [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser generates the same output of AutoDetectParser and, in addition, the metadata containing the identified clinical annotations detected by cTAKES. To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first to install the last stable release of cTAKES (3.2.2), following the instructions on [User Install Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide]. Then, you can launch Tika as follows: {noformat} CTAKES_HOME=/usr/local/apache-ctakes-3.2.2 java -cp tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input {noformat} In the example above, {{/path/to/CTAKESConfig}} is the parent directory of file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains the configuration properties to build the cTAKES AnalysisEngine; {{tika-config.xml}} is a custom configuration file for Tika that contains the mimetypes whose CTAKESParser will perform parsing. You can find in attachment an example of both {{CTAKESConfig.properties}} and {{tika-config.xml}} to parse ISA-Tab files using cTAKES. You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use the UMLS-based components of cTAKES. I would really appreciate your feedback. Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Configuring parsers and translators
(Devil's advocate hat slightly on.) My one hesitation about putting it all into tika-config is that the default might get to be a monstrosity -- difficult for new users to use. Tyler On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: I think it would be great to have all this in the Tika Config. The one thing then is to provide an example default config and to make it *hugely* clear rather than all the levels of indirection that we currently have going on which makes it super hard when there is a config error (SPI, swallowing print messages, etc.) ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Saturday, June 6, 2015 at 3:45 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: Configuring parsers and translators Hi Nick, I've been mulling this over since you sent the first message. But, I'm afraid I don't have a good solution or developed ideas. I agree, it would be very nice to consolidate all configuration for all parsers in the server and app. Is it feasible to put everything into tika-config? Then Parser implementations would read the config to pull out their own configuration. Or, would it be better to keep some configuration separate? Documentation would be an issue if every parser defines its own metadata keys... But, it might be an improvement since we don't have free form properties and configuration files. Tyler On Sat, Jun 6, 2015 at 12:30 PM Nick Burch apa...@gagravarr.org wrote: Anyone have any thoughts on this? On Fri, 8 May 2015, Nick Burch wrote: Hi All This came up in TIKA-1623, but I thought it might be better brought out to the list for discussion To configure parsers on a per-document basis, such as setting PDF spacing tolerances, or telling Tesseract what language it should be OCRing for, we have the *Config objects. You create one of these, use the setters to configure it for your document, pop it onto the Parse context and it's used when processing your document To configure parsers and translators on a per-JVM basis, to apply to all documents processed, it's a bit less consistent. At least some look for a properties file with a specific name, usually in the tika namespace, and grab their settings / keys / etc out of that. At least some expect to find a *Config with their program path on it, even though that remains constant between documents. None of them support getting their settings from the Tika Config As part of our evolution of parser preferences, we're moving towards people either being able to set their preferences in code, or being able to supply a Tika Config xml which sets their parser preferences or overrides certain bits of the default. The code option works for people who want to declare certain specific things, the Tika Config one gives the same functionality but allows a consistent and clean way to set it between Tika App, Tika Server and java code. Another related example is the External Parser support. Because you can have multiple External Parser instances in your setup, one per format / program, we look for all the org/apache/tika/parser/external/tika-external-parsers.xml files on the classpath, and create parser instances based on definitions in there What do we think about setting executable paths and keys/logins for parsers like OCR, Strings, Translators etc? Always on ParseContext? Properties? Custom xml config? Tika config xml? Other? Combination? Nick
[jira] [Resolved] (TIKA-1426) Let's allow users to specify a tika config file on the commandline for tika-app and tika-server
[ https://issues.apache.org/jira/browse/TIKA-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved TIKA-1426. - Resolution: Fixed Fix Version/s: (was: 1.10) 1.9 Assignee: Chris A. Mattmann - Fixed: {noformat} bash-3.2$ svn commit -m Fix for TIKA-1652, TIKA-1426: Tika Server should allow config file override from the command line like Tika App CHANGES.txt tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java SendingCHANGES.txt Sending tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java Transmitting file data .. Committed revision 1683966. bash-3.2$ {noformat} Let's allow users to specify a tika config file on the commandline for tika-app and tika-server --- Key: TIKA-1426 URL: https://issues.apache.org/jira/browse/TIKA-1426 Project: Tika Issue Type: Improvement Components: cli, server Reporter: Tim Allison Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.9 It would be handy to be able to specify a tika-config file when using tika-app and tika-server. I added this capability to tika-app as part of TIKA-1418. I should have opened a separate issue at the time (mea culpa). This present issue covers both tika-app and tika-server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [VOTE] Release Apache Tika 1.9 Candidate #1
Hey Chris, On 1 Jun 2015, at 06:38, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Please vote on releasing this package as Apache Tika 1.9. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.9 [ ] -1 Do not release this package because… Thanks for preparing this, lots of great stuff in this one. +1 from me. Cheers, Dave
Re: svn commit: r1683969 - /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
Also the lovely thing here too is that since cTAKESParser is a decorator for AutoDetectParser there is magical infinite recursion if it’s enabled via SPI. TODO: make this a LOT cleaner in 1.10+. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: jpluser mattm...@apache.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Saturday, June 6, 2015 at 5:50 PM To: comm...@tika.apache.org comm...@tika.apache.org Subject: svn commit: r1683969 - /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.ti ka.parser.Parser Author: mattmann Date: Sun Jun 7 00:50:23 2015 New Revision: 1683969 URL: http://svn.apache.org/r1683969 Log: CTAKESParser: don't enable via SPI since enabled via config. Modified: tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.ti ka.parser.Parser Modified: tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.ti ka.parser.Parser URL: http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/main/resources/ME TA-INF/services/org.apache.tika.parser.Parser?rev=1683969r1=1683968r2=16 83969view=diff == --- tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.ti ka.parser.Parser (original) +++ tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.ti ka.parser.Parser Sun Jun 7 00:50:23 2015 @@ -65,4 +65,3 @@ org.apache.tika.parser.isatab.ISArchiveP org.apache.tika.parser.geoinfo.GeographicInformationParser org.apache.tika.parser.geo.topic.GeoParser org.apache.tika.parser.external.CompositeExternalParser -org.apache.tika.parser.ctakes.CTAKESParser \ No newline at end of file
Re: Configuring parsers and translators
I think it would be great to have all this in the Tika Config. The one thing then is to provide an example default config and to make it *hugely* clear rather than all the levels of indirection that we currently have going on which makes it super hard when there is a config error (SPI, swallowing print messages, etc.) ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Saturday, June 6, 2015 at 3:45 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: Configuring parsers and translators Hi Nick, I've been mulling this over since you sent the first message. But, I'm afraid I don't have a good solution or developed ideas. I agree, it would be very nice to consolidate all configuration for all parsers in the server and app. Is it feasible to put everything into tika-config? Then Parser implementations would read the config to pull out their own configuration. Or, would it be better to keep some configuration separate? Documentation would be an issue if every parser defines its own metadata keys... But, it might be an improvement since we don't have free form properties and configuration files. Tyler On Sat, Jun 6, 2015 at 12:30 PM Nick Burch apa...@gagravarr.org wrote: Anyone have any thoughts on this? On Fri, 8 May 2015, Nick Burch wrote: Hi All This came up in TIKA-1623, but I thought it might be better brought out to the list for discussion To configure parsers on a per-document basis, such as setting PDF spacing tolerances, or telling Tesseract what language it should be OCRing for, we have the *Config objects. You create one of these, use the setters to configure it for your document, pop it onto the Parse context and it's used when processing your document To configure parsers and translators on a per-JVM basis, to apply to all documents processed, it's a bit less consistent. At least some look for a properties file with a specific name, usually in the tika namespace, and grab their settings / keys / etc out of that. At least some expect to find a *Config with their program path on it, even though that remains constant between documents. None of them support getting their settings from the Tika Config As part of our evolution of parser preferences, we're moving towards people either being able to set their preferences in code, or being able to supply a Tika Config xml which sets their parser preferences or overrides certain bits of the default. The code option works for people who want to declare certain specific things, the Tika Config one gives the same functionality but allows a consistent and clean way to set it between Tika App, Tika Server and java code. Another related example is the External Parser support. Because you can have multiple External Parser instances in your setup, one per format / program, we look for all the org/apache/tika/parser/external/tika-external-parsers.xml files on the classpath, and create parser instances based on definitions in there What do we think about setting executable paths and keys/logins for parsers like OCR, Strings, Translators etc? Always on ParseContext? Properties? Custom xml config? Tika config xml? Other? Combination? Nick
[jira] [Commented] (TIKA-1645) Extraction of biomedical information using CTAKESParser
[ https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575997#comment-14575997 ] Chris A. Mattmann commented on TIKA-1645: - Documentation: https://wiki.apache.org/tika/cTAKESParser Extraction of biomedical information using CTAKESParser --- Key: TIKA-1645 URL: https://issues.apache.org/jira/browse/TIKA-1645 Project: Tika Issue Type: Improvement Components: parser Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Labels: patch Fix For: 1.10 Attachments: CTAKESConfig.properties, TIKA-1645.patch, TIKA-1645.v02.patch, tika-config.xml As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] is a preliminary work in order to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract biomedical information from clinical text. Essentially, this work includes a wrapper for CAS serializers that aim at dumping out the identified annotations into XML-based formats. You can find in attachment a new patch that includes the CTAKESParser, a new parser that decorates the AutoDetectParser and relies on a new version of CTAKESContentHandler, based on feedback from [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser generates the same output of AutoDetectParser and, in addition, the metadata containing the identified clinical annotations detected by cTAKES. To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first to install the last stable release of cTAKES (3.2.2), following the instructions on [User Install Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide]. Then, you can launch Tika as follows: {noformat} CTAKES_HOME=/usr/local/apache-ctakes-3.2.2 java -cp tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input {noformat} In the example above, {{/path/to/CTAKESConfig}} is the parent directory of file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains the configuration properties to build the cTAKES AnalysisEngine; {{tika-config.xml}} is a custom configuration file for Tika that contains the mimetypes whose CTAKESParser will perform parsing. You can find in attachment an example of both {{CTAKESConfig.properties}} and {{tika-config.xml}} to parse ISA-Tab files using cTAKES. You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use the UMLS-based components of cTAKES. I would really appreciate your feedback. Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1652) Tika Server should allow config file override from the command line like Tika App
[ https://issues.apache.org/jira/browse/TIKA-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved TIKA-1652. - Resolution: Fixed - Fixed: {noformat} bash-3.2$ svn commit -m Fix for TIKA-1652, TIKA-1426: Tika Server should allow config file override from the command line like Tika App CHANGES.txt tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java SendingCHANGES.txt Sending tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java Transmitting file data .. Committed revision 1683966. bash-3.2$ {noformat} Note the thing about this is that it creates first the actual parser specified in tika-config.xml (or a DefaultParser if not specified) when specifying the Tika config on the command line - as opposed to the environment variable and/or system property way in which it directly creates the DefaultParser, regardless. This can cause big time havoc say if you have a parser that decorates AutoDetectParser like cTAKESParser does. The only way in fact for it to work correctly with SPI and all the surrounding config magic is to specify the config from the command line which this fix enables. Tika Server should allow config file override from the command line like Tika App - Key: TIKA-1652 URL: https://issues.apache.org/jira/browse/TIKA-1652 Project: Tika Issue Type: Bug Components: server Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.9 Tika-app's TikaCLI allows a command line parameter, --config, to override the Tika config at the command line. For whatever reason, Tika-server doesn't it should since it causes a different control flow for things to get created. I first saw this when testing the CTAKESParser (TIKA-1645) in Tika-server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[VOTE] Release Apache Tika 1.9 Candidate #2
Hi Folks, A second candidate for the Tika 1.9 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.9-rc2/ The SHA1 checksum of the archive is 9b78c9e9ce9640b402b7fef8e30f3cdbe384f44c. In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapachetika-1011/ Please vote on releasing this package as Apache Tika 1.9. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.9 [ ] -1 Do not release this package because… Cheers, Chris P.S. Of course here is my +1. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] [Created] (TIKA-1652) Tika Server should allow config file override from the command line like Tika App
Chris A. Mattmann created TIKA-1652: --- Summary: Tika Server should allow config file override from the command line like Tika App Key: TIKA-1652 URL: https://issues.apache.org/jira/browse/TIKA-1652 Project: Tika Issue Type: Bug Components: server Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.9 Tika-app's TikaCLI allows a command line parameter, --config, to override the Tika config at the command line. For whatever reason, Tika-server doesn't it should since it causes a different control flow for things to get created. I first saw this when testing the CTAKESParser (TIKA-1645) in Tika-server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Configuring parsers and translators
Anyone have any thoughts on this? On Fri, 8 May 2015, Nick Burch wrote: Hi All This came up in TIKA-1623, but I thought it might be better brought out to the list for discussion To configure parsers on a per-document basis, such as setting PDF spacing tolerances, or telling Tesseract what language it should be OCRing for, we have the *Config objects. You create one of these, use the setters to configure it for your document, pop it onto the Parse context and it's used when processing your document To configure parsers and translators on a per-JVM basis, to apply to all documents processed, it's a bit less consistent. At least some look for a properties file with a specific name, usually in the tika namespace, and grab their settings / keys / etc out of that. At least some expect to find a *Config with their program path on it, even though that remains constant between documents. None of them support getting their settings from the Tika Config As part of our evolution of parser preferences, we're moving towards people either being able to set their preferences in code, or being able to supply a Tika Config xml which sets their parser preferences or overrides certain bits of the default. The code option works for people who want to declare certain specific things, the Tika Config one gives the same functionality but allows a consistent and clean way to set it between Tika App, Tika Server and java code. Another related example is the External Parser support. Because you can have multiple External Parser instances in your setup, one per format / program, we look for all the org/apache/tika/parser/external/tika-external-parsers.xml files on the classpath, and create parser instances based on definitions in there What do we think about setting executable paths and keys/logins for parsers like OCR, Strings, Translators etc? Always on ParseContext? Properties? Custom xml config? Tika config xml? Other? Combination? Nick