tika-trunk-jdk1.7 - Build # 733 - Failure

2015-06-06 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-trunk-jdk1.7 (build #733)

Status: Failure

Check console output at https://builds.apache.org/job/tika-trunk-jdk1.7/733/ to 
view the results.

Re: tika-trunk-jdk1.7 - Build # 733 - Failure

2015-06-06 Thread Mattmann, Chris A (3980)
This was due to the SVN issues that infra was dealing
with last night.

I’ll go ahead and spin RC #2 shortly.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




-Original Message-
From: Apache Jenkins Server jenk...@builds.apache.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Saturday, June 6, 2015 at 1:00 AM
To: dev@tika.apache.org dev@tika.apache.org
Subject: tika-trunk-jdk1.7 - Build # 733 - Failure

The Apache Jenkins build system has built tika-trunk-jdk1.7 (build #733)

Status: Failure

Check console output at
https://builds.apache.org/job/tika-trunk-jdk1.7/733/ to view the results.



Re: Configuring parsers and translators

2015-06-06 Thread Tyler Palsulich
Hi Nick,

I've been mulling this over since you sent the first message. But, I'm
afraid I don't have a good solution or developed ideas.

I agree, it would be very nice to consolidate all configuration for all
parsers in the server and app.

Is it feasible to put everything into tika-config? Then Parser
implementations would read the config to pull out their own configuration.
Or, would it be better to keep some configuration separate? Documentation
would be an issue if every parser defines its own metadata keys... But, it
might be an improvement since we don't have free form properties and
configuration files.

Tyler

On Sat, Jun 6, 2015 at 12:30 PM Nick Burch apa...@gagravarr.org wrote:

 Anyone have any thoughts on this?

 On Fri, 8 May 2015, Nick Burch wrote:
  Hi All
 
  This came up in TIKA-1623, but I thought it might be better brought out
 to
  the list for discussion
 
  To configure parsers on a per-document basis, such as setting PDF
  spacing tolerances, or telling Tesseract what language it should be
  OCRing for, we have the *Config objects. You create one of these, use
  the setters to configure it for your document, pop it onto the Parse
  context and it's used when processing your document
 
  To configure parsers and translators on a per-JVM basis, to apply to all
  documents processed, it's a bit less consistent. At least some look for
  a properties file with a specific name, usually in the tika namespace,
  and grab their settings / keys / etc out of that. At least some expect
  to find a *Config with their program path on it, even though that
  remains constant between documents. None of them support getting their
  settings from the Tika Config
 
 
  As part of our evolution of parser preferences, we're moving towards
  people either being able to set their preferences in code, or being able
  to supply a Tika Config xml which sets their parser preferences or
  overrides certain bits of the default. The code option works for people
  who want to declare certain specific things, the Tika Config one gives
  the same functionality but allows a consistent and clean way to set it
  between Tika App, Tika Server and java code.
 
  Another related example is the External Parser support. Because you can
  have multiple External Parser instances in your setup, one per format /
  program, we look for all the
  org/apache/tika/parser/external/tika-external-parsers.xml files on the
  classpath, and create parser instances based on definitions in there
 
 
  What do we think about setting executable paths and keys/logins for
  parsers like OCR, Strings, Translators etc? Always on ParseContext?
  Properties? Custom xml config? Tika config xml? Other? Combination?
 
  Nick
 



Re: Configuring parsers and translators

2015-06-06 Thread Nick Burch

On Sat, 6 Jun 2015, Tyler Palsulich wrote:
(Devil's advocate hat slightly on.) My one hesitation about putting it 
all into tika-config is that the default might get to be a monstrosity 
-- difficult for new users to use.


Assuming you don't want any translators, and have no non-standard paths to 
external parsers, and are happy with default parser orderings, then your 
default config would be:


properties/

(The plan so far remains with using the service loader to find parsers, 
detectors and friends, with the config just being used when you want to 
override parsers or parser orderings)



My main worry with putting it all into config xml is that we accidently 
end up re-inventing spring badly...


Nick


[jira] [Commented] (TIKA-1652) Tika Server should allow config file override from the command line like Tika App

2015-06-06 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575993#comment-14575993
 ] 

Chris A. Mattmann commented on TIKA-1652:
-

+1, agreed. I'll wrap them both up shortly.

 Tika Server should allow config file override from the command line like Tika 
 App
 -

 Key: TIKA-1652
 URL: https://issues.apache.org/jira/browse/TIKA-1652
 Project: Tika
  Issue Type: Bug
  Components: server
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.9


 Tika-app's TikaCLI allows a command line parameter, --config, to override the 
 Tika config at the command line. For whatever reason, Tika-server doesn't it 
 should since it causes a different control flow for things to get created. I 
 first saw this when testing the CTAKESParser (TIKA-1645) in Tika-server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1652) Tika Server should allow config file override from the command line like Tika App

2015-06-06 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575986#comment-14575986
 ] 

Tyler Palsulich commented on TIKA-1652:
---

I think this is a duplicate of TIKA-1426?

 Tika Server should allow config file override from the command line like Tika 
 App
 -

 Key: TIKA-1652
 URL: https://issues.apache.org/jira/browse/TIKA-1652
 Project: Tika
  Issue Type: Bug
  Components: server
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.9


 Tika-app's TikaCLI allows a command line parameter, --config, to override the 
 Tika config at the command line. For whatever reason, Tika-server doesn't it 
 should since it causes a different control flow for things to get created. I 
 first saw this when testing the CTAKESParser (TIKA-1645) in Tika-server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Configuring parsers and translators

2015-06-06 Thread Mattmann, Chris A (3980)
Hey Tyler,

I hear you, but balance that against all the hidden things here
and there, and everywhere, that I constantly keep discovering and
having to pour through lines of TikaConfig - service loaders, class
loaders.

When things work right - no problem. When something goes wrong;
HUGE waste of time.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




-Original Message-
From: Tyler Palsulich tpalsul...@gmail.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Saturday, June 6, 2015 at 3:59 PM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Re: Configuring parsers and translators

(Devil's advocate hat slightly on.) My one hesitation about putting it all
into tika-config is that the default might get to be a monstrosity --
difficult for new users to use.

Tyler

On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 I think it would be great to have all this in the Tika Config.

 The one thing then is to provide an example default config and
 to make it *hugely* clear rather than all the levels of indirection
 that we currently have going on which makes it super hard when
 there is a config error (SPI, swallowing print messages, etc.)


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




 -Original Message-
 From: Tyler Palsulich tpalsul...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Saturday, June 6, 2015 at 3:45 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: Configuring parsers and translators

 Hi Nick,
 
 I've been mulling this over since you sent the first message. But, I'm
 afraid I don't have a good solution or developed ideas.
 
 I agree, it would be very nice to consolidate all configuration for all
 parsers in the server and app.
 
 Is it feasible to put everything into tika-config? Then Parser
 implementations would read the config to pull out their own
configuration.
 Or, would it be better to keep some configuration separate?
Documentation
 would be an issue if every parser defines its own metadata keys...
But, it
 might be an improvement since we don't have free form properties and
 configuration files.
 
 Tyler
 
 On Sat, Jun 6, 2015 at 12:30 PM Nick Burch apa...@gagravarr.org
wrote:
 
  Anyone have any thoughts on this?
 
  On Fri, 8 May 2015, Nick Burch wrote:
   Hi All
  
   This came up in TIKA-1623, but I thought it might be better brought
 out
  to
   the list for discussion
  
   To configure parsers on a per-document basis, such as setting PDF
   spacing tolerances, or telling Tesseract what language it should be
   OCRing for, we have the *Config objects. You create one of these,
use
   the setters to configure it for your document, pop it onto the
Parse
   context and it's used when processing your document
  
   To configure parsers and translators on a per-JVM basis, to apply
to
 all
   documents processed, it's a bit less consistent. At least some look
 for
   a properties file with a specific name, usually in the tika
namespace,
   and grab their settings / keys / etc out of that. At least some
expect
   to find a *Config with their program path on it, even though that
   remains constant between documents. None of them support getting
their
   settings from the Tika Config
  
  
   As part of our evolution of parser preferences, we're moving
towards
   people either being able to set their preferences in code, or being
 able
   to supply a Tika Config xml which sets their parser preferences or
   overrides certain bits of the default. The code option works for
 people
   who want to declare certain specific things, the Tika Config one
gives
   the same functionality but allows a consistent and clean way to
set it
   between Tika App, Tika Server and java code.
  
   Another related example is the External Parser support. Because you
 can
   have multiple External Parser instances in your setup, one per

[jira] [Commented] (TIKA-1645) Extraction of biomedical information using CTAKESParser

2015-06-06 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575996#comment-14575996
 ] 

Chris A. Mattmann commented on TIKA-1645:
-

I got this working with both tika-app and tika-server. See TIKA-1652 for a 
needed fix for this to work. I'm going to go ahead and commit this since I 
fully documented how to install and test and it's working good for me. It 
should work fine for 1.9 since it's not enabled by default and you have to do 
quite a bit to get it running. I'd love unit tests at some point, but not a 
blocker to getting this great piece of code part of 1.9. Thanks for the great 
work [~gostep] and [~selina]!

 Extraction of biomedical information using CTAKESParser
 ---

 Key: TIKA-1645
 URL: https://issues.apache.org/jira/browse/TIKA-1645
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
  Labels: patch
 Fix For: 1.10

 Attachments: CTAKESConfig.properties, TIKA-1645.patch, 
 TIKA-1645.v02.patch, tika-config.xml


 As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], 
 [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
 is a preliminary work in order to integrate [Apache 
 cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract 
 biomedical information from clinical text.
 Essentially, this work includes a wrapper for CAS serializers that aim at 
 dumping out the identified annotations into XML-based formats.
 You can find in attachment a new patch that includes the CTAKESParser, a new 
 parser that decorates the AutoDetectParser and relies on a new version of 
 CTAKESContentHandler, based on feedback from 
 [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser 
 generates the same output of AutoDetectParser and, in addition, the metadata 
 containing the identified clinical annotations detected by cTAKES.
 To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first 
 to install the last stable release of cTAKES (3.2.2), following the 
 instructions on [User Install 
 Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide].
  Then, you can launch Tika as follows:
 {noformat}
 CTAKES_HOME=/usr/local/apache-ctakes-3.2.2
 java -cp 
 tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig
  org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input
 {noformat}
 In the example above, {{/path/to/CTAKESConfig}} is the parent directory of 
 file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains 
 the configuration properties to build the cTAKES AnalysisEngine; 
 {{tika-config.xml}} is a custom configuration file for Tika that contains the 
 mimetypes whose CTAKESParser will perform parsing.
 You can find in attachment an example of both {{CTAKESConfig.properties}} and 
 {{tika-config.xml}} to parse ISA-Tab files using cTAKES.
 You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use 
 the UMLS-based components of cTAKES.
 I would really appreciate your feedback.
 Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this 
 work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TIKA-1645) Extraction of biomedical information using CTAKESParser

2015-06-06 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned TIKA-1645:
---

Assignee: Chris A. Mattmann  (was: Giuseppe Totaro)

 Extraction of biomedical information using CTAKESParser
 ---

 Key: TIKA-1645
 URL: https://issues.apache.org/jira/browse/TIKA-1645
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
  Labels: patch
 Fix For: 1.10

 Attachments: CTAKESConfig.properties, TIKA-1645.patch, 
 TIKA-1645.v02.patch, tika-config.xml


 As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], 
 [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
 is a preliminary work in order to integrate [Apache 
 cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract 
 biomedical information from clinical text.
 Essentially, this work includes a wrapper for CAS serializers that aim at 
 dumping out the identified annotations into XML-based formats.
 You can find in attachment a new patch that includes the CTAKESParser, a new 
 parser that decorates the AutoDetectParser and relies on a new version of 
 CTAKESContentHandler, based on feedback from 
 [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser 
 generates the same output of AutoDetectParser and, in addition, the metadata 
 containing the identified clinical annotations detected by cTAKES.
 To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first 
 to install the last stable release of cTAKES (3.2.2), following the 
 instructions on [User Install 
 Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide].
  Then, you can launch Tika as follows:
 {noformat}
 CTAKES_HOME=/usr/local/apache-ctakes-3.2.2
 java -cp 
 tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig
  org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input
 {noformat}
 In the example above, {{/path/to/CTAKESConfig}} is the parent directory of 
 file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains 
 the configuration properties to build the cTAKES AnalysisEngine; 
 {{tika-config.xml}} is a custom configuration file for Tika that contains the 
 mimetypes whose CTAKESParser will perform parsing.
 You can find in attachment an example of both {{CTAKESConfig.properties}} and 
 {{tika-config.xml}} to parse ISA-Tab files using cTAKES.
 You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use 
 the UMLS-based components of cTAKES.
 I would really appreciate your feedback.
 Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this 
 work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1645) Extraction of biomedical information using CTAKESParser

2015-06-06 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1645.
-
   Resolution: Fixed
Fix Version/s: (was: 1.10)
   1.9

Contributed! Thanks [~gostep] and [~selina]!

{noformat}
bash-3.2$ svn commit -m Fix for TIKA-1645  TIKA-1642: Extraction of 
biomedical information using CTAKESParser contributed by Selina Chu, Giuseppe 
Totaro and mattmann.
SendingCHANGES.txt
Sendingtika-bundle/pom.xml
Sendingtika-parsers/pom.xml
Adding tika-parsers/src/main/java/org/apache/tika/parser/ctakes
Adding 
tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESAnnotationProperty.java
Adding 
tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESConfig.java
Adding 
tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESContentHandler.java
Adding 
tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESParser.java
Adding 
tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESSerializer.java
Adding 
tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESUtils.java
Sending
tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
Transmitting file data ..
Committed revision 1683968.
{noformat}

Please note, improvements are welcomed. I know Giuseppe is working on an 
ExternalParser version of this and some other improvements. Selina is working 
on unit tests.


 Extraction of biomedical information using CTAKESParser
 ---

 Key: TIKA-1645
 URL: https://issues.apache.org/jira/browse/TIKA-1645
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
  Labels: patch
 Fix For: 1.9

 Attachments: CTAKESConfig.properties, TIKA-1645.patch, 
 TIKA-1645.v02.patch, tika-config.xml


 As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], 
 [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
 is a preliminary work in order to integrate [Apache 
 cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract 
 biomedical information from clinical text.
 Essentially, this work includes a wrapper for CAS serializers that aim at 
 dumping out the identified annotations into XML-based formats.
 You can find in attachment a new patch that includes the CTAKESParser, a new 
 parser that decorates the AutoDetectParser and relies on a new version of 
 CTAKESContentHandler, based on feedback from 
 [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser 
 generates the same output of AutoDetectParser and, in addition, the metadata 
 containing the identified clinical annotations detected by cTAKES.
 To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first 
 to install the last stable release of cTAKES (3.2.2), following the 
 instructions on [User Install 
 Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide].
  Then, you can launch Tika as follows:
 {noformat}
 CTAKES_HOME=/usr/local/apache-ctakes-3.2.2
 java -cp 
 tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig
  org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input
 {noformat}
 In the example above, {{/path/to/CTAKESConfig}} is the parent directory of 
 file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains 
 the configuration properties to build the cTAKES AnalysisEngine; 
 {{tika-config.xml}} is a custom configuration file for Tika that contains the 
 mimetypes whose CTAKESParser will perform parsing.
 You can find in attachment an example of both {{CTAKESConfig.properties}} and 
 {{tika-config.xml}} to parse ISA-Tab files using cTAKES.
 You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use 
 the UMLS-based components of cTAKES.
 I would really appreciate your feedback.
 Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this 
 work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1642) Integrate cTAKES into Tika

2015-06-06 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1642.
-
   Resolution: Fixed
Fix Version/s: 1.9
 Assignee: Chris A. Mattmann  (was: Giuseppe Totaro)

- fixed!

{noformat}
bash-3.2$ svn commit -m Fix for TIKA-1645  TIKA-1642: Extraction of 
biomedical information using CTAKESParser contributed by Selina Chu, Giuseppe 
Totaro and mattmann.
SendingCHANGES.txt
Sendingtika-bundle/pom.xml
Sendingtika-parsers/pom.xml
Adding tika-parsers/src/main/java/org/apache/tika/parser/ctakes
Adding 
tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESAnnotationProperty.java
Adding 
tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESConfig.java
Adding 
tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESContentHandler.java
Adding 
tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESParser.java
Adding 
tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESSerializer.java
Adding 
tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESUtils.java
Sending
tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
Transmitting file data ..
Committed revision 1683968.
{noformat}


 Integrate cTAKES into Tika
 --

 Key: TIKA-1642
 URL: https://issues.apache.org/jira/browse/TIKA-1642
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Selina Chu
Assignee: Chris A. Mattmann
 Fix For: 1.9


 [~gostep] has written a preliminary version of 
 [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
 to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika.
 The CTAKESContentHandler allows to perform the following step into Tika:
 * create an AnalysisEngine based on a given XML descriptor;
 * create a CAS (Common Analysis System) appropriate for this AnalysisEngine;
 * populate the CAS with the text extracted by using Tika;
 * perform the AnalysisEngine against the plain text added to CAS;
 * write out the results in the given format (XML, XCAS, XMI, etc.).
 It would be great improvement if we can parse the output of cTAKES and create 
 a list of metadata which describes the terms found in the annotation index 
 and their corresponding tokens. For instance, using the 
 AggregatePlaintextFastUMLSProcessor analysis engine, we can utilize the UMLS 
 database to obtain the annotations related to DiseaseDisorderMention, and I 
 would like to be able to produce a list of words corresponding to the input 
 text which is annotated as DiseaseDisorderMention.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1642) Integrate cTAKES into Tika

2015-06-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576051#comment-14576051
 ] 

Hudson commented on TIKA-1642:
--

ABORTED: Integrated in tika-trunk-jdk1.7 #734 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/734/])
Fix for TIKA-1645  TIKA-1642: Extraction of biomedical information using 
CTAKESParser contributed by Selina Chu, Giuseppe Totaro and mattmann. 
(mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1683968)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-bundle/pom.xml
* /tika/trunk/tika-parsers/pom.xml
* /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESAnnotationProperty.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESConfig.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESContentHandler.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESParser.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESSerializer.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESUtils.java
* 
/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser


 Integrate cTAKES into Tika
 --

 Key: TIKA-1642
 URL: https://issues.apache.org/jira/browse/TIKA-1642
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Selina Chu
Assignee: Chris A. Mattmann
 Fix For: 1.9


 [~gostep] has written a preliminary version of 
 [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
 to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika.
 The CTAKESContentHandler allows to perform the following step into Tika:
 * create an AnalysisEngine based on a given XML descriptor;
 * create a CAS (Common Analysis System) appropriate for this AnalysisEngine;
 * populate the CAS with the text extracted by using Tika;
 * perform the AnalysisEngine against the plain text added to CAS;
 * write out the results in the given format (XML, XCAS, XMI, etc.).
 It would be great improvement if we can parse the output of cTAKES and create 
 a list of metadata which describes the terms found in the annotation index 
 and their corresponding tokens. For instance, using the 
 AggregatePlaintextFastUMLSProcessor analysis engine, we can utilize the UMLS 
 database to obtain the annotations related to DiseaseDisorderMention, and I 
 would like to be able to produce a list of words corresponding to the input 
 text which is annotated as DiseaseDisorderMention.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1652) Tika Server should allow config file override from the command line like Tika App

2015-06-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576053#comment-14576053
 ] 

Hudson commented on TIKA-1652:
--

ABORTED: Integrated in tika-trunk-jdk1.7 #734 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/734/])
Fix for TIKA-1652, TIKA-1426: Tika Server should allow config file override 
from the command line like Tika App (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1683966)
* /tika/trunk/CHANGES.txt
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java


 Tika Server should allow config file override from the command line like Tika 
 App
 -

 Key: TIKA-1652
 URL: https://issues.apache.org/jira/browse/TIKA-1652
 Project: Tika
  Issue Type: Bug
  Components: server
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.9


 Tika-app's TikaCLI allows a command line parameter, --config, to override the 
 Tika config at the command line. For whatever reason, Tika-server doesn't it 
 should since it causes a different control flow for things to get created. I 
 first saw this when testing the CTAKESParser (TIKA-1645) in Tika-server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1426) Let's allow users to specify a tika config file on the commandline for tika-app and tika-server

2015-06-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576052#comment-14576052
 ] 

Hudson commented on TIKA-1426:
--

ABORTED: Integrated in tika-trunk-jdk1.7 #734 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/734/])
Fix for TIKA-1652, TIKA-1426: Tika Server should allow config file override 
from the command line like Tika App (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1683966)
* /tika/trunk/CHANGES.txt
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java


 Let's allow users to specify a tika config file on the commandline for 
 tika-app and tika-server
 ---

 Key: TIKA-1426
 URL: https://issues.apache.org/jira/browse/TIKA-1426
 Project: Tika
  Issue Type: Improvement
  Components: cli, server
Reporter: Tim Allison
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.9


 It would be handy to be able to specify a tika-config file when using 
 tika-app and tika-server.  I added this capability to tika-app as part of 
 TIKA-1418.  I should have opened a separate issue at the time (mea culpa).  
 This present issue covers both tika-app and tika-server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1645) Extraction of biomedical information using CTAKESParser

2015-06-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576054#comment-14576054
 ] 

Hudson commented on TIKA-1645:
--

ABORTED: Integrated in tika-trunk-jdk1.7 #734 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/734/])
Fix for TIKA-1645  TIKA-1642: Extraction of biomedical information using 
CTAKESParser contributed by Selina Chu, Giuseppe Totaro and mattmann. 
(mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1683968)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-bundle/pom.xml
* /tika/trunk/tika-parsers/pom.xml
* /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESAnnotationProperty.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESConfig.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESContentHandler.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESParser.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESSerializer.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESUtils.java
* 
/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser


 Extraction of biomedical information using CTAKESParser
 ---

 Key: TIKA-1645
 URL: https://issues.apache.org/jira/browse/TIKA-1645
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
  Labels: patch
 Fix For: 1.9

 Attachments: CTAKESConfig.properties, TIKA-1645.patch, 
 TIKA-1645.v02.patch, tika-config.xml


 As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], 
 [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
 is a preliminary work in order to integrate [Apache 
 cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract 
 biomedical information from clinical text.
 Essentially, this work includes a wrapper for CAS serializers that aim at 
 dumping out the identified annotations into XML-based formats.
 You can find in attachment a new patch that includes the CTAKESParser, a new 
 parser that decorates the AutoDetectParser and relies on a new version of 
 CTAKESContentHandler, based on feedback from 
 [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser 
 generates the same output of AutoDetectParser and, in addition, the metadata 
 containing the identified clinical annotations detected by cTAKES.
 To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first 
 to install the last stable release of cTAKES (3.2.2), following the 
 instructions on [User Install 
 Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide].
  Then, you can launch Tika as follows:
 {noformat}
 CTAKES_HOME=/usr/local/apache-ctakes-3.2.2
 java -cp 
 tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig
  org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input
 {noformat}
 In the example above, {{/path/to/CTAKESConfig}} is the parent directory of 
 file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains 
 the configuration properties to build the cTAKES AnalysisEngine; 
 {{tika-config.xml}} is a custom configuration file for Tika that contains the 
 mimetypes whose CTAKESParser will perform parsing.
 You can find in attachment an example of both {{CTAKESConfig.properties}} and 
 {{tika-config.xml}} to parse ISA-Tab files using cTAKES.
 You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use 
 the UMLS-based components of cTAKES.
 I would really appreciate your feedback.
 Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this 
 work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Configuring parsers and translators

2015-06-06 Thread Tyler Palsulich
(Devil's advocate hat slightly on.) My one hesitation about putting it all
into tika-config is that the default might get to be a monstrosity --
difficult for new users to use.

Tyler

On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 I think it would be great to have all this in the Tika Config.

 The one thing then is to provide an example default config and
 to make it *hugely* clear rather than all the levels of indirection
 that we currently have going on which makes it super hard when
 there is a config error (SPI, swallowing print messages, etc.)


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




 -Original Message-
 From: Tyler Palsulich tpalsul...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Saturday, June 6, 2015 at 3:45 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: Configuring parsers and translators

 Hi Nick,
 
 I've been mulling this over since you sent the first message. But, I'm
 afraid I don't have a good solution or developed ideas.
 
 I agree, it would be very nice to consolidate all configuration for all
 parsers in the server and app.
 
 Is it feasible to put everything into tika-config? Then Parser
 implementations would read the config to pull out their own configuration.
 Or, would it be better to keep some configuration separate? Documentation
 would be an issue if every parser defines its own metadata keys... But, it
 might be an improvement since we don't have free form properties and
 configuration files.
 
 Tyler
 
 On Sat, Jun 6, 2015 at 12:30 PM Nick Burch apa...@gagravarr.org wrote:
 
  Anyone have any thoughts on this?
 
  On Fri, 8 May 2015, Nick Burch wrote:
   Hi All
  
   This came up in TIKA-1623, but I thought it might be better brought
 out
  to
   the list for discussion
  
   To configure parsers on a per-document basis, such as setting PDF
   spacing tolerances, or telling Tesseract what language it should be
   OCRing for, we have the *Config objects. You create one of these, use
   the setters to configure it for your document, pop it onto the Parse
   context and it's used when processing your document
  
   To configure parsers and translators on a per-JVM basis, to apply to
 all
   documents processed, it's a bit less consistent. At least some look
 for
   a properties file with a specific name, usually in the tika namespace,
   and grab their settings / keys / etc out of that. At least some expect
   to find a *Config with their program path on it, even though that
   remains constant between documents. None of them support getting their
   settings from the Tika Config
  
  
   As part of our evolution of parser preferences, we're moving towards
   people either being able to set their preferences in code, or being
 able
   to supply a Tika Config xml which sets their parser preferences or
   overrides certain bits of the default. The code option works for
 people
   who want to declare certain specific things, the Tika Config one gives
   the same functionality but allows a consistent and clean way to set it
   between Tika App, Tika Server and java code.
  
   Another related example is the External Parser support. Because you
 can
   have multiple External Parser instances in your setup, one per format
 /
   program, we look for all the
   org/apache/tika/parser/external/tika-external-parsers.xml files on the
   classpath, and create parser instances based on definitions in there
  
  
   What do we think about setting executable paths and keys/logins for
   parsers like OCR, Strings, Translators etc? Always on ParseContext?
   Properties? Custom xml config? Tika config xml? Other? Combination?
  
   Nick
  
 




[jira] [Resolved] (TIKA-1426) Let's allow users to specify a tika config file on the commandline for tika-app and tika-server

2015-06-06 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1426.
-
   Resolution: Fixed
Fix Version/s: (was: 1.10)
   1.9
 Assignee: Chris A. Mattmann

- Fixed:

{noformat}
bash-3.2$ svn commit -m Fix for TIKA-1652, TIKA-1426: Tika Server should allow 
config file override from the command line like Tika App CHANGES.txt 
tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java
SendingCHANGES.txt
Sending
tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java
Transmitting file data ..
Committed revision 1683966.
bash-3.2$ 
{noformat}

 Let's allow users to specify a tika config file on the commandline for 
 tika-app and tika-server
 ---

 Key: TIKA-1426
 URL: https://issues.apache.org/jira/browse/TIKA-1426
 Project: Tika
  Issue Type: Improvement
  Components: cli, server
Reporter: Tim Allison
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.9


 It would be handy to be able to specify a tika-config file when using 
 tika-app and tika-server.  I added this capability to tika-app as part of 
 TIKA-1418.  I should have opened a separate issue at the time (mea culpa).  
 This present issue covers both tika-app and tika-server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Release Apache Tika 1.9 Candidate #1

2015-06-06 Thread David Meikle
Hey Chris,

 On 1 Jun 2015, at 06:38, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 
 Please vote on releasing this package as Apache Tika 1.9.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Tika PMC votes are cast.
 
 [ ] +1 Release this package as Apache Tika 1.9
 [ ] -1 Do not release this package because…

Thanks for preparing this, lots of great stuff in this one.

+1 from me.

Cheers,
Dave

Re: svn commit: r1683969 - /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser

2015-06-06 Thread Mattmann, Chris A (3980)
Also the lovely thing here too is that since cTAKESParser is a
decorator for AutoDetectParser there is magical infinite recursion
if it’s enabled via SPI.

TODO: make this a LOT cleaner in 1.10+.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




-Original Message-
From: jpluser mattm...@apache.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Saturday, June 6, 2015 at 5:50 PM
To: comm...@tika.apache.org comm...@tika.apache.org
Subject: svn commit: r1683969 -
/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.ti
ka.parser.Parser

Author: mattmann
Date: Sun Jun  7 00:50:23 2015
New Revision: 1683969

URL: http://svn.apache.org/r1683969
Log:
CTAKESParser: don't enable via SPI since enabled via config.

Modified:

tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.ti
ka.parser.Parser

Modified: 
tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.ti
ka.parser.Parser
URL: 
http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/main/resources/ME
TA-INF/services/org.apache.tika.parser.Parser?rev=1683969r1=1683968r2=16
83969view=diff
==

--- 
tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.ti
ka.parser.Parser (original)
+++ 
tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.ti
ka.parser.Parser Sun Jun  7 00:50:23 2015
@@ -65,4 +65,3 @@ org.apache.tika.parser.isatab.ISArchiveP
 org.apache.tika.parser.geoinfo.GeographicInformationParser
 org.apache.tika.parser.geo.topic.GeoParser
 org.apache.tika.parser.external.CompositeExternalParser
-org.apache.tika.parser.ctakes.CTAKESParser
\ No newline at end of file





Re: Configuring parsers and translators

2015-06-06 Thread Mattmann, Chris A (3980)
I think it would be great to have all this in the Tika Config.

The one thing then is to provide an example default config and
to make it *hugely* clear rather than all the levels of indirection
that we currently have going on which makes it super hard when
there is a config error (SPI, swallowing print messages, etc.)


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




-Original Message-
From: Tyler Palsulich tpalsul...@gmail.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Saturday, June 6, 2015 at 3:45 PM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Re: Configuring parsers and translators

Hi Nick,

I've been mulling this over since you sent the first message. But, I'm
afraid I don't have a good solution or developed ideas.

I agree, it would be very nice to consolidate all configuration for all
parsers in the server and app.

Is it feasible to put everything into tika-config? Then Parser
implementations would read the config to pull out their own configuration.
Or, would it be better to keep some configuration separate? Documentation
would be an issue if every parser defines its own metadata keys... But, it
might be an improvement since we don't have free form properties and
configuration files.

Tyler

On Sat, Jun 6, 2015 at 12:30 PM Nick Burch apa...@gagravarr.org wrote:

 Anyone have any thoughts on this?

 On Fri, 8 May 2015, Nick Burch wrote:
  Hi All
 
  This came up in TIKA-1623, but I thought it might be better brought
out
 to
  the list for discussion
 
  To configure parsers on a per-document basis, such as setting PDF
  spacing tolerances, or telling Tesseract what language it should be
  OCRing for, we have the *Config objects. You create one of these, use
  the setters to configure it for your document, pop it onto the Parse
  context and it's used when processing your document
 
  To configure parsers and translators on a per-JVM basis, to apply to
all
  documents processed, it's a bit less consistent. At least some look
for
  a properties file with a specific name, usually in the tika namespace,
  and grab their settings / keys / etc out of that. At least some expect
  to find a *Config with their program path on it, even though that
  remains constant between documents. None of them support getting their
  settings from the Tika Config
 
 
  As part of our evolution of parser preferences, we're moving towards
  people either being able to set their preferences in code, or being
able
  to supply a Tika Config xml which sets their parser preferences or
  overrides certain bits of the default. The code option works for
people
  who want to declare certain specific things, the Tika Config one gives
  the same functionality but allows a consistent and clean way to set it
  between Tika App, Tika Server and java code.
 
  Another related example is the External Parser support. Because you
can
  have multiple External Parser instances in your setup, one per format
/
  program, we look for all the
  org/apache/tika/parser/external/tika-external-parsers.xml files on the
  classpath, and create parser instances based on definitions in there
 
 
  What do we think about setting executable paths and keys/logins for
  parsers like OCR, Strings, Translators etc? Always on ParseContext?
  Properties? Custom xml config? Tika config xml? Other? Combination?
 
  Nick
 




[jira] [Commented] (TIKA-1645) Extraction of biomedical information using CTAKESParser

2015-06-06 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575997#comment-14575997
 ] 

Chris A. Mattmann commented on TIKA-1645:
-

Documentation: https://wiki.apache.org/tika/cTAKESParser

 Extraction of biomedical information using CTAKESParser
 ---

 Key: TIKA-1645
 URL: https://issues.apache.org/jira/browse/TIKA-1645
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
  Labels: patch
 Fix For: 1.10

 Attachments: CTAKESConfig.properties, TIKA-1645.patch, 
 TIKA-1645.v02.patch, tika-config.xml


 As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], 
 [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
 is a preliminary work in order to integrate [Apache 
 cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract 
 biomedical information from clinical text.
 Essentially, this work includes a wrapper for CAS serializers that aim at 
 dumping out the identified annotations into XML-based formats.
 You can find in attachment a new patch that includes the CTAKESParser, a new 
 parser that decorates the AutoDetectParser and relies on a new version of 
 CTAKESContentHandler, based on feedback from 
 [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser 
 generates the same output of AutoDetectParser and, in addition, the metadata 
 containing the identified clinical annotations detected by cTAKES.
 To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first 
 to install the last stable release of cTAKES (3.2.2), following the 
 instructions on [User Install 
 Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide].
  Then, you can launch Tika as follows:
 {noformat}
 CTAKES_HOME=/usr/local/apache-ctakes-3.2.2
 java -cp 
 tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig
  org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input
 {noformat}
 In the example above, {{/path/to/CTAKESConfig}} is the parent directory of 
 file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains 
 the configuration properties to build the cTAKES AnalysisEngine; 
 {{tika-config.xml}} is a custom configuration file for Tika that contains the 
 mimetypes whose CTAKESParser will perform parsing.
 You can find in attachment an example of both {{CTAKESConfig.properties}} and 
 {{tika-config.xml}} to parse ISA-Tab files using cTAKES.
 You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use 
 the UMLS-based components of cTAKES.
 I would really appreciate your feedback.
 Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this 
 work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1652) Tika Server should allow config file override from the command line like Tika App

2015-06-06 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1652.
-
Resolution: Fixed

- Fixed:

{noformat}
bash-3.2$ svn commit -m Fix for TIKA-1652, TIKA-1426: Tika Server should allow 
config file override from the command line like Tika App CHANGES.txt 
tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java
SendingCHANGES.txt
Sending
tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java
Transmitting file data ..
Committed revision 1683966.
bash-3.2$ 
{noformat}

Note the thing about this is that it creates first the actual parser specified 
in tika-config.xml (or a DefaultParser if not specified) when specifying the 
Tika config on the command line - as opposed to the environment variable and/or 
system property way in which it directly creates the DefaultParser, regardless. 
This can cause big time havoc say if you have a parser that decorates 
AutoDetectParser like cTAKESParser does. The only way in fact for it to work 
correctly with SPI and all the surrounding config magic is to specify the 
config from the command line which this fix enables.

 Tika Server should allow config file override from the command line like Tika 
 App
 -

 Key: TIKA-1652
 URL: https://issues.apache.org/jira/browse/TIKA-1652
 Project: Tika
  Issue Type: Bug
  Components: server
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.9


 Tika-app's TikaCLI allows a command line parameter, --config, to override the 
 Tika config at the command line. For whatever reason, Tika-server doesn't it 
 should since it causes a different control flow for things to get created. I 
 first saw this when testing the CTAKESParser (TIKA-1645) in Tika-server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[VOTE] Release Apache Tika 1.9 Candidate #2

2015-06-06 Thread Mattmann, Chris A (3980)
Hi Folks,

A second candidate for the Tika 1.9 release is available at:

  https://dist.apache.org/repos/dist/dev/tika/

The release candidate is a zip archive of the sources in:
  http://svn.apache.org/repos/asf/tika/tags/1.9-rc2/

The SHA1 checksum of the archive is
9b78c9e9ce9640b402b7fef8e30f3cdbe384f44c.

In addition, a staged maven repository is available here:
https://repository.apache.org/content/repositories/orgapachetika-1011/


Please vote on releasing this package as Apache Tika 1.9.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.9
[ ] -1 Do not release this package because…

Cheers,
Chris

P.S. Of course here is my +1.


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




[jira] [Created] (TIKA-1652) Tika Server should allow config file override from the command line like Tika App

2015-06-06 Thread Chris A. Mattmann (JIRA)
Chris A. Mattmann created TIKA-1652:
---

 Summary: Tika Server should allow config file override from the 
command line like Tika App
 Key: TIKA-1652
 URL: https://issues.apache.org/jira/browse/TIKA-1652
 Project: Tika
  Issue Type: Bug
  Components: server
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.9


Tika-app's TikaCLI allows a command line parameter, --config, to override the 
Tika config at the command line. For whatever reason, Tika-server doesn't it 
should since it causes a different control flow for things to get created. I 
first saw this when testing the CTAKESParser (TIKA-1645) in Tika-server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Configuring parsers and translators

2015-06-06 Thread Nick Burch

Anyone have any thoughts on this?

On Fri, 8 May 2015, Nick Burch wrote:

Hi All

This came up in TIKA-1623, but I thought it might be better brought out to 
the list for discussion


To configure parsers on a per-document basis, such as setting PDF 
spacing tolerances, or telling Tesseract what language it should be 
OCRing for, we have the *Config objects. You create one of these, use 
the setters to configure it for your document, pop it onto the Parse 
context and it's used when processing your document


To configure parsers and translators on a per-JVM basis, to apply to all 
documents processed, it's a bit less consistent. At least some look for 
a properties file with a specific name, usually in the tika namespace, 
and grab their settings / keys / etc out of that. At least some expect 
to find a *Config with their program path on it, even though that 
remains constant between documents. None of them support getting their 
settings from the Tika Config



As part of our evolution of parser preferences, we're moving towards 
people either being able to set their preferences in code, or being able 
to supply a Tika Config xml which sets their parser preferences or 
overrides certain bits of the default. The code option works for people 
who want to declare certain specific things, the Tika Config one gives 
the same functionality but allows a consistent and clean way to set it 
between Tika App, Tika Server and java code.


Another related example is the External Parser support. Because you can 
have multiple External Parser instances in your setup, one per format / 
program, we look for all the 
org/apache/tika/parser/external/tika-external-parsers.xml files on the 
classpath, and create parser instances based on definitions in there



What do we think about setting executable paths and keys/logins for 
parsers like OCR, Strings, Translators etc? Always on ParseContext? 
Properties? Custom xml config? Tika config xml? Other? Combination?


Nick