Re: [DISCUSS] Enable specific ContentHandler for tika-server

2017-10-24 Thread Giuseppe Totaro
Hi folks,

I am developing the proposed solutions within tika-server for enabling
specific ContentHandlers. Basically, I am working to provide the ability of
giving the name of the ContentHandler to be used by either command-line or
HTTP header.
In order to complete my work, I would like to get your feedback about the
following aspects:

   1. To create and use the given ContentHandler, should I modify each
   method within the TikaResource class (as well as the other classes
   within org.apache.tika.server.resource) where the parse method is
   performed by wrapping the ContentHandler currently used? Alternatively, I
   could create a new method (therefore a new REST API) specifically focused
   on creating a ContentHandler from the list provided by the user. Of course,
   I am totally open to other solutions.

   2. As ContentHandlers often provide different types of constructors, we
   would need a mechanism to determine via reflection the constructor and the
   parameters to be used. I think we could get the ContentHandler by using the
   static method Class.forName(String className) [0] with the
   fully-qualified name of the given class and then using the method
getConstructor(Class...
   parameterTypes) [1] to determine the constructor to be used and
   instantiates the ContentHandler.

   3. If you agree with the above, I think that we can allow users to
   provide the parameters according to RCFC822 [3] so that they can give the
   name of the ContentHandler to be used and the parameter as a
   semicolon-separated list of entries:

= X-Content-Handler:  *[, ]
=  *[; ]
=  = 

   Consistently, I would enable the same syntax when using the command-line
   option:

   java -jar tika-server-X.jar -contentHandler *[,]

I look forward to having your feedback.

Thanks a lot,
Giuseppe

[0]
https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#forName-java.lang.String-
[1]
https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#getConstructor-java.lang.Class...-
[3] https://www.w3.org/Protocols/rfc822/

On Fri, Oct 6, 2017 at 3:06 PM, Sergey Beryozkin <sberyoz...@gmail.com>
wrote:

> Konstantin, by the way, if you are interested in having a good discussion
> to do with using the serialized lambdas then you will be welcome to comment
> on the relevant text in the Tika Concerns Beam thread, though may be Beam
> knows how to take care of the issues you raised...
>
> Thanks, Sergey
>
> On 06/10/17 18:27, Sergey Beryozkin wrote:
>
>> On 06/10/17 18:08, Konstantin Gribov wrote:
>>
>>> My +1 to this idea.
>>>
>>> IMHO, second option is more flexible. I also like Nick's suggestion about
>>> using default package for handlers and interpret dot-separated string as
>>> fqcn. Solr does similar thing and it's very convenient to use (but they
>>> use
>>> prefix `solr.` for their classes in predefined package and any other is
>>> interpreted as fqcn).
>>>
>>> I'll add that you could allow user to pass several comma-separated
>>> handlers
>>> to allow build content-handler stack if user wants to.
>>>
>>> I would disagree with Sergey about serialized lambdas for 2 reasons:
>>> - it's useful only for java-clients;
>>> - it could bring very nasty bugs leading to RCE class vulnerabilities, so
>>> it's very controversial from security PoV.
>>>
>> Sure. I was not actually suggesting to use them in Tika natively, I only
>> referred to it as the alternative mentioned in the context of the Beam
>> integration work
>>
>> Sergey
>>
>>>
>>> On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <totarope...@gmail.com>
>>> wrote:
>>>
>>> Hi folks,
>>>>
>>>> if I am not wrong, currently you cannot configure a specific
>>>> ContentHandler
>>>> while using tika-server. I mean that you can configure your own parser
>>>> [0]
>>>> but you cannot control which ContentHandler the parser leverages to
>>>> extract
>>>> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
>>>> StandardsExtractingContentHandler, etc).
>>>> If it is correct, it would be nice to enable the use of specific
>>>> ContentHandlers within tika-server and I would like to discuss how to
>>>> solve
>>>> this issue generally.
>>>>
>>>> I propose two solutions:
>>>>
>>>> 1. augment the TikaConfig class so that a specific ContentHandler
>>>> can be
>>>> used in tika-config.xml;
>>>> 2. determine the ContentHandler to use for parsing through HTTP
>>>> headers,
>>>> for e

Re: [DISCUSS] Enable specific ContentHandler for tika-server

2017-09-29 Thread Giuseppe Totaro
Hi folks,

first of all, I want to express my gratitude for your feedback and
insightful suggestions.

To sum up, I would like to quickly discuss the following aspects:

   - As you all mentioned, the HTTP headers for configuring the
   ContentHandler to be used are better suited for the dynamic cases.
   Specifically, a ContentHadler can be given through an ad-hoc header, e.g.
   -H "X-Content-Handler: StandardsExtractingContentHandler", parsed and used
   run-time within tika-server.
   - Nick, I believe that providing the ability to determine the
   ContentHandler through a command-line option is a great idea. It could be
   better also for users.

Please let me implement both solutions and provide an example in the next
days that we can discuss.

Thanks again for your kind availability,
Giuseppe


On Thu, Sep 28, 2017 at 10:08 PM, Nick Burch <apa...@gagravarr.org> wrote:

> On Thu, 28 Sep 2017, Giuseppe Totaro wrote:
>
>> if I am not wrong, currently you cannot configure a specific
>> ContentHandler
>> while using tika-server. I mean that you can configure your own parser [0]
>> but you cannot control which ContentHandler the parser leverages to
>> extract
>> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
>> StandardsExtractingContentHandler, etc).
>>
>
> I think the long-term plan was to work out a viable plan for laying
> multiple parsers on top of each other, then change some of these to be
> "enhancing parsers" on top. However, that's still on the "TODO" list for
> Tika 2.0, as we've still yet to come up with a good way to allow it to
> happen within the SAX / ContentHandler structure
>
>
> I propose two solutions:
>>
>>   1. augment the TikaConfig class so that a specific ContentHandler can be
>>   used in tika-config.xml;
>>
>
> That feels a bit wrong to me, because in almost all Tika use-cases, the
> value from the Config would be ignored.
>
> Trying to explain to a new user which were the cases where it'd be used,
> and which ones it was ignored, seems hard and confusing too...
>
>
>   2. determine the ContentHandler to use for parsing through HTTP headers,
>>   for example:
>>
>
> We do allow setting of parser config via headers, so this would have
> precidence. It would also allow per-request changing
>
> Otherwise, if server-wide is OK (which your config idea would require
> anyway), might it not be better to make it an option when you start the
> server? I see it as being a bit more like picking a port, in terms of
> something specific to how you run that server instance
>
> eg java -jar tika-server.jar --port 1234 --content-handler
> PhoneExtractingContentHandler
> eg java -jar tika-server.jar --port 1234 --content-handler
> com.example.CustomHandler
>
> Nick
>


[DISCUSS] Enable specific ContentHandler for tika-server

2017-09-28 Thread Giuseppe Totaro
Hi folks,

if I am not wrong, currently you cannot configure a specific ContentHandler
while using tika-server. I mean that you can configure your own parser [0]
but you cannot control which ContentHandler the parser leverages to extract
text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
StandardsExtractingContentHandler, etc).
If it is correct, it would be nice to enable the use of specific
ContentHandlers within tika-server and I would like to discuss how to solve
this issue generally.

I propose two solutions:

   1. augment the TikaConfig class so that a specific ContentHandler can be
   used in tika-config.xml;
   2. determine the ContentHandler to use for parsing through HTTP headers,
   for example:
   curl -T filename.pdf http://localhost:9998/meta --header
   "X-Content-Handler: PhoneExtractingContentHandler"
   This should affect also the TikaResource.java class.

I look forward to having your feedback. I strongly believe that every user
who wants to use Tika as a service through tika-server and needs to extract
content and metadata like phone numbers, standard references, etc would be
very happy.

Thanks a lot,
Giuseppe


[jira] [Resolved] (TIKA-2449) Enabling extraction of standard references from text

2017-09-13 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro resolved TIKA-2449.
---
   Resolution: Fixed
Fix Version/s: 1.17

> Enabling extraction of standard references from text
> 
>
> Key: TIKA-2449
> URL: https://issues.apache.org/jira/browse/TIKA-2449
> Project: Tika
>  Issue Type: Improvement
>  Components: handler
>        Reporter: Giuseppe Totaro
>    Assignee: Giuseppe Totaro
>  Labels: handler
> Fix For: 1.17
>
> Attachments: flowchart_standards_extraction.png, 
> flowchart_standards_extraction_v02.png, SOW-TacCOM.pdf, 
> standards_extraction.patch
>
>
> Apache Tika currently provides many _ContentHandler_ which help to 
> de-obfuscate some information from text. For instance, the 
> {{PhoneExtractingContentHandler}} is used to extract phone numbers while 
> parsing.
> This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a 
> new ContentHandler that relies on regular expressions in order to identify 
> and extract standard references from text. 
> Basically, a standard reference is just a reference to a 
> norm/convention/requirement (i.e., a standard) released by a standard 
> organization. This work is maily focused on identifying and extracting the 
> references to the standards already cited within a given document (e.g., 
> SOW/PWS) so the references can be stored and provided to the user as 
> additional metadata in case the StandardExtractingContentHandler is used.
> In addition to the patch, the first version of the 
> {{StandardsExtractingContentHandler}} along with an example class to easily 
> execute the handler is available on 
> [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. 
> The following sections provide more in detail how the 
> {{StandardsExtractingHandler}} has been developed.
> h1. Background
> From a technical perspective, a standard reference is a string that is 
> usually composed of two parts: 
> # the name of the standard organization; 
> # the alphanumeric identifier of the standard within the organization. 
> Specifically, the first part can include the acronym or the full name of the 
> standard organization or even both, and the second part can include an 
> alphanumeric string, possibly containing one or more separation symbols 
> (e.g., "-", "_", ".") depending on the format adopted by the organization, 
> representing the identifier of the standard within the organization.
> Furthermore, the standard references are usually reported within the 
> "Applicable Documents" or "References" section of a SOW, and they can be 
> cited also within sections that include in the header the word "standard", 
> "requirement", "guideline", or "compliance".
> Consequently, the citation of standard references within a SOW/PWS document 
> can be summarized by the following rules:
> * *RULE #1*: standard references are usually reported within the section 
> named "Applicable Documents" or "References".
> * *RULE #2*: standard references can be cited also within sections including 
> the word "compliance" or another semantically-equivalent word in their name.
> * *RULE #3*: standard references is composed of two parts:
> ** Name of the standard organization (acronym, full name, or both).
> ** Alphanumeric identifier of the standard within the organization.
> * *RULE #4*: The name of the standard organization includes the acronym or 
> the full name or both. The name must belong to the set of standard 
> organizations {{S = O U V}}, where {{O}} represents the set of open standard 
> organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific 
> standard organizations (e.g., Motorola).
> * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be 
> used between the name of the standard organization and the alphanumeric 
> identifier.
> * *RULE #6*: The alphanumeric identifier of the standard is composed of 
> alphabetic and numeric characters, possibly split in two or more parts by a 
> separation symbol (e.g., "-", "_", ".").
> On the basis of the above rules, here are some examples of formats used for 
> reporting standard references within a SOW/PWS:
> * {{}}
> * 
> {{()}}
> * 
> {{()}}
> Moreover, some standards are sometimes released by two standard 
> organizations. In this case, the standard reference 

[jira] [Assigned] (TIKA-2449) Enabling extraction of standard references from text

2017-09-13 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro reassigned TIKA-2449:
-

Assignee: Giuseppe Totaro

> Enabling extraction of standard references from text
> 
>
> Key: TIKA-2449
> URL: https://issues.apache.org/jira/browse/TIKA-2449
> Project: Tika
>  Issue Type: Improvement
>  Components: handler
>        Reporter: Giuseppe Totaro
>    Assignee: Giuseppe Totaro
>  Labels: handler
> Attachments: flowchart_standards_extraction.png, 
> flowchart_standards_extraction_v02.png, SOW-TacCOM.pdf, 
> standards_extraction.patch
>
>
> Apache Tika currently provides many _ContentHandler_ which help to 
> de-obfuscate some information from text. For instance, the 
> {{PhoneExtractingContentHandler}} is used to extract phone numbers while 
> parsing.
> This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a 
> new ContentHandler that relies on regular expressions in order to identify 
> and extract standard references from text. 
> Basically, a standard reference is just a reference to a 
> norm/convention/requirement (i.e., a standard) released by a standard 
> organization. This work is maily focused on identifying and extracting the 
> references to the standards already cited within a given document (e.g., 
> SOW/PWS) so the references can be stored and provided to the user as 
> additional metadata in case the StandardExtractingContentHandler is used.
> In addition to the patch, the first version of the 
> {{StandardsExtractingContentHandler}} along with an example class to easily 
> execute the handler is available on 
> [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. 
> The following sections provide more in detail how the 
> {{StandardsExtractingHandler}} has been developed.
> h1. Background
> From a technical perspective, a standard reference is a string that is 
> usually composed of two parts: 
> # the name of the standard organization; 
> # the alphanumeric identifier of the standard within the organization. 
> Specifically, the first part can include the acronym or the full name of the 
> standard organization or even both, and the second part can include an 
> alphanumeric string, possibly containing one or more separation symbols 
> (e.g., "-", "_", ".") depending on the format adopted by the organization, 
> representing the identifier of the standard within the organization.
> Furthermore, the standard references are usually reported within the 
> "Applicable Documents" or "References" section of a SOW, and they can be 
> cited also within sections that include in the header the word "standard", 
> "requirement", "guideline", or "compliance".
> Consequently, the citation of standard references within a SOW/PWS document 
> can be summarized by the following rules:
> * *RULE #1*: standard references are usually reported within the section 
> named "Applicable Documents" or "References".
> * *RULE #2*: standard references can be cited also within sections including 
> the word "compliance" or another semantically-equivalent word in their name.
> * *RULE #3*: standard references is composed of two parts:
> ** Name of the standard organization (acronym, full name, or both).
> ** Alphanumeric identifier of the standard within the organization.
> * *RULE #4*: The name of the standard organization includes the acronym or 
> the full name or both. The name must belong to the set of standard 
> organizations {{S = O U V}}, where {{O}} represents the set of open standard 
> organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific 
> standard organizations (e.g., Motorola).
> * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be 
> used between the name of the standard organization and the alphanumeric 
> identifier.
> * *RULE #6*: The alphanumeric identifier of the standard is composed of 
> alphabetic and numeric characters, possibly split in two or more parts by a 
> separation symbol (e.g., "-", "_", ".").
> On the basis of the above rules, here are some examples of formats used for 
> reporting standard references within a SOW/PWS:
> * {{}}
> * 
> {{()}}
> * 
> {{()}}
> Moreover, some standards are sometimes released by two standard 
> organizations. In this case, the standard reference can be reported as 
> follows:
> * 
> {{/}}
&

[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text

2017-09-08 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-2449:
--
External issue URL: https://github.com/apache/tika/pull/204  (was: 
https://github.com/giuseppetotaro/StandardsExtractingContentHandler)

> Enabling extraction of standard references from text
> 
>
> Key: TIKA-2449
> URL: https://issues.apache.org/jira/browse/TIKA-2449
> Project: Tika
>  Issue Type: Improvement
>  Components: handler
>        Reporter: Giuseppe Totaro
>  Labels: handler
> Attachments: flowchart_standards_extraction.png, 
> flowchart_standards_extraction_v02.png, SOW-TacCOM.pdf, 
> standards_extraction.patch
>
>
> Apache Tika currently provides many _ContentHandler_ which help to 
> de-obfuscate some information from text. For instance, the 
> {{PhoneExtractingContentHandler}} is used to extract phone numbers while 
> parsing.
> This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a 
> new ContentHandler that relies on regular expressions in order to identify 
> and extract standard references from text. 
> Basically, a standard reference is just a reference to a 
> norm/convention/requirement (i.e., a standard) released by a standard 
> organization. This work is maily focused on identifying and extracting the 
> references to the standards already cited within a given document (e.g., 
> SOW/PWS) so the references can be stored and provided to the user as 
> additional metadata in case the StandardExtractingContentHandler is used.
> In addition to the patch, the first version of the 
> {{StandardsExtractingContentHandler}} along with an example class to easily 
> execute the handler is available on 
> [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. 
> The following sections provide more in detail how the 
> {{StandardsExtractingHandler}} has been developed.
> h1. Background
> From a technical perspective, a standard reference is a string that is 
> usually composed of two parts: 
> # the name of the standard organization; 
> # the alphanumeric identifier of the standard within the organization. 
> Specifically, the first part can include the acronym or the full name of the 
> standard organization or even both, and the second part can include an 
> alphanumeric string, possibly containing one or more separation symbols 
> (e.g., "-", "_", ".") depending on the format adopted by the organization, 
> representing the identifier of the standard within the organization.
> Furthermore, the standard references are usually reported within the 
> "Applicable Documents" or "References" section of a SOW, and they can be 
> cited also within sections that include in the header the word "standard", 
> "requirement", "guideline", or "compliance".
> Consequently, the citation of standard references within a SOW/PWS document 
> can be summarized by the following rules:
> * *RULE #1*: standard references are usually reported within the section 
> named "Applicable Documents" or "References".
> * *RULE #2*: standard references can be cited also within sections including 
> the word "compliance" or another semantically-equivalent word in their name.
> * *RULE #3*: standard references is composed of two parts:
> ** Name of the standard organization (acronym, full name, or both).
> ** Alphanumeric identifier of the standard within the organization.
> * *RULE #4*: The name of the standard organization includes the acronym or 
> the full name or both. The name must belong to the set of standard 
> organizations {{S = O U V}}, where {{O}} represents the set of open standard 
> organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific 
> standard organizations (e.g., Motorola).
> * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be 
> used between the name of the standard organization and the alphanumeric 
> identifier.
> * *RULE #6*: The alphanumeric identifier of the standard is composed of 
> alphabetic and numeric characters, possibly split in two or more parts by a 
> separation symbol (e.g., "-", "_", ".").
> On the basis of the above rules, here are some examples of formats used for 
> reporting standard references within a SOW/PWS:
> * {{}}
> * 
> {{()}}
> * 
> {{()}}
> Moreover, some standards are sometimes released by two standard 
> organizations. In this case, the standard

[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text

2017-09-07 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-2449:
--
Attachment: flowchart_standards_extraction_v02.png

> Enabling extraction of standard references from text
> 
>
> Key: TIKA-2449
> URL: https://issues.apache.org/jira/browse/TIKA-2449
> Project: Tika
>  Issue Type: Improvement
>  Components: handler
>        Reporter: Giuseppe Totaro
>  Labels: handler
> Attachments: flowchart_standards_extraction.png, 
> flowchart_standards_extraction_v02.png, SOW-TacCOM.pdf, 
> standards_extraction.patch
>
>
> Apache Tika currently provides many _ContentHandler_ which help to 
> de-obfuscate some information from text. For instance, the 
> {{PhoneExtractingContentHandler}} is used to extract phone numbers while 
> parsing.
> This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a 
> new ContentHandler that relies on regular expressions in order to identify 
> and extract standard references from text. 
> Basically, a standard reference is just a reference to a 
> norm/convention/requirement (i.e., a standard) released by a standard 
> organization. This work is maily focused on identifying and extracting the 
> references to the standards already cited within a given document (e.g., 
> SOW/PWS) so the references can be stored and provided to the user as 
> additional metadata in case the StandardExtractingContentHandler is used.
> In addition to the patch, the first version of the 
> {{StandardsExtractingContentHandler}} along with an example class to easily 
> execute the handler is available on 
> [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. 
> The following sections provide more in detail how the 
> {{StandardsExtractingHandler}} has been developed.
> h1. Background
> From a technical perspective, a standard reference is a string that is 
> usually composed of two parts: 
> # the name of the standard organization; 
> # the alphanumeric identifier of the standard within the organization. 
> Specifically, the first part can include the acronym or the full name of the 
> standard organization or even both, and the second part can include an 
> alphanumeric string, possibly containing one or more separation symbols 
> (e.g., "-", "_", ".") depending on the format adopted by the organization, 
> representing the identifier of the standard within the organization.
> Furthermore, the standard references are usually reported within the 
> "Applicable Documents" or "References" section of a SOW, and they can be 
> cited also within sections that include in the header the word "standard", 
> "requirement", "guideline", or "compliance".
> Consequently, the citation of standard references within a SOW/PWS document 
> can be summarized by the following rules:
> * *RULE #1*: standard references are usually reported within the section 
> named "Applicable Documents" or "References".
> * *RULE #2*: standard references can be cited also within sections including 
> the word "compliance" or another semantically-equivalent word in their name.
> * *RULE #3*: standard references is composed of two parts:
> ** Name of the standard organization (acronym, full name, or both).
> ** Alphanumeric identifier of the standard within the organization.
> * *RULE #4*: The name of the standard organization includes the acronym or 
> the full name or both. The name must belong to the set of standard 
> organizations {{S = O U V}}, where {{O}} represents the set of open standard 
> organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific 
> standard organizations (e.g., Motorola).
> * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be 
> used between the name of the standard organization and the alphanumeric 
> identifier.
> * *RULE #6*: The alphanumeric identifier of the standard is composed of 
> alphabetic and numeric characters, possibly split in two or more parts by a 
> separation symbol (e.g., "-", "_", ".").
> On the basis of the above rules, here are some examples of formats used for 
> reporting standard references within a SOW/PWS:
> * {{}}
> * 
> {{()}}
> * 
> {{()}}
> Moreover, some standards are sometimes released by two standard 
> organizations. In this case, the standard reference can be reported as 
> follows:
> * 
> {{/}}
> h1. Regular E

[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text

2017-09-07 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-2449:
--
Attachment: (was: flowchart_standards_extraction_v02.png)

> Enabling extraction of standard references from text
> 
>
> Key: TIKA-2449
> URL: https://issues.apache.org/jira/browse/TIKA-2449
> Project: Tika
>  Issue Type: Improvement
>  Components: handler
>        Reporter: Giuseppe Totaro
>  Labels: handler
> Attachments: flowchart_standards_extraction.png, SOW-TacCOM.pdf, 
> standards_extraction.patch
>
>
> Apache Tika currently provides many _ContentHandler_ which help to 
> de-obfuscate some information from text. For instance, the 
> {{PhoneExtractingContentHandler}} is used to extract phone numbers while 
> parsing.
> This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a 
> new ContentHandler that relies on regular expressions in order to identify 
> and extract standard references from text. 
> Basically, a standard reference is just a reference to a 
> norm/convention/requirement (i.e., a standard) released by a standard 
> organization. This work is maily focused on identifying and extracting the 
> references to the standards already cited within a given document (e.g., 
> SOW/PWS) so the references can be stored and provided to the user as 
> additional metadata in case the StandardExtractingContentHandler is used.
> In addition to the patch, the first version of the 
> {{StandardsExtractingContentHandler}} along with an example class to easily 
> execute the handler is available on 
> [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. 
> The following sections provide more in detail how the 
> {{StandardsExtractingHandler}} has been developed.
> h1. Background
> From a technical perspective, a standard reference is a string that is 
> usually composed of two parts: 
> # the name of the standard organization; 
> # the alphanumeric identifier of the standard within the organization. 
> Specifically, the first part can include the acronym or the full name of the 
> standard organization or even both, and the second part can include an 
> alphanumeric string, possibly containing one or more separation symbols 
> (e.g., "-", "_", ".") depending on the format adopted by the organization, 
> representing the identifier of the standard within the organization.
> Furthermore, the standard references are usually reported within the 
> "Applicable Documents" or "References" section of a SOW, and they can be 
> cited also within sections that include in the header the word "standard", 
> "requirement", "guideline", or "compliance".
> Consequently, the citation of standard references within a SOW/PWS document 
> can be summarized by the following rules:
> * *RULE #1*: standard references are usually reported within the section 
> named "Applicable Documents" or "References".
> * *RULE #2*: standard references can be cited also within sections including 
> the word "compliance" or another semantically-equivalent word in their name.
> * *RULE #3*: standard references is composed of two parts:
> ** Name of the standard organization (acronym, full name, or both).
> ** Alphanumeric identifier of the standard within the organization.
> * *RULE #4*: The name of the standard organization includes the acronym or 
> the full name or both. The name must belong to the set of standard 
> organizations {{S = O U V}}, where {{O}} represents the set of open standard 
> organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific 
> standard organizations (e.g., Motorola).
> * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be 
> used between the name of the standard organization and the alphanumeric 
> identifier.
> * *RULE #6*: The alphanumeric identifier of the standard is composed of 
> alphabetic and numeric characters, possibly split in two or more parts by a 
> separation symbol (e.g., "-", "_", ".").
> On the basis of the above rules, here are some examples of formats used for 
> reporting standard references within a SOW/PWS:
> * {{}}
> * 
> {{()}}
> * 
> {{()}}
> Moreover, some standards are sometimes released by two standard 
> organizations. In this case, the standard reference can be reported as 
> follows:
> * 
> {{/}}
> h1. Regular Expressions
> The {{StandardsExtracting

[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text

2017-09-07 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-2449:
--
Attachment: flowchart_standards_extraction_v02.png

> Enabling extraction of standard references from text
> 
>
> Key: TIKA-2449
> URL: https://issues.apache.org/jira/browse/TIKA-2449
> Project: Tika
>  Issue Type: Improvement
>  Components: handler
>        Reporter: Giuseppe Totaro
>  Labels: handler
> Attachments: flowchart_standards_extraction.png, 
> flowchart_standards_extraction_v02.png, SOW-TacCOM.pdf, 
> standards_extraction.patch
>
>
> Apache Tika currently provides many _ContentHandler_ which help to 
> de-obfuscate some information from text. For instance, the 
> {{PhoneExtractingContentHandler}} is used to extract phone numbers while 
> parsing.
> This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a 
> new ContentHandler that relies on regular expressions in order to identify 
> and extract standard references from text. 
> Basically, a standard reference is just a reference to a 
> norm/convention/requirement (i.e., a standard) released by a standard 
> organization. This work is maily focused on identifying and extracting the 
> references to the standards already cited within a given document (e.g., 
> SOW/PWS) so the references can be stored and provided to the user as 
> additional metadata in case the StandardExtractingContentHandler is used.
> In addition to the patch, the first version of the 
> {{StandardsExtractingContentHandler}} along with an example class to easily 
> execute the handler is available on 
> [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. 
> The following sections provide more in detail how the 
> {{StandardsExtractingHandler}} has been developed.
> h1. Background
> From a technical perspective, a standard reference is a string that is 
> usually composed of two parts: 
> # the name of the standard organization; 
> # the alphanumeric identifier of the standard within the organization. 
> Specifically, the first part can include the acronym or the full name of the 
> standard organization or even both, and the second part can include an 
> alphanumeric string, possibly containing one or more separation symbols 
> (e.g., "-", "_", ".") depending on the format adopted by the organization, 
> representing the identifier of the standard within the organization.
> Furthermore, the standard references are usually reported within the 
> "Applicable Documents" or "References" section of a SOW, and they can be 
> cited also within sections that include in the header the word "standard", 
> "requirement", "guideline", or "compliance".
> Consequently, the citation of standard references within a SOW/PWS document 
> can be summarized by the following rules:
> * *RULE #1*: standard references are usually reported within the section 
> named "Applicable Documents" or "References".
> * *RULE #2*: standard references can be cited also within sections including 
> the word "compliance" or another semantically-equivalent word in their name.
> * *RULE #3*: standard references is composed of two parts:
> ** Name of the standard organization (acronym, full name, or both).
> ** Alphanumeric identifier of the standard within the organization.
> * *RULE #4*: The name of the standard organization includes the acronym or 
> the full name or both. The name must belong to the set of standard 
> organizations {{S = O U V}}, where {{O}} represents the set of open standard 
> organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific 
> standard organizations (e.g., Motorola).
> * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be 
> used between the name of the standard organization and the alphanumeric 
> identifier.
> * *RULE #6*: The alphanumeric identifier of the standard is composed of 
> alphabetic and numeric characters, possibly split in two or more parts by a 
> separation symbol (e.g., "-", "_", ".").
> On the basis of the above rules, here are some examples of formats used for 
> reporting standard references within a SOW/PWS:
> * {{}}
> * 
> {{()}}
> * 
> {{()}}
> Moreover, some standards are sometimes released by two standard 
> organizations. In this case, the standard reference can be reported as 
> follows:
> * 
> {{/}}
> h1. Regular E

[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text

2017-09-07 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-2449:
--
Attachment: standards_extraction_v02.png

> Enabling extraction of standard references from text
> 
>
> Key: TIKA-2449
> URL: https://issues.apache.org/jira/browse/TIKA-2449
> Project: Tika
>  Issue Type: Improvement
>  Components: handler
>        Reporter: Giuseppe Totaro
>  Labels: handler
> Attachments: flowchart_standards_extraction.png, SOW-TacCOM.pdf, 
> standards_extraction.patch
>
>
> Apache Tika currently provides many _ContentHandler_ which help to 
> de-obfuscate some information from text. For instance, the 
> {{PhoneExtractingContentHandler}} is used to extract phone numbers while 
> parsing.
> This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a 
> new ContentHandler that relies on regular expressions in order to identify 
> and extract standard references from text. 
> Basically, a standard reference is just a reference to a 
> norm/convention/requirement (i.e., a standard) released by a standard 
> organization. This work is maily focused on identifying and extracting the 
> references to the standards already cited within a given document (e.g., 
> SOW/PWS) so the references can be stored and provided to the user as 
> additional metadata in case the StandardExtractingContentHandler is used.
> In addition to the patch, the first version of the 
> {{StandardsExtractingContentHandler}} along with an example class to easily 
> execute the handler is available on 
> [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. 
> The following sections provide more in detail how the 
> {{StandardsExtractingHandler}} has been developed.
> h1. Background
> From a technical perspective, a standard reference is a string that is 
> usually composed of two parts: 
> # the name of the standard organization; 
> # the alphanumeric identifier of the standard within the organization. 
> Specifically, the first part can include the acronym or the full name of the 
> standard organization or even both, and the second part can include an 
> alphanumeric string, possibly containing one or more separation symbols 
> (e.g., "-", "_", ".") depending on the format adopted by the organization, 
> representing the identifier of the standard within the organization.
> Furthermore, the standard references are usually reported within the 
> "Applicable Documents" or "References" section of a SOW, and they can be 
> cited also within sections that include in the header the word "standard", 
> "requirement", "guideline", or "compliance".
> Consequently, the citation of standard references within a SOW/PWS document 
> can be summarized by the following rules:
> * *RULE #1*: standard references are usually reported within the section 
> named "Applicable Documents" or "References".
> * *RULE #2*: standard references can be cited also within sections including 
> the word "compliance" or another semantically-equivalent word in their name.
> * *RULE #3*: standard references is composed of two parts:
> ** Name of the standard organization (acronym, full name, or both).
> ** Alphanumeric identifier of the standard within the organization.
> * *RULE #4*: The name of the standard organization includes the acronym or 
> the full name or both. The name must belong to the set of standard 
> organizations {{S = O U V}}, where {{O}} represents the set of open standard 
> organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific 
> standard organizations (e.g., Motorola).
> * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be 
> used between the name of the standard organization and the alphanumeric 
> identifier.
> * *RULE #6*: The alphanumeric identifier of the standard is composed of 
> alphabetic and numeric characters, possibly split in two or more parts by a 
> separation symbol (e.g., "-", "_", ".").
> On the basis of the above rules, here are some examples of formats used for 
> reporting standard references within a SOW/PWS:
> * {{}}
> * 
> {{()}}
> * 
> {{()}}
> Moreover, some standards are sometimes released by two standard 
> organizations. In this case, the standard reference can be reported as 
> follows:
> * 
> {{/}}
> h1. Regular Expressions
> The {{StandardsExtractingContentHandler}} uses

[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text

2017-09-07 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-2449:
--
Attachment: (was: standards_extraction_v02.png)

> Enabling extraction of standard references from text
> 
>
> Key: TIKA-2449
> URL: https://issues.apache.org/jira/browse/TIKA-2449
> Project: Tika
>  Issue Type: Improvement
>  Components: handler
>        Reporter: Giuseppe Totaro
>  Labels: handler
> Attachments: flowchart_standards_extraction.png, SOW-TacCOM.pdf, 
> standards_extraction.patch
>
>
> Apache Tika currently provides many _ContentHandler_ which help to 
> de-obfuscate some information from text. For instance, the 
> {{PhoneExtractingContentHandler}} is used to extract phone numbers while 
> parsing.
> This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a 
> new ContentHandler that relies on regular expressions in order to identify 
> and extract standard references from text. 
> Basically, a standard reference is just a reference to a 
> norm/convention/requirement (i.e., a standard) released by a standard 
> organization. This work is maily focused on identifying and extracting the 
> references to the standards already cited within a given document (e.g., 
> SOW/PWS) so the references can be stored and provided to the user as 
> additional metadata in case the StandardExtractingContentHandler is used.
> In addition to the patch, the first version of the 
> {{StandardsExtractingContentHandler}} along with an example class to easily 
> execute the handler is available on 
> [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. 
> The following sections provide more in detail how the 
> {{StandardsExtractingHandler}} has been developed.
> h1. Background
> From a technical perspective, a standard reference is a string that is 
> usually composed of two parts: 
> # the name of the standard organization; 
> # the alphanumeric identifier of the standard within the organization. 
> Specifically, the first part can include the acronym or the full name of the 
> standard organization or even both, and the second part can include an 
> alphanumeric string, possibly containing one or more separation symbols 
> (e.g., "-", "_", ".") depending on the format adopted by the organization, 
> representing the identifier of the standard within the organization.
> Furthermore, the standard references are usually reported within the 
> "Applicable Documents" or "References" section of a SOW, and they can be 
> cited also within sections that include in the header the word "standard", 
> "requirement", "guideline", or "compliance".
> Consequently, the citation of standard references within a SOW/PWS document 
> can be summarized by the following rules:
> * *RULE #1*: standard references are usually reported within the section 
> named "Applicable Documents" or "References".
> * *RULE #2*: standard references can be cited also within sections including 
> the word "compliance" or another semantically-equivalent word in their name.
> * *RULE #3*: standard references is composed of two parts:
> ** Name of the standard organization (acronym, full name, or both).
> ** Alphanumeric identifier of the standard within the organization.
> * *RULE #4*: The name of the standard organization includes the acronym or 
> the full name or both. The name must belong to the set of standard 
> organizations {{S = O U V}}, where {{O}} represents the set of open standard 
> organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific 
> standard organizations (e.g., Motorola).
> * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be 
> used between the name of the standard organization and the alphanumeric 
> identifier.
> * *RULE #6*: The alphanumeric identifier of the standard is composed of 
> alphabetic and numeric characters, possibly split in two or more parts by a 
> separation symbol (e.g., "-", "_", ".").
> On the basis of the above rules, here are some examples of formats used for 
> reporting standard references within a SOW/PWS:
> * {{}}
> * 
> {{()}}
> * 
> {{()}}
> Moreover, some standards are sometimes released by two standard 
> organizations. In this case, the standard reference can be reported as 
> follows:
> * 
> {{/}}
> h1. Regular Expressions
> The {{StandardsExtractingConten

[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text

2017-08-30 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-2449:
--
Description: 
Apache Tika currently provides many _ContentHandler_ which help to de-obfuscate 
some information from text. For instance, the {{PhoneExtractingContentHandler}} 
is used to extract phone numbers while parsing.

This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a 
new ContentHandler that relies on regular expressions in order to identify and 
extract standard references from text. 
Basically, a standard reference is just a reference to a 
norm/convention/requirement (i.e., a standard) released by a standard 
organization. This work is maily focused on identifying and extracting the 
references to the standards already cited within a given document (e.g., 
SOW/PWS) so the references can be stored and provided to the user as additional 
metadata in case the StandardExtractingContentHandler is used.

In addition to the patch, the first version of the 
{{StandardsExtractingContentHandler}} along with an example class to easily 
execute the handler is available on 
[GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. 
The following sections provide more in detail how the 
{{StandardsExtractingHandler}} has been developed.

h1. Background

>From a technical perspective, a standard reference is a string that is usually 
>composed of two parts: 
# the name of the standard organization; 
# the alphanumeric identifier of the standard within the organization. 
Specifically, the first part can include the acronym or the full name of the 
standard organization or even both, and the second part can include an 
alphanumeric string, possibly containing one or more separation symbols (e.g., 
"-", "_", ".") depending on the format adopted by the organization, 
representing the identifier of the standard within the organization.

Furthermore, the standard references are usually reported within the 
"Applicable Documents" or "References" section of a SOW, and they can be cited 
also within sections that include in the header the word "standard", 
"requirement", "guideline", or "compliance".

Consequently, the citation of standard references within a SOW/PWS document can 
be summarized by the following rules:
* *RULE #1*: standard references are usually reported within the section named 
"Applicable Documents" or "References".
* *RULE #2*: standard references can be cited also within sections including 
the word "compliance" or another semantically-equivalent word in their name.
* *RULE #3*: standard references is composed of two parts:
** Name of the standard organization (acronym, full name, or both).
** Alphanumeric identifier of the standard within the organization.
* *RULE #4*: The name of the standard organization includes the acronym or the 
full name or both. The name must belong to the set of standard organizations 
{{S = O U V}}, where {{O}} represents the set of open standard organizations 
(e.g., ANSI) and {{V}} represents the set of vendor-specific standard 
organizations (e.g., Motorola).
* *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be 
used between the name of the standard organization and the alphanumeric 
identifier.
* *RULE #6*: The alphanumeric identifier of the standard is composed of 
alphabetic and numeric characters, possibly split in two or more parts by a 
separation symbol (e.g., "-", "_", ".").

On the basis of the above rules, here are some examples of formats used for 
reporting standard references within a SOW/PWS:
* {{}}
* 
{{()}}
* 
{{()}}

Moreover, some standards are sometimes released by two standard organizations. 
In this case, the standard reference can be reported as follows:
* 
{{/}}

h1. Regular Expressions

The {{StandardsExtractingContentHandler}} uses a helper class named 
{{StandardsText}} that relies on Java regular expressions and provides some 
methods to identify headers and standard references, and determine the score of 
the references found within the given text.

Here are the main regular expressions used within the {{StandardsText}} class:
* *REGEX_HEADER*: regular expression to match only uppercase headers.
  {code}
  (\d+\.(\d+\.?)*)\p{Blank}+([A-Z]+(\s[A-Z]+)*){5,}
  {code}
* *REGEX_APPLICABLE_DOCUMENTS*: regular expression to match the the header of 
"APPLICABLE DOCUMENTS" and equivalent sections.
  {code}
  
(?i:.*APPLICABLE\sDOCUMENTS|REFERENCE|STANDARD|REQUIREMENT|GUIDELINE|COMPLIANCE.*)
  {code}
* *REGEX_FALLBACK*: regular expression to match a string that is supposed to be 
a standard reference.
  {code}
  
\(?(?[A-Z]\w+)\)?((\s?(?\/)\s?)(\w+\s)*\(?(?[A-Z]\w+)\)?)?(\s(Publicat

[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text

2017-08-29 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-2449:
--
Description: 
Apache Tika currently provides many _ContentHandler_ which help to de-obfuscate 
some information from text. For instance, the {{PhoneExtractingContentHandler}} 
is used to extract phone numbers while parsing.

This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a 
new ContentHandler that relies on regular expressions in order to identify and 
extract standard references from text. 
Basically, a standard reference is just a reference to a 
norm/convention/requirement (i.e., a standard) released by a standard 
organization. This work is maily focused on identifying and extracting the 
references to the standards already cited within a given document (e.g., 
SOW/PWS) so the references can be stored and provided to the user as additional 
metadata in case the StandardExtractingContentHandler is used.

In addition to the patch, the first version of the 
{{StandardsExtractingContentHandler}} along with an example class to easily 
execute the handler is available on 
[GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. 
The following sections provide more in detail how the 
{{StandardsExtractingHandler}} has been developed.

h1. Background

>From a technical perspective, a standard reference is a string that is usually 
>composed of two parts: 
# the name of the standard organization; 
# the alphanumeric identifier of the standard within the organization. 
Specifically, the first part can include the acronym or the full name of the 
standard organization or even both, and the second part can include an 
alphanumeric string, possibly containing one or more separation symbols (e.g., 
"-", "_", ".") depending on the format adopted by the organization, 
representing the identifier of the standard within the organization.

Furthermore, the standard references are usually reported within the 
"Applicable Documents" or "References" section of a SOW, and they can be cited 
also within sections that include in the header the word "standard", 
"requirement", "guideline", or "compliance".

Consequently, the citation of standard references within a SOW/PWS document can 
be summarized by the following rules:
* *RULE #1*: standard references are usually reported within the section named 
"Applicable Documents" or "References".
* *RULE #2*: standard references can be cited also within sections including 
the word "compliance" or another semantically-equivalent word in their name.
* *RULE #3*: standard references is composed of two parts:
** Name of the standard organization (acronym, full name, or both).
** Alphanumeric identifier of the standard within the organization.
* *RULE #4*: The name of the standard organization includes the acronym or the 
full name or both. The name must belong to the set of standard organizations 
{{S = O U V}}, where {{O}} represents the set of open standard organizations 
(e.g., ANSI) and {{V}} represents the set of vendor-specific standard 
organizations (e.g., Motorola).
* *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be 
used between the name of the standard organization and the alphanumeric 
identifier.
* *RULE #6*: The alphanumeric identifier of the standard is composed of 
alphabetic and numeric characters, possibly split in two or more parts by a 
separation symbol (e.g., "-", "_", ".").

On the basis of the above rules, here are some examples of formats used for 
reporting standard references within a SOW/PWS:
* {{}}
* 
{{()}}
* 
{{()}}

Moreover, some standards are sometimes released by two standard organizations. 
In this case, the standard reference can be reported as follows:
* 
{{/}}

h1. Regular Expressions

The {{StandardsExtractingContentHandler}} uses a helper class named 
`StandardsText` that relies on Java regular expressions and provides some 
methods to identify headers and standard references, and determine the score of 
the references found within the given text.

Here are the main regular expressions used within the StandardsText class:
* *REGEX_HEADER*: regular expression to match only uppercase headers.
  {code}
  (\d+\.(\d+\.?)*)\p{Blank}+([A-Z]+(\s[A-Z]+)*){5,}
  {code}
* *REGEX_APPLICABLE_DOCUMENTS*: regular expression to match the the header of 
"APPLICABLE DOCUMENTS" and equivalent sections.
  {code}
  
(?i:.*APPLICABLE\sDOCUMENTS|REFERENCE|STANDARD|REQUIREMENT|GUIDELINE|COMPLIANCE.*)
  {code}
* *REGEX_FALLBACK*: regular expression to match a string that is supposed to be 
a standard reference.
  {code}
  
\(?(?[A-Z]\w+)\)?((\s?(?\/)\s?)(\w+\s)*\(?(?[A-Z]\w+)\)?)?(\s(Publication|Standard)

[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text

2017-08-29 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-2449:
--
Description: 
Apache Tika currently provides many _ContentHandler_ which help to de-obfuscate 
some information from text. For instance, the {{PhoneExtractingContentHandler}} 
is used to extract phone numbers while parsing.

This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a 
new ContentHandler that relies on regular expressions in order to identify and 
extract standard references from text. 
Basically, a standard reference is just a reference to a 
norm/convention/requirement (i.e., a standard) released by a standard 
organization. This work is maily focused on identifying and extracting the 
references to the standards already cited within a given document (e.g., 
SOW/PWS) so the references can be stored and provided to the user as additional 
metadata in case the StandardExtractingContentHandler is used.

In addition to the patch, the first version of the 
{{StandardsExtractingContentHandler}} along with an example class to easily 
execute the handler is available on 
[GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. 
The following sections provide more in detail how the 
{{StandardsExtractingHandler}} has been developed.

h1. Background

>From a technical perspective, a standard reference is a string that is usually 
>composed of two parts: 
# the name of the standard organization; 
# the alphanumeric identifier of the standard within the organization. 
Specifically, the first part can include the acronym or the full name of the 
standard organization or even both, and the second part can include an 
alphanumeric string, possibly containing one or more separation symbols (e.g., 
"-", "_", ".") depending on the format adopted by the organization, 
representing the identifier of the standard within the organization.

Furthermore, the standard references are usually reported within the 
"Applicable Documents" or "References" section of a SOW, and they can be cited 
also within sections that include in the header the word "standard", 
"requirement", "guideline", or "compliance".

Consequently, the citation of standard references within a SOW/PWS document can 
be summarized by the following rules:
* *RULE #1*: standard references are usually reported within the section named 
"Applicable Documents" or "References".
* *RULE #2*: standard references can be cited also within sections including 
the word "compliance" or another semantically-equivalent word in their name.
* *RULE #3*: standard references is composed of two parts:
** Name of the standard organization (acronym, full name, or both).
** Alphanumeric identifier of the standard within the organization.
* *RULE #4*: The name of the standard organization includes the acronym or the 
full name or both. The name must belong to the set of standard organizations 
{{S = O U V}}, where {{O}} represents the set of open standard organizations 
(e.g., ANSI) and {{V}} represents the set of vendor-specific standard 
organizations (e.g., Motorola).
* *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be 
used between the name of the standard organization and the alphanumeric 
identifier.
* *RULE #6*: The alphanumeric identifier of the standard is composed of 
alphabetic and numeric characters, possibly split in two or more parts by a 
separation symbol (e.g., "-", "_", ".").

On the basis of the above rules, here are some examples of formats used for 
reporting standard references within a SOW/PWS:
* {{}}
* 
{{()}}
* 
{{()}}

Moreover, some standards are sometimes released by two standard organizations. 
In this case, the standard reference can be reported as follows:
* 
{{/}}

h1. Regular Expressions

The {{StandardsExtractingContentHandler}} uses a helper class named 
`StandardsText` that relies on Java regular expressions and provides some 
methods to identify headers and standard references, and determine the score of 
the references found within the given text.

Here are the main regular expressions used within the StandardsText class:
* *REGEX_HEADER*: regular expression to match only uppercase headers.
  {code}
  (\d+\.(\d+\.?)*)\p{Blank}+([A-Z]+(\s[A-Z]+)*){5,}
  {code}
* *REGEX_APPLICABLE_DOCUMENTS*: regular expression to match the the header of 
"APPLICABLE DOCUMENTS" and equivalent sections.
  {code}
  
(?i:.*APPLICABLE\sDOCUMENTS|REFERENCE|STANDARD|REQUIREMENT|GUIDELINE|COMPLIANCE.*)
  {code}
* *REGEX_FALLBACK*: regular expression to match a string that is supposed to be 
a standard reference.
  {code}
  
\(?(?[A-Z]\w+)\)?((\s?(?\/)\s?)(\w+\s)*\(?(?[A-Z]\w+)\)?)?(\s(Publication|Standard)

[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text

2017-08-29 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-2449:
--
Description: 
Apache Tika currently provides many _ContentHandler_ which help to de-obfuscate 
some information from text. For instance, the {{PhoneExtractingContentHandler}} 
is used to extract phone numbers while parsing.

This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a 
new ContentHandler that relies on regular expressions in order to identify and 
extract standard references from text. 
Basically, a standard reference is just a reference to a 
norm/convention/requirement (i.e., a standard) released by a standard 
organization. This work is maily focused on identifying and extracting the 
references to the standards already cited within a given document (e.g., 
SOW/PWS) so the references can be stored and provided to the user as additional 
metadata in case the StandardExtractingContentHandler is used.

In addition to the patch, the first version of the 
{{StandardsExtractingContentHandler}} along with an example class to easily 
execute the handler is available on 
[GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. 
The following sections provide more in detail how the 
{{StandardsExtractingHandler}} has been developed.

h1. Background

>From a technical perspective, a standard reference is a string that is usually 
>composed of two parts: 
# the name of the standard organization; 
# the alphanumeric identifier of the standard within the organization. 
Specifically, the first part can include the acronym or the full name of the 
standard organization or even both, and the second part can include an 
alphanumeric string, possibly containing one or more separation symbols (e.g., 
"-", "_", ".") depending on the format adopted by the organization, 
representing the identifier of the standard within the organization.

Furthermore, the standard references are usually reported within the 
"Applicable Documents" or "References" section of a SOW, and they can be cited 
also within sections that include in the header the word "standard", 
"requirement", "guideline", or "compliance".

Consequently, the citation of standard references within a SOW/PWS document can 
be summarized by the following rules:
* *RULE #1*: standard references are usually reported within the section named 
"Applicable Documents" or "References".
* *RULE #2*: standard references can be cited also within sections including 
the word "compliance" or another semantically-equivalent word in their name.
* *RULE #3*: standard references is composed of two parts:
** Name of the standard organization (acronym, full name, or both).
** Alphanumeric identifier of the standard within the organization.
* *RULE #4*: The name of the standard organization includes the acronym or the 
full name or both. The name must belong to the set of standard organizations S 
= O U V, where O represents the set of open standard organizations (e.g., ANSI) 
and V represents the set of vendor-specific standard organizations (e.g., 
Motorola).
* *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be 
used between the name of the standard organization and the alphanumeric 
identifier.
* *RULE #6*: The alphanumeric identifier of the standard is composed of 
alphabetic and numeric characters, possibly split in two or more parts by a 
separation symbol (e.g., "-", "_", ".").

On the basis of the above rules, here are some examples of formats used for 
reporting standard references within a SOW/PWS:
* {{}}
* 
{{()}}
* 
{{()}}

Moreover, some standards are sometimes released by two standard organizations. 
In this case, the standard reference can be reported as follows:
* 
{{/}}

h1. Regular Expressions

The {{StandardsExtractingContentHandler}} uses a helper class named 
`StandardsText` that relies on Java regular expressions and provides some 
methods to identify headers and standard references, and determine the score of 
the references found within the given text.

Here are the main regular expressions used within the StandardsText class:
* *REGEX_HEADER*: regular expression to match only uppercase headers.
  {code}
  (\d+\.(\d+\.?)*)\p{Blank}+([A-Z]+(\s[A-Z]+)*){5,}
  {code}
* *REGEX_APPLICABLE_DOCUMENTS*: regular expression to match the the header of 
"APPLICABLE DOCUMENTS" and equivalent sections.
  {code}
  
(?i:.*APPLICABLE\sDOCUMENTS|REFERENCE|STANDARD|REQUIREMENT|GUIDELINE|COMPLIANCE.*)
  {code}
* *REGEX_FALLBACK*: regular expression to match a string that is supposed to be 
a standard reference.
  {code}
  
\(?(?[A-Z]\w+)\)?((\s?(?\/)\s?)(\w+\s)*\(?(?[A-Z]\w+)\)?)?(\s(Publication|Standard))?(-|\s)?(?

[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text

2017-08-29 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-2449:
--
Attachment: standards_extraction.patch
flowchart_standards_extraction.png
SOW-TacCOM.pdf

> Enabling extraction of standard references from text
> 
>
> Key: TIKA-2449
> URL: https://issues.apache.org/jira/browse/TIKA-2449
> Project: Tika
>  Issue Type: Improvement
>  Components: handler
>        Reporter: Giuseppe Totaro
>  Labels: handler
> Attachments: flowchart_standards_extraction.png, SOW-TacCOM.pdf, 
> standards_extraction.patch
>
>
> Apache Tika currently provides many _ContentHandler_ which help to 
> de-obfuscate some information from text. For instance, the 
> {{PhoneExtractingContentHandler}} is used to extract phone numbers while 
> parsing.
> This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a 
> new ContentHandler that relies on regular expressions in order to identify 
> and extract standard references from text. 
> Basically, a standard reference is just a reference to a 
> norm/convention/requirement (i.e., a standard) released by a standard 
> organization. This work is maily focused on identifying and extracting the 
> references to the standards already cited within a given document (e.g., 
> SOW/PWS) so the references can be stored and provided to the user as 
> additional metadata in case the StandardExtractingContentHandler is used.
> In addition to the patch, the first version of the 
> {{StandardsExtractingContentHandler}} along with an example class to easily 
> execute the handler is available on 
> [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. 
> The following sections provide more in detail how the 
> {{StandardsExtractingHandler}} has been developed.
> h1. Background
> From a technical perspective, a standard reference is a string that is 
> usually composed of two parts: 
> # the name of the standard organization; 
> # the alphanumeric identifier of the standard within the organization. 
> Specifically, the first part can include the acronym or the full name of the 
> standard organization or even both, and the second part can include an 
> alphanumeric string, possibly containing one or more separation symbols 
> (e.g., "-", "_", ".") depending on the format adopted by the organization, 
> representing the identifier of the standard within the organization.
> Furthermore, the standard references are usually reported within the 
> "Applicable Documents" or "References" section of a SOW, and they can be 
> cited also within sections that include in the header the word "standard", 
> "requirement", "guideline", or "compliance".
> Consequently, the citation of standard references within a SOW/PWS document 
> can be summarized by the following rules:
> * *RULE #1*: standard references are usually reported within the section 
> named "Applicable Documents" or "References".
> * *RULE #2*: standard references can be cited also within sections including 
> the word "compliance" or another semantically-equivalent word in their name.
> * *RULE #3*: standard references is composed of two parts:
> ** Name of the standard organization (acronym, full name, or both).
> ** Alphanumeric identifier of the standard within the organization.
> * *RULE #4*: The name of the standard organization includes the acronym or 
> the full name or both. The name must belong to the set of standard 
> organizations S = O U V, where O represents the set of open standard 
> organizations (e.g., ANSI) and V represents the set of vendor-specific 
> standard organizations (e.g., Motorola).
> * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be 
> used between the name of the standard organization and the alphanumeric 
> identifier.
> * *RULE #6*: The alphanumeric identifier of the standard is composed of 
> alphabetic and numeric characters, possibly split in two or more parts by a 
> separation symbol (e.g., "-", "_", ".").
> On the basis of the above rules, here are some examples of formats used for 
> reporting standard references within a SOW/PWS:
> * {{}}
> * 
> {{()}}
> * 
> {{()}}
> Moreover, some standards are sometimes released by two standard 
> organizations. In this case, the standard reference can be reported as 
> follows:
> * 
> {{/}}
> h1. R

[jira] [Created] (TIKA-2449) Enabling extraction of standard references from text

2017-08-29 Thread Giuseppe Totaro (JIRA)
Giuseppe Totaro created TIKA-2449:
-

 Summary: Enabling extraction of standard references from text
 Key: TIKA-2449
 URL: https://issues.apache.org/jira/browse/TIKA-2449
 Project: Tika
  Issue Type: Improvement
  Components: handler
Reporter: Giuseppe Totaro


Apache Tika currently provides many _ContentHandler_ which help to de-obfuscate 
some information from text. For instance, the {{PhoneExtractingContentHandler}} 
is used to extract phone numbers while parsing.

This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a 
new ContentHandler that relies on regular expressions in order to identify and 
extract standard references from text. 
Basically, a standard reference is just a reference to a 
norm/convention/requirement (i.e., a standard) released by a standard 
organization. This work is maily focused on identifying and extracting the 
references to the standards already cited within a given document (e.g., 
SOW/PWS) so the references can be stored and provided to the user as additional 
metadata in case the StandardExtractingContentHandler is used.

In addition to the patch, the first version of the 
{{StandardsExtractingContentHandler}} along with an example class to easily 
execute the handler is available on 
[GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. 
The following sections provide more in detail how the 
{{StandardsExtractingHandler}} has been developed.

h1. Background

>From a technical perspective, a standard reference is a string that is usually 
>composed of two parts: 
# the name of the standard organization; 
# the alphanumeric identifier of the standard within the organization. 
Specifically, the first part can include the acronym or the full name of the 
standard organization or even both, and the second part can include an 
alphanumeric string, possibly containing one or more separation symbols (e.g., 
"-", "_", ".") depending on the format adopted by the organization, 
representing the identifier of the standard within the organization.

Furthermore, the standard references are usually reported within the 
"Applicable Documents" or "References" section of a SOW, and they can be cited 
also within sections that include in the header the word "standard", 
"requirement", "guideline", or "compliance".

Consequently, the citation of standard references within a SOW/PWS document can 
be summarized by the following rules:
* *RULE #1*: standard references are usually reported within the section named 
"Applicable Documents" or "References".
* *RULE #2*: standard references can be cited also within sections including 
the word "compliance" or another semantically-equivalent word in their name.
* *RULE #3*: standard references is composed of two parts:
** Name of the standard organization (acronym, full name, or both).
** Alphanumeric identifier of the standard within the organization.
* *RULE #4*: The name of the standard organization includes the acronym or the 
full name or both. The name must belong to the set of standard organizations S 
= O U V, where O represents the set of open standard organizations (e.g., ANSI) 
and V represents the set of vendor-specific standard organizations (e.g., 
Motorola).
* *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be 
used between the name of the standard organization and the alphanumeric 
identifier.
* *RULE #6*: The alphanumeric identifier of the standard is composed of 
alphabetic and numeric characters, possibly split in two or more parts by a 
separation symbol (e.g., "-", "_", ".").

On the basis of the above rules, here are some examples of formats used for 
reporting standard references within a SOW/PWS:
* {{}}
* 
{{()}}
* 
{{()}}

Moreover, some standards are sometimes released by two standard organizations. 
In this case, the standard reference can be reported as follows:
* 
{{/}}

h1. Regular Expressions

The {{StandardsExtractingContentHandler}} uses a helper class named 
`StandardsText` that relies on Java regular expressions and provides some 
methods to identify headers and standard references, and determine the score of 
the references found within the given text.

Here are the main regular expressions used within the StandardsText class:
* *REGEX_HEADER*: regular expression to match only uppercase headers.
  {code}
  (\d+\.(\d+\.?)*)\p{Blank}+([A-Z]+(\s[A-Z]+)*){5,}
  {code}
* *REGEX_APPLICABLE_DOCUMENTS*: regular expression to match the the header of 
"APPLICABLE DOCUMENTS" and equivalent sections.
  {code}
  
(?i:.*APPLICABLE\sDOCUMENTS|REFERENCE|STANDARD|REQUIREMENT|GUIDELINE|COMPLIANCE.*)
  {code}
* *REGEX_FALLBACK*: regular expression to ma

[jira] [Commented] (TIKA-1739) cTAKESParser doesn't work in 1.11

2015-09-23 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904761#comment-14904761
 ] 

Giuseppe Totaro commented on TIKA-1739:
---

Great suggestion [~gagravarr]. Thanks [~chrismattmann] for updating the 
documentation.
Giuseppe

> cTAKESParser doesn't work in 1.11
> -
>
> Key: TIKA-1739
> URL: https://issues.apache.org/jira/browse/TIKA-1739
> Project: Tika
>  Issue Type: Bug
>  Components: parser, server
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
> Attachments: TIKA-1739.patch
>
>
> Tika cTAKESParser integration doesn't work in 1.11. The parser is called, but 
> blank metadata comes back:
> {noformat}
> curl -T test.txt -H "Content-Type: text/plain" 
> http://localhost:/rmeta/text
> [{"Content-Type":"text/plain","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.ctakes.CTAKESParser","org.apache.tika.parser.EmptyParser"],"X-TIKA:parse_time_millis":"20371","ctakes:schema":"coveredText:start:end:ontologyConceptArr"}
> {noformat}
> [~gagravarr] I wonder if something that happened in TIKA-1653 broke it?
> http://svn.apache.org/viewvc?view=revision=1684199
> [~gostep] can you help me look here?
> I'm working on 
> https://github.com/chrismattmann/shangridocs/tree/convert-wicket which is 
> where I first saw this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1739) cTAKESParser doesn't work in 1.11

2015-09-22 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903123#comment-14903123
 ] 

Giuseppe Totaro commented on TIKA-1739:
---

Hi [~chrismattmann], Hi [~gagravarr],
I looked at the last code of {{CTAKESParser.java}} and I did some experiments 
on my laptop.
Basically, the problem is due to the default constructor of 
{{CTAKESParser.java}}:
{code:java}
/**
 * Wraps the default Parser
 */
public CTAKESParser() {
this(TikaConfig.getDefaultConfig());
}
{code}

To use CTAKESParser, we need to create a specific configuration for 
CTAKESParser (unless we aim at using the parser programmatically), as reported 
in [ctakesparser-utils|https://github.com/chrismattmann/ctakesparser-utils] 
repository.
While parsing, the default constructor of CTAKESParser is used by Tika 
overriding the given configuration at runtime. Therefore, CTAKESParser is only 
"visited" by Tika that will use, instead, the EmptyParser as fallback.

For instance, if we use again the previous default constructor (that does not 
override the given configuration), then we can use properly cTAKES and obtain 
the right metadata:
{code:java}
public CTAKESParser() {
super(new AutoDetectParser());
}
{code}

[~chrismattmann] and [~gagravarr]], I will be really gald to hear your feedback.
Thanks a lot,
Giuseppe

> cTAKESParser doesn't work in 1.11
> -
>
> Key: TIKA-1739
> URL: https://issues.apache.org/jira/browse/TIKA-1739
> Project: Tika
>  Issue Type: Bug
>  Components: parser, server
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
>
> Tika cTAKESParser integration doesn't work in 1.11. The parser is called, but 
> blank metadata comes back:
> {noformat}
> curl -T test.txt -H "Content-Type: text/plain" 
> http://localhost:/rmeta/text
> [{"Content-Type":"text/plain","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.ctakes.CTAKESParser","org.apache.tika.parser.EmptyParser"],"X-TIKA:parse_time_millis":"20371","ctakes:schema":"coveredText:start:end:ontologyConceptArr"}
> {noformat}
> [~gagravarr] I wonder if something that happened in TIKA-1653 broke it?
> http://svn.apache.org/viewvc?view=revision=1684199
> [~gostep] can you help me look here?
> I'm working on 
> https://github.com/chrismattmann/shangridocs/tree/convert-wicket which is 
> where I first saw this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1691) Apache Tika for enabling metadata interoperability

2015-07-27 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14643066#comment-14643066
 ] 

Giuseppe Totaro commented on TIKA-1691:
---

Hi [~gagravarr], Hi [~chrismattmann],

did you have any chance to read my last comment?

Thanks,
Giusppe

 Apache Tika for enabling metadata interoperability
 --

 Key: TIKA-1691
 URL: https://issues.apache.org/jira/browse/TIKA-1691
 Project: Tika
  Issue Type: New Feature
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
  Labels: mapping, metadata
 Attachments: mapping_example.pdf


 If am not wrong, enabling consistent metadata across file formats is already 
 (partially) provided into Tika by relying on {{TikaCoreProperties}} and, 
 within the context of Solr, {{ExtractingRequestHandler}} (by defining how to 
 map metadata fields in {{solrconfig.xml}}). However, I am working on a new 
 component for both schema mapping (to operate on the name of metadata 
 properties) and instance transformation (to operate on the value of metadata) 
 that consists, essentially, of the following changes:
 * A wrapper of {{Metadata}} object ({{MappedMetadata.java}}) that decorates 
 the {{set}} method (currently, line number 367 of {{Metadata.java}}) by 
 applying the given mapping functions (via configuration) before setting 
 metadata properties.
 * Basic mapping functions ({{BasicMappingUtils.java}}) that are utility 
 methods to map a set of metadata to the target schema.
 * A new {{MetadataConfig}} object that, as well as {{TikaConfig}}, may be 
 configured via XML file (organized as showed in the following snippet) and 
 allows to perform a fine-grained metadata mapping by using Java reflection.
 {code:xml|title=tika-metadata.xml|borderStyle=solid}
 ?xml version=1.0 encoding=UTF-8 standalone=no?
 properties
   mappings
 mapping type=type/sub-type
   relation name=SOURCE_FIELD
 targetTARGET_FIELD/target
 expressionexclude|include|equivalent|overlap/expression
 function name=FUNCTION_NAME
   argumentARGUMENT_VALUE/argument
 /function
 cardinality
   sourceSOURCE_CARDINALITY/source
   targetTARGET_CARDINALITY/target
   orderORDER_NUMBER/order
   dependencies
 fieldFIELD_NAME/field
   /dependencies
 /cardinality
   /relation
 /mapping
 ...
 mapping !-- This contains the fallback strategy for unknown metadata 
 --
   relation
 ...
   /relation
 mapping
   /mappings
 /properties
 {code}
 The theoretical definition of metadata mapping is available in [A survey of 
 techniques for achieving metadata 
 interoperability|http://www.researchgate.net/profile/Bernhard_Haslhofer/publication/220566013_A_survey_of_techniques_for_achieving_metadata_interoperability/links/02e7e533e76187c0b800.pdf];.
  This paper shows also some basic examples of metadata mappings.
 Currently, I am still working on some core functionalities, but I have 
 already performed some experiments by using a small prototype.
 By the way, I think that we should modify the method {{add}} in order to use 
 {{set}} instead of {{metadata.put}} (currently, line number 316 of 
 {{Metadata.java}}). This is a trivial change (I could create a new Jira issue 
 about that), but it would allow to be coherent with the other implementation 
 of {{add}} method and, moreover, the methods of {{Metadata}} could be 
 extended more easily.
 I would really appreciate your feedback about this proposal. If you believe 
 that it is a good idea, I could provide the code in few days.
 Thanks a lot,
 Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1691) Apache Tika for enabling metadata interoperability

2015-07-22 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637930#comment-14637930
 ] 

Giuseppe Totaro commented on TIKA-1691:
---

Hello [~gagravarr],

your feedback is very much appreciated. I believe that providing metadata 
mapping on the getter side is a great idea. However, I will try to clarify my 
proposal below by reporting two (high-level) use cases.

As use case, we can consider the following:

We want to index both textual content and metadata from a heterogeneous set of 
digital documents, providing uniform access to the metadata properties 
extracted from files. Therefore, we want to allow users to submit search 
queries by using an end-user specific mediated schema.

We can summarize the use case above as follows:
# collect a huge amount of heterogeneous files (e.g., PDF, DOC, JPG, PPT, TXT, 
etc);
# extract both text and metadata from files by using Tika;
# map all metadata properties to a mediated schema that will be used for 
searching purposes;
# create an inverted index from the extracted contents;
# use the index in order to perform search queries based on metadata values.

Another use case is the following:

We want to compute some similarity metrics based on metadata features. To 
perform similarity, we need to provide the semantic correspondences among 
different metadata schemes.

We can summarize the use case above as follows:
# collect a huge amount of heterogeneous files (e.g., PDF, DOC, JPG, PPT, TXT, 
etc);
# extract both text and metadata from files by using Tika;
# map all metadata properties to a mediated schema that will be used for 
performing similarity among different schemes;
# use the metadata mapping to compute the given similarity metric among 
metadata from different schemes.

Currently, Tika enables consistent metadata across file formats by relying on 
[TikaCoreProperties|http://tika.apache.org/1.9/api/org/apache/tika/metadata/TikaCoreProperties.html],
 that are defined in terms of other standard namespaces. However, this core set 
of metadata could limit the interoperability among many metadata schemes, since 
Tika developers are continually providing support to new filetypes (and 
metadata schemes). 

Furthermore, I have identified two more functionalities for better metadata 
interoperability:
* a fine-grained mapping technique to potentially define metadata mappings for 
each mimetype. This allows, for example, either to exclude the mapping of 
metadata for some types or to provide different mappings of the same schema on 
different types. 
* a metadata mapping technique that subsumes schema mapping (property names) 
and instance transformation (property values).

I am working on providing a default mediated schema (via XML-based 
configuration) based on a core set of utility (Java) methods for metadata 
mapping.

You can find in attachment (_mapping_example_) an extremely simple diagram that 
reports an example of metadata mapping by defining source property, target 
property (that provides essentially schema mapping), mapping expression (that 
describes the semantics of each mapping relationship), and function (that 
provides instance transformation).

By the way, I am working also on a [D3|http://d3js.org/]-based utility that 
allows to visualize the new metadata mappings provided into Tika starting from 
the XML configuration file (i.e., {{tike-metadata.xml}}). The output is based 
on [hierarchical edge building 
algorithm|https://github.com/mbostock/d3/wiki/Bundle-Layout].

Regarding the possibility to provide mappings on the getter side, I thing that 
is a great idea. I believe that we should enable the users to select 
programmatically (or via configuration) whether using mappings on setter side 
or not. For instance, providing mappings on setter side requires to perform the 
actual mapping only during extraction, whereas on the getter side the mappings 
would be performed for each {{metadata.get()}}.

Thanks again Nich for your feedback. I hope that you are going to give more 
comments on this work. I would really appreciate it.
I take this opportunity to thank [~chrismattmann] for supporting me on this 
work.

Cheers,
Giuseppe

 Apache Tika for enabling metadata interoperability
 --

 Key: TIKA-1691
 URL: https://issues.apache.org/jira/browse/TIKA-1691
 Project: Tika
  Issue Type: New Feature
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
  Labels: mapping, metadata
 Attachments: mapping_example.pdf


 If am not wrong, enabling consistent metadata across file formats is already 
 (partially) provided into Tika by relying on {{TikaCoreProperties}} and, 
 within the context of Solr, {{ExtractingRequestHandler}} (by defining how to 
 map metadata fields in {{solrconfig.xml}}). However

[jira] [Updated] (TIKA-1691) Apache Tika for enabling metadata interoperability

2015-07-22 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1691:
--
Attachment: mapping_example.pdf

 Apache Tika for enabling metadata interoperability
 --

 Key: TIKA-1691
 URL: https://issues.apache.org/jira/browse/TIKA-1691
 Project: Tika
  Issue Type: New Feature
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
  Labels: mapping, metadata
 Attachments: mapping_example.pdf


 If am not wrong, enabling consistent metadata across file formats is already 
 (partially) provided into Tika by relying on {{TikaCoreProperties}} and, 
 within the context of Solr, {{ExtractingRequestHandler}} (by defining how to 
 map metadata fields in {{solrconfig.xml}}). However, I am working on a new 
 component for both schema mapping (to operate on the name of metadata 
 properties) and instance transformation (to operate on the value of metadata) 
 that consists, essentially, of the following changes:
 * A wrapper of {{Metadata}} object ({{MappedMetadata.java}}) that decorates 
 the {{set}} method (currently, line number 367 of {{Metadata.java}}) by 
 applying the given mapping functions (via configuration) before setting 
 metadata properties.
 * Basic mapping functions ({{BasicMappingUtils.java}}) that are utility 
 methods to map a set of metadata to the target schema.
 * A new {{MetadataConfig}} object that, as well as {{TikaConfig}}, may be 
 configured via XML file (organized as showed in the following snippet) and 
 allows to perform a fine-grained metadata mapping by using Java reflection.
 {code:xml|title=tika-metadata.xml|borderStyle=solid}
 ?xml version=1.0 encoding=UTF-8 standalone=no?
 properties
   mappings
 mapping type=type/sub-type
   relation name=SOURCE_FIELD
 targetTARGET_FIELD/target
 expressionexclude|include|equivalent|overlap/expression
 function name=FUNCTION_NAME
   argumentARGUMENT_VALUE/argument
 /function
 cardinality
   sourceSOURCE_CARDINALITY/source
   targetTARGET_CARDINALITY/target
   orderORDER_NUMBER/order
   dependencies
 fieldFIELD_NAME/field
   /dependencies
 /cardinality
   /relation
 /mapping
 ...
 mapping !-- This contains the fallback strategy for unknown metadata 
 --
   relation
 ...
   /relation
 mapping
   /mappings
 /properties
 {code}
 The theoretical definition of metadata mapping is available in [A survey of 
 techniques for achieving metadata 
 interoperability|http://www.researchgate.net/profile/Bernhard_Haslhofer/publication/220566013_A_survey_of_techniques_for_achieving_metadata_interoperability/links/02e7e533e76187c0b800.pdf];.
  This paper shows also some basic examples of metadata mappings.
 Currently, I am still working on some core functionalities, but I have 
 already performed some experiments by using a small prototype.
 By the way, I think that we should modify the method {{add}} in order to use 
 {{set}} instead of {{metadata.put}} (currently, line number 316 of 
 {{Metadata.java}}). This is a trivial change (I could create a new Jira issue 
 about that), but it would allow to be coherent with the other implementation 
 of {{add}} method and, moreover, the methods of {{Metadata}} could be 
 extended more easily.
 I would really appreciate your feedback about this proposal. If you believe 
 that it is a good idea, I could provide the code in few days.
 Thanks a lot,
 Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1691) Apache Tika for enabling metadata interoperability

2015-07-21 Thread Giuseppe Totaro (JIRA)
Giuseppe Totaro created TIKA-1691:
-

 Summary: Apache Tika for enabling metadata interoperability
 Key: TIKA-1691
 URL: https://issues.apache.org/jira/browse/TIKA-1691
 Project: Tika
  Issue Type: New Feature
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro


If am not wrong, enabling consistent metadata across file formats is already 
(partially) provided into Tika by relying on {{TikaCoreProperties}} and, within 
the context of Solr, {{ExtractingRequestHandler}} (by defining how to map 
metadata fields in {{solrconfig.xml}}). However, I am working on a new 
component for both schema mapping (to operate on the name of metadata 
properties) and instance transformation (to operate on the value of metadata) 
that consists, essentially, of the following changes:

* A wrapper of {{Metadata}} object ({{MappedMetadata.java}}) that decorates the 
{{set}} method (currently, line number 367 of {{Metadata.java}}) by applying 
the given mapping functions (via configuration) before setting metadata 
properties.
* Basic mapping functions ({{BasicMappingUtils.java}}) that are utility methods 
to map a set of metadata to the target schema.
* A new {{MetadataConfig}} object that, as well as {{TikaConfig}}, may be 
configured via XML file (organized as showed in the following snippet) and 
allows to perform a fine-grained metadata mapping by using Java reflection.

{code:xml|title=tika-metadata.xml|borderStyle=solid}
?xml version=1.0 encoding=UTF-8 standalone=no?
properties
  mappings
mapping type=type/sub-type
  relation name=SOURCE_FIELD
targetTARGET_FIELD/target
expressionexclude|include|equivalent|overlap/expression
function name=FUNCTION_NAME
  argumentARGUMENT_VALUE/argument
/function
cardinality
  sourceSOURCE_CARDINALITY/source
  targetTARGET_CARDINALITY/target
  orderORDER_NUMBER/order
  dependencies
fieldFIELD_NAME/field
  /dependencies
/cardinality
  /relation
/mapping
...
mapping !-- This contains the fallback strategy for unknown metadata --
  relation
...
  /relation
mapping
  /mappings
/properties
{code}

The theoretical definition of metadata mapping is available in [A survey of 
techniques for achieving metadata 
interoperability|http://www.researchgate.net/profile/Bernhard_Haslhofer/publication/220566013_A_survey_of_techniques_for_achieving_metadata_interoperability/links/02e7e533e76187c0b800.pdf];.
 This paper shows also some basic examples of metadata mappings.

Currently, I am still working on some core functionalities, but I have already 
performed some experiments by using a small prototype.

By the way, I think that we should modify the method {{add}} in order to use 
{{set}} instead of {{metadata.put}} (currently, line number 316 of 
{{Metadata.java}}). This is a trivial change (I could create a new Jira issue 
about that), but it would allow to be coherent with the other implementation of 
{{add}} method and, moreover, the methods of {{Metadata}} could be extended 
more easily.

I would really appreciate your feedback about this proposal. If you believe 
that it is a good idea, I could provide the code in few days.

Thanks a lot,
Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1654) Reset cTAKES CAS into CTAKESParser

2015-06-21 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro resolved TIKA-1654.
---
Resolution: Fixed

 Reset cTAKES CAS into CTAKESParser
 --

 Key: TIKA-1654
 URL: https://issues.apache.org/jira/browse/TIKA-1654
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
  Labels: patch
 Fix For: 1.10

 Attachments: TIKA-1654.patch, TIKA-1654.v02.patch


 Using [CTAKESParser from Tika 
 Server|https://wiki.apache.org/tika/cTAKESParser], I noticed that an 
 exception occurs when the CTAKESParser is used multiple times:
 {noformat}
 org.apache.uima.cas.CASRuntimeException: Data for Sofa feature 
 setLocalSofaData() has already been set.
 {noformat}
 This is due to the CAS (Common Analysis System) used by CTAKESParser. The 
 CAS, as the AE (AnalysisEngine), is a static field into CTAKESParser to make 
 a sort of singleton.
 By the way, An Analysis Engine is a cTAKES/UIMA component responsible for 
 analyzing unstructured information, discovering and representing semantic 
 content. An AnalysisEngine operates on an analysis structure (implemented 
 by CAS).
 It is highly recommended to reuse the CAS, but it has to be reset before the 
 next run. The CTAKESUtils class ({{org.apache.tika.parser.ctakes}}) provides 
 the reset method to release all resources held by both AnalysisEngine and CAS 
 and then destroy them. This method prevents the CASRuntimeException error.
 You can find in attachment the patch including two new methods (resetCAS and 
 resetAE) to reset, but not to destroy, the CAS and the AnalysisEngine 
 respectively.
 By using only resetCAS, CTAKESParser can reuse both CAS and AE instead of 
 building them again for each run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1654) Reset cTAKES CAS into CTAKESParser

2015-06-11 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1654:
--
Fix Version/s: (was: 1.9)
   1.10

 Reset cTAKES CAS into CTAKESParser
 --

 Key: TIKA-1654
 URL: https://issues.apache.org/jira/browse/TIKA-1654
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
  Labels: patch
 Fix For: 1.10

 Attachments: TIKA-1654.patch


 Using [CTAKESParser from Tika 
 Server|https://wiki.apache.org/tika/cTAKESParser], I noticed that an 
 exception occurs when the CTAKESParser is used multiple times:
 {noformat}
 org.apache.uima.cas.CASRuntimeException: Data for Sofa feature 
 setLocalSofaData() has already been set.
 {noformat}
 This is due to the CAS (Common Analysis System) used by CTAKESParser. The 
 CAS, as the AE (AnalysisEngine), is a static field into CTAKESParser to make 
 a sort of singleton.
 By the way, An Analysis Engine is a cTAKES/UIMA component responsible for 
 analyzing unstructured information, discovering and representing semantic 
 content. An AnalysisEngine operates on an analysis structure (implemented 
 by CAS).
 It is highly recommended to reuse the CAS, but it has to be reset before the 
 next run. The CTAKESUtils class ({{org.apache.tika.parser.ctakes}}) provides 
 the reset method to release all resources held by both AnalysisEngine and CAS 
 and then destroy them. This method prevents the CASRuntimeException error.
 You can find in attachment the patch including two new methods (resetCAS and 
 resetAE) to reset, but not to destroy, the CAS and the AnalysisEngine 
 respectively.
 By using only resetCAS, CTAKESParser can reuse both CAS and AE instead of 
 building them again for each run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1654) Reset cTAKES CAS into CTAKESParser

2015-06-10 Thread Giuseppe Totaro (JIRA)
Giuseppe Totaro created TIKA-1654:
-

 Summary: Reset cTAKES CAS into CTAKESParser
 Key: TIKA-1654
 URL: https://issues.apache.org/jira/browse/TIKA-1654
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro


Using [CTAKESParser from Tika 
Server|https://wiki.apache.org/tika/cTAKESParser], I noticed that an exception 
occurs when the CTAKESParser is used multiple times:

{noformat}
org.apache.uima.cas.CASRuntimeException: Data for Sofa feature 
setLocalSofaData() has already been set.
{noformat}

This is due to the CAS (Common Analysis System) used by CTAKESParser. The CAS, 
as the AE (AnalysisEngine), is a static field into CTAKESParser to make a sort 
of singleton.

By the way, An Analysis Engine is a cTAKES/UIMA component responsible for 
analyzing unstructured information, discovering and representing semantic 
content. An AnalysisEngine operates on an analysis structure (implemented by 
CAS).

It is highly recommended to reuse the CAS, but it has to be reset before the 
next run. The CTAKESUtils class ({{org.apache.tika.parser.ctakes}}) provides 
the reset method to release all resources held by both AnalysisEngine and CAS 
and then destroy them. This method prevents the CASRuntimeException error.

You can find in attachment the patch including two new methods (resetCAS and 
resetAE) to reset, but not to destroy, the CAS and the AnalysisEngine 
respectively.
By using only resetCAS, CTAKESParser can reuse both CAS and AE instead of 
building them again for each run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1654) Reset cTAKES CAS into CTAKESParser

2015-06-10 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1654:
--
Fix Version/s: 1.9

 Reset cTAKES CAS into CTAKESParser
 --

 Key: TIKA-1654
 URL: https://issues.apache.org/jira/browse/TIKA-1654
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
  Labels: patch
 Fix For: 1.9


 Using [CTAKESParser from Tika 
 Server|https://wiki.apache.org/tika/cTAKESParser], I noticed that an 
 exception occurs when the CTAKESParser is used multiple times:
 {noformat}
 org.apache.uima.cas.CASRuntimeException: Data for Sofa feature 
 setLocalSofaData() has already been set.
 {noformat}
 This is due to the CAS (Common Analysis System) used by CTAKESParser. The 
 CAS, as the AE (AnalysisEngine), is a static field into CTAKESParser to make 
 a sort of singleton.
 By the way, An Analysis Engine is a cTAKES/UIMA component responsible for 
 analyzing unstructured information, discovering and representing semantic 
 content. An AnalysisEngine operates on an analysis structure (implemented 
 by CAS).
 It is highly recommended to reuse the CAS, but it has to be reset before the 
 next run. The CTAKESUtils class ({{org.apache.tika.parser.ctakes}}) provides 
 the reset method to release all resources held by both AnalysisEngine and CAS 
 and then destroy them. This method prevents the CASRuntimeException error.
 You can find in attachment the patch including two new methods (resetCAS and 
 resetAE) to reset, but not to destroy, the CAS and the AnalysisEngine 
 respectively.
 By using only resetCAS, CTAKESParser can reuse both CAS and AE instead of 
 building them again for each run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1654) Reset cTAKES CAS into CTAKESParser

2015-06-10 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1654:
--
Attachment: TIKA-1654.patch

 Reset cTAKES CAS into CTAKESParser
 --

 Key: TIKA-1654
 URL: https://issues.apache.org/jira/browse/TIKA-1654
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
  Labels: patch
 Fix For: 1.9

 Attachments: TIKA-1654.patch


 Using [CTAKESParser from Tika 
 Server|https://wiki.apache.org/tika/cTAKESParser], I noticed that an 
 exception occurs when the CTAKESParser is used multiple times:
 {noformat}
 org.apache.uima.cas.CASRuntimeException: Data for Sofa feature 
 setLocalSofaData() has already been set.
 {noformat}
 This is due to the CAS (Common Analysis System) used by CTAKESParser. The 
 CAS, as the AE (AnalysisEngine), is a static field into CTAKESParser to make 
 a sort of singleton.
 By the way, An Analysis Engine is a cTAKES/UIMA component responsible for 
 analyzing unstructured information, discovering and representing semantic 
 content. An AnalysisEngine operates on an analysis structure (implemented 
 by CAS).
 It is highly recommended to reuse the CAS, but it has to be reset before the 
 next run. The CTAKESUtils class ({{org.apache.tika.parser.ctakes}}) provides 
 the reset method to release all resources held by both AnalysisEngine and CAS 
 and then destroy them. This method prevents the CASRuntimeException error.
 You can find in attachment the patch including two new methods (resetCAS and 
 resetAE) to reset, but not to destroy, the CAS and the AnalysisEngine 
 respectively.
 By using only resetCAS, CTAKESParser can reuse both CAS and AE instead of 
 building them again for each run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Release Apache Tika 1.9 Candidate #2

2015-06-08 Thread Giuseppe Totaro
Hi Chris,

I have tested tika 1.9-rc2. In particular, I checked the new work on
CTAKESParser.
Thank you for your great work.

My vote for this RC is +1.

Thanks,
Giuseppe



On Mon, Jun 8, 2015 at 8:58 AM, Konstantin Gribov gros...@gmail.com wrote:

 Hi, Chris.

 SHA1 hash and GPG signature are valid for all published artifacts. I've
 tested 1.9-rc2 on several text docs (rtf, pdf, doc, docx) and result is
 quite good.

 I've found minor regression since 1.7 (it may be related to POI, not Tika
 itself), but they shouldn't prevent releasing 1.9 from rc2. I'll try to
 create doc to reproduce it and file a ticket to jira because I can't share
 original doc file on which it can be reproduced. JFYI,
 o.a.t.p.microsoft.OfficeParser produces U+200B (zero width white space)
 where U+00AD (soft hyphen) should be. Same document saved to odt and docx
 have different content (one has U+00AD on same position, one has nothing
 there, like tika-app-1.7 had).

 [x] +1 Release this package as Apache Tika 1.9
 [ ] -1 Do not release this package because…

 Thank you for preparing this release.

 --
 Best regards,
 Konstantin Gribov

 вс, 7 июня 2015 г. в 4:47, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov:

  Hi Folks,
 
  A second candidate for the Tika 1.9 release is available at:
 
https://dist.apache.org/repos/dist/dev/tika/
 
  The release candidate is a zip archive of the sources in:
http://svn.apache.org/repos/asf/tika/tags/1.9-rc2/
 
  The SHA1 checksum of the archive is
  9b78c9e9ce9640b402b7fef8e30f3cdbe384f44c.
 
  In addition, a staged maven repository is available here:
  https://repository.apache.org/content/repositories/orgapachetika-1011/
 
 
  Please vote on releasing this package as Apache Tika 1.9.
  The vote is open for the next 72 hours and passes if a majority of at
  least three +1 Tika PMC votes are cast.
 
  [ ] +1 Release this package as Apache Tika 1.9
  [ ] -1 Do not release this package because…
 
  Cheers,
  Chris
 
  P.S. Of course here is my +1.
 
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Associate Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 



[jira] [Updated] (TIKA-1645) Extraction of biomedical information using CTAKESParser

2015-06-04 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1645:
--
Attachment: TIKA-1645.v02.patch

 Extraction of biomedical information using CTAKESParser
 ---

 Key: TIKA-1645
 URL: https://issues.apache.org/jira/browse/TIKA-1645
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
  Labels: patch
 Fix For: 1.10

 Attachments: CTAKESConfig.properties, TIKA-1645.patch, 
 TIKA-1645.v02.patch, tika-config.xml


 As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], 
 [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
 is a preliminary work in order to integrate [Apache 
 cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract 
 biomedical information from clinical text.
 Essentially, this work includes a wrapper for CAS serializers that aim at 
 dumping out the identified annotations into XML-based formats.
 You can find in attachment a new patch that includes the CTAKESParser, a new 
 parser that decorates the AutoDetectParser and relies on a new version of 
 CTAKESContentHandler, based on feedback from 
 [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser 
 generates the same output of AutoDetectParser and, in addition, the metadata 
 containing the identified clinical annotations detected by cTAKES.
 To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first 
 to install the last stable release of cTAKES (3.2.2), following the 
 instructions on [User Install 
 Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide].
  Then, you can launch Tika as follows:
 {noformat}
 CTAKES_HOME=/usr/local/apache-ctakes-3.2.2
 java -cp 
 tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig
  org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input
 {noformat}
 In the example above, {{/path/to/CTAKESConfig}} is the parent directory of 
 file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains 
 the configuration properties to build the cTAKES AnalysisEngine; 
 {{tika-config.xml}} is a custom configuration file for Tika that contains the 
 mimetypes whose CTAKESParser will perform parsing.
 You can find in attachment an example of both {{CTAKESConfig.properties}} and 
 {{tika-config.xml}} to parse ISA-Tab files using cTAKES.
 You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use 
 the UMLS-based components of cTAKES.
 I would really appreciate your feedback.
 Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this 
 work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1645) Extraction of biomedical information using CTAKESParser

2015-06-04 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572252#comment-14572252
 ] 

Giuseppe Totaro commented on TIKA-1645:
---

Hi [~chrismattmann], thanks for your feedback. I really appreciate it.
You can find in attachment a new patch. Basically, the patch includes the java 
class CTAKESPaser that decorates the AutoDetectParser and leverages on cTAKES 
java APIs in order to extract biomedical information from text and, optionally, 
metadata. Then, all the 
[IndetifiedAnnotation|http://ctakes.apache.org/apidocs/trunk/org/apache/ctakes/typesystem/type/textsem/IdentifiedAnnotation.html]’s
 extracted by cTAKES are included into file metadata using, by default, the 
prefix {{ctakes:}}.

To build Tika with this patch via maven, I had to modify 
{{tika-parsers/pom.xml}} and {{tika-bundle/pom.xml}}, otherwise several “cannot 
find symbol” errors would
be generated at compile time. More in detail, I added the {{ctakes-core}} 
dependency (scope “provided) to {{tika-parsers/pom.xml}} and I excluded both 
ctakes and uima dependencies in {{tika-bundle/pom.xml}} using the following 
directives into {{ImportPackage}}:
{noformat}
!org.apache.ctakes.*
!org.apache.uima.*
{noformat}

By the way, I am going to implement another version of CTAKESParser as an 
external parser.
Thanks again,
Giuseppe

 Extraction of biomedical information using CTAKESParser
 ---

 Key: TIKA-1645
 URL: https://issues.apache.org/jira/browse/TIKA-1645
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
  Labels: patch
 Fix For: 1.10

 Attachments: CTAKESConfig.properties, TIKA-1645.patch, 
 TIKA-1645.v02.patch, tika-config.xml


 As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], 
 [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
 is a preliminary work in order to integrate [Apache 
 cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract 
 biomedical information from clinical text.
 Essentially, this work includes a wrapper for CAS serializers that aim at 
 dumping out the identified annotations into XML-based formats.
 You can find in attachment a new patch that includes the CTAKESParser, a new 
 parser that decorates the AutoDetectParser and relies on a new version of 
 CTAKESContentHandler, based on feedback from 
 [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser 
 generates the same output of AutoDetectParser and, in addition, the metadata 
 containing the identified clinical annotations detected by cTAKES.
 To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first 
 to install the last stable release of cTAKES (3.2.2), following the 
 instructions on [User Install 
 Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide].
  Then, you can launch Tika as follows:
 {noformat}
 CTAKES_HOME=/usr/local/apache-ctakes-3.2.2
 java -cp 
 tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig
  org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input
 {noformat}
 In the example above, {{/path/to/CTAKESConfig}} is the parent directory of 
 file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains 
 the configuration properties to build the cTAKES AnalysisEngine; 
 {{tika-config.xml}} is a custom configuration file for Tika that contains the 
 mimetypes whose CTAKESParser will perform parsing.
 You can find in attachment an example of both {{CTAKESConfig.properties}} and 
 {{tika-config.xml}} to parse ISA-Tab files using cTAKES.
 You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use 
 the UMLS-based components of cTAKES.
 I would really appreciate your feedback.
 Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this 
 work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1645) Extraction of biomedical information using CTAKESParser

2015-06-03 Thread Giuseppe Totaro (JIRA)
Giuseppe Totaro created TIKA-1645:
-

 Summary: Extraction of biomedical information using CTAKESParser
 Key: TIKA-1645
 URL: https://issues.apache.org/jira/browse/TIKA-1645
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Giuseppe Totaro


As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], 
[CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] is 
a preliminary work in order to integrate [Apache 
cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract 
biomedical information from clinical text.
Essentially, this work includes a wrapper for CAS serializers that aim at 
dumping out the identified annotations into XML-based formats.

You can find in attachment a new patch that includes the CTAKESParser, a new 
parser that decorates the AutoDetectParser and relies on a new version of 
CTAKESContentHandler, based on feedback from 
[TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser 
generates the same output of AutoDetectParser and, in addition, the metadata 
containing the identified clinical annotations detected by cTAKES.

To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first 
to install the last stable release of cTAKES (3.2.2), following the 
instructions on [User Install 
Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide].
 Then, you can launch Tika as follows:
{noformat}
CTAKES_HOME=/usr/local/apache-ctakes-3.2.2
java -cp 
tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig
 org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input
{noformat}
In the example above, {{/path/to/CTAKESConfig}} is the parent directory of file 
{{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains the 
configuration properties to build the cTAKES AnalysisEngine; 
{{tika-config.xml}} is a custom configuration file for Tika that contains the 
mimetypes whose CTAKESParser will perform parsing.
You can find in attachment an example of both {{CTAKESConfig.properties}} and 
{{tika-config.xml}} to parse ISA-Tab files using cTAKES.

You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use 
the UMLS-based components of cTAKES.

I would really appreciate your feedback.
Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this 
work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TIKA-1645) Extraction of biomedical information using CTAKESParser

2015-06-03 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro reassigned TIKA-1645:
-

Assignee: Giuseppe Totaro

 Extraction of biomedical information using CTAKESParser
 ---

 Key: TIKA-1645
 URL: https://issues.apache.org/jira/browse/TIKA-1645
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro

 As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], 
 [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
 is a preliminary work in order to integrate [Apache 
 cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract 
 biomedical information from clinical text.
 Essentially, this work includes a wrapper for CAS serializers that aim at 
 dumping out the identified annotations into XML-based formats.
 You can find in attachment a new patch that includes the CTAKESParser, a new 
 parser that decorates the AutoDetectParser and relies on a new version of 
 CTAKESContentHandler, based on feedback from 
 [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser 
 generates the same output of AutoDetectParser and, in addition, the metadata 
 containing the identified clinical annotations detected by cTAKES.
 To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first 
 to install the last stable release of cTAKES (3.2.2), following the 
 instructions on [User Install 
 Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide].
  Then, you can launch Tika as follows:
 {noformat}
 CTAKES_HOME=/usr/local/apache-ctakes-3.2.2
 java -cp 
 tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig
  org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input
 {noformat}
 In the example above, {{/path/to/CTAKESConfig}} is the parent directory of 
 file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains 
 the configuration properties to build the cTAKES AnalysisEngine; 
 {{tika-config.xml}} is a custom configuration file for Tika that contains the 
 mimetypes whose CTAKESParser will perform parsing.
 You can find in attachment an example of both {{CTAKESConfig.properties}} and 
 {{tika-config.xml}} to parse ISA-Tab files using cTAKES.
 You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use 
 the UMLS-based components of cTAKES.
 I would really appreciate your feedback.
 Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this 
 work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1645) Extraction of biomedical information using CTAKESParser

2015-06-03 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1645:
--
Attachment: TIKA-1645.patch

 Extraction of biomedical information using CTAKESParser
 ---

 Key: TIKA-1645
 URL: https://issues.apache.org/jira/browse/TIKA-1645
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
  Labels: patch
 Attachments: TIKA-1645.patch


 As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], 
 [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
 is a preliminary work in order to integrate [Apache 
 cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract 
 biomedical information from clinical text.
 Essentially, this work includes a wrapper for CAS serializers that aim at 
 dumping out the identified annotations into XML-based formats.
 You can find in attachment a new patch that includes the CTAKESParser, a new 
 parser that decorates the AutoDetectParser and relies on a new version of 
 CTAKESContentHandler, based on feedback from 
 [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser 
 generates the same output of AutoDetectParser and, in addition, the metadata 
 containing the identified clinical annotations detected by cTAKES.
 To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first 
 to install the last stable release of cTAKES (3.2.2), following the 
 instructions on [User Install 
 Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide].
  Then, you can launch Tika as follows:
 {noformat}
 CTAKES_HOME=/usr/local/apache-ctakes-3.2.2
 java -cp 
 tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig
  org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input
 {noformat}
 In the example above, {{/path/to/CTAKESConfig}} is the parent directory of 
 file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains 
 the configuration properties to build the cTAKES AnalysisEngine; 
 {{tika-config.xml}} is a custom configuration file for Tika that contains the 
 mimetypes whose CTAKESParser will perform parsing.
 You can find in attachment an example of both {{CTAKESConfig.properties}} and 
 {{tika-config.xml}} to parse ISA-Tab files using cTAKES.
 You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use 
 the UMLS-based components of cTAKES.
 I would really appreciate your feedback.
 Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this 
 work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1645) Extraction of biomedical information using CTAKESParser

2015-06-03 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1645:
--
Labels: patch  (was: )

 Extraction of biomedical information using CTAKESParser
 ---

 Key: TIKA-1645
 URL: https://issues.apache.org/jira/browse/TIKA-1645
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
  Labels: patch
 Attachments: TIKA-1645.patch


 As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], 
 [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
 is a preliminary work in order to integrate [Apache 
 cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract 
 biomedical information from clinical text.
 Essentially, this work includes a wrapper for CAS serializers that aim at 
 dumping out the identified annotations into XML-based formats.
 You can find in attachment a new patch that includes the CTAKESParser, a new 
 parser that decorates the AutoDetectParser and relies on a new version of 
 CTAKESContentHandler, based on feedback from 
 [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser 
 generates the same output of AutoDetectParser and, in addition, the metadata 
 containing the identified clinical annotations detected by cTAKES.
 To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first 
 to install the last stable release of cTAKES (3.2.2), following the 
 instructions on [User Install 
 Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide].
  Then, you can launch Tika as follows:
 {noformat}
 CTAKES_HOME=/usr/local/apache-ctakes-3.2.2
 java -cp 
 tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig
  org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input
 {noformat}
 In the example above, {{/path/to/CTAKESConfig}} is the parent directory of 
 file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains 
 the configuration properties to build the cTAKES AnalysisEngine; 
 {{tika-config.xml}} is a custom configuration file for Tika that contains the 
 mimetypes whose CTAKESParser will perform parsing.
 You can find in attachment an example of both {{CTAKESConfig.properties}} and 
 {{tika-config.xml}} to parse ISA-Tab files using cTAKES.
 You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use 
 the UMLS-based components of cTAKES.
 I would really appreciate your feedback.
 Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this 
 work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1645) Extraction of biomedical information using CTAKESParser

2015-06-03 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1645:
--
Attachment: CTAKESConfig.properties
tika-config.xml

 Extraction of biomedical information using CTAKESParser
 ---

 Key: TIKA-1645
 URL: https://issues.apache.org/jira/browse/TIKA-1645
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
  Labels: patch
 Attachments: CTAKESConfig.properties, TIKA-1645.patch, tika-config.xml


 As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], 
 [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
 is a preliminary work in order to integrate [Apache 
 cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract 
 biomedical information from clinical text.
 Essentially, this work includes a wrapper for CAS serializers that aim at 
 dumping out the identified annotations into XML-based formats.
 You can find in attachment a new patch that includes the CTAKESParser, a new 
 parser that decorates the AutoDetectParser and relies on a new version of 
 CTAKESContentHandler, based on feedback from 
 [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser 
 generates the same output of AutoDetectParser and, in addition, the metadata 
 containing the identified clinical annotations detected by cTAKES.
 To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first 
 to install the last stable release of cTAKES (3.2.2), following the 
 instructions on [User Install 
 Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide].
  Then, you can launch Tika as follows:
 {noformat}
 CTAKES_HOME=/usr/local/apache-ctakes-3.2.2
 java -cp 
 tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig
  org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input
 {noformat}
 In the example above, {{/path/to/CTAKESConfig}} is the parent directory of 
 file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains 
 the configuration properties to build the cTAKES AnalysisEngine; 
 {{tika-config.xml}} is a custom configuration file for Tika that contains the 
 mimetypes whose CTAKESParser will perform parsing.
 You can find in attachment an example of both {{CTAKESConfig.properties}} and 
 {{tika-config.xml}} to parse ISA-Tab files using cTAKES.
 You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use 
 the UMLS-based components of cTAKES.
 I would really appreciate your feedback.
 Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this 
 work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TIKA-1642) Integrate cTAKES into Tika

2015-05-28 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro reassigned TIKA-1642:
-

Assignee: Giuseppe Totaro

 Integrate cTAKES into Tika
 --

 Key: TIKA-1642
 URL: https://issues.apache.org/jira/browse/TIKA-1642
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Selina Chu
Assignee: Giuseppe Totaro

 [~gostep] has written a preliminary version of 
 [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
 to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika.
 The CTAKESContentHandler allows to perform the following step into Tika:
 * create an AnalysisEngine based on a given XML descriptor;
 * create a CAS (Common Analysis System) appropriate for this AnalysisEngine;
 * populate the CAS with the text extracted by using Tika;
 * perform the AnalysisEngine against the plain text added to CAS;
 * write out the results in the given format (XML, XCAS, XMI, etc.).
 It would be great improvement if we can parse the output of cTAKES and create 
 a list of metadata which describes the terms found in the annotation index 
 and their corresponding tokens. For instance, using the 
 AggregatePlaintextFastUMLSProcessor analysis engine, we can utilize the UMLS 
 database to obtain the annotations related to DiseaseDisorderMention, and I 
 would like to be able to produce a list of words corresponding to the input 
 text which is annotated as DiseaseDisorderMention.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1642) Integrate cTAKES into Tika

2015-05-28 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563993#comment-14563993
 ] 

Giuseppe Totaro commented on TIKA-1642:
---

Hi [~selina], I believe that is a great idea. I am going right now to update my 
code on GitHub and add support for cTAKES metadata as suggested by you.
Then, I will post here a new patch for Tika.
Thanks a lot,
Giuseppe

 Integrate cTAKES into Tika
 --

 Key: TIKA-1642
 URL: https://issues.apache.org/jira/browse/TIKA-1642
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Selina Chu

 [~gostep] has written a preliminary version of 
 [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
 to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika.
 The CTAKESContentHandler allows to perform the following step into Tika:
 * create an AnalysisEngine based on a given XML descriptor;
 * create a CAS (Common Analysis System) appropriate for this AnalysisEngine;
 * populate the CAS with the text extracted by using Tika;
 * perform the AnalysisEngine against the plain text added to CAS;
 * write out the results in the given format (XML, XCAS, XMI, etc.).
 It would be great improvement if we can parse the output of cTAKES and create 
 a list of metadata which describes the terms found in the annotation index 
 and their corresponding tokens. For instance, using the 
 AggregatePlaintextFastUMLSProcessor analysis engine, we can utilize the UMLS 
 database to obtain the annotations related to DiseaseDisorderMention, and I 
 would like to be able to produce a list of words corresponding to the input 
 text which is annotated as DiseaseDisorderMention.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [ANNOUNCE] Welcome Giuseppe Totaro As Tika Committer + PMC Member

2015-04-25 Thread Giuseppe Totaro
Thanks a lot David. I apologize for my delay.
I am very proud to be part of this project as committer and member of the
Tika PMC.

I am working on Information Retrieval at scale under the supervision of
Professor Chris Mattmann at NASA JPL.
I developed new parsers (e.g., ISArchiveParser) and now I am working on
adding support for more data formats in Tika.
I take this opportunity to thank Chris Mattmann and Lewis McGibbney for
kindly supporting me on this work.

I would really like to get your feedback on my work. Feel free to ask me
any question.

Cheers,
Giuseppe

On Thu, Apr 9, 2015 at 1:27 PM, David Meikle dmei...@apache.org wrote:

 Hello All,

 Please welcome Giuseppe Totaro as he joins us as the latest Tika committer
 and PMC Member.

 He's recently been VOTEd in and now has his account all set up so is ready
 to roll!

 Giuseppe, please feel free to say a bit about yourself as an introduction
 to the group.

 Welcome aboard,
 Dave


[jira] [Updated] (TIKA-1580) ISA-Tab parsers

2015-03-26 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1580:
--
Attachment: TIKA-1580.v03.2.Mattmann.Totaro.03262015.patch

 ISA-Tab parsers
 ---

 Key: TIKA-1580
 URL: https://issues.apache.org/jira/browse/TIKA-1580
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: new-parser
 Fix For: 1.8

 Attachments: TIKA-1580.Mattmann.Totaro.032515.patch.txt, 
 TIKA-1580.patch, TIKA-1580.v02.patch, 
 TIKA-1580.v03.2.Mattmann.Totaro.03262015.patch


 We are going to add parsers for ISA-Tab data formats.
 ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
 to manage an increasingly diverse set of life science, environmental and 
 biomedical experiments that employing one or a combination of technologies.
 The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
 format. Therefore, ISA-Tab data format includes three types of file: 
 Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
 ({{a_.txt}}). These files are organized as [top-down 
 hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
 file includes one or more Study files: each Study files includes one or more 
 Assay files.
 Essentially, the Investigation files contains high-level information about 
 the related study, so it provides only metadata about ISA-Tab files.
 More details on file format specification are [available 
 online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
 The patch in attachment provides a preliminary version of ISA-Tab parsers 
 (there are three parsers; one parser for each ISA-Tab filetype):
 * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
 only metadata.
 * {{ISATabStudyParser.java}}: parses Study files.
 * {{ISATabAssayParser.java}}: parses Assay files.
 The most important improvements are:
 * Combine these three parsers in order to parse an ISArchive
 * Provide a better mapping of both study and assay data on XHML. Currently, 
 {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
 function relying on [Apache Commons 
 CSV|https://commons.apache.org/proper/commons-csv/].
 Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1580) ISA-Tab parsers

2015-03-26 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1580:
--
Attachment: TIKA-1580.v03.Mattmann.Totaro.03262015.patch

Hi all, I uploaded a new patch 
({{TIKA-1580.v03.Mattmann.Totaro.03262015.patch}}) including a parser for 
ISATab archive.
In particular, this patch includes a new {{ISArchiveParser}} java class that 
leverages on {{ISATabUtils}} static methods. {{ISATabUtils}} is a utility class 
that provides methods for parsing investigation, study, and assay files.
{{ISArchiveParser}} runs over study files. It starts from the given study file 
and looks for the related investigation and assay files in the same directory.
Mimetype detection is provided also for investigation and assay files.
Thanks [~chrismattmann] for helping me on this stuff.

 ISA-Tab parsers
 ---

 Key: TIKA-1580
 URL: https://issues.apache.org/jira/browse/TIKA-1580
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: new-parser
 Fix For: 1.8

 Attachments: TIKA-1580.Mattmann.Totaro.032515.patch.txt, 
 TIKA-1580.patch, TIKA-1580.v02.patch, 
 TIKA-1580.v03.Mattmann.Totaro.03262015.patch


 We are going to add parsers for ISA-Tab data formats.
 ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
 to manage an increasingly diverse set of life science, environmental and 
 biomedical experiments that employing one or a combination of technologies.
 The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
 format. Therefore, ISA-Tab data format includes three types of file: 
 Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
 ({{a_.txt}}). These files are organized as [top-down 
 hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
 file includes one or more Study files: each Study files includes one or more 
 Assay files.
 Essentially, the Investigation files contains high-level information about 
 the related study, so it provides only metadata about ISA-Tab files.
 More details on file format specification are [available 
 online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
 The patch in attachment provides a preliminary version of ISA-Tab parsers 
 (there are three parsers; one parser for each ISA-Tab filetype):
 * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
 only metadata.
 * {{ISATabStudyParser.java}}: parses Study files.
 * {{ISATabAssayParser.java}}: parses Assay files.
 The most important improvements are:
 * Combine these three parsers in order to parse an ISArchive
 * Provide a better mapping of both study and assay data on XHML. Currently, 
 {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
 function relying on [Apache Commons 
 CSV|https://commons.apache.org/proper/commons-csv/].
 Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 32291: ISATab parsers (preliminary version)

2015-03-23 Thread Giuseppe Totaro

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32291/
---

(Updated March 23, 2015, 5:04 p.m.)


Review request for tika and Chris Mattmann.


Bugs: TIKA-1580
https://issues.apache.org/jira/browse/TIKA-1580


Repository: tika


Description
---

ISATab parsers. This preliminary solution provides three parsers, one for each 
ISA-Tab filetype (Investigation, Study, Assay).


Diffs (updated)
-

  trunk/tika-bundle/pom.xml 1668683 
  trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
1668683 
  trunk/tika-parsers/pom.xml 1668683 
  
trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabAssayParser.java
 PRE-CREATION 
  
trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabInvestigationParser.java
 PRE-CREATION 
  
trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabStudyParser.java
 PRE-CREATION 
  
trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
 1668683 
  
trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabAssayParserTest.java
 PRE-CREATION 
  
trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabInvestigationParserTest.java
 PRE-CREATION 
  
trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabStudyParserTest.java
 PRE-CREATION 
  
trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_bii-s-2_metabolite
 profiling_NMR spectroscopy.txt PRE-CREATION 
  
trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_metabolome.txt
 PRE-CREATION 
  
trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_microarray.txt
 PRE-CREATION 
  
trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_proteome.txt
 PRE-CREATION 
  
trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_transcriptome.txt
 PRE-CREATION 
  
trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/i_investigation.txt
 PRE-CREATION 
  
trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-1.txt
 PRE-CREATION 
  
trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-2.txt
 PRE-CREATION 

Diff: https://reviews.apache.org/r/32291/diff/


Testing
---

Tested on sample ISA-Tab files downloaded from 
http://www.isa-tools.org/format/examples/.


Thanks,

Giuseppe Totaro



[jira] [Updated] (TIKA-1580) ISA-Tab parsers

2015-03-23 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1580:
--
Attachment: TIKA-1580.v02.patch

 ISA-Tab parsers
 ---

 Key: TIKA-1580
 URL: https://issues.apache.org/jira/browse/TIKA-1580
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: new-parser
 Fix For: 1.8

 Attachments: TIKA-1580.patch, TIKA-1580.v02.patch


 We are going to add parsers for ISA-Tab data formats.
 ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
 to manage an increasingly diverse set of life science, environmental and 
 biomedical experiments that employing one or a combination of technologies.
 The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
 format. Therefore, ISA-Tab data format includes three types of file: 
 Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
 ({{a_.txt}}). These files are organized as [top-down 
 hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
 file includes one or more Study files: each Study files includes one or more 
 Assay files.
 Essentially, the Investigation files contains high-level information about 
 the related study, so it provides only metadata about ISA-Tab files.
 More details on file format specification are [available 
 online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
 The patch in attachment provides a preliminary version of ISA-Tab parsers 
 (there are three parsers; one parser for each ISA-Tab filetype):
 * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
 only metadata.
 * {{ISATabStudyParser.java}}: parses Study files.
 * {{ISATabAssayParser.java}}: parses Assay files.
 The most important improvements are:
 * Combine these three parsers in order to parse an ISArchive
 * Provide a better mapping of both study and assay data on XHML. Currently, 
 {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
 function relying on [Apache Commons 
 CSV|https://commons.apache.org/proper/commons-csv/].
 Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1580) ISA-Tab parsers

2015-03-23 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376210#comment-14376210
 ] 

Giuseppe Totaro commented on TIKA-1580:
---

Hi [~chrismattmann], I apologize about that. I forgot to include the parsers.
I updated right now the patch in [https://reviews.apache.org/r/32291/]. You can 
find the patch also in attachment.
Thanks [~tpalsulich] for your review. The new patch should include what you 
suggested.
[~chrismattmann] and [~tpalsulich], I am going to create my own sample files 
using [ISACreator|http://www.isa-tools.org/software-suite/] tool and then I 
will add to the patch.
Thanks a lot for your feedback.

 ISA-Tab parsers
 ---

 Key: TIKA-1580
 URL: https://issues.apache.org/jira/browse/TIKA-1580
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: new-parser
 Fix For: 1.8

 Attachments: TIKA-1580.patch


 We are going to add parsers for ISA-Tab data formats.
 ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
 to manage an increasingly diverse set of life science, environmental and 
 biomedical experiments that employing one or a combination of technologies.
 The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
 format. Therefore, ISA-Tab data format includes three types of file: 
 Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
 ({{a_.txt}}). These files are organized as [top-down 
 hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
 file includes one or more Study files: each Study files includes one or more 
 Assay files.
 Essentially, the Investigation files contains high-level information about 
 the related study, so it provides only metadata about ISA-Tab files.
 More details on file format specification are [available 
 online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
 The patch in attachment provides a preliminary version of ISA-Tab parsers 
 (there are three parsers; one parser for each ISA-Tab filetype):
 * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
 only metadata.
 * {{ISATabStudyParser.java}}: parses Study files.
 * {{ISATabAssayParser.java}}: parses Assay files.
 The most important improvements are:
 * Combine these three parsers in order to parse an ISArchive
 * Provide a better mapping of both study and assay data on XHML. Currently, 
 {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
 function relying on [Apache Commons 
 CSV|https://commons.apache.org/proper/commons-csv/].
 Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1580) ISA-Tab parsers

2015-03-23 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376210#comment-14376210
 ] 

Giuseppe Totaro edited comment on TIKA-1580 at 3/23/15 5:10 PM:


Hi [~chrismattmann], I apologize about that. I forgot to include the parsers.
I updated right now the patch (including parsers) in 
[https://reviews.apache.org/r/32291/]. You can find the patch also in 
attachment.
Thanks [~tpalsulich] for your review. The new patch should include what you 
suggested.
[~chrismattmann] and [~tpalsulich], I am going to create my own sample files 
using [ISACreator|http://www.isa-tools.org/software-suite/] tool and then I 
will add to the patch.
Thanks a lot for your feedback.


was (Author: gostep):
Hi [~chrismattmann], I apologize about that. I forgot to include the parsers.
I updated right now the patch in [https://reviews.apache.org/r/32291/]. You can 
find the patch also in attachment.
Thanks [~tpalsulich] for your review. The new patch should include what you 
suggested.
[~chrismattmann] and [~tpalsulich], I am going to create my own sample files 
using [ISACreator|http://www.isa-tools.org/software-suite/] tool and then I 
will add to the patch.
Thanks a lot for your feedback.

 ISA-Tab parsers
 ---

 Key: TIKA-1580
 URL: https://issues.apache.org/jira/browse/TIKA-1580
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: new-parser
 Fix For: 1.8

 Attachments: TIKA-1580.patch, TIKA-1580.v02.patch


 We are going to add parsers for ISA-Tab data formats.
 ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
 to manage an increasingly diverse set of life science, environmental and 
 biomedical experiments that employing one or a combination of technologies.
 The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
 format. Therefore, ISA-Tab data format includes three types of file: 
 Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
 ({{a_.txt}}). These files are organized as [top-down 
 hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
 file includes one or more Study files: each Study files includes one or more 
 Assay files.
 Essentially, the Investigation files contains high-level information about 
 the related study, so it provides only metadata about ISA-Tab files.
 More details on file format specification are [available 
 online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
 The patch in attachment provides a preliminary version of ISA-Tab parsers 
 (there are three parsers; one parser for each ISA-Tab filetype):
 * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
 only metadata.
 * {{ISATabStudyParser.java}}: parses Study files.
 * {{ISATabAssayParser.java}}: parses Assay files.
 The most important improvements are:
 * Combine these three parsers in order to parse an ISArchive
 * Provide a better mapping of both study and assay data on XHML. Currently, 
 {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
 function relying on [Apache Commons 
 CSV|https://commons.apache.org/proper/commons-csv/].
 Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1580) ISA-Tab parsers

2015-03-20 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14370945#comment-14370945
 ] 

Giuseppe Totaro commented on TIKA-1580:
---

The patch has been uploaded for review on [https://reviews.apache.org/r/32291/].


 ISA-Tab parsers
 ---

 Key: TIKA-1580
 URL: https://issues.apache.org/jira/browse/TIKA-1580
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Giuseppe Totaro
Priority: Minor
 Attachments: TIKA-1580.patch


 We are going to add parsers for ISA-Tab data formats.
 ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
 to manage an increasingly diverse set of life science, environmental and 
 biomedical experiments that employing one or a combination of technologies.
 The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
 format. Therefore, ISA-Tab data format includes three types of file: 
 Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
 ({{a_.txt}}). These files are organized as [top-down 
 hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
 file includes one or more Study files: each Study files includes one or more 
 Assay files.
 Essentially, the Investigation files contains high-level information about 
 the related study, so it provides only metadata about ISA-Tab files.
 More details on file format specification are [available 
 online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
 The patch in attachment provides a preliminary version of ISA-Tab parsers 
 (there are three parsers; one parser for each ISA-Tab filetype):
 * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
 only metadata.
 * {{ISATabStudyParser.java}}: parses Study files.
 * {{ISATabAssayParser.java}}: parses Assay files.
 The most important improvements are:
 * Combine these three parsers in order to parse an ISArchive
 * Provide a better mapping of both study and assay data on XHML. Currently, 
 {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
 function relying on [Apache Commons 
 CSV|https://commons.apache.org/proper/commons-csv/].
 Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1580) ISA-Tab parsers

2015-03-19 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1580:
--
Summary: ISA-Tab parsers  (was: ISA-Tab)

 ISA-Tab parsers
 ---

 Key: TIKA-1580
 URL: https://issues.apache.org/jira/browse/TIKA-1580
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Giuseppe Totaro
Priority: Minor

 We are going to add parsers for ISA-Tab data formats.
 ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
 to manage an increasingly diverse set of life science, environmental and 
 biomedical experiments that employing one or a combination of technologies.
 The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
 format. Therefore, ISA-Tab data format includes three types of file: 
 Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
 ({{a_.txt}}). These files are organized as [top-down 
 hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
 file includes one or more Study files: each Study files includes one or more 
 Assay files.
 Essentially, the Investigation files contains high-level information about 
 the related study, so it provides only metadata about ISA-Tab files.
 More details on file format specification are [available 
 online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
 The patch in attachment provides a preliminary version of ISA-Tab parsers 
 (there are three parsers; one parser for each ISA-Tab filetype):
 * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
 only metadata.
 * {{ISATabStudyParser.java}}: parses Study files.
 * {{ISATabAssayParser.java}}: parses Assay files.
 The most important improvements are:
 * Combine these three parsers in order to parse an ISArchive
 * Provide a better mapping of both study and assay data on XHML. Currently, 
 {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
 function relying on [Apache Commons 
 CSV|https://commons.apache.org/proper/commons-csv/].
 Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1580) ISA-Tab

2015-03-19 Thread Giuseppe Totaro (JIRA)
Giuseppe Totaro created TIKA-1580:
-

 Summary: ISA-Tab
 Key: TIKA-1580
 URL: https://issues.apache.org/jira/browse/TIKA-1580
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Giuseppe Totaro
Priority: Minor


We are going to add parsers for ISA-Tab data formats.
ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
to manage an increasingly diverse set of life science, environmental and 
biomedical experiments that employing one or a combination of technologies.
The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
format. Therefore, ISA-Tab data format includes three types of file: 
Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
({{a_.txt}}). These files are organized as [top-down 
hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
file includes one or more Study files: each Study files includes one or more 
Assay files.
Essentially, the Investigation files contains high-level information about the 
related study, so it provides only metadata about ISA-Tab files.
More details on file format specification are [available 
online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].

The patch in attachment provides a preliminary version of ISA-Tab parsers 
(there are three parsers; one parser for each ISA-Tab filetype):
* {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
only metadata.
* {{ISATabStudyParser.java}}: parses Study files.
* {{ISATabAssayParser.java}}: parses Assay files.

The most important improvements are:
* Combine these three parsers in order to parse an ISArchive
* Provide a better mapping of both study and assay data on XHML. Currently, 
{{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
function relying on [Apache Commons 
CSV|https://commons.apache.org/proper/commons-csv/].

Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1483) Create a general raw string parser

2015-02-24 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336144#comment-14336144
 ] 

Giuseppe Totaro commented on TIKA-1483:
---

Hi [~chrismattmann], I don't know why but sometimes the {{patch}} tool does not 
work well if there is not a newline at the end of file.
On my laptop, the patch works well if you put a new line at the end of file 
(e.g., {{printf \n  TIKA-1483_v2.patch}}).
I hope this trivial trick may be useful.
Thank you,
Giuseppe

 Create a general raw string parser
 --

 Key: TIKA-1483
 URL: https://issues.apache.org/jira/browse/TIKA-1483
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Attachments: TIKA-1483.patch, TIKA-1483_v2.patch


 I think it can be very useful adding a general parser able to extract raw 
 strings from files (like the strings command), which can be used as the 
 fallback parser for all mimetypes not having a specific parser 
 implementation, like application/octet-stream. It can also be used as a 
 fallback for corrupt files throwing a TikaException.
 It must be configured with the script/language to be extracted from the files 
 (currently I implemented one specific for Latin1).
 It can use heuristics to extract strings encoded with different charsets 
 within the same file, mainly the common ISO-8859-1, UTF8 and UTF16.
 What the community thinks about that?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

2015-02-18 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327027#comment-14327027
 ] 

Giuseppe Totaro edited comment on TIKA-1541 at 2/19/15 5:57 AM:


Hi all,
I added more unit tests, especially for {{StringsConfig}} class (inspired by 
work on TesseractOCRParser). You can find in attachment the patch 
({{TIKA-1541.v02.02182015.patch}}).
Thank you,
Giuseppe


was (Author: gostep):
Hi all,
I added more unit tests, especially for {{StringsConfig}} class (inspired by 
work on TesseractOCRParser). You can find in attachment the patch 
({{TIKA-1541.v02.02182015.patch}).
Thank you,
Giuseppe

 StringsParser: a simple strings-based parser for Tika
 -

 Key: TIKA-1541
 URL: https://issues.apache.org/jira/browse/TIKA-1541
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, 
 TIKA-1541.TotaroMattmann.020615.patch.txt, 
 TIKA-1541.TotaroMattmannBurchNassif.020715.patch, 
 TIKA-1541.TotaroMattmannBurchNassif.020815.patch, 
 TIKA-1541.TotaroMattmannBurchNassif.020915.patch, TIKA-1541.patch, 
 TIKA-1541.v02.02182015.patch, testOCTET_header.dbase3


 I thought to implement an extremely simple implementation of 
 {{StringsParser}}, a parser based on the {{strings}} command (or 
 {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} 
 for undetected files. It is a preliminary work (you can see a lot of todos). 
 It is inspired by the work on {{TesseractOCRParser}}. You can find the patch 
 in attachment.
 I created a GitHub 
 [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the 
 code. As first test, you can clone the repo, build the code using the 
 {{build.sh}} script, and then run the parser using the {{run.sh}} script on 
 some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 
 016 subset) detected as {{application/octet-stream}}. The latter script 
 launches a simple {{StringsTest}} class for testing.
 I hope you will find the {{StringsParser}} a good solution for extracting 
 ASCII strings from undetected filetypes. As far as I understood, many 
 sophisticated forensics tools work in a similar manner for indexing 
 purposes. They use a sort of {{strings}} command against files that they are 
 not able to detect.
 In addition to run {{strings}} on undetected files, the {{StringsParser}} 
 launches the {{file}} command on undetected files and then writes the output 
 in the {{strings:file_output}} property (I noticed that sometimes the 
 {{file}} command is able to detect the media type for documents not detected 
 by Tika).
 Finally, you can fine an old discussion about this topic 
 [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
 Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

2015-02-18 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1541:
--
Attachment: TIKA-1541.v02.02182015.patch

Hi all,
I added more unit tests, especially for {{StringsConfig}} class (inspired by 
work on TesseractOCRParser). You can find in attachment the patch 
({{TIKA-1541.v02.02182015.patch}).
Thank you,
Giuseppe

 StringsParser: a simple strings-based parser for Tika
 -

 Key: TIKA-1541
 URL: https://issues.apache.org/jira/browse/TIKA-1541
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, 
 TIKA-1541.TotaroMattmann.020615.patch.txt, 
 TIKA-1541.TotaroMattmannBurchNassif.020715.patch, 
 TIKA-1541.TotaroMattmannBurchNassif.020815.patch, 
 TIKA-1541.TotaroMattmannBurchNassif.020915.patch, TIKA-1541.patch, 
 TIKA-1541.v02.02182015.patch, testOCTET_header.dbase3


 I thought to implement an extremely simple implementation of 
 {{StringsParser}}, a parser based on the {{strings}} command (or 
 {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} 
 for undetected files. It is a preliminary work (you can see a lot of todos). 
 It is inspired by the work on {{TesseractOCRParser}}. You can find the patch 
 in attachment.
 I created a GitHub 
 [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the 
 code. As first test, you can clone the repo, build the code using the 
 {{build.sh}} script, and then run the parser using the {{run.sh}} script on 
 some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 
 016 subset) detected as {{application/octet-stream}}. The latter script 
 launches a simple {{StringsTest}} class for testing.
 I hope you will find the {{StringsParser}} a good solution for extracting 
 ASCII strings from undetected filetypes. As far as I understood, many 
 sophisticated forensics tools work in a similar manner for indexing 
 purposes. They use a sort of {{strings}} command against files that they are 
 not able to detect.
 In addition to run {{strings}} on undetected files, the {{StringsParser}} 
 launches the {{file}} command on undetected files and then writes the output 
 in the {{strings:file_output}} property (I noticed that sometimes the 
 {{file}} command is able to detect the media type for documents not detected 
 by Tika).
 Finally, you can fine an old discussion about this topic 
 [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
 Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1483) Create a general raw string parser

2015-02-18 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326875#comment-14326875
 ] 

Giuseppe Totaro commented on TIKA-1483:
---

Thanks [~lfcnassif].
I agree with you about the configuration object. Generally speaking, I use the 
configuration pattern for objects with three or more parameters.
Thanks a lot,
Giuseppe

 Create a general raw string parser
 --

 Key: TIKA-1483
 URL: https://issues.apache.org/jira/browse/TIKA-1483
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Attachments: TIKA-1483.patch, TIKA-1483_v2.patch


 I think it can be very useful adding a general parser able to extract raw 
 strings from files (like the strings command), which can be used as the 
 fallback parser for all mimetypes not having a specific parser 
 implementation, like application/octet-stream. It can also be used as a 
 fallback for corrupt files throwing a TikaException.
 It must be configured with the script/language to be extracted from the files 
 (currently I implemented one specific for Latin1).
 It can use heuristics to extract strings encoded with different charsets 
 within the same file, mainly the common ISO-8859-1, UTF8 and UTF16.
 What the community thinks about that?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

2015-02-10 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313796#comment-14313796
 ] 

Giuseppe Totaro commented on TIKA-1541:
---

[~chrismattmann], probably it depends on the encoding option of the {{strings}} 
command. Even though, {{strings}} on *nix systems supports {{-e}} option, I 
noticed that also some non-Windows version of {{strings}} do not provide {{-e}} 
option. Therefore, the test fails if a version without support for {{-e}} is 
used.
I modified the {{hasStrings}} method in {{StringsParser}} class in order to 
test if {{strings}} provides that option.
I tried to build Tika+{{StringsParser}} using a {{strings}} version that does 
not support {{-e}} and all test work well.
Please let me know if the updated patch works well for you.
Thanks a lot,
Giuseppe

 StringsParser: a simple strings-based parser for Tika
 -

 Key: TIKA-1541
 URL: https://issues.apache.org/jira/browse/TIKA-1541
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
 Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, 
 TIKA-1541.TotaroMattmann.020615.patch.txt, 
 TIKA-1541.TotaroMattmannBurchNassif.020715.patch, 
 TIKA-1541.TotaroMattmannBurchNassif.020815.patch, 
 TIKA-1541.TotaroMattmannBurchNassif.020915.patch, TIKA-1541.patch, 
 testOCTET_header.dbase3


 I thought to implement an extremely simple implementation of 
 {{StringsParser}}, a parser based on the {{strings}} command (or 
 {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} 
 for undetected files. It is a preliminary work (you can see a lot of todos). 
 It is inspired by the work on {{TesseractOCRParser}}. You can find the patch 
 in attachment.
 I created a GitHub 
 [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the 
 code. As first test, you can clone the repo, build the code using the 
 {{build.sh}} script, and then run the parser using the {{run.sh}} script on 
 some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 
 016 subset) detected as {{application/octet-stream}}. The latter script 
 launches a simple {{StringsTest}} class for testing.
 I hope you will find the {{StringsParser}} a good solution for extracting 
 ASCII strings from undetected filetypes. As far as I understood, many 
 sophisticated forensics tools work in a similar manner for indexing 
 purposes. They use a sort of {{strings}} command against files that they are 
 not able to detect.
 In addition to run {{strings}} on undetected files, the {{StringsParser}} 
 launches the {{file}} command on undetected files and then writes the output 
 in the {{strings:file_output}} property (I noticed that sometimes the 
 {{file}} command is able to detect the media type for documents not detected 
 by Tika).
 Finally, you can fine an old discussion about this topic 
 [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
 Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

2015-02-05 Thread Giuseppe Totaro (JIRA)
Giuseppe Totaro created TIKA-1541:
-

 Summary: StringsParser: a simple strings-based parser for Tika
 Key: TIKA-1541
 URL: https://issues.apache.org/jira/browse/TIKA-1541
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Giuseppe Totaro


I thought to implement an extremely simple implementation of {{StringsParser}}, 
a parser based on the {{strings}} command (or {{strings}}-alternative command), 
instead of using the dummy {{EmptyParser}} for undetected files. It is a 
preliminary work (you can see a lot of todos). It is inspired by the work on 
{{TesseractOCRParser}}.

I created a GitHub [repository|https://github.com/giuseppetotaro/StringsParser] 
for sharing the code. As first test, you can clone the repo, build the code 
using the {{build.sh}} script, and then run the parser using the {{run.sh}} 
script on some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files 
(grabbed from 016 subset) detected as {{application/octet-stream}}. The 
latter script launches a simple {{StringsTest}} class for testing.

I hope you will find the {{StringsParser}} a good solution for extracting ASCII 
strings from undetected filetypes. As far as I understood, many sophisticated 
forensics tools work in a similar manner for indexing purposes. They use a sort 
of {{strings}} command against files that they are not able to detect.

In addition to run {{strings}} on undetected files, the {{StringsParser}} 
launches the {{file}} command on undetected files and then writes the output in 
the {{strings:file_output}} property (I noticed that sometimes the {{file}} 
command is able to detect the media type for documents not detected by Tika).

Finally, you can fine an old discussion about this topic 
[here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

2015-02-05 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1541:
--
Description: 
I thought to implement an extremely simple implementation of {{StringsParser}}, 
a parser based on the {{strings}} command (or {{strings}}-alternative command), 
instead of using the dummy {{EmptyParser}} for undetected files. It is a 
preliminary work (you can see a lot of todos). It is inspired by the work on 
{{TesseractOCRParser}}. You can find the patch in attachment.
[file:Users/gtotaro/Desktop/TIKA-1541.patch]

I created a GitHub [repository|https://github.com/giuseppetotaro/StringsParser] 
for sharing the code. As first test, you can clone the repo, build the code 
using the {{build.sh}} script, and then run the parser using the {{run.sh}} 
script on some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files 
(grabbed from 016 subset) detected as {{application/octet-stream}}. The 
latter script launches a simple {{StringsTest}} class for testing.

I hope you will find the {{StringsParser}} a good solution for extracting ASCII 
strings from undetected filetypes. As far as I understood, many sophisticated 
forensics tools work in a similar manner for indexing purposes. They use a sort 
of {{strings}} command against files that they are not able to detect.

In addition to run {{strings}} on undetected files, the {{StringsParser}} 
launches the {{file}} command on undetected files and then writes the output in 
the {{strings:file_output}} property (I noticed that sometimes the {{file}} 
command is able to detect the media type for documents not detected by Tika).

Finally, you can fine an old discussion about this topic 
[here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
Thanks [~chrismattmann].

  was:
I thought to implement an extremely simple implementation of {{StringsParser}}, 
a parser based on the {{strings}} command (or {{strings}}-alternative command), 
instead of using the dummy {{EmptyParser}} for undetected files. It is a 
preliminary work (you can see a lot of todos). It is inspired by the work on 
{{TesseractOCRParser}}.

I created a GitHub [repository|https://github.com/giuseppetotaro/StringsParser] 
for sharing the code. As first test, you can clone the repo, build the code 
using the {{build.sh}} script, and then run the parser using the {{run.sh}} 
script on some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files 
(grabbed from 016 subset) detected as {{application/octet-stream}}. The 
latter script launches a simple {{StringsTest}} class for testing.

I hope you will find the {{StringsParser}} a good solution for extracting ASCII 
strings from undetected filetypes. As far as I understood, many sophisticated 
forensics tools work in a similar manner for indexing purposes. They use a sort 
of {{strings}} command against files that they are not able to detect.

In addition to run {{strings}} on undetected files, the {{StringsParser}} 
launches the {{file}} command on undetected files and then writes the output in 
the {{strings:file_output}} property (I noticed that sometimes the {{file}} 
command is able to detect the media type for documents not detected by Tika).

Finally, you can fine an old discussion about this topic 
[here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. 
Thanks [~chrismattmann].


 StringsParser: a simple strings-based parser for Tika
 -

 Key: TIKA-1541
 URL: https://issues.apache.org/jira/browse/TIKA-1541
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Giuseppe Totaro
 Attachments: TIKA-1541.patch


 I thought to implement an extremely simple implementation of 
 {{StringsParser}}, a parser based on the {{strings}} command (or 
 {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} 
 for undetected files. It is a preliminary work (you can see a lot of todos). 
 It is inspired by the work on {{TesseractOCRParser}}. You can find the patch 
 in attachment.
 [file:Users/gtotaro/Desktop/TIKA-1541.patch]
 I created a GitHub 
 [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the 
 code. As first test, you can clone the repo, build the code using the 
 {{build.sh}} script, and then run the parser using the {{run.sh}} script on 
 some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 
 016 subset) detected as {{application/octet-stream}}. The latter script 
 launches a simple {{StringsTest}} class for testing.
 I hope you will find the {{StringsParser}} a good solution for extracting 
 ASCII strings from undetected filetypes. As far as I understood, many 
 sophisticated forensics tools work in a similar manner for indexing 
 purposes. They use a sort of {{strings

[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats

2015-01-29 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14297612#comment-14297612
 ] 

Giuseppe Totaro commented on TIKA-1423:
---

[~lewismc] your patch matches perfectly the improvements. Thank you.

 Build a parser to extract data from GRIB formats
 

 Key: TIKA-1423
 URL: https://issues.apache.org/jira/browse/TIKA-1423
 Project: Tika
  Issue Type: New Feature
  Components: metadata, mime, parser
Affects Versions: 1.6
Reporter: Vineet Ghatge
Assignee: Vineet Ghatge
Priority: Critical
  Labels: features, newbie
 Fix For: 1.8

 Attachments: GRIBParsertest.java, GribParser.java, 
 NLDAS_FORA0125_H.A20130112.1200.002.grb, TIKA-1423.palsulich.120614.patch, 
 TIKA-1423.patch, TIKA-1423v2.patch, fileName.html, 
 gdas1.forecmwf.2014062612.grib2


 Arctic dataset contains a MIME format called GRIB -  General 
 Regularly­distributed information in Binary form 
 http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
 a concise data format used in meteorology to store historical and 
 weather data. There are 2 different types of the format ­ GRIB 0, GRIB 2.  
 The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
 intended for either transmission or storage contains a single parameter with 
 values located at an array of grid points, or represented as a set of 
 spectral coefficients, for a single level (or layer), encoded as a continuous 
 bit stream. Logical divisions of the record are designated as sections, 
 each of which provides control information and/or data. A GRIB record 
 consists of six sections, two of which are optional: 
  
 (0) Indicator Section 
 (1) Product Definition Section (PDS) 
 (2) Grid Description Section (GDS) ­ optional 
 (3) Bit Map Section (BMS) ­ optional 
 (4) Binary Data Section (BDS) 
 (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats

2015-01-25 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291207#comment-14291207
 ] 

Giuseppe Totaro commented on TIKA-1423:
---

Hello [~vinegh], I noted in your parser that you instantiate a {{File}} object 
starting from {{RESOURCE_NAME_KEY}} string without using the {{InputStream}} 
object passed to the {{parse}} method:
{code:title=gribParser.java|borderStyle=solid}
…
49 //Get grib2 file name from metadata  
 
50  
 
51 File gribFile = new File(metadata.get(Metadata.RESOURCE_NAME_KEY));  
 
52  
 
53 try {

54 NetcdfFile ncFile = 
NetcdfDataset.openFile(gribFile.getAbsolutePath(),
…
{code}
This means that any implementation that does not define the 
{{RESOURCE_NAME_KEY}} property in the caller as follows
{code}
metadata.add(Metadata.RESOURCE_NAME_KEY, filename);
{code}
will fail because the {{File}} constructor throws a {{NullPointerException}}.
Instead of adding {{RESOURCE_NAME_KEY}}, we can obtain the file from stream 
using the {{TikaInputStream}} class as well as in {{NetCDFParser.java}}:
{code}
 51 //File gribFile = new 
File(metadata.get(Metadata.RESOURCE_NAME_KEY));
 53 TikaInputStream tis = TikaInputStream.get(stream, new 
TemporaryResources());
 54 
 55 try {
 57 NetcdfFile ncFile = 
NetcdfDataset.openFile(tis.getFile().getAbsolutePath(), null);
{code}
I tested it on my macbook and it works. I tried also the 
[netcdf-tools|http://netcdftools.sourceforge.net/] library for retrieving the 
set of global attributes but it does not work well and it seems outdated.
Thank you for your great work,
Giuseppe

 Build a parser to extract data from GRIB formats
 

 Key: TIKA-1423
 URL: https://issues.apache.org/jira/browse/TIKA-1423
 Project: Tika
  Issue Type: New Feature
  Components: metadata, mime, parser
Affects Versions: 1.6
Reporter: Vineet Ghatge
Assignee: Vineet Ghatge
Priority: Critical
  Labels: features, newbie
 Fix For: 1.8

 Attachments: GRIBParsertest.java, GribParser.java, 
 NLDAS_FORA0125_H.A20130112.1200.002.grb, TIKA-1423.palsulich.120614.patch, 
 TIKA-1423.patch, fileName.html, gdas1.forecmwf.2014062612.grib2


 Arctic dataset contains a MIME format called GRIB -  General 
 Regularly­distributed information in Binary form 
 http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
 a concise data format used in meteorology to store historical and 
 weather data. There are 2 different types of the format ­ GRIB 0, GRIB 2.  
 The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
 intended for either transmission or storage contains a single parameter with 
 values located at an array of grid points, or represented as a set of 
 spectral coefficients, for a single level (or layer), encoded as a continuous 
 bit stream. Logical divisions of the record are designated as sections, 
 each of which provides control information and/or data. A GRIB record 
 consists of six sections, two of which are optional: 
  
 (0) Indicator Section 
 (1) Product Definition Section (PDS) 
 (2) Grid Description Section (GDS) ­ optional 
 (3) Bit Map Section (BMS) ­ optional 
 (4) Binary Data Section (BDS) 
 (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)

2014-02-05 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892116#comment-13892116
 ] 

Giuseppe Totaro commented on TIKA-1184:
---

Hello,

I've just run the tika-app-1.4.jar against files extracted from a disk image, 
and Tika hangs on a .sys file (attached).
I tried tika-app-1.6-SNAPSHOT.jar and it worked fine.

!file:///Users/giuseppe/Desktop/ansi.sys!

 Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
 --

 Key: TIKA-1184
 URL: https://issues.apache.org/jira/browse/TIKA-1184
 Project: Tika
  Issue Type: Bug
  Components: cli, parser
Affects Versions: 1.4
 Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64)
 java version 1.7.0
 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 
 20130422_146026 (JIT enabled, AOT enabled)
 J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026
 JIT  - r11.b03_20130131_32403ifx4
 GC   - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS
 J9CL - 20130422_146026)
 JCL - 20130425_01 based on Oracle 7u21-b09
Reporter: Jürgen Enge

 tika hangs on identifying several types of files. the following example is an 
 mp3 file with corrupt metadata. other filetypes which have the same problem 
 are for example MSDOS device drivers (*.sys)
 i am not into java programming, but my guess would be, that tika is trying to 
 seek() within a file and the target position is greater than filesize. 
  java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [hangs forever without error message]
 ffmpeg gives some warnings about duration errors...
  ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510
 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate
 Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e':
   Metadata:
 artist  : 
 album   : 
   Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s
 Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)

2014-02-05 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892116#comment-13892116
 ] 

Giuseppe Totaro edited comment on TIKA-1184 at 2/5/14 1:52 PM:
---

Hello,

I've just run the tika-app-1.4.jar against files extracted from a disk image, 
and Tika hangs on a .sys file (attached).
I tried tika-app-1.6-SNAPSHOT.jar and it worked fine.

[file:///Users/giuseppe/Desktop/ansi.sys]


was (Author: gostep):
Hello,

I've just run the tika-app-1.4.jar against files extracted from a disk image, 
and Tika hangs on a .sys file (attached).
I tried tika-app-1.6-SNAPSHOT.jar and it worked fine.

!file:///Users/giuseppe/Desktop/ansi.sys!

 Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
 --

 Key: TIKA-1184
 URL: https://issues.apache.org/jira/browse/TIKA-1184
 Project: Tika
  Issue Type: Bug
  Components: cli, parser
Affects Versions: 1.4
 Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64)
 java version 1.7.0
 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 
 20130422_146026 (JIT enabled, AOT enabled)
 J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026
 JIT  - r11.b03_20130131_32403ifx4
 GC   - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS
 J9CL - 20130422_146026)
 JCL - 20130425_01 based on Oracle 7u21-b09
Reporter: Jürgen Enge

 tika hangs on identifying several types of files. the following example is an 
 mp3 file with corrupt metadata. other filetypes which have the same problem 
 are for example MSDOS device drivers (*.sys)
 i am not into java programming, but my guess would be, that tika is trying to 
 seek() within a file and the target position is greater than filesize. 
  java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [hangs forever without error message]
 ffmpeg gives some warnings about duration errors...
  ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510
 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate
 Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e':
   Metadata:
 artist  : 
 album   : 
   Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s
 Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Issue Comment Deleted] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)

2014-02-05 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1184:
--

Comment: was deleted

(was: Hello,

I've just run the tika-app-1.4.jar against files extracted from a disk image, 
and Tika hangs on a .sys file (attached).
I tried tika-app-1.6-SNAPSHOT.jar and it worked fine.

[file:///Users/giuseppe/Desktop/ansi.sys])

 Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
 --

 Key: TIKA-1184
 URL: https://issues.apache.org/jira/browse/TIKA-1184
 Project: Tika
  Issue Type: Bug
  Components: cli, parser
Affects Versions: 1.4
 Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64)
 java version 1.7.0
 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 
 20130422_146026 (JIT enabled, AOT enabled)
 J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026
 JIT  - r11.b03_20130131_32403ifx4
 GC   - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS
 J9CL - 20130422_146026)
 JCL - 20130425_01 based on Oracle 7u21-b09
Reporter: Jürgen Enge

 tika hangs on identifying several types of files. the following example is an 
 mp3 file with corrupt metadata. other filetypes which have the same problem 
 are for example MSDOS device drivers (*.sys)
 i am not into java programming, but my guess would be, that tika is trying to 
 seek() within a file and the target position is greater than filesize. 
  java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [hangs forever without error message]
 ffmpeg gives some warnings about duration errors...
  ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510
 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate
 Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e':
   Metadata:
 artist  : 
 album   : 
   Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s
 Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)

2014-02-05 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892122#comment-13892122
 ] 

Giuseppe Totaro commented on TIKA-1184:
---

Hello,

I've just run the tika-app-1.4.jar against files extracted from a disk image, 
and Tika hangs on a .sys file (attached).
I tried tika-app-1.6-SNAPSHOT.jar and it worked fine.


 Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
 --

 Key: TIKA-1184
 URL: https://issues.apache.org/jira/browse/TIKA-1184
 Project: Tika
  Issue Type: Bug
  Components: cli, parser
Affects Versions: 1.4
 Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64)
 java version 1.7.0
 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 
 20130422_146026 (JIT enabled, AOT enabled)
 J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026
 JIT  - r11.b03_20130131_32403ifx4
 GC   - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS
 J9CL - 20130422_146026)
 JCL - 20130425_01 based on Oracle 7u21-b09
Reporter: Jürgen Enge
 Attachments: ansi.sys, ansi.sys


 tika hangs on identifying several types of files. the following example is an 
 mp3 file with corrupt metadata. other filetypes which have the same problem 
 are for example MSDOS device drivers (*.sys)
 i am not into java programming, but my guess would be, that tika is trying to 
 seek() within a file and the target position is greater than filesize. 
  java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [hangs forever without error message]
 ffmpeg gives some warnings about duration errors...
  ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510
 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate
 Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e':
   Metadata:
 artist  : 
 album   : 
   Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s
 Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)

2014-02-05 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1184:
--

Attachment: ansi.sys

 Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
 --

 Key: TIKA-1184
 URL: https://issues.apache.org/jira/browse/TIKA-1184
 Project: Tika
  Issue Type: Bug
  Components: cli, parser
Affects Versions: 1.4
 Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64)
 java version 1.7.0
 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 
 20130422_146026 (JIT enabled, AOT enabled)
 J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026
 JIT  - r11.b03_20130131_32403ifx4
 GC   - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS
 J9CL - 20130422_146026)
 JCL - 20130425_01 based on Oracle 7u21-b09
Reporter: Jürgen Enge
 Attachments: ansi.sys, ansi.sys


 tika hangs on identifying several types of files. the following example is an 
 mp3 file with corrupt metadata. other filetypes which have the same problem 
 are for example MSDOS device drivers (*.sys)
 i am not into java programming, but my guess would be, that tika is trying to 
 seek() within a file and the target position is greater than filesize. 
  java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [hangs forever without error message]
 ffmpeg gives some warnings about duration errors...
  ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510
 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate
 Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e':
   Metadata:
 artist  : 
 album   : 
   Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s
 Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)

2014-02-05 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1184:
--

Attachment: ansi.sys

 Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
 --

 Key: TIKA-1184
 URL: https://issues.apache.org/jira/browse/TIKA-1184
 Project: Tika
  Issue Type: Bug
  Components: cli, parser
Affects Versions: 1.4
 Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64)
 java version 1.7.0
 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 
 20130422_146026 (JIT enabled, AOT enabled)
 J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026
 JIT  - r11.b03_20130131_32403ifx4
 GC   - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS
 J9CL - 20130422_146026)
 JCL - 20130425_01 based on Oracle 7u21-b09
Reporter: Jürgen Enge
 Attachments: ansi.sys, ansi.sys


 tika hangs on identifying several types of files. the following example is an 
 mp3 file with corrupt metadata. other filetypes which have the same problem 
 are for example MSDOS device drivers (*.sys)
 i am not into java programming, but my guess would be, that tika is trying to 
 seek() within a file and the target position is greater than filesize. 
  java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [hangs forever without error message]
 ffmpeg gives some warnings about duration errors...
  ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510
 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate
 Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e':
   Metadata:
 artist  : 
 album   : 
   Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s
 Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Issue Comment Deleted] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)

2014-02-05 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1184:
--

Comment: was deleted

(was: Hello,

I've just run the tika-app-1.4.jar against files extracted from a disk image, 
and Tika hangs on a .sys file (attached).
I tried tika-app-1.6-SNAPSHOT.jar and it worked fine.
)

 Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
 --

 Key: TIKA-1184
 URL: https://issues.apache.org/jira/browse/TIKA-1184
 Project: Tika
  Issue Type: Bug
  Components: cli, parser
Affects Versions: 1.4
 Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64)
 java version 1.7.0
 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 
 20130422_146026 (JIT enabled, AOT enabled)
 J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026
 JIT  - r11.b03_20130131_32403ifx4
 GC   - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS
 J9CL - 20130422_146026)
 JCL - 20130425_01 based on Oracle 7u21-b09
Reporter: Jürgen Enge
 Attachments: ansi.sys, ansi.sys


 tika hangs on identifying several types of files. the following example is an 
 mp3 file with corrupt metadata. other filetypes which have the same problem 
 are for example MSDOS device drivers (*.sys)
 i am not into java programming, but my guess would be, that tika is trying to 
 seek() within a file and the target position is greater than filesize. 
  java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [hangs forever without error message]
 ffmpeg gives some warnings about duration errors...
  ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510
 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate
 Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e':
   Metadata:
 artist  : 
 album   : 
   Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s
 Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)

2014-02-05 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1184:
--

Attachment: ansi.sys

Hello,

I've just run the tika-app-1.4.jar against files extracted from a disk image, 
and Tika hangs on a .sys file (attached).
I tried tika-app-1.6-SNAPSHOT.jar and it worked fine.

 Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
 --

 Key: TIKA-1184
 URL: https://issues.apache.org/jira/browse/TIKA-1184
 Project: Tika
  Issue Type: Bug
  Components: cli, parser
Affects Versions: 1.4
 Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64)
 java version 1.7.0
 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 
 20130422_146026 (JIT enabled, AOT enabled)
 J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026
 JIT  - r11.b03_20130131_32403ifx4
 GC   - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS
 J9CL - 20130422_146026)
 JCL - 20130425_01 based on Oracle 7u21-b09
Reporter: Jürgen Enge
 Attachments: ansi.sys, ansi.sys, ansi.sys


 tika hangs on identifying several types of files. the following example is an 
 mp3 file with corrupt metadata. other filetypes which have the same problem 
 are for example MSDOS device drivers (*.sys)
 i am not into java programming, but my guess would be, that tika is trying to 
 seek() within a file and the target position is greater than filesize. 
  java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [hangs forever without error message]
 ffmpeg gives some warnings about duration errors...
  ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510
 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate
 Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e':
   Metadata:
 artist  : 
 album   : 
   Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s
 Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)

2014-02-05 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1184:
--

Attachment: (was: ansi.sys)

 Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
 --

 Key: TIKA-1184
 URL: https://issues.apache.org/jira/browse/TIKA-1184
 Project: Tika
  Issue Type: Bug
  Components: cli, parser
Affects Versions: 1.4
 Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64)
 java version 1.7.0
 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 
 20130422_146026 (JIT enabled, AOT enabled)
 J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026
 JIT  - r11.b03_20130131_32403ifx4
 GC   - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS
 J9CL - 20130422_146026)
 JCL - 20130425_01 based on Oracle 7u21-b09
Reporter: Jürgen Enge
 Attachments: ansi.sys


 tika hangs on identifying several types of files. the following example is an 
 mp3 file with corrupt metadata. other filetypes which have the same problem 
 are for example MSDOS device drivers (*.sys)
 i am not into java programming, but my guess would be, that tika is trying to 
 seek() within a file and the target position is greater than filesize. 
  java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [hangs forever without error message]
 ffmpeg gives some warnings about duration errors...
  ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510
 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate
 Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e':
   Metadata:
 artist  : 
 album   : 
   Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s
 Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)

2014-02-05 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1184:
--

Attachment: (was: ansi.sys)

 Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
 --

 Key: TIKA-1184
 URL: https://issues.apache.org/jira/browse/TIKA-1184
 Project: Tika
  Issue Type: Bug
  Components: cli, parser
Affects Versions: 1.4
 Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64)
 java version 1.7.0
 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 
 20130422_146026 (JIT enabled, AOT enabled)
 J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026
 JIT  - r11.b03_20130131_32403ifx4
 GC   - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS
 J9CL - 20130422_146026)
 JCL - 20130425_01 based on Oracle 7u21-b09
Reporter: Jürgen Enge
 Attachments: ansi.sys, ansi.sys


 tika hangs on identifying several types of files. the following example is an 
 mp3 file with corrupt metadata. other filetypes which have the same problem 
 are for example MSDOS device drivers (*.sys)
 i am not into java programming, but my guess would be, that tika is trying to 
 seek() within a file and the target position is greater than filesize. 
  java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [hangs forever without error message]
 ffmpeg gives some warnings about duration errors...
  ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510
 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate
 Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e':
   Metadata:
 artist  : 
 album   : 
   Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s
 Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)

2014-02-05 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1184:
--

Attachment: (was: ansi.sys)

 Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
 --

 Key: TIKA-1184
 URL: https://issues.apache.org/jira/browse/TIKA-1184
 Project: Tika
  Issue Type: Bug
  Components: cli, parser
Affects Versions: 1.4
 Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64)
 java version 1.7.0
 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 
 20130422_146026 (JIT enabled, AOT enabled)
 J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026
 JIT  - r11.b03_20130131_32403ifx4
 GC   - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS
 J9CL - 20130422_146026)
 JCL - 20130425_01 based on Oracle 7u21-b09
Reporter: Jürgen Enge
 Attachments: ansi.sys, ansi.sys


 tika hangs on identifying several types of files. the following example is an 
 mp3 file with corrupt metadata. other filetypes which have the same problem 
 are for example MSDOS device drivers (*.sys)
 i am not into java programming, but my guess would be, that tika is trying to 
 seek() within a file and the target position is greater than filesize. 
  java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [hangs forever without error message]
 ffmpeg gives some warnings about duration errors...
  ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510
 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate
 Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e':
   Metadata:
 artist  : 
 album   : 
   Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s
 Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)

2014-02-05 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892124#comment-13892124
 ] 

Giuseppe Totaro commented on TIKA-1184:
---

Hello,

I've just run the tika-app-1.4.jar against files extracted from a disk image, 
and Tika hangs on a .sys file (attached).
I tried tika-app-1.6-SNAPSHOT.jar and it worked fine.

 Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
 --

 Key: TIKA-1184
 URL: https://issues.apache.org/jira/browse/TIKA-1184
 Project: Tika
  Issue Type: Bug
  Components: cli, parser
Affects Versions: 1.4
 Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64)
 java version 1.7.0
 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 
 20130422_146026 (JIT enabled, AOT enabled)
 J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026
 JIT  - r11.b03_20130131_32403ifx4
 GC   - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS
 J9CL - 20130422_146026)
 JCL - 20130425_01 based on Oracle 7u21-b09
Reporter: Jürgen Enge
 Attachments: ansi.sys, ansi.sys, ansi.sys


 tika hangs on identifying several types of files. the following example is an 
 mp3 file with corrupt metadata. other filetypes which have the same problem 
 are for example MSDOS device drivers (*.sys)
 i am not into java programming, but my guess would be, that tika is trying to 
 seek() within a file and the target position is greater than filesize. 
  java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [hangs forever without error message]
 ffmpeg gives some warnings about duration errors...
  ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510
 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate
 Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e':
   Metadata:
 artist  : 
 album   : 
   Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s
 Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)

2014-02-05 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1184:
--

Attachment: ansi.sys

 Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
 --

 Key: TIKA-1184
 URL: https://issues.apache.org/jira/browse/TIKA-1184
 Project: Tika
  Issue Type: Bug
  Components: cli, parser
Affects Versions: 1.4
 Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64)
 java version 1.7.0
 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 
 20130422_146026 (JIT enabled, AOT enabled)
 J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026
 JIT  - r11.b03_20130131_32403ifx4
 GC   - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS
 J9CL - 20130422_146026)
 JCL - 20130425_01 based on Oracle 7u21-b09
Reporter: Jürgen Enge
 Attachments: ansi.sys, ansi.sys, ansi.sys


 tika hangs on identifying several types of files. the following example is an 
 mp3 file with corrupt metadata. other filetypes which have the same problem 
 are for example MSDOS device drivers (*.sys)
 i am not into java programming, but my guess would be, that tika is trying to 
 seek() within a file and the target position is greater than filesize. 
  java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [hangs forever without error message]
 ffmpeg gives some warnings about duration errors...
  ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e
 [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510
 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate
 Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e':
   Metadata:
 artist  : 
 album   : 
   Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s
 Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1092) Parsing of old Word file causes a TikaException

2013-04-02 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13619639#comment-13619639
 ] 

Giuseppe Totaro commented on TIKA-1092:
---

Thanks Nick. I'll give you feedback as soon as possible.

 Parsing of old Word file causes a TikaException
 ---

 Key: TIKA-1092
 URL: https://issues.apache.org/jira/browse/TIKA-1092
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Giuseppe Totaro
Priority: Minor
  Labels: office, parse, word-exception

 I found an issue with the parse method of 
 org.apache.tika.parser.microsoft.OfficeParser. This parser generates a Tika 
 Exception when it try to parse very old file of Microsoft Word.
 I think this issue is not a priority because the files that cause the 
 exception belong to an obsolete format/structure that even new Microsoft 
 Office versions don't support them, but it's important to know that something 
 wrong about these outdated types can happen.
 I report two links about old types (Microsoft support perspective):
 http://support.microsoft.com/?kbid=922850
 http://support.microsoft.com/kb/922849/it
 For example, the message of TikaException is below:
 Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: 
 Illegal IOException from 
 org.apache.tika.parser.microsoft.OfficeParser@789ab21d
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
 Caused by: java.io.IOException: Invalid header signature; read 
 0x0410401F002DA5DB, expected 0xE11AB1A1E011CFD0
   at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:140)
   at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:115)
   at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:198)
   at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:184)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   ... 5 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1092) Parsing of old Word file causes a TikaException

2013-03-14 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602168#comment-13602168
 ] 

Giuseppe Totaro commented on TIKA-1092:
---

Hi Nick,

thanks for your support.
I'll send you the first 384 bytes (saved by hex editor for mac) of three old 
file .DOC that cause TikaException. Unfortunately I can't supply the first 2 KB 
because at byte 385 begins the confidential text.
I'll try to send you other information.

Thanks,
Giuseppe

 Parsing of old Word file causes a TikaException
 ---

 Key: TIKA-1092
 URL: https://issues.apache.org/jira/browse/TIKA-1092
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Giuseppe Totaro
Priority: Minor
  Labels: office, parse, word-exception

 I found an issue with the parse method of 
 org.apache.tika.parser.microsoft.OfficeParser. This parser generates a Tika 
 Exception when it try to parse very old file of Microsoft Word.
 I think this issue is not a priority because the files that cause the 
 exception belong to an obsolete format/structure that even new Microsoft 
 Office versions don't support them, but it's important to know that something 
 wrong about these outdated types can happen.
 I report two links about old types (Microsoft support perspective):
 http://support.microsoft.com/?kbid=922850
 http://support.microsoft.com/kb/922849/it
 For example, the message of TikaException is below:
 Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: 
 Illegal IOException from 
 org.apache.tika.parser.microsoft.OfficeParser@789ab21d
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
 Caused by: java.io.IOException: Invalid header signature; read 
 0x0410401F002DA5DB, expected 0xE11AB1A1E011CFD0
   at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:140)
   at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:115)
   at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:198)
   at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:184)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   ... 5 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1092) Parsing of old Word file causes a TikaException

2013-03-13 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13601195#comment-13601195
 ] 

Giuseppe Totaro commented on TIKA-1092:
---

Hi Nick,

most files were created in 1992 (before the launch of Word 6).

When I try to open these files with my Word version (Office 2007) I receive the 
message:
You are attempting to open a file that was created in an earlier version of 
Microsoft Office. This file type is blocked from opening in this version by 
your registry policy setting.

To open the file I must apply the manual (or fix app) correction to Windows 
registry following the instructions reported in 
http://support.microsoft.com/kb/922849/en-us#fixit4me

After the correction, I'm able to open the file with Word and I see the 
document text correctly. If I try to save the file (on itself), the Word 
application ask me to select a type. Thus I can see the file with Word but I'm 
not able to know the original type version of the document and the application 
used to create it.

I attempted to know other information about these misterious files, but I 
didn't obtain relevant results. For example, I used the command-line tool 
file under Linux or other metadata analyzer (don't worry... Tika remains my 
favorite parser :)).

Thanks,
Giuseppe 

 Parsing of old Word file causes a TikaException
 ---

 Key: TIKA-1092
 URL: https://issues.apache.org/jira/browse/TIKA-1092
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Giuseppe Totaro
Priority: Minor
  Labels: office, parse, word-exception

 I found an issue with the parse method of 
 org.apache.tika.parser.microsoft.OfficeParser. This parser generates a Tika 
 Exception when it try to parse very old file of Microsoft Word.
 I think this issue is not a priority because the files that cause the 
 exception belong to an obsolete format/structure that even new Microsoft 
 Office versions don't support them, but it's important to know that something 
 wrong about these outdated types can happen.
 I report two links about old types (Microsoft support perspective):
 http://support.microsoft.com/?kbid=922850
 http://support.microsoft.com/kb/922849/it
 For example, the message of TikaException is below:
 Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: 
 Illegal IOException from 
 org.apache.tika.parser.microsoft.OfficeParser@789ab21d
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
 Caused by: java.io.IOException: Invalid header signature; read 
 0x0410401F002DA5DB, expected 0xE11AB1A1E011CFD0
   at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:140)
   at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:115)
   at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:198)
   at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:184)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   ... 5 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1092) Parsing of old Word file causes a TikaException

2013-03-12 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13600065#comment-13600065
 ] 

Giuseppe Totaro commented on TIKA-1092:
---

Hi Nick,

I'm agree with your first observation about old office documents.
I don't think that someone has renamed the files. These files were created with 
an older version of Word (I think Microsoft Word 6.0) and they were saved with 
.doc extension.
Unfortunately I can't supply my set of files because they are classified. I'll 
send you one or more files If I find documents without confidentiality limits 
that generate the same exception.

Thanks,
Giuseppe

 Parsing of old Word file causes a TikaException
 ---

 Key: TIKA-1092
 URL: https://issues.apache.org/jira/browse/TIKA-1092
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Giuseppe Totaro
Priority: Minor
  Labels: office, parse, word-exception

 I found an issue with the parse method of 
 org.apache.tika.parser.microsoft.OfficeParser. This parser generates a Tika 
 Exception when it try to parse very old file of Microsoft Word.
 I think this issue is not a priority because the files that cause the 
 exception belong to an obsolete format/structure that even new Microsoft 
 Office versions don't support them, but it's important to know that something 
 wrong about these outdated types can happen.
 I report two links about old types (Microsoft support perspective):
 http://support.microsoft.com/?kbid=922850
 http://support.microsoft.com/kb/922849/it
 For example, the message of TikaException is below:
 Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: 
 Illegal IOException from 
 org.apache.tika.parser.microsoft.OfficeParser@789ab21d
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
 Caused by: java.io.IOException: Invalid header signature; read 
 0x0410401F002DA5DB, expected 0xE11AB1A1E011CFD0
   at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:140)
   at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:115)
   at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:198)
   at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:184)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   ... 5 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1092) Parsing of old Word file causes a TikaException

2013-03-11 Thread Giuseppe Totaro (JIRA)
Giuseppe Totaro created TIKA-1092:
-

 Summary: Parsing of old Word file causes a TikaException
 Key: TIKA-1092
 URL: https://issues.apache.org/jira/browse/TIKA-1092
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Giuseppe Totaro
Priority: Minor


I found an issue with the parse method of 
org.apache.tika.parser.microsoft.OfficeParser. This parser generates a Tika 
Exception when it try to parse very old file of Microsoft Word.

I think this issue is not a priority because the files that cause the exception 
belong to an obsolete format/structure that even new Microsoft Office versions 
don't support them, but it's important to know that something wrong about these 
outdated types can happen.

I report two links about old types (Microsoft support perspective):
http://support.microsoft.com/?kbid=922850
http://support.microsoft.com/kb/922849/it

For example, the message of TikaException is below:

Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: 
Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@789ab21d
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
Caused by: java.io.IOException: Invalid header signature; read 
0x0410401F002DA5DB, expected 0xE11AB1A1E011CFD0
at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:140)
at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:115)
at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:198)
at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:184)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 5 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1081) Error in specification of glob pattern for awk files

2013-02-09 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575242#comment-13575242
 ] 

Giuseppe Totaro commented on TIKA-1081:
---

Thanks Chris :)

 Error in specification of glob pattern for awk files
 

 Key: TIKA-1081
 URL: https://issues.apache.org/jira/browse/TIKA-1081
 Project: Tika
  Issue Type: Bug
  Components: mime
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: mime
 Fix For: 1.4


 The tika-mimetypes.xml file presents an orthographic error at line n. 4591:
 {{glob *pattenr*=*.awk/}}
 I'm new of Apache community and I hope that my little, maybe unuseful, 
 contribution is correct.
 Best regards,
 Giuseppe

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira