Re: [DISCUSS] Enable specific ContentHandler for tika-server
Hi folks, I am developing the proposed solutions within tika-server for enabling specific ContentHandlers. Basically, I am working to provide the ability of giving the name of the ContentHandler to be used by either command-line or HTTP header. In order to complete my work, I would like to get your feedback about the following aspects: 1. To create and use the given ContentHandler, should I modify each method within the TikaResource class (as well as the other classes within org.apache.tika.server.resource) where the parse method is performed by wrapping the ContentHandler currently used? Alternatively, I could create a new method (therefore a new REST API) specifically focused on creating a ContentHandler from the list provided by the user. Of course, I am totally open to other solutions. 2. As ContentHandlers often provide different types of constructors, we would need a mechanism to determine via reflection the constructor and the parameters to be used. I think we could get the ContentHandler by using the static method Class.forName(String className) [0] with the fully-qualified name of the given class and then using the method getConstructor(Class... parameterTypes) [1] to determine the constructor to be used and instantiates the ContentHandler. 3. If you agree with the above, I think that we can allow users to provide the parameters according to RCFC822 [3] so that they can give the name of the ContentHandler to be used and the parameter as a semicolon-separated list of entries: = X-Content-Handler: *[, ] = *[; ] = = Consistently, I would enable the same syntax when using the command-line option: java -jar tika-server-X.jar -contentHandler *[,] I look forward to having your feedback. Thanks a lot, Giuseppe [0] https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#forName-java.lang.String- [1] https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#getConstructor-java.lang.Class...- [3] https://www.w3.org/Protocols/rfc822/ On Fri, Oct 6, 2017 at 3:06 PM, Sergey Beryozkin <sberyoz...@gmail.com> wrote: > Konstantin, by the way, if you are interested in having a good discussion > to do with using the serialized lambdas then you will be welcome to comment > on the relevant text in the Tika Concerns Beam thread, though may be Beam > knows how to take care of the issues you raised... > > Thanks, Sergey > > On 06/10/17 18:27, Sergey Beryozkin wrote: > >> On 06/10/17 18:08, Konstantin Gribov wrote: >> >>> My +1 to this idea. >>> >>> IMHO, second option is more flexible. I also like Nick's suggestion about >>> using default package for handlers and interpret dot-separated string as >>> fqcn. Solr does similar thing and it's very convenient to use (but they >>> use >>> prefix `solr.` for their classes in predefined package and any other is >>> interpreted as fqcn). >>> >>> I'll add that you could allow user to pass several comma-separated >>> handlers >>> to allow build content-handler stack if user wants to. >>> >>> I would disagree with Sergey about serialized lambdas for 2 reasons: >>> - it's useful only for java-clients; >>> - it could bring very nasty bugs leading to RCE class vulnerabilities, so >>> it's very controversial from security PoV. >>> >> Sure. I was not actually suggesting to use them in Tika natively, I only >> referred to it as the alternative mentioned in the context of the Beam >> integration work >> >> Sergey >> >>> >>> On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <totarope...@gmail.com> >>> wrote: >>> >>> Hi folks, >>>> >>>> if I am not wrong, currently you cannot configure a specific >>>> ContentHandler >>>> while using tika-server. I mean that you can configure your own parser >>>> [0] >>>> but you cannot control which ContentHandler the parser leverages to >>>> extract >>>> text and metadata (e.g., you cannot use PhoneExtractingContentHandler, >>>> StandardsExtractingContentHandler, etc). >>>> If it is correct, it would be nice to enable the use of specific >>>> ContentHandlers within tika-server and I would like to discuss how to >>>> solve >>>> this issue generally. >>>> >>>> I propose two solutions: >>>> >>>> 1. augment the TikaConfig class so that a specific ContentHandler >>>> can be >>>> used in tika-config.xml; >>>> 2. determine the ContentHandler to use for parsing through HTTP >>>> headers, >>>> for e
Re: [DISCUSS] Enable specific ContentHandler for tika-server
Hi folks, first of all, I want to express my gratitude for your feedback and insightful suggestions. To sum up, I would like to quickly discuss the following aspects: - As you all mentioned, the HTTP headers for configuring the ContentHandler to be used are better suited for the dynamic cases. Specifically, a ContentHadler can be given through an ad-hoc header, e.g. -H "X-Content-Handler: StandardsExtractingContentHandler", parsed and used run-time within tika-server. - Nick, I believe that providing the ability to determine the ContentHandler through a command-line option is a great idea. It could be better also for users. Please let me implement both solutions and provide an example in the next days that we can discuss. Thanks again for your kind availability, Giuseppe On Thu, Sep 28, 2017 at 10:08 PM, Nick Burch <apa...@gagravarr.org> wrote: > On Thu, 28 Sep 2017, Giuseppe Totaro wrote: > >> if I am not wrong, currently you cannot configure a specific >> ContentHandler >> while using tika-server. I mean that you can configure your own parser [0] >> but you cannot control which ContentHandler the parser leverages to >> extract >> text and metadata (e.g., you cannot use PhoneExtractingContentHandler, >> StandardsExtractingContentHandler, etc). >> > > I think the long-term plan was to work out a viable plan for laying > multiple parsers on top of each other, then change some of these to be > "enhancing parsers" on top. However, that's still on the "TODO" list for > Tika 2.0, as we've still yet to come up with a good way to allow it to > happen within the SAX / ContentHandler structure > > > I propose two solutions: >> >> 1. augment the TikaConfig class so that a specific ContentHandler can be >> used in tika-config.xml; >> > > That feels a bit wrong to me, because in almost all Tika use-cases, the > value from the Config would be ignored. > > Trying to explain to a new user which were the cases where it'd be used, > and which ones it was ignored, seems hard and confusing too... > > > 2. determine the ContentHandler to use for parsing through HTTP headers, >> for example: >> > > We do allow setting of parser config via headers, so this would have > precidence. It would also allow per-request changing > > Otherwise, if server-wide is OK (which your config idea would require > anyway), might it not be better to make it an option when you start the > server? I see it as being a bit more like picking a port, in terms of > something specific to how you run that server instance > > eg java -jar tika-server.jar --port 1234 --content-handler > PhoneExtractingContentHandler > eg java -jar tika-server.jar --port 1234 --content-handler > com.example.CustomHandler > > Nick >
[DISCUSS] Enable specific ContentHandler for tika-server
Hi folks, if I am not wrong, currently you cannot configure a specific ContentHandler while using tika-server. I mean that you can configure your own parser [0] but you cannot control which ContentHandler the parser leverages to extract text and metadata (e.g., you cannot use PhoneExtractingContentHandler, StandardsExtractingContentHandler, etc). If it is correct, it would be nice to enable the use of specific ContentHandlers within tika-server and I would like to discuss how to solve this issue generally. I propose two solutions: 1. augment the TikaConfig class so that a specific ContentHandler can be used in tika-config.xml; 2. determine the ContentHandler to use for parsing through HTTP headers, for example: curl -T filename.pdf http://localhost:9998/meta --header "X-Content-Handler: PhoneExtractingContentHandler" This should affect also the TikaResource.java class. I look forward to having your feedback. I strongly believe that every user who wants to use Tika as a service through tika-server and needs to extract content and metadata like phone numbers, standard references, etc would be very happy. Thanks a lot, Giuseppe
[jira] [Resolved] (TIKA-2449) Enabling extraction of standard references from text
[ https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro resolved TIKA-2449. --- Resolution: Fixed Fix Version/s: 1.17 > Enabling extraction of standard references from text > > > Key: TIKA-2449 > URL: https://issues.apache.org/jira/browse/TIKA-2449 > Project: Tika > Issue Type: Improvement > Components: handler > Reporter: Giuseppe Totaro > Assignee: Giuseppe Totaro > Labels: handler > Fix For: 1.17 > > Attachments: flowchart_standards_extraction.png, > flowchart_standards_extraction_v02.png, SOW-TacCOM.pdf, > standards_extraction.patch > > > Apache Tika currently provides many _ContentHandler_ which help to > de-obfuscate some information from text. For instance, the > {{PhoneExtractingContentHandler}} is used to extract phone numbers while > parsing. > This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a > new ContentHandler that relies on regular expressions in order to identify > and extract standard references from text. > Basically, a standard reference is just a reference to a > norm/convention/requirement (i.e., a standard) released by a standard > organization. This work is maily focused on identifying and extracting the > references to the standards already cited within a given document (e.g., > SOW/PWS) so the references can be stored and provided to the user as > additional metadata in case the StandardExtractingContentHandler is used. > In addition to the patch, the first version of the > {{StandardsExtractingContentHandler}} along with an example class to easily > execute the handler is available on > [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. > The following sections provide more in detail how the > {{StandardsExtractingHandler}} has been developed. > h1. Background > From a technical perspective, a standard reference is a string that is > usually composed of two parts: > # the name of the standard organization; > # the alphanumeric identifier of the standard within the organization. > Specifically, the first part can include the acronym or the full name of the > standard organization or even both, and the second part can include an > alphanumeric string, possibly containing one or more separation symbols > (e.g., "-", "_", ".") depending on the format adopted by the organization, > representing the identifier of the standard within the organization. > Furthermore, the standard references are usually reported within the > "Applicable Documents" or "References" section of a SOW, and they can be > cited also within sections that include in the header the word "standard", > "requirement", "guideline", or "compliance". > Consequently, the citation of standard references within a SOW/PWS document > can be summarized by the following rules: > * *RULE #1*: standard references are usually reported within the section > named "Applicable Documents" or "References". > * *RULE #2*: standard references can be cited also within sections including > the word "compliance" or another semantically-equivalent word in their name. > * *RULE #3*: standard references is composed of two parts: > ** Name of the standard organization (acronym, full name, or both). > ** Alphanumeric identifier of the standard within the organization. > * *RULE #4*: The name of the standard organization includes the acronym or > the full name or both. The name must belong to the set of standard > organizations {{S = O U V}}, where {{O}} represents the set of open standard > organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific > standard organizations (e.g., Motorola). > * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be > used between the name of the standard organization and the alphanumeric > identifier. > * *RULE #6*: The alphanumeric identifier of the standard is composed of > alphabetic and numeric characters, possibly split in two or more parts by a > separation symbol (e.g., "-", "_", "."). > On the basis of the above rules, here are some examples of formats used for > reporting standard references within a SOW/PWS: > * {{}} > * > {{()}} > * > {{()}} > Moreover, some standards are sometimes released by two standard > organizations. In this case, the standard reference
[jira] [Assigned] (TIKA-2449) Enabling extraction of standard references from text
[ https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro reassigned TIKA-2449: - Assignee: Giuseppe Totaro > Enabling extraction of standard references from text > > > Key: TIKA-2449 > URL: https://issues.apache.org/jira/browse/TIKA-2449 > Project: Tika > Issue Type: Improvement > Components: handler > Reporter: Giuseppe Totaro > Assignee: Giuseppe Totaro > Labels: handler > Attachments: flowchart_standards_extraction.png, > flowchart_standards_extraction_v02.png, SOW-TacCOM.pdf, > standards_extraction.patch > > > Apache Tika currently provides many _ContentHandler_ which help to > de-obfuscate some information from text. For instance, the > {{PhoneExtractingContentHandler}} is used to extract phone numbers while > parsing. > This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a > new ContentHandler that relies on regular expressions in order to identify > and extract standard references from text. > Basically, a standard reference is just a reference to a > norm/convention/requirement (i.e., a standard) released by a standard > organization. This work is maily focused on identifying and extracting the > references to the standards already cited within a given document (e.g., > SOW/PWS) so the references can be stored and provided to the user as > additional metadata in case the StandardExtractingContentHandler is used. > In addition to the patch, the first version of the > {{StandardsExtractingContentHandler}} along with an example class to easily > execute the handler is available on > [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. > The following sections provide more in detail how the > {{StandardsExtractingHandler}} has been developed. > h1. Background > From a technical perspective, a standard reference is a string that is > usually composed of two parts: > # the name of the standard organization; > # the alphanumeric identifier of the standard within the organization. > Specifically, the first part can include the acronym or the full name of the > standard organization or even both, and the second part can include an > alphanumeric string, possibly containing one or more separation symbols > (e.g., "-", "_", ".") depending on the format adopted by the organization, > representing the identifier of the standard within the organization. > Furthermore, the standard references are usually reported within the > "Applicable Documents" or "References" section of a SOW, and they can be > cited also within sections that include in the header the word "standard", > "requirement", "guideline", or "compliance". > Consequently, the citation of standard references within a SOW/PWS document > can be summarized by the following rules: > * *RULE #1*: standard references are usually reported within the section > named "Applicable Documents" or "References". > * *RULE #2*: standard references can be cited also within sections including > the word "compliance" or another semantically-equivalent word in their name. > * *RULE #3*: standard references is composed of two parts: > ** Name of the standard organization (acronym, full name, or both). > ** Alphanumeric identifier of the standard within the organization. > * *RULE #4*: The name of the standard organization includes the acronym or > the full name or both. The name must belong to the set of standard > organizations {{S = O U V}}, where {{O}} represents the set of open standard > organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific > standard organizations (e.g., Motorola). > * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be > used between the name of the standard organization and the alphanumeric > identifier. > * *RULE #6*: The alphanumeric identifier of the standard is composed of > alphabetic and numeric characters, possibly split in two or more parts by a > separation symbol (e.g., "-", "_", "."). > On the basis of the above rules, here are some examples of formats used for > reporting standard references within a SOW/PWS: > * {{}} > * > {{()}} > * > {{()}} > Moreover, some standards are sometimes released by two standard > organizations. In this case, the standard reference can be reported as > follows: > * > {{/}} &
[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text
[ https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-2449: -- External issue URL: https://github.com/apache/tika/pull/204 (was: https://github.com/giuseppetotaro/StandardsExtractingContentHandler) > Enabling extraction of standard references from text > > > Key: TIKA-2449 > URL: https://issues.apache.org/jira/browse/TIKA-2449 > Project: Tika > Issue Type: Improvement > Components: handler > Reporter: Giuseppe Totaro > Labels: handler > Attachments: flowchart_standards_extraction.png, > flowchart_standards_extraction_v02.png, SOW-TacCOM.pdf, > standards_extraction.patch > > > Apache Tika currently provides many _ContentHandler_ which help to > de-obfuscate some information from text. For instance, the > {{PhoneExtractingContentHandler}} is used to extract phone numbers while > parsing. > This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a > new ContentHandler that relies on regular expressions in order to identify > and extract standard references from text. > Basically, a standard reference is just a reference to a > norm/convention/requirement (i.e., a standard) released by a standard > organization. This work is maily focused on identifying and extracting the > references to the standards already cited within a given document (e.g., > SOW/PWS) so the references can be stored and provided to the user as > additional metadata in case the StandardExtractingContentHandler is used. > In addition to the patch, the first version of the > {{StandardsExtractingContentHandler}} along with an example class to easily > execute the handler is available on > [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. > The following sections provide more in detail how the > {{StandardsExtractingHandler}} has been developed. > h1. Background > From a technical perspective, a standard reference is a string that is > usually composed of two parts: > # the name of the standard organization; > # the alphanumeric identifier of the standard within the organization. > Specifically, the first part can include the acronym or the full name of the > standard organization or even both, and the second part can include an > alphanumeric string, possibly containing one or more separation symbols > (e.g., "-", "_", ".") depending on the format adopted by the organization, > representing the identifier of the standard within the organization. > Furthermore, the standard references are usually reported within the > "Applicable Documents" or "References" section of a SOW, and they can be > cited also within sections that include in the header the word "standard", > "requirement", "guideline", or "compliance". > Consequently, the citation of standard references within a SOW/PWS document > can be summarized by the following rules: > * *RULE #1*: standard references are usually reported within the section > named "Applicable Documents" or "References". > * *RULE #2*: standard references can be cited also within sections including > the word "compliance" or another semantically-equivalent word in their name. > * *RULE #3*: standard references is composed of two parts: > ** Name of the standard organization (acronym, full name, or both). > ** Alphanumeric identifier of the standard within the organization. > * *RULE #4*: The name of the standard organization includes the acronym or > the full name or both. The name must belong to the set of standard > organizations {{S = O U V}}, where {{O}} represents the set of open standard > organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific > standard organizations (e.g., Motorola). > * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be > used between the name of the standard organization and the alphanumeric > identifier. > * *RULE #6*: The alphanumeric identifier of the standard is composed of > alphabetic and numeric characters, possibly split in two or more parts by a > separation symbol (e.g., "-", "_", "."). > On the basis of the above rules, here are some examples of formats used for > reporting standard references within a SOW/PWS: > * {{}} > * > {{()}} > * > {{()}} > Moreover, some standards are sometimes released by two standard > organizations. In this case, the standard
[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text
[ https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-2449: -- Attachment: flowchart_standards_extraction_v02.png > Enabling extraction of standard references from text > > > Key: TIKA-2449 > URL: https://issues.apache.org/jira/browse/TIKA-2449 > Project: Tika > Issue Type: Improvement > Components: handler > Reporter: Giuseppe Totaro > Labels: handler > Attachments: flowchart_standards_extraction.png, > flowchart_standards_extraction_v02.png, SOW-TacCOM.pdf, > standards_extraction.patch > > > Apache Tika currently provides many _ContentHandler_ which help to > de-obfuscate some information from text. For instance, the > {{PhoneExtractingContentHandler}} is used to extract phone numbers while > parsing. > This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a > new ContentHandler that relies on regular expressions in order to identify > and extract standard references from text. > Basically, a standard reference is just a reference to a > norm/convention/requirement (i.e., a standard) released by a standard > organization. This work is maily focused on identifying and extracting the > references to the standards already cited within a given document (e.g., > SOW/PWS) so the references can be stored and provided to the user as > additional metadata in case the StandardExtractingContentHandler is used. > In addition to the patch, the first version of the > {{StandardsExtractingContentHandler}} along with an example class to easily > execute the handler is available on > [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. > The following sections provide more in detail how the > {{StandardsExtractingHandler}} has been developed. > h1. Background > From a technical perspective, a standard reference is a string that is > usually composed of two parts: > # the name of the standard organization; > # the alphanumeric identifier of the standard within the organization. > Specifically, the first part can include the acronym or the full name of the > standard organization or even both, and the second part can include an > alphanumeric string, possibly containing one or more separation symbols > (e.g., "-", "_", ".") depending on the format adopted by the organization, > representing the identifier of the standard within the organization. > Furthermore, the standard references are usually reported within the > "Applicable Documents" or "References" section of a SOW, and they can be > cited also within sections that include in the header the word "standard", > "requirement", "guideline", or "compliance". > Consequently, the citation of standard references within a SOW/PWS document > can be summarized by the following rules: > * *RULE #1*: standard references are usually reported within the section > named "Applicable Documents" or "References". > * *RULE #2*: standard references can be cited also within sections including > the word "compliance" or another semantically-equivalent word in their name. > * *RULE #3*: standard references is composed of two parts: > ** Name of the standard organization (acronym, full name, or both). > ** Alphanumeric identifier of the standard within the organization. > * *RULE #4*: The name of the standard organization includes the acronym or > the full name or both. The name must belong to the set of standard > organizations {{S = O U V}}, where {{O}} represents the set of open standard > organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific > standard organizations (e.g., Motorola). > * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be > used between the name of the standard organization and the alphanumeric > identifier. > * *RULE #6*: The alphanumeric identifier of the standard is composed of > alphabetic and numeric characters, possibly split in two or more parts by a > separation symbol (e.g., "-", "_", "."). > On the basis of the above rules, here are some examples of formats used for > reporting standard references within a SOW/PWS: > * {{}} > * > {{()}} > * > {{()}} > Moreover, some standards are sometimes released by two standard > organizations. In this case, the standard reference can be reported as > follows: > * > {{/}} > h1. Regular E
[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text
[ https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-2449: -- Attachment: (was: flowchart_standards_extraction_v02.png) > Enabling extraction of standard references from text > > > Key: TIKA-2449 > URL: https://issues.apache.org/jira/browse/TIKA-2449 > Project: Tika > Issue Type: Improvement > Components: handler > Reporter: Giuseppe Totaro > Labels: handler > Attachments: flowchart_standards_extraction.png, SOW-TacCOM.pdf, > standards_extraction.patch > > > Apache Tika currently provides many _ContentHandler_ which help to > de-obfuscate some information from text. For instance, the > {{PhoneExtractingContentHandler}} is used to extract phone numbers while > parsing. > This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a > new ContentHandler that relies on regular expressions in order to identify > and extract standard references from text. > Basically, a standard reference is just a reference to a > norm/convention/requirement (i.e., a standard) released by a standard > organization. This work is maily focused on identifying and extracting the > references to the standards already cited within a given document (e.g., > SOW/PWS) so the references can be stored and provided to the user as > additional metadata in case the StandardExtractingContentHandler is used. > In addition to the patch, the first version of the > {{StandardsExtractingContentHandler}} along with an example class to easily > execute the handler is available on > [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. > The following sections provide more in detail how the > {{StandardsExtractingHandler}} has been developed. > h1. Background > From a technical perspective, a standard reference is a string that is > usually composed of two parts: > # the name of the standard organization; > # the alphanumeric identifier of the standard within the organization. > Specifically, the first part can include the acronym or the full name of the > standard organization or even both, and the second part can include an > alphanumeric string, possibly containing one or more separation symbols > (e.g., "-", "_", ".") depending on the format adopted by the organization, > representing the identifier of the standard within the organization. > Furthermore, the standard references are usually reported within the > "Applicable Documents" or "References" section of a SOW, and they can be > cited also within sections that include in the header the word "standard", > "requirement", "guideline", or "compliance". > Consequently, the citation of standard references within a SOW/PWS document > can be summarized by the following rules: > * *RULE #1*: standard references are usually reported within the section > named "Applicable Documents" or "References". > * *RULE #2*: standard references can be cited also within sections including > the word "compliance" or another semantically-equivalent word in their name. > * *RULE #3*: standard references is composed of two parts: > ** Name of the standard organization (acronym, full name, or both). > ** Alphanumeric identifier of the standard within the organization. > * *RULE #4*: The name of the standard organization includes the acronym or > the full name or both. The name must belong to the set of standard > organizations {{S = O U V}}, where {{O}} represents the set of open standard > organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific > standard organizations (e.g., Motorola). > * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be > used between the name of the standard organization and the alphanumeric > identifier. > * *RULE #6*: The alphanumeric identifier of the standard is composed of > alphabetic and numeric characters, possibly split in two or more parts by a > separation symbol (e.g., "-", "_", "."). > On the basis of the above rules, here are some examples of formats used for > reporting standard references within a SOW/PWS: > * {{}} > * > {{()}} > * > {{()}} > Moreover, some standards are sometimes released by two standard > organizations. In this case, the standard reference can be reported as > follows: > * > {{/}} > h1. Regular Expressions > The {{StandardsExtracting
[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text
[ https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-2449: -- Attachment: flowchart_standards_extraction_v02.png > Enabling extraction of standard references from text > > > Key: TIKA-2449 > URL: https://issues.apache.org/jira/browse/TIKA-2449 > Project: Tika > Issue Type: Improvement > Components: handler > Reporter: Giuseppe Totaro > Labels: handler > Attachments: flowchart_standards_extraction.png, > flowchart_standards_extraction_v02.png, SOW-TacCOM.pdf, > standards_extraction.patch > > > Apache Tika currently provides many _ContentHandler_ which help to > de-obfuscate some information from text. For instance, the > {{PhoneExtractingContentHandler}} is used to extract phone numbers while > parsing. > This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a > new ContentHandler that relies on regular expressions in order to identify > and extract standard references from text. > Basically, a standard reference is just a reference to a > norm/convention/requirement (i.e., a standard) released by a standard > organization. This work is maily focused on identifying and extracting the > references to the standards already cited within a given document (e.g., > SOW/PWS) so the references can be stored and provided to the user as > additional metadata in case the StandardExtractingContentHandler is used. > In addition to the patch, the first version of the > {{StandardsExtractingContentHandler}} along with an example class to easily > execute the handler is available on > [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. > The following sections provide more in detail how the > {{StandardsExtractingHandler}} has been developed. > h1. Background > From a technical perspective, a standard reference is a string that is > usually composed of two parts: > # the name of the standard organization; > # the alphanumeric identifier of the standard within the organization. > Specifically, the first part can include the acronym or the full name of the > standard organization or even both, and the second part can include an > alphanumeric string, possibly containing one or more separation symbols > (e.g., "-", "_", ".") depending on the format adopted by the organization, > representing the identifier of the standard within the organization. > Furthermore, the standard references are usually reported within the > "Applicable Documents" or "References" section of a SOW, and they can be > cited also within sections that include in the header the word "standard", > "requirement", "guideline", or "compliance". > Consequently, the citation of standard references within a SOW/PWS document > can be summarized by the following rules: > * *RULE #1*: standard references are usually reported within the section > named "Applicable Documents" or "References". > * *RULE #2*: standard references can be cited also within sections including > the word "compliance" or another semantically-equivalent word in their name. > * *RULE #3*: standard references is composed of two parts: > ** Name of the standard organization (acronym, full name, or both). > ** Alphanumeric identifier of the standard within the organization. > * *RULE #4*: The name of the standard organization includes the acronym or > the full name or both. The name must belong to the set of standard > organizations {{S = O U V}}, where {{O}} represents the set of open standard > organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific > standard organizations (e.g., Motorola). > * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be > used between the name of the standard organization and the alphanumeric > identifier. > * *RULE #6*: The alphanumeric identifier of the standard is composed of > alphabetic and numeric characters, possibly split in two or more parts by a > separation symbol (e.g., "-", "_", "."). > On the basis of the above rules, here are some examples of formats used for > reporting standard references within a SOW/PWS: > * {{}} > * > {{()}} > * > {{()}} > Moreover, some standards are sometimes released by two standard > organizations. In this case, the standard reference can be reported as > follows: > * > {{/}} > h1. Regular E
[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text
[ https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-2449: -- Attachment: standards_extraction_v02.png > Enabling extraction of standard references from text > > > Key: TIKA-2449 > URL: https://issues.apache.org/jira/browse/TIKA-2449 > Project: Tika > Issue Type: Improvement > Components: handler > Reporter: Giuseppe Totaro > Labels: handler > Attachments: flowchart_standards_extraction.png, SOW-TacCOM.pdf, > standards_extraction.patch > > > Apache Tika currently provides many _ContentHandler_ which help to > de-obfuscate some information from text. For instance, the > {{PhoneExtractingContentHandler}} is used to extract phone numbers while > parsing. > This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a > new ContentHandler that relies on regular expressions in order to identify > and extract standard references from text. > Basically, a standard reference is just a reference to a > norm/convention/requirement (i.e., a standard) released by a standard > organization. This work is maily focused on identifying and extracting the > references to the standards already cited within a given document (e.g., > SOW/PWS) so the references can be stored and provided to the user as > additional metadata in case the StandardExtractingContentHandler is used. > In addition to the patch, the first version of the > {{StandardsExtractingContentHandler}} along with an example class to easily > execute the handler is available on > [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. > The following sections provide more in detail how the > {{StandardsExtractingHandler}} has been developed. > h1. Background > From a technical perspective, a standard reference is a string that is > usually composed of two parts: > # the name of the standard organization; > # the alphanumeric identifier of the standard within the organization. > Specifically, the first part can include the acronym or the full name of the > standard organization or even both, and the second part can include an > alphanumeric string, possibly containing one or more separation symbols > (e.g., "-", "_", ".") depending on the format adopted by the organization, > representing the identifier of the standard within the organization. > Furthermore, the standard references are usually reported within the > "Applicable Documents" or "References" section of a SOW, and they can be > cited also within sections that include in the header the word "standard", > "requirement", "guideline", or "compliance". > Consequently, the citation of standard references within a SOW/PWS document > can be summarized by the following rules: > * *RULE #1*: standard references are usually reported within the section > named "Applicable Documents" or "References". > * *RULE #2*: standard references can be cited also within sections including > the word "compliance" or another semantically-equivalent word in their name. > * *RULE #3*: standard references is composed of two parts: > ** Name of the standard organization (acronym, full name, or both). > ** Alphanumeric identifier of the standard within the organization. > * *RULE #4*: The name of the standard organization includes the acronym or > the full name or both. The name must belong to the set of standard > organizations {{S = O U V}}, where {{O}} represents the set of open standard > organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific > standard organizations (e.g., Motorola). > * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be > used between the name of the standard organization and the alphanumeric > identifier. > * *RULE #6*: The alphanumeric identifier of the standard is composed of > alphabetic and numeric characters, possibly split in two or more parts by a > separation symbol (e.g., "-", "_", "."). > On the basis of the above rules, here are some examples of formats used for > reporting standard references within a SOW/PWS: > * {{}} > * > {{()}} > * > {{()}} > Moreover, some standards are sometimes released by two standard > organizations. In this case, the standard reference can be reported as > follows: > * > {{/}} > h1. Regular Expressions > The {{StandardsExtractingContentHandler}} uses
[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text
[ https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-2449: -- Attachment: (was: standards_extraction_v02.png) > Enabling extraction of standard references from text > > > Key: TIKA-2449 > URL: https://issues.apache.org/jira/browse/TIKA-2449 > Project: Tika > Issue Type: Improvement > Components: handler > Reporter: Giuseppe Totaro > Labels: handler > Attachments: flowchart_standards_extraction.png, SOW-TacCOM.pdf, > standards_extraction.patch > > > Apache Tika currently provides many _ContentHandler_ which help to > de-obfuscate some information from text. For instance, the > {{PhoneExtractingContentHandler}} is used to extract phone numbers while > parsing. > This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a > new ContentHandler that relies on regular expressions in order to identify > and extract standard references from text. > Basically, a standard reference is just a reference to a > norm/convention/requirement (i.e., a standard) released by a standard > organization. This work is maily focused on identifying and extracting the > references to the standards already cited within a given document (e.g., > SOW/PWS) so the references can be stored and provided to the user as > additional metadata in case the StandardExtractingContentHandler is used. > In addition to the patch, the first version of the > {{StandardsExtractingContentHandler}} along with an example class to easily > execute the handler is available on > [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. > The following sections provide more in detail how the > {{StandardsExtractingHandler}} has been developed. > h1. Background > From a technical perspective, a standard reference is a string that is > usually composed of two parts: > # the name of the standard organization; > # the alphanumeric identifier of the standard within the organization. > Specifically, the first part can include the acronym or the full name of the > standard organization or even both, and the second part can include an > alphanumeric string, possibly containing one or more separation symbols > (e.g., "-", "_", ".") depending on the format adopted by the organization, > representing the identifier of the standard within the organization. > Furthermore, the standard references are usually reported within the > "Applicable Documents" or "References" section of a SOW, and they can be > cited also within sections that include in the header the word "standard", > "requirement", "guideline", or "compliance". > Consequently, the citation of standard references within a SOW/PWS document > can be summarized by the following rules: > * *RULE #1*: standard references are usually reported within the section > named "Applicable Documents" or "References". > * *RULE #2*: standard references can be cited also within sections including > the word "compliance" or another semantically-equivalent word in their name. > * *RULE #3*: standard references is composed of two parts: > ** Name of the standard organization (acronym, full name, or both). > ** Alphanumeric identifier of the standard within the organization. > * *RULE #4*: The name of the standard organization includes the acronym or > the full name or both. The name must belong to the set of standard > organizations {{S = O U V}}, where {{O}} represents the set of open standard > organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific > standard organizations (e.g., Motorola). > * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be > used between the name of the standard organization and the alphanumeric > identifier. > * *RULE #6*: The alphanumeric identifier of the standard is composed of > alphabetic and numeric characters, possibly split in two or more parts by a > separation symbol (e.g., "-", "_", "."). > On the basis of the above rules, here are some examples of formats used for > reporting standard references within a SOW/PWS: > * {{}} > * > {{()}} > * > {{()}} > Moreover, some standards are sometimes released by two standard > organizations. In this case, the standard reference can be reported as > follows: > * > {{/}} > h1. Regular Expressions > The {{StandardsExtractingConten
[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text
[ https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-2449: -- Description: Apache Tika currently provides many _ContentHandler_ which help to de-obfuscate some information from text. For instance, the {{PhoneExtractingContentHandler}} is used to extract phone numbers while parsing. This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a new ContentHandler that relies on regular expressions in order to identify and extract standard references from text. Basically, a standard reference is just a reference to a norm/convention/requirement (i.e., a standard) released by a standard organization. This work is maily focused on identifying and extracting the references to the standards already cited within a given document (e.g., SOW/PWS) so the references can be stored and provided to the user as additional metadata in case the StandardExtractingContentHandler is used. In addition to the patch, the first version of the {{StandardsExtractingContentHandler}} along with an example class to easily execute the handler is available on [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. The following sections provide more in detail how the {{StandardsExtractingHandler}} has been developed. h1. Background >From a technical perspective, a standard reference is a string that is usually >composed of two parts: # the name of the standard organization; # the alphanumeric identifier of the standard within the organization. Specifically, the first part can include the acronym or the full name of the standard organization or even both, and the second part can include an alphanumeric string, possibly containing one or more separation symbols (e.g., "-", "_", ".") depending on the format adopted by the organization, representing the identifier of the standard within the organization. Furthermore, the standard references are usually reported within the "Applicable Documents" or "References" section of a SOW, and they can be cited also within sections that include in the header the word "standard", "requirement", "guideline", or "compliance". Consequently, the citation of standard references within a SOW/PWS document can be summarized by the following rules: * *RULE #1*: standard references are usually reported within the section named "Applicable Documents" or "References". * *RULE #2*: standard references can be cited also within sections including the word "compliance" or another semantically-equivalent word in their name. * *RULE #3*: standard references is composed of two parts: ** Name of the standard organization (acronym, full name, or both). ** Alphanumeric identifier of the standard within the organization. * *RULE #4*: The name of the standard organization includes the acronym or the full name or both. The name must belong to the set of standard organizations {{S = O U V}}, where {{O}} represents the set of open standard organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific standard organizations (e.g., Motorola). * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be used between the name of the standard organization and the alphanumeric identifier. * *RULE #6*: The alphanumeric identifier of the standard is composed of alphabetic and numeric characters, possibly split in two or more parts by a separation symbol (e.g., "-", "_", "."). On the basis of the above rules, here are some examples of formats used for reporting standard references within a SOW/PWS: * {{}} * {{()}} * {{()}} Moreover, some standards are sometimes released by two standard organizations. In this case, the standard reference can be reported as follows: * {{/}} h1. Regular Expressions The {{StandardsExtractingContentHandler}} uses a helper class named {{StandardsText}} that relies on Java regular expressions and provides some methods to identify headers and standard references, and determine the score of the references found within the given text. Here are the main regular expressions used within the {{StandardsText}} class: * *REGEX_HEADER*: regular expression to match only uppercase headers. {code} (\d+\.(\d+\.?)*)\p{Blank}+([A-Z]+(\s[A-Z]+)*){5,} {code} * *REGEX_APPLICABLE_DOCUMENTS*: regular expression to match the the header of "APPLICABLE DOCUMENTS" and equivalent sections. {code} (?i:.*APPLICABLE\sDOCUMENTS|REFERENCE|STANDARD|REQUIREMENT|GUIDELINE|COMPLIANCE.*) {code} * *REGEX_FALLBACK*: regular expression to match a string that is supposed to be a standard reference. {code} \(?(?[A-Z]\w+)\)?((\s?(?\/)\s?)(\w+\s)*\(?(?[A-Z]\w+)\)?)?(\s(Publicat
[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text
[ https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-2449: -- Description: Apache Tika currently provides many _ContentHandler_ which help to de-obfuscate some information from text. For instance, the {{PhoneExtractingContentHandler}} is used to extract phone numbers while parsing. This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a new ContentHandler that relies on regular expressions in order to identify and extract standard references from text. Basically, a standard reference is just a reference to a norm/convention/requirement (i.e., a standard) released by a standard organization. This work is maily focused on identifying and extracting the references to the standards already cited within a given document (e.g., SOW/PWS) so the references can be stored and provided to the user as additional metadata in case the StandardExtractingContentHandler is used. In addition to the patch, the first version of the {{StandardsExtractingContentHandler}} along with an example class to easily execute the handler is available on [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. The following sections provide more in detail how the {{StandardsExtractingHandler}} has been developed. h1. Background >From a technical perspective, a standard reference is a string that is usually >composed of two parts: # the name of the standard organization; # the alphanumeric identifier of the standard within the organization. Specifically, the first part can include the acronym or the full name of the standard organization or even both, and the second part can include an alphanumeric string, possibly containing one or more separation symbols (e.g., "-", "_", ".") depending on the format adopted by the organization, representing the identifier of the standard within the organization. Furthermore, the standard references are usually reported within the "Applicable Documents" or "References" section of a SOW, and they can be cited also within sections that include in the header the word "standard", "requirement", "guideline", or "compliance". Consequently, the citation of standard references within a SOW/PWS document can be summarized by the following rules: * *RULE #1*: standard references are usually reported within the section named "Applicable Documents" or "References". * *RULE #2*: standard references can be cited also within sections including the word "compliance" or another semantically-equivalent word in their name. * *RULE #3*: standard references is composed of two parts: ** Name of the standard organization (acronym, full name, or both). ** Alphanumeric identifier of the standard within the organization. * *RULE #4*: The name of the standard organization includes the acronym or the full name or both. The name must belong to the set of standard organizations {{S = O U V}}, where {{O}} represents the set of open standard organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific standard organizations (e.g., Motorola). * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be used between the name of the standard organization and the alphanumeric identifier. * *RULE #6*: The alphanumeric identifier of the standard is composed of alphabetic and numeric characters, possibly split in two or more parts by a separation symbol (e.g., "-", "_", "."). On the basis of the above rules, here are some examples of formats used for reporting standard references within a SOW/PWS: * {{}} * {{()}} * {{()}} Moreover, some standards are sometimes released by two standard organizations. In this case, the standard reference can be reported as follows: * {{/}} h1. Regular Expressions The {{StandardsExtractingContentHandler}} uses a helper class named `StandardsText` that relies on Java regular expressions and provides some methods to identify headers and standard references, and determine the score of the references found within the given text. Here are the main regular expressions used within the StandardsText class: * *REGEX_HEADER*: regular expression to match only uppercase headers. {code} (\d+\.(\d+\.?)*)\p{Blank}+([A-Z]+(\s[A-Z]+)*){5,} {code} * *REGEX_APPLICABLE_DOCUMENTS*: regular expression to match the the header of "APPLICABLE DOCUMENTS" and equivalent sections. {code} (?i:.*APPLICABLE\sDOCUMENTS|REFERENCE|STANDARD|REQUIREMENT|GUIDELINE|COMPLIANCE.*) {code} * *REGEX_FALLBACK*: regular expression to match a string that is supposed to be a standard reference. {code} \(?(?[A-Z]\w+)\)?((\s?(?\/)\s?)(\w+\s)*\(?(?[A-Z]\w+)\)?)?(\s(Publication|Standard)
[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text
[ https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-2449: -- Description: Apache Tika currently provides many _ContentHandler_ which help to de-obfuscate some information from text. For instance, the {{PhoneExtractingContentHandler}} is used to extract phone numbers while parsing. This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a new ContentHandler that relies on regular expressions in order to identify and extract standard references from text. Basically, a standard reference is just a reference to a norm/convention/requirement (i.e., a standard) released by a standard organization. This work is maily focused on identifying and extracting the references to the standards already cited within a given document (e.g., SOW/PWS) so the references can be stored and provided to the user as additional metadata in case the StandardExtractingContentHandler is used. In addition to the patch, the first version of the {{StandardsExtractingContentHandler}} along with an example class to easily execute the handler is available on [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. The following sections provide more in detail how the {{StandardsExtractingHandler}} has been developed. h1. Background >From a technical perspective, a standard reference is a string that is usually >composed of two parts: # the name of the standard organization; # the alphanumeric identifier of the standard within the organization. Specifically, the first part can include the acronym or the full name of the standard organization or even both, and the second part can include an alphanumeric string, possibly containing one or more separation symbols (e.g., "-", "_", ".") depending on the format adopted by the organization, representing the identifier of the standard within the organization. Furthermore, the standard references are usually reported within the "Applicable Documents" or "References" section of a SOW, and they can be cited also within sections that include in the header the word "standard", "requirement", "guideline", or "compliance". Consequently, the citation of standard references within a SOW/PWS document can be summarized by the following rules: * *RULE #1*: standard references are usually reported within the section named "Applicable Documents" or "References". * *RULE #2*: standard references can be cited also within sections including the word "compliance" or another semantically-equivalent word in their name. * *RULE #3*: standard references is composed of two parts: ** Name of the standard organization (acronym, full name, or both). ** Alphanumeric identifier of the standard within the organization. * *RULE #4*: The name of the standard organization includes the acronym or the full name or both. The name must belong to the set of standard organizations {{S = O U V}}, where {{O}} represents the set of open standard organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific standard organizations (e.g., Motorola). * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be used between the name of the standard organization and the alphanumeric identifier. * *RULE #6*: The alphanumeric identifier of the standard is composed of alphabetic and numeric characters, possibly split in two or more parts by a separation symbol (e.g., "-", "_", "."). On the basis of the above rules, here are some examples of formats used for reporting standard references within a SOW/PWS: * {{}} * {{()}} * {{()}} Moreover, some standards are sometimes released by two standard organizations. In this case, the standard reference can be reported as follows: * {{/}} h1. Regular Expressions The {{StandardsExtractingContentHandler}} uses a helper class named `StandardsText` that relies on Java regular expressions and provides some methods to identify headers and standard references, and determine the score of the references found within the given text. Here are the main regular expressions used within the StandardsText class: * *REGEX_HEADER*: regular expression to match only uppercase headers. {code} (\d+\.(\d+\.?)*)\p{Blank}+([A-Z]+(\s[A-Z]+)*){5,} {code} * *REGEX_APPLICABLE_DOCUMENTS*: regular expression to match the the header of "APPLICABLE DOCUMENTS" and equivalent sections. {code} (?i:.*APPLICABLE\sDOCUMENTS|REFERENCE|STANDARD|REQUIREMENT|GUIDELINE|COMPLIANCE.*) {code} * *REGEX_FALLBACK*: regular expression to match a string that is supposed to be a standard reference. {code} \(?(?[A-Z]\w+)\)?((\s?(?\/)\s?)(\w+\s)*\(?(?[A-Z]\w+)\)?)?(\s(Publication|Standard)
[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text
[ https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-2449: -- Description: Apache Tika currently provides many _ContentHandler_ which help to de-obfuscate some information from text. For instance, the {{PhoneExtractingContentHandler}} is used to extract phone numbers while parsing. This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a new ContentHandler that relies on regular expressions in order to identify and extract standard references from text. Basically, a standard reference is just a reference to a norm/convention/requirement (i.e., a standard) released by a standard organization. This work is maily focused on identifying and extracting the references to the standards already cited within a given document (e.g., SOW/PWS) so the references can be stored and provided to the user as additional metadata in case the StandardExtractingContentHandler is used. In addition to the patch, the first version of the {{StandardsExtractingContentHandler}} along with an example class to easily execute the handler is available on [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. The following sections provide more in detail how the {{StandardsExtractingHandler}} has been developed. h1. Background >From a technical perspective, a standard reference is a string that is usually >composed of two parts: # the name of the standard organization; # the alphanumeric identifier of the standard within the organization. Specifically, the first part can include the acronym or the full name of the standard organization or even both, and the second part can include an alphanumeric string, possibly containing one or more separation symbols (e.g., "-", "_", ".") depending on the format adopted by the organization, representing the identifier of the standard within the organization. Furthermore, the standard references are usually reported within the "Applicable Documents" or "References" section of a SOW, and they can be cited also within sections that include in the header the word "standard", "requirement", "guideline", or "compliance". Consequently, the citation of standard references within a SOW/PWS document can be summarized by the following rules: * *RULE #1*: standard references are usually reported within the section named "Applicable Documents" or "References". * *RULE #2*: standard references can be cited also within sections including the word "compliance" or another semantically-equivalent word in their name. * *RULE #3*: standard references is composed of two parts: ** Name of the standard organization (acronym, full name, or both). ** Alphanumeric identifier of the standard within the organization. * *RULE #4*: The name of the standard organization includes the acronym or the full name or both. The name must belong to the set of standard organizations S = O U V, where O represents the set of open standard organizations (e.g., ANSI) and V represents the set of vendor-specific standard organizations (e.g., Motorola). * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be used between the name of the standard organization and the alphanumeric identifier. * *RULE #6*: The alphanumeric identifier of the standard is composed of alphabetic and numeric characters, possibly split in two or more parts by a separation symbol (e.g., "-", "_", "."). On the basis of the above rules, here are some examples of formats used for reporting standard references within a SOW/PWS: * {{}} * {{()}} * {{()}} Moreover, some standards are sometimes released by two standard organizations. In this case, the standard reference can be reported as follows: * {{/}} h1. Regular Expressions The {{StandardsExtractingContentHandler}} uses a helper class named `StandardsText` that relies on Java regular expressions and provides some methods to identify headers and standard references, and determine the score of the references found within the given text. Here are the main regular expressions used within the StandardsText class: * *REGEX_HEADER*: regular expression to match only uppercase headers. {code} (\d+\.(\d+\.?)*)\p{Blank}+([A-Z]+(\s[A-Z]+)*){5,} {code} * *REGEX_APPLICABLE_DOCUMENTS*: regular expression to match the the header of "APPLICABLE DOCUMENTS" and equivalent sections. {code} (?i:.*APPLICABLE\sDOCUMENTS|REFERENCE|STANDARD|REQUIREMENT|GUIDELINE|COMPLIANCE.*) {code} * *REGEX_FALLBACK*: regular expression to match a string that is supposed to be a standard reference. {code} \(?(?[A-Z]\w+)\)?((\s?(?\/)\s?)(\w+\s)*\(?(?[A-Z]\w+)\)?)?(\s(Publication|Standard))?(-|\s)?(?
[jira] [Updated] (TIKA-2449) Enabling extraction of standard references from text
[ https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-2449: -- Attachment: standards_extraction.patch flowchart_standards_extraction.png SOW-TacCOM.pdf > Enabling extraction of standard references from text > > > Key: TIKA-2449 > URL: https://issues.apache.org/jira/browse/TIKA-2449 > Project: Tika > Issue Type: Improvement > Components: handler > Reporter: Giuseppe Totaro > Labels: handler > Attachments: flowchart_standards_extraction.png, SOW-TacCOM.pdf, > standards_extraction.patch > > > Apache Tika currently provides many _ContentHandler_ which help to > de-obfuscate some information from text. For instance, the > {{PhoneExtractingContentHandler}} is used to extract phone numbers while > parsing. > This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a > new ContentHandler that relies on regular expressions in order to identify > and extract standard references from text. > Basically, a standard reference is just a reference to a > norm/convention/requirement (i.e., a standard) released by a standard > organization. This work is maily focused on identifying and extracting the > references to the standards already cited within a given document (e.g., > SOW/PWS) so the references can be stored and provided to the user as > additional metadata in case the StandardExtractingContentHandler is used. > In addition to the patch, the first version of the > {{StandardsExtractingContentHandler}} along with an example class to easily > execute the handler is available on > [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. > The following sections provide more in detail how the > {{StandardsExtractingHandler}} has been developed. > h1. Background > From a technical perspective, a standard reference is a string that is > usually composed of two parts: > # the name of the standard organization; > # the alphanumeric identifier of the standard within the organization. > Specifically, the first part can include the acronym or the full name of the > standard organization or even both, and the second part can include an > alphanumeric string, possibly containing one or more separation symbols > (e.g., "-", "_", ".") depending on the format adopted by the organization, > representing the identifier of the standard within the organization. > Furthermore, the standard references are usually reported within the > "Applicable Documents" or "References" section of a SOW, and they can be > cited also within sections that include in the header the word "standard", > "requirement", "guideline", or "compliance". > Consequently, the citation of standard references within a SOW/PWS document > can be summarized by the following rules: > * *RULE #1*: standard references are usually reported within the section > named "Applicable Documents" or "References". > * *RULE #2*: standard references can be cited also within sections including > the word "compliance" or another semantically-equivalent word in their name. > * *RULE #3*: standard references is composed of two parts: > ** Name of the standard organization (acronym, full name, or both). > ** Alphanumeric identifier of the standard within the organization. > * *RULE #4*: The name of the standard organization includes the acronym or > the full name or both. The name must belong to the set of standard > organizations S = O U V, where O represents the set of open standard > organizations (e.g., ANSI) and V represents the set of vendor-specific > standard organizations (e.g., Motorola). > * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be > used between the name of the standard organization and the alphanumeric > identifier. > * *RULE #6*: The alphanumeric identifier of the standard is composed of > alphabetic and numeric characters, possibly split in two or more parts by a > separation symbol (e.g., "-", "_", "."). > On the basis of the above rules, here are some examples of formats used for > reporting standard references within a SOW/PWS: > * {{}} > * > {{()}} > * > {{()}} > Moreover, some standards are sometimes released by two standard > organizations. In this case, the standard reference can be reported as > follows: > * > {{/}} > h1. R
[jira] [Created] (TIKA-2449) Enabling extraction of standard references from text
Giuseppe Totaro created TIKA-2449: - Summary: Enabling extraction of standard references from text Key: TIKA-2449 URL: https://issues.apache.org/jira/browse/TIKA-2449 Project: Tika Issue Type: Improvement Components: handler Reporter: Giuseppe Totaro Apache Tika currently provides many _ContentHandler_ which help to de-obfuscate some information from text. For instance, the {{PhoneExtractingContentHandler}} is used to extract phone numbers while parsing. This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a new ContentHandler that relies on regular expressions in order to identify and extract standard references from text. Basically, a standard reference is just a reference to a norm/convention/requirement (i.e., a standard) released by a standard organization. This work is maily focused on identifying and extracting the references to the standards already cited within a given document (e.g., SOW/PWS) so the references can be stored and provided to the user as additional metadata in case the StandardExtractingContentHandler is used. In addition to the patch, the first version of the {{StandardsExtractingContentHandler}} along with an example class to easily execute the handler is available on [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. The following sections provide more in detail how the {{StandardsExtractingHandler}} has been developed. h1. Background >From a technical perspective, a standard reference is a string that is usually >composed of two parts: # the name of the standard organization; # the alphanumeric identifier of the standard within the organization. Specifically, the first part can include the acronym or the full name of the standard organization or even both, and the second part can include an alphanumeric string, possibly containing one or more separation symbols (e.g., "-", "_", ".") depending on the format adopted by the organization, representing the identifier of the standard within the organization. Furthermore, the standard references are usually reported within the "Applicable Documents" or "References" section of a SOW, and they can be cited also within sections that include in the header the word "standard", "requirement", "guideline", or "compliance". Consequently, the citation of standard references within a SOW/PWS document can be summarized by the following rules: * *RULE #1*: standard references are usually reported within the section named "Applicable Documents" or "References". * *RULE #2*: standard references can be cited also within sections including the word "compliance" or another semantically-equivalent word in their name. * *RULE #3*: standard references is composed of two parts: ** Name of the standard organization (acronym, full name, or both). ** Alphanumeric identifier of the standard within the organization. * *RULE #4*: The name of the standard organization includes the acronym or the full name or both. The name must belong to the set of standard organizations S = O U V, where O represents the set of open standard organizations (e.g., ANSI) and V represents the set of vendor-specific standard organizations (e.g., Motorola). * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be used between the name of the standard organization and the alphanumeric identifier. * *RULE #6*: The alphanumeric identifier of the standard is composed of alphabetic and numeric characters, possibly split in two or more parts by a separation symbol (e.g., "-", "_", "."). On the basis of the above rules, here are some examples of formats used for reporting standard references within a SOW/PWS: * {{}} * {{()}} * {{()}} Moreover, some standards are sometimes released by two standard organizations. In this case, the standard reference can be reported as follows: * {{/}} h1. Regular Expressions The {{StandardsExtractingContentHandler}} uses a helper class named `StandardsText` that relies on Java regular expressions and provides some methods to identify headers and standard references, and determine the score of the references found within the given text. Here are the main regular expressions used within the StandardsText class: * *REGEX_HEADER*: regular expression to match only uppercase headers. {code} (\d+\.(\d+\.?)*)\p{Blank}+([A-Z]+(\s[A-Z]+)*){5,} {code} * *REGEX_APPLICABLE_DOCUMENTS*: regular expression to match the the header of "APPLICABLE DOCUMENTS" and equivalent sections. {code} (?i:.*APPLICABLE\sDOCUMENTS|REFERENCE|STANDARD|REQUIREMENT|GUIDELINE|COMPLIANCE.*) {code} * *REGEX_FALLBACK*: regular expression to ma
[jira] [Commented] (TIKA-1739) cTAKESParser doesn't work in 1.11
[ https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904761#comment-14904761 ] Giuseppe Totaro commented on TIKA-1739: --- Great suggestion [~gagravarr]. Thanks [~chrismattmann] for updating the documentation. Giuseppe > cTAKESParser doesn't work in 1.11 > - > > Key: TIKA-1739 > URL: https://issues.apache.org/jira/browse/TIKA-1739 > Project: Tika > Issue Type: Bug > Components: parser, server >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Fix For: 1.11 > > Attachments: TIKA-1739.patch > > > Tika cTAKESParser integration doesn't work in 1.11. The parser is called, but > blank metadata comes back: > {noformat} > curl -T test.txt -H "Content-Type: text/plain" > http://localhost:/rmeta/text > [{"Content-Type":"text/plain","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.ctakes.CTAKESParser","org.apache.tika.parser.EmptyParser"],"X-TIKA:parse_time_millis":"20371","ctakes:schema":"coveredText:start:end:ontologyConceptArr"} > {noformat} > [~gagravarr] I wonder if something that happened in TIKA-1653 broke it? > http://svn.apache.org/viewvc?view=revision=1684199 > [~gostep] can you help me look here? > I'm working on > https://github.com/chrismattmann/shangridocs/tree/convert-wicket which is > where I first saw this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1739) cTAKESParser doesn't work in 1.11
[ https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903123#comment-14903123 ] Giuseppe Totaro commented on TIKA-1739: --- Hi [~chrismattmann], Hi [~gagravarr], I looked at the last code of {{CTAKESParser.java}} and I did some experiments on my laptop. Basically, the problem is due to the default constructor of {{CTAKESParser.java}}: {code:java} /** * Wraps the default Parser */ public CTAKESParser() { this(TikaConfig.getDefaultConfig()); } {code} To use CTAKESParser, we need to create a specific configuration for CTAKESParser (unless we aim at using the parser programmatically), as reported in [ctakesparser-utils|https://github.com/chrismattmann/ctakesparser-utils] repository. While parsing, the default constructor of CTAKESParser is used by Tika overriding the given configuration at runtime. Therefore, CTAKESParser is only "visited" by Tika that will use, instead, the EmptyParser as fallback. For instance, if we use again the previous default constructor (that does not override the given configuration), then we can use properly cTAKES and obtain the right metadata: {code:java} public CTAKESParser() { super(new AutoDetectParser()); } {code} [~chrismattmann] and [~gagravarr]], I will be really gald to hear your feedback. Thanks a lot, Giuseppe > cTAKESParser doesn't work in 1.11 > - > > Key: TIKA-1739 > URL: https://issues.apache.org/jira/browse/TIKA-1739 > Project: Tika > Issue Type: Bug > Components: parser, server >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Fix For: 1.11 > > > Tika cTAKESParser integration doesn't work in 1.11. The parser is called, but > blank metadata comes back: > {noformat} > curl -T test.txt -H "Content-Type: text/plain" > http://localhost:/rmeta/text > [{"Content-Type":"text/plain","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.ctakes.CTAKESParser","org.apache.tika.parser.EmptyParser"],"X-TIKA:parse_time_millis":"20371","ctakes:schema":"coveredText:start:end:ontologyConceptArr"} > {noformat} > [~gagravarr] I wonder if something that happened in TIKA-1653 broke it? > http://svn.apache.org/viewvc?view=revision=1684199 > [~gostep] can you help me look here? > I'm working on > https://github.com/chrismattmann/shangridocs/tree/convert-wicket which is > where I first saw this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1691) Apache Tika for enabling metadata interoperability
[ https://issues.apache.org/jira/browse/TIKA-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14643066#comment-14643066 ] Giuseppe Totaro commented on TIKA-1691: --- Hi [~gagravarr], Hi [~chrismattmann], did you have any chance to read my last comment? Thanks, Giusppe Apache Tika for enabling metadata interoperability -- Key: TIKA-1691 URL: https://issues.apache.org/jira/browse/TIKA-1691 Project: Tika Issue Type: New Feature Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro Labels: mapping, metadata Attachments: mapping_example.pdf If am not wrong, enabling consistent metadata across file formats is already (partially) provided into Tika by relying on {{TikaCoreProperties}} and, within the context of Solr, {{ExtractingRequestHandler}} (by defining how to map metadata fields in {{solrconfig.xml}}). However, I am working on a new component for both schema mapping (to operate on the name of metadata properties) and instance transformation (to operate on the value of metadata) that consists, essentially, of the following changes: * A wrapper of {{Metadata}} object ({{MappedMetadata.java}}) that decorates the {{set}} method (currently, line number 367 of {{Metadata.java}}) by applying the given mapping functions (via configuration) before setting metadata properties. * Basic mapping functions ({{BasicMappingUtils.java}}) that are utility methods to map a set of metadata to the target schema. * A new {{MetadataConfig}} object that, as well as {{TikaConfig}}, may be configured via XML file (organized as showed in the following snippet) and allows to perform a fine-grained metadata mapping by using Java reflection. {code:xml|title=tika-metadata.xml|borderStyle=solid} ?xml version=1.0 encoding=UTF-8 standalone=no? properties mappings mapping type=type/sub-type relation name=SOURCE_FIELD targetTARGET_FIELD/target expressionexclude|include|equivalent|overlap/expression function name=FUNCTION_NAME argumentARGUMENT_VALUE/argument /function cardinality sourceSOURCE_CARDINALITY/source targetTARGET_CARDINALITY/target orderORDER_NUMBER/order dependencies fieldFIELD_NAME/field /dependencies /cardinality /relation /mapping ... mapping !-- This contains the fallback strategy for unknown metadata -- relation ... /relation mapping /mappings /properties {code} The theoretical definition of metadata mapping is available in [A survey of techniques for achieving metadata interoperability|http://www.researchgate.net/profile/Bernhard_Haslhofer/publication/220566013_A_survey_of_techniques_for_achieving_metadata_interoperability/links/02e7e533e76187c0b800.pdf];. This paper shows also some basic examples of metadata mappings. Currently, I am still working on some core functionalities, but I have already performed some experiments by using a small prototype. By the way, I think that we should modify the method {{add}} in order to use {{set}} instead of {{metadata.put}} (currently, line number 316 of {{Metadata.java}}). This is a trivial change (I could create a new Jira issue about that), but it would allow to be coherent with the other implementation of {{add}} method and, moreover, the methods of {{Metadata}} could be extended more easily. I would really appreciate your feedback about this proposal. If you believe that it is a good idea, I could provide the code in few days. Thanks a lot, Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1691) Apache Tika for enabling metadata interoperability
[ https://issues.apache.org/jira/browse/TIKA-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637930#comment-14637930 ] Giuseppe Totaro commented on TIKA-1691: --- Hello [~gagravarr], your feedback is very much appreciated. I believe that providing metadata mapping on the getter side is a great idea. However, I will try to clarify my proposal below by reporting two (high-level) use cases. As use case, we can consider the following: We want to index both textual content and metadata from a heterogeneous set of digital documents, providing uniform access to the metadata properties extracted from files. Therefore, we want to allow users to submit search queries by using an end-user specific mediated schema. We can summarize the use case above as follows: # collect a huge amount of heterogeneous files (e.g., PDF, DOC, JPG, PPT, TXT, etc); # extract both text and metadata from files by using Tika; # map all metadata properties to a mediated schema that will be used for searching purposes; # create an inverted index from the extracted contents; # use the index in order to perform search queries based on metadata values. Another use case is the following: We want to compute some similarity metrics based on metadata features. To perform similarity, we need to provide the semantic correspondences among different metadata schemes. We can summarize the use case above as follows: # collect a huge amount of heterogeneous files (e.g., PDF, DOC, JPG, PPT, TXT, etc); # extract both text and metadata from files by using Tika; # map all metadata properties to a mediated schema that will be used for performing similarity among different schemes; # use the metadata mapping to compute the given similarity metric among metadata from different schemes. Currently, Tika enables consistent metadata across file formats by relying on [TikaCoreProperties|http://tika.apache.org/1.9/api/org/apache/tika/metadata/TikaCoreProperties.html], that are defined in terms of other standard namespaces. However, this core set of metadata could limit the interoperability among many metadata schemes, since Tika developers are continually providing support to new filetypes (and metadata schemes). Furthermore, I have identified two more functionalities for better metadata interoperability: * a fine-grained mapping technique to potentially define metadata mappings for each mimetype. This allows, for example, either to exclude the mapping of metadata for some types or to provide different mappings of the same schema on different types. * a metadata mapping technique that subsumes schema mapping (property names) and instance transformation (property values). I am working on providing a default mediated schema (via XML-based configuration) based on a core set of utility (Java) methods for metadata mapping. You can find in attachment (_mapping_example_) an extremely simple diagram that reports an example of metadata mapping by defining source property, target property (that provides essentially schema mapping), mapping expression (that describes the semantics of each mapping relationship), and function (that provides instance transformation). By the way, I am working also on a [D3|http://d3js.org/]-based utility that allows to visualize the new metadata mappings provided into Tika starting from the XML configuration file (i.e., {{tike-metadata.xml}}). The output is based on [hierarchical edge building algorithm|https://github.com/mbostock/d3/wiki/Bundle-Layout]. Regarding the possibility to provide mappings on the getter side, I thing that is a great idea. I believe that we should enable the users to select programmatically (or via configuration) whether using mappings on setter side or not. For instance, providing mappings on setter side requires to perform the actual mapping only during extraction, whereas on the getter side the mappings would be performed for each {{metadata.get()}}. Thanks again Nich for your feedback. I hope that you are going to give more comments on this work. I would really appreciate it. I take this opportunity to thank [~chrismattmann] for supporting me on this work. Cheers, Giuseppe Apache Tika for enabling metadata interoperability -- Key: TIKA-1691 URL: https://issues.apache.org/jira/browse/TIKA-1691 Project: Tika Issue Type: New Feature Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro Labels: mapping, metadata Attachments: mapping_example.pdf If am not wrong, enabling consistent metadata across file formats is already (partially) provided into Tika by relying on {{TikaCoreProperties}} and, within the context of Solr, {{ExtractingRequestHandler}} (by defining how to map metadata fields in {{solrconfig.xml}}). However
[jira] [Updated] (TIKA-1691) Apache Tika for enabling metadata interoperability
[ https://issues.apache.org/jira/browse/TIKA-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1691: -- Attachment: mapping_example.pdf Apache Tika for enabling metadata interoperability -- Key: TIKA-1691 URL: https://issues.apache.org/jira/browse/TIKA-1691 Project: Tika Issue Type: New Feature Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro Labels: mapping, metadata Attachments: mapping_example.pdf If am not wrong, enabling consistent metadata across file formats is already (partially) provided into Tika by relying on {{TikaCoreProperties}} and, within the context of Solr, {{ExtractingRequestHandler}} (by defining how to map metadata fields in {{solrconfig.xml}}). However, I am working on a new component for both schema mapping (to operate on the name of metadata properties) and instance transformation (to operate on the value of metadata) that consists, essentially, of the following changes: * A wrapper of {{Metadata}} object ({{MappedMetadata.java}}) that decorates the {{set}} method (currently, line number 367 of {{Metadata.java}}) by applying the given mapping functions (via configuration) before setting metadata properties. * Basic mapping functions ({{BasicMappingUtils.java}}) that are utility methods to map a set of metadata to the target schema. * A new {{MetadataConfig}} object that, as well as {{TikaConfig}}, may be configured via XML file (organized as showed in the following snippet) and allows to perform a fine-grained metadata mapping by using Java reflection. {code:xml|title=tika-metadata.xml|borderStyle=solid} ?xml version=1.0 encoding=UTF-8 standalone=no? properties mappings mapping type=type/sub-type relation name=SOURCE_FIELD targetTARGET_FIELD/target expressionexclude|include|equivalent|overlap/expression function name=FUNCTION_NAME argumentARGUMENT_VALUE/argument /function cardinality sourceSOURCE_CARDINALITY/source targetTARGET_CARDINALITY/target orderORDER_NUMBER/order dependencies fieldFIELD_NAME/field /dependencies /cardinality /relation /mapping ... mapping !-- This contains the fallback strategy for unknown metadata -- relation ... /relation mapping /mappings /properties {code} The theoretical definition of metadata mapping is available in [A survey of techniques for achieving metadata interoperability|http://www.researchgate.net/profile/Bernhard_Haslhofer/publication/220566013_A_survey_of_techniques_for_achieving_metadata_interoperability/links/02e7e533e76187c0b800.pdf];. This paper shows also some basic examples of metadata mappings. Currently, I am still working on some core functionalities, but I have already performed some experiments by using a small prototype. By the way, I think that we should modify the method {{add}} in order to use {{set}} instead of {{metadata.put}} (currently, line number 316 of {{Metadata.java}}). This is a trivial change (I could create a new Jira issue about that), but it would allow to be coherent with the other implementation of {{add}} method and, moreover, the methods of {{Metadata}} could be extended more easily. I would really appreciate your feedback about this proposal. If you believe that it is a good idea, I could provide the code in few days. Thanks a lot, Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1691) Apache Tika for enabling metadata interoperability
Giuseppe Totaro created TIKA-1691: - Summary: Apache Tika for enabling metadata interoperability Key: TIKA-1691 URL: https://issues.apache.org/jira/browse/TIKA-1691 Project: Tika Issue Type: New Feature Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro If am not wrong, enabling consistent metadata across file formats is already (partially) provided into Tika by relying on {{TikaCoreProperties}} and, within the context of Solr, {{ExtractingRequestHandler}} (by defining how to map metadata fields in {{solrconfig.xml}}). However, I am working on a new component for both schema mapping (to operate on the name of metadata properties) and instance transformation (to operate on the value of metadata) that consists, essentially, of the following changes: * A wrapper of {{Metadata}} object ({{MappedMetadata.java}}) that decorates the {{set}} method (currently, line number 367 of {{Metadata.java}}) by applying the given mapping functions (via configuration) before setting metadata properties. * Basic mapping functions ({{BasicMappingUtils.java}}) that are utility methods to map a set of metadata to the target schema. * A new {{MetadataConfig}} object that, as well as {{TikaConfig}}, may be configured via XML file (organized as showed in the following snippet) and allows to perform a fine-grained metadata mapping by using Java reflection. {code:xml|title=tika-metadata.xml|borderStyle=solid} ?xml version=1.0 encoding=UTF-8 standalone=no? properties mappings mapping type=type/sub-type relation name=SOURCE_FIELD targetTARGET_FIELD/target expressionexclude|include|equivalent|overlap/expression function name=FUNCTION_NAME argumentARGUMENT_VALUE/argument /function cardinality sourceSOURCE_CARDINALITY/source targetTARGET_CARDINALITY/target orderORDER_NUMBER/order dependencies fieldFIELD_NAME/field /dependencies /cardinality /relation /mapping ... mapping !-- This contains the fallback strategy for unknown metadata -- relation ... /relation mapping /mappings /properties {code} The theoretical definition of metadata mapping is available in [A survey of techniques for achieving metadata interoperability|http://www.researchgate.net/profile/Bernhard_Haslhofer/publication/220566013_A_survey_of_techniques_for_achieving_metadata_interoperability/links/02e7e533e76187c0b800.pdf];. This paper shows also some basic examples of metadata mappings. Currently, I am still working on some core functionalities, but I have already performed some experiments by using a small prototype. By the way, I think that we should modify the method {{add}} in order to use {{set}} instead of {{metadata.put}} (currently, line number 316 of {{Metadata.java}}). This is a trivial change (I could create a new Jira issue about that), but it would allow to be coherent with the other implementation of {{add}} method and, moreover, the methods of {{Metadata}} could be extended more easily. I would really appreciate your feedback about this proposal. If you believe that it is a good idea, I could provide the code in few days. Thanks a lot, Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1654) Reset cTAKES CAS into CTAKESParser
[ https://issues.apache.org/jira/browse/TIKA-1654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro resolved TIKA-1654. --- Resolution: Fixed Reset cTAKES CAS into CTAKESParser -- Key: TIKA-1654 URL: https://issues.apache.org/jira/browse/TIKA-1654 Project: Tika Issue Type: Bug Components: parser Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro Labels: patch Fix For: 1.10 Attachments: TIKA-1654.patch, TIKA-1654.v02.patch Using [CTAKESParser from Tika Server|https://wiki.apache.org/tika/cTAKESParser], I noticed that an exception occurs when the CTAKESParser is used multiple times: {noformat} org.apache.uima.cas.CASRuntimeException: Data for Sofa feature setLocalSofaData() has already been set. {noformat} This is due to the CAS (Common Analysis System) used by CTAKESParser. The CAS, as the AE (AnalysisEngine), is a static field into CTAKESParser to make a sort of singleton. By the way, An Analysis Engine is a cTAKES/UIMA component responsible for analyzing unstructured information, discovering and representing semantic content. An AnalysisEngine operates on an analysis structure (implemented by CAS). It is highly recommended to reuse the CAS, but it has to be reset before the next run. The CTAKESUtils class ({{org.apache.tika.parser.ctakes}}) provides the reset method to release all resources held by both AnalysisEngine and CAS and then destroy them. This method prevents the CASRuntimeException error. You can find in attachment the patch including two new methods (resetCAS and resetAE) to reset, but not to destroy, the CAS and the AnalysisEngine respectively. By using only resetCAS, CTAKESParser can reuse both CAS and AE instead of building them again for each run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1654) Reset cTAKES CAS into CTAKESParser
[ https://issues.apache.org/jira/browse/TIKA-1654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1654: -- Fix Version/s: (was: 1.9) 1.10 Reset cTAKES CAS into CTAKESParser -- Key: TIKA-1654 URL: https://issues.apache.org/jira/browse/TIKA-1654 Project: Tika Issue Type: Bug Components: parser Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro Labels: patch Fix For: 1.10 Attachments: TIKA-1654.patch Using [CTAKESParser from Tika Server|https://wiki.apache.org/tika/cTAKESParser], I noticed that an exception occurs when the CTAKESParser is used multiple times: {noformat} org.apache.uima.cas.CASRuntimeException: Data for Sofa feature setLocalSofaData() has already been set. {noformat} This is due to the CAS (Common Analysis System) used by CTAKESParser. The CAS, as the AE (AnalysisEngine), is a static field into CTAKESParser to make a sort of singleton. By the way, An Analysis Engine is a cTAKES/UIMA component responsible for analyzing unstructured information, discovering and representing semantic content. An AnalysisEngine operates on an analysis structure (implemented by CAS). It is highly recommended to reuse the CAS, but it has to be reset before the next run. The CTAKESUtils class ({{org.apache.tika.parser.ctakes}}) provides the reset method to release all resources held by both AnalysisEngine and CAS and then destroy them. This method prevents the CASRuntimeException error. You can find in attachment the patch including two new methods (resetCAS and resetAE) to reset, but not to destroy, the CAS and the AnalysisEngine respectively. By using only resetCAS, CTAKESParser can reuse both CAS and AE instead of building them again for each run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1654) Reset cTAKES CAS into CTAKESParser
Giuseppe Totaro created TIKA-1654: - Summary: Reset cTAKES CAS into CTAKESParser Key: TIKA-1654 URL: https://issues.apache.org/jira/browse/TIKA-1654 Project: Tika Issue Type: Bug Components: parser Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro Using [CTAKESParser from Tika Server|https://wiki.apache.org/tika/cTAKESParser], I noticed that an exception occurs when the CTAKESParser is used multiple times: {noformat} org.apache.uima.cas.CASRuntimeException: Data for Sofa feature setLocalSofaData() has already been set. {noformat} This is due to the CAS (Common Analysis System) used by CTAKESParser. The CAS, as the AE (AnalysisEngine), is a static field into CTAKESParser to make a sort of singleton. By the way, An Analysis Engine is a cTAKES/UIMA component responsible for analyzing unstructured information, discovering and representing semantic content. An AnalysisEngine operates on an analysis structure (implemented by CAS). It is highly recommended to reuse the CAS, but it has to be reset before the next run. The CTAKESUtils class ({{org.apache.tika.parser.ctakes}}) provides the reset method to release all resources held by both AnalysisEngine and CAS and then destroy them. This method prevents the CASRuntimeException error. You can find in attachment the patch including two new methods (resetCAS and resetAE) to reset, but not to destroy, the CAS and the AnalysisEngine respectively. By using only resetCAS, CTAKESParser can reuse both CAS and AE instead of building them again for each run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1654) Reset cTAKES CAS into CTAKESParser
[ https://issues.apache.org/jira/browse/TIKA-1654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1654: -- Fix Version/s: 1.9 Reset cTAKES CAS into CTAKESParser -- Key: TIKA-1654 URL: https://issues.apache.org/jira/browse/TIKA-1654 Project: Tika Issue Type: Bug Components: parser Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro Labels: patch Fix For: 1.9 Using [CTAKESParser from Tika Server|https://wiki.apache.org/tika/cTAKESParser], I noticed that an exception occurs when the CTAKESParser is used multiple times: {noformat} org.apache.uima.cas.CASRuntimeException: Data for Sofa feature setLocalSofaData() has already been set. {noformat} This is due to the CAS (Common Analysis System) used by CTAKESParser. The CAS, as the AE (AnalysisEngine), is a static field into CTAKESParser to make a sort of singleton. By the way, An Analysis Engine is a cTAKES/UIMA component responsible for analyzing unstructured information, discovering and representing semantic content. An AnalysisEngine operates on an analysis structure (implemented by CAS). It is highly recommended to reuse the CAS, but it has to be reset before the next run. The CTAKESUtils class ({{org.apache.tika.parser.ctakes}}) provides the reset method to release all resources held by both AnalysisEngine and CAS and then destroy them. This method prevents the CASRuntimeException error. You can find in attachment the patch including two new methods (resetCAS and resetAE) to reset, but not to destroy, the CAS and the AnalysisEngine respectively. By using only resetCAS, CTAKESParser can reuse both CAS and AE instead of building them again for each run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1654) Reset cTAKES CAS into CTAKESParser
[ https://issues.apache.org/jira/browse/TIKA-1654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1654: -- Attachment: TIKA-1654.patch Reset cTAKES CAS into CTAKESParser -- Key: TIKA-1654 URL: https://issues.apache.org/jira/browse/TIKA-1654 Project: Tika Issue Type: Bug Components: parser Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro Labels: patch Fix For: 1.9 Attachments: TIKA-1654.patch Using [CTAKESParser from Tika Server|https://wiki.apache.org/tika/cTAKESParser], I noticed that an exception occurs when the CTAKESParser is used multiple times: {noformat} org.apache.uima.cas.CASRuntimeException: Data for Sofa feature setLocalSofaData() has already been set. {noformat} This is due to the CAS (Common Analysis System) used by CTAKESParser. The CAS, as the AE (AnalysisEngine), is a static field into CTAKESParser to make a sort of singleton. By the way, An Analysis Engine is a cTAKES/UIMA component responsible for analyzing unstructured information, discovering and representing semantic content. An AnalysisEngine operates on an analysis structure (implemented by CAS). It is highly recommended to reuse the CAS, but it has to be reset before the next run. The CTAKESUtils class ({{org.apache.tika.parser.ctakes}}) provides the reset method to release all resources held by both AnalysisEngine and CAS and then destroy them. This method prevents the CASRuntimeException error. You can find in attachment the patch including two new methods (resetCAS and resetAE) to reset, but not to destroy, the CAS and the AnalysisEngine respectively. By using only resetCAS, CTAKESParser can reuse both CAS and AE instead of building them again for each run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [VOTE] Release Apache Tika 1.9 Candidate #2
Hi Chris, I have tested tika 1.9-rc2. In particular, I checked the new work on CTAKESParser. Thank you for your great work. My vote for this RC is +1. Thanks, Giuseppe On Mon, Jun 8, 2015 at 8:58 AM, Konstantin Gribov gros...@gmail.com wrote: Hi, Chris. SHA1 hash and GPG signature are valid for all published artifacts. I've tested 1.9-rc2 on several text docs (rtf, pdf, doc, docx) and result is quite good. I've found minor regression since 1.7 (it may be related to POI, not Tika itself), but they shouldn't prevent releasing 1.9 from rc2. I'll try to create doc to reproduce it and file a ticket to jira because I can't share original doc file on which it can be reproduced. JFYI, o.a.t.p.microsoft.OfficeParser produces U+200B (zero width white space) where U+00AD (soft hyphen) should be. Same document saved to odt and docx have different content (one has U+00AD on same position, one has nothing there, like tika-app-1.7 had). [x] +1 Release this package as Apache Tika 1.9 [ ] -1 Do not release this package because… Thank you for preparing this release. -- Best regards, Konstantin Gribov вс, 7 июня 2015 г. в 4:47, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov: Hi Folks, A second candidate for the Tika 1.9 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.9-rc2/ The SHA1 checksum of the archive is 9b78c9e9ce9640b402b7fef8e30f3cdbe384f44c. In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapachetika-1011/ Please vote on releasing this package as Apache Tika 1.9. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.9 [ ] -1 Do not release this package because… Cheers, Chris P.S. Of course here is my +1. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] [Updated] (TIKA-1645) Extraction of biomedical information using CTAKESParser
[ https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1645: -- Attachment: TIKA-1645.v02.patch Extraction of biomedical information using CTAKESParser --- Key: TIKA-1645 URL: https://issues.apache.org/jira/browse/TIKA-1645 Project: Tika Issue Type: Improvement Components: parser Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro Labels: patch Fix For: 1.10 Attachments: CTAKESConfig.properties, TIKA-1645.patch, TIKA-1645.v02.patch, tika-config.xml As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] is a preliminary work in order to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract biomedical information from clinical text. Essentially, this work includes a wrapper for CAS serializers that aim at dumping out the identified annotations into XML-based formats. You can find in attachment a new patch that includes the CTAKESParser, a new parser that decorates the AutoDetectParser and relies on a new version of CTAKESContentHandler, based on feedback from [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser generates the same output of AutoDetectParser and, in addition, the metadata containing the identified clinical annotations detected by cTAKES. To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first to install the last stable release of cTAKES (3.2.2), following the instructions on [User Install Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide]. Then, you can launch Tika as follows: {noformat} CTAKES_HOME=/usr/local/apache-ctakes-3.2.2 java -cp tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input {noformat} In the example above, {{/path/to/CTAKESConfig}} is the parent directory of file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains the configuration properties to build the cTAKES AnalysisEngine; {{tika-config.xml}} is a custom configuration file for Tika that contains the mimetypes whose CTAKESParser will perform parsing. You can find in attachment an example of both {{CTAKESConfig.properties}} and {{tika-config.xml}} to parse ISA-Tab files using cTAKES. You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use the UMLS-based components of cTAKES. I would really appreciate your feedback. Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1645) Extraction of biomedical information using CTAKESParser
[ https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572252#comment-14572252 ] Giuseppe Totaro commented on TIKA-1645: --- Hi [~chrismattmann], thanks for your feedback. I really appreciate it. You can find in attachment a new patch. Basically, the patch includes the java class CTAKESPaser that decorates the AutoDetectParser and leverages on cTAKES java APIs in order to extract biomedical information from text and, optionally, metadata. Then, all the [IndetifiedAnnotation|http://ctakes.apache.org/apidocs/trunk/org/apache/ctakes/typesystem/type/textsem/IdentifiedAnnotation.html]’s extracted by cTAKES are included into file metadata using, by default, the prefix {{ctakes:}}. To build Tika with this patch via maven, I had to modify {{tika-parsers/pom.xml}} and {{tika-bundle/pom.xml}}, otherwise several “cannot find symbol” errors would be generated at compile time. More in detail, I added the {{ctakes-core}} dependency (scope “provided) to {{tika-parsers/pom.xml}} and I excluded both ctakes and uima dependencies in {{tika-bundle/pom.xml}} using the following directives into {{ImportPackage}}: {noformat} !org.apache.ctakes.* !org.apache.uima.* {noformat} By the way, I am going to implement another version of CTAKESParser as an external parser. Thanks again, Giuseppe Extraction of biomedical information using CTAKESParser --- Key: TIKA-1645 URL: https://issues.apache.org/jira/browse/TIKA-1645 Project: Tika Issue Type: Improvement Components: parser Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro Labels: patch Fix For: 1.10 Attachments: CTAKESConfig.properties, TIKA-1645.patch, TIKA-1645.v02.patch, tika-config.xml As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] is a preliminary work in order to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract biomedical information from clinical text. Essentially, this work includes a wrapper for CAS serializers that aim at dumping out the identified annotations into XML-based formats. You can find in attachment a new patch that includes the CTAKESParser, a new parser that decorates the AutoDetectParser and relies on a new version of CTAKESContentHandler, based on feedback from [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser generates the same output of AutoDetectParser and, in addition, the metadata containing the identified clinical annotations detected by cTAKES. To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first to install the last stable release of cTAKES (3.2.2), following the instructions on [User Install Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide]. Then, you can launch Tika as follows: {noformat} CTAKES_HOME=/usr/local/apache-ctakes-3.2.2 java -cp tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input {noformat} In the example above, {{/path/to/CTAKESConfig}} is the parent directory of file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains the configuration properties to build the cTAKES AnalysisEngine; {{tika-config.xml}} is a custom configuration file for Tika that contains the mimetypes whose CTAKESParser will perform parsing. You can find in attachment an example of both {{CTAKESConfig.properties}} and {{tika-config.xml}} to parse ISA-Tab files using cTAKES. You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use the UMLS-based components of cTAKES. I would really appreciate your feedback. Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1645) Extraction of biomedical information using CTAKESParser
Giuseppe Totaro created TIKA-1645: - Summary: Extraction of biomedical information using CTAKESParser Key: TIKA-1645 URL: https://issues.apache.org/jira/browse/TIKA-1645 Project: Tika Issue Type: Improvement Components: parser Reporter: Giuseppe Totaro As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] is a preliminary work in order to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract biomedical information from clinical text. Essentially, this work includes a wrapper for CAS serializers that aim at dumping out the identified annotations into XML-based formats. You can find in attachment a new patch that includes the CTAKESParser, a new parser that decorates the AutoDetectParser and relies on a new version of CTAKESContentHandler, based on feedback from [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser generates the same output of AutoDetectParser and, in addition, the metadata containing the identified clinical annotations detected by cTAKES. To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first to install the last stable release of cTAKES (3.2.2), following the instructions on [User Install Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide]. Then, you can launch Tika as follows: {noformat} CTAKES_HOME=/usr/local/apache-ctakes-3.2.2 java -cp tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input {noformat} In the example above, {{/path/to/CTAKESConfig}} is the parent directory of file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains the configuration properties to build the cTAKES AnalysisEngine; {{tika-config.xml}} is a custom configuration file for Tika that contains the mimetypes whose CTAKESParser will perform parsing. You can find in attachment an example of both {{CTAKESConfig.properties}} and {{tika-config.xml}} to parse ISA-Tab files using cTAKES. You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use the UMLS-based components of cTAKES. I would really appreciate your feedback. Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TIKA-1645) Extraction of biomedical information using CTAKESParser
[ https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro reassigned TIKA-1645: - Assignee: Giuseppe Totaro Extraction of biomedical information using CTAKESParser --- Key: TIKA-1645 URL: https://issues.apache.org/jira/browse/TIKA-1645 Project: Tika Issue Type: Improvement Components: parser Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] is a preliminary work in order to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract biomedical information from clinical text. Essentially, this work includes a wrapper for CAS serializers that aim at dumping out the identified annotations into XML-based formats. You can find in attachment a new patch that includes the CTAKESParser, a new parser that decorates the AutoDetectParser and relies on a new version of CTAKESContentHandler, based on feedback from [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser generates the same output of AutoDetectParser and, in addition, the metadata containing the identified clinical annotations detected by cTAKES. To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first to install the last stable release of cTAKES (3.2.2), following the instructions on [User Install Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide]. Then, you can launch Tika as follows: {noformat} CTAKES_HOME=/usr/local/apache-ctakes-3.2.2 java -cp tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input {noformat} In the example above, {{/path/to/CTAKESConfig}} is the parent directory of file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains the configuration properties to build the cTAKES AnalysisEngine; {{tika-config.xml}} is a custom configuration file for Tika that contains the mimetypes whose CTAKESParser will perform parsing. You can find in attachment an example of both {{CTAKESConfig.properties}} and {{tika-config.xml}} to parse ISA-Tab files using cTAKES. You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use the UMLS-based components of cTAKES. I would really appreciate your feedback. Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1645) Extraction of biomedical information using CTAKESParser
[ https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1645: -- Attachment: TIKA-1645.patch Extraction of biomedical information using CTAKESParser --- Key: TIKA-1645 URL: https://issues.apache.org/jira/browse/TIKA-1645 Project: Tika Issue Type: Improvement Components: parser Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro Labels: patch Attachments: TIKA-1645.patch As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] is a preliminary work in order to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract biomedical information from clinical text. Essentially, this work includes a wrapper for CAS serializers that aim at dumping out the identified annotations into XML-based formats. You can find in attachment a new patch that includes the CTAKESParser, a new parser that decorates the AutoDetectParser and relies on a new version of CTAKESContentHandler, based on feedback from [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser generates the same output of AutoDetectParser and, in addition, the metadata containing the identified clinical annotations detected by cTAKES. To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first to install the last stable release of cTAKES (3.2.2), following the instructions on [User Install Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide]. Then, you can launch Tika as follows: {noformat} CTAKES_HOME=/usr/local/apache-ctakes-3.2.2 java -cp tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input {noformat} In the example above, {{/path/to/CTAKESConfig}} is the parent directory of file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains the configuration properties to build the cTAKES AnalysisEngine; {{tika-config.xml}} is a custom configuration file for Tika that contains the mimetypes whose CTAKESParser will perform parsing. You can find in attachment an example of both {{CTAKESConfig.properties}} and {{tika-config.xml}} to parse ISA-Tab files using cTAKES. You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use the UMLS-based components of cTAKES. I would really appreciate your feedback. Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1645) Extraction of biomedical information using CTAKESParser
[ https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1645: -- Labels: patch (was: ) Extraction of biomedical information using CTAKESParser --- Key: TIKA-1645 URL: https://issues.apache.org/jira/browse/TIKA-1645 Project: Tika Issue Type: Improvement Components: parser Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro Labels: patch Attachments: TIKA-1645.patch As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] is a preliminary work in order to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract biomedical information from clinical text. Essentially, this work includes a wrapper for CAS serializers that aim at dumping out the identified annotations into XML-based formats. You can find in attachment a new patch that includes the CTAKESParser, a new parser that decorates the AutoDetectParser and relies on a new version of CTAKESContentHandler, based on feedback from [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser generates the same output of AutoDetectParser and, in addition, the metadata containing the identified clinical annotations detected by cTAKES. To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first to install the last stable release of cTAKES (3.2.2), following the instructions on [User Install Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide]. Then, you can launch Tika as follows: {noformat} CTAKES_HOME=/usr/local/apache-ctakes-3.2.2 java -cp tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input {noformat} In the example above, {{/path/to/CTAKESConfig}} is the parent directory of file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains the configuration properties to build the cTAKES AnalysisEngine; {{tika-config.xml}} is a custom configuration file for Tika that contains the mimetypes whose CTAKESParser will perform parsing. You can find in attachment an example of both {{CTAKESConfig.properties}} and {{tika-config.xml}} to parse ISA-Tab files using cTAKES. You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use the UMLS-based components of cTAKES. I would really appreciate your feedback. Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1645) Extraction of biomedical information using CTAKESParser
[ https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1645: -- Attachment: CTAKESConfig.properties tika-config.xml Extraction of biomedical information using CTAKESParser --- Key: TIKA-1645 URL: https://issues.apache.org/jira/browse/TIKA-1645 Project: Tika Issue Type: Improvement Components: parser Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro Labels: patch Attachments: CTAKESConfig.properties, TIKA-1645.patch, tika-config.xml As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] is a preliminary work in order to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract biomedical information from clinical text. Essentially, this work includes a wrapper for CAS serializers that aim at dumping out the identified annotations into XML-based formats. You can find in attachment a new patch that includes the CTAKESParser, a new parser that decorates the AutoDetectParser and relies on a new version of CTAKESContentHandler, based on feedback from [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser generates the same output of AutoDetectParser and, in addition, the metadata containing the identified clinical annotations detected by cTAKES. To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first to install the last stable release of cTAKES (3.2.2), following the instructions on [User Install Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide]. Then, you can launch Tika as follows: {noformat} CTAKES_HOME=/usr/local/apache-ctakes-3.2.2 java -cp tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input {noformat} In the example above, {{/path/to/CTAKESConfig}} is the parent directory of file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains the configuration properties to build the cTAKES AnalysisEngine; {{tika-config.xml}} is a custom configuration file for Tika that contains the mimetypes whose CTAKESParser will perform parsing. You can find in attachment an example of both {{CTAKESConfig.properties}} and {{tika-config.xml}} to parse ISA-Tab files using cTAKES. You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use the UMLS-based components of cTAKES. I would really appreciate your feedback. Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TIKA-1642) Integrate cTAKES into Tika
[ https://issues.apache.org/jira/browse/TIKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro reassigned TIKA-1642: - Assignee: Giuseppe Totaro Integrate cTAKES into Tika -- Key: TIKA-1642 URL: https://issues.apache.org/jira/browse/TIKA-1642 Project: Tika Issue Type: Improvement Components: parser Reporter: Selina Chu Assignee: Giuseppe Totaro [~gostep] has written a preliminary version of [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika. The CTAKESContentHandler allows to perform the following step into Tika: * create an AnalysisEngine based on a given XML descriptor; * create a CAS (Common Analysis System) appropriate for this AnalysisEngine; * populate the CAS with the text extracted by using Tika; * perform the AnalysisEngine against the plain text added to CAS; * write out the results in the given format (XML, XCAS, XMI, etc.). It would be great improvement if we can parse the output of cTAKES and create a list of metadata which describes the terms found in the annotation index and their corresponding tokens. For instance, using the AggregatePlaintextFastUMLSProcessor analysis engine, we can utilize the UMLS database to obtain the annotations related to DiseaseDisorderMention, and I would like to be able to produce a list of words corresponding to the input text which is annotated as DiseaseDisorderMention. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1642) Integrate cTAKES into Tika
[ https://issues.apache.org/jira/browse/TIKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563993#comment-14563993 ] Giuseppe Totaro commented on TIKA-1642: --- Hi [~selina], I believe that is a great idea. I am going right now to update my code on GitHub and add support for cTAKES metadata as suggested by you. Then, I will post here a new patch for Tika. Thanks a lot, Giuseppe Integrate cTAKES into Tika -- Key: TIKA-1642 URL: https://issues.apache.org/jira/browse/TIKA-1642 Project: Tika Issue Type: Improvement Components: parser Reporter: Selina Chu [~gostep] has written a preliminary version of [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] to integrate [Apache cTAKES|http://ctakes.apache.org/] into Tika. The CTAKESContentHandler allows to perform the following step into Tika: * create an AnalysisEngine based on a given XML descriptor; * create a CAS (Common Analysis System) appropriate for this AnalysisEngine; * populate the CAS with the text extracted by using Tika; * perform the AnalysisEngine against the plain text added to CAS; * write out the results in the given format (XML, XCAS, XMI, etc.). It would be great improvement if we can parse the output of cTAKES and create a list of metadata which describes the terms found in the annotation index and their corresponding tokens. For instance, using the AggregatePlaintextFastUMLSProcessor analysis engine, we can utilize the UMLS database to obtain the annotations related to DiseaseDisorderMention, and I would like to be able to produce a list of words corresponding to the input text which is annotated as DiseaseDisorderMention. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [ANNOUNCE] Welcome Giuseppe Totaro As Tika Committer + PMC Member
Thanks a lot David. I apologize for my delay. I am very proud to be part of this project as committer and member of the Tika PMC. I am working on Information Retrieval at scale under the supervision of Professor Chris Mattmann at NASA JPL. I developed new parsers (e.g., ISArchiveParser) and now I am working on adding support for more data formats in Tika. I take this opportunity to thank Chris Mattmann and Lewis McGibbney for kindly supporting me on this work. I would really like to get your feedback on my work. Feel free to ask me any question. Cheers, Giuseppe On Thu, Apr 9, 2015 at 1:27 PM, David Meikle dmei...@apache.org wrote: Hello All, Please welcome Giuseppe Totaro as he joins us as the latest Tika committer and PMC Member. He's recently been VOTEd in and now has his account all set up so is ready to roll! Giuseppe, please feel free to say a bit about yourself as an introduction to the group. Welcome aboard, Dave
[jira] [Updated] (TIKA-1580) ISA-Tab parsers
[ https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1580: -- Attachment: TIKA-1580.v03.2.Mattmann.Totaro.03262015.patch ISA-Tab parsers --- Key: TIKA-1580 URL: https://issues.apache.org/jira/browse/TIKA-1580 Project: Tika Issue Type: New Feature Components: parser Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Priority: Minor Labels: new-parser Fix For: 1.8 Attachments: TIKA-1580.Mattmann.Totaro.032515.patch.txt, TIKA-1580.patch, TIKA-1580.v02.patch, TIKA-1580.v03.2.Mattmann.Totaro.03262015.patch We are going to add parsers for ISA-Tab data formats. ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help to manage an increasingly diverse set of life science, environmental and biomedical experiments that employing one or a combination of technologies. The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular format. Therefore, ISA-Tab data format includes three types of file: Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file ({{a_.txt}}). These files are organized as [top-down hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation file includes one or more Study files: each Study files includes one or more Assay files. Essentially, the Investigation files contains high-level information about the related study, so it provides only metadata about ISA-Tab files. More details on file format specification are [available online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf]. The patch in attachment provides a preliminary version of ISA-Tab parsers (there are three parsers; one parser for each ISA-Tab filetype): * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts only metadata. * {{ISATabStudyParser.java}}: parses Study files. * {{ISATabAssayParser.java}}: parses Assay files. The most important improvements are: * Combine these three parsers in order to parse an ISArchive * Provide a better mapping of both study and assay data on XHML. Currently, {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping function relying on [Apache Commons CSV|https://commons.apache.org/proper/commons-csv/]. Thanks for supporting me on this work [~chrismattmann]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1580) ISA-Tab parsers
[ https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1580: -- Attachment: TIKA-1580.v03.Mattmann.Totaro.03262015.patch Hi all, I uploaded a new patch ({{TIKA-1580.v03.Mattmann.Totaro.03262015.patch}}) including a parser for ISATab archive. In particular, this patch includes a new {{ISArchiveParser}} java class that leverages on {{ISATabUtils}} static methods. {{ISATabUtils}} is a utility class that provides methods for parsing investigation, study, and assay files. {{ISArchiveParser}} runs over study files. It starts from the given study file and looks for the related investigation and assay files in the same directory. Mimetype detection is provided also for investigation and assay files. Thanks [~chrismattmann] for helping me on this stuff. ISA-Tab parsers --- Key: TIKA-1580 URL: https://issues.apache.org/jira/browse/TIKA-1580 Project: Tika Issue Type: New Feature Components: parser Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Priority: Minor Labels: new-parser Fix For: 1.8 Attachments: TIKA-1580.Mattmann.Totaro.032515.patch.txt, TIKA-1580.patch, TIKA-1580.v02.patch, TIKA-1580.v03.Mattmann.Totaro.03262015.patch We are going to add parsers for ISA-Tab data formats. ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help to manage an increasingly diverse set of life science, environmental and biomedical experiments that employing one or a combination of technologies. The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular format. Therefore, ISA-Tab data format includes three types of file: Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file ({{a_.txt}}). These files are organized as [top-down hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation file includes one or more Study files: each Study files includes one or more Assay files. Essentially, the Investigation files contains high-level information about the related study, so it provides only metadata about ISA-Tab files. More details on file format specification are [available online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf]. The patch in attachment provides a preliminary version of ISA-Tab parsers (there are three parsers; one parser for each ISA-Tab filetype): * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts only metadata. * {{ISATabStudyParser.java}}: parses Study files. * {{ISATabAssayParser.java}}: parses Assay files. The most important improvements are: * Combine these three parsers in order to parse an ISArchive * Provide a better mapping of both study and assay data on XHML. Currently, {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping function relying on [Apache Commons CSV|https://commons.apache.org/proper/commons-csv/]. Thanks for supporting me on this work [~chrismattmann]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 32291: ISATab parsers (preliminary version)
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32291/ --- (Updated March 23, 2015, 5:04 p.m.) Review request for tika and Chris Mattmann. Bugs: TIKA-1580 https://issues.apache.org/jira/browse/TIKA-1580 Repository: tika Description --- ISATab parsers. This preliminary solution provides three parsers, one for each ISA-Tab filetype (Investigation, Study, Assay). Diffs (updated) - trunk/tika-bundle/pom.xml 1668683 trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 1668683 trunk/tika-parsers/pom.xml 1668683 trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabAssayParser.java PRE-CREATION trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabInvestigationParser.java PRE-CREATION trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabStudyParser.java PRE-CREATION trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser 1668683 trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabAssayParserTest.java PRE-CREATION trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabInvestigationParserTest.java PRE-CREATION trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabStudyParserTest.java PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_bii-s-2_metabolite profiling_NMR spectroscopy.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_metabolome.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_microarray.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_proteome.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_transcriptome.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/i_investigation.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-1.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-2.txt PRE-CREATION Diff: https://reviews.apache.org/r/32291/diff/ Testing --- Tested on sample ISA-Tab files downloaded from http://www.isa-tools.org/format/examples/. Thanks, Giuseppe Totaro
[jira] [Updated] (TIKA-1580) ISA-Tab parsers
[ https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1580: -- Attachment: TIKA-1580.v02.patch ISA-Tab parsers --- Key: TIKA-1580 URL: https://issues.apache.org/jira/browse/TIKA-1580 Project: Tika Issue Type: New Feature Components: parser Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Priority: Minor Labels: new-parser Fix For: 1.8 Attachments: TIKA-1580.patch, TIKA-1580.v02.patch We are going to add parsers for ISA-Tab data formats. ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help to manage an increasingly diverse set of life science, environmental and biomedical experiments that employing one or a combination of technologies. The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular format. Therefore, ISA-Tab data format includes three types of file: Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file ({{a_.txt}}). These files are organized as [top-down hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation file includes one or more Study files: each Study files includes one or more Assay files. Essentially, the Investigation files contains high-level information about the related study, so it provides only metadata about ISA-Tab files. More details on file format specification are [available online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf]. The patch in attachment provides a preliminary version of ISA-Tab parsers (there are three parsers; one parser for each ISA-Tab filetype): * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts only metadata. * {{ISATabStudyParser.java}}: parses Study files. * {{ISATabAssayParser.java}}: parses Assay files. The most important improvements are: * Combine these three parsers in order to parse an ISArchive * Provide a better mapping of both study and assay data on XHML. Currently, {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping function relying on [Apache Commons CSV|https://commons.apache.org/proper/commons-csv/]. Thanks for supporting me on this work [~chrismattmann]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1580) ISA-Tab parsers
[ https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376210#comment-14376210 ] Giuseppe Totaro commented on TIKA-1580: --- Hi [~chrismattmann], I apologize about that. I forgot to include the parsers. I updated right now the patch in [https://reviews.apache.org/r/32291/]. You can find the patch also in attachment. Thanks [~tpalsulich] for your review. The new patch should include what you suggested. [~chrismattmann] and [~tpalsulich], I am going to create my own sample files using [ISACreator|http://www.isa-tools.org/software-suite/] tool and then I will add to the patch. Thanks a lot for your feedback. ISA-Tab parsers --- Key: TIKA-1580 URL: https://issues.apache.org/jira/browse/TIKA-1580 Project: Tika Issue Type: New Feature Components: parser Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Priority: Minor Labels: new-parser Fix For: 1.8 Attachments: TIKA-1580.patch We are going to add parsers for ISA-Tab data formats. ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help to manage an increasingly diverse set of life science, environmental and biomedical experiments that employing one or a combination of technologies. The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular format. Therefore, ISA-Tab data format includes three types of file: Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file ({{a_.txt}}). These files are organized as [top-down hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation file includes one or more Study files: each Study files includes one or more Assay files. Essentially, the Investigation files contains high-level information about the related study, so it provides only metadata about ISA-Tab files. More details on file format specification are [available online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf]. The patch in attachment provides a preliminary version of ISA-Tab parsers (there are three parsers; one parser for each ISA-Tab filetype): * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts only metadata. * {{ISATabStudyParser.java}}: parses Study files. * {{ISATabAssayParser.java}}: parses Assay files. The most important improvements are: * Combine these three parsers in order to parse an ISArchive * Provide a better mapping of both study and assay data on XHML. Currently, {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping function relying on [Apache Commons CSV|https://commons.apache.org/proper/commons-csv/]. Thanks for supporting me on this work [~chrismattmann]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1580) ISA-Tab parsers
[ https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376210#comment-14376210 ] Giuseppe Totaro edited comment on TIKA-1580 at 3/23/15 5:10 PM: Hi [~chrismattmann], I apologize about that. I forgot to include the parsers. I updated right now the patch (including parsers) in [https://reviews.apache.org/r/32291/]. You can find the patch also in attachment. Thanks [~tpalsulich] for your review. The new patch should include what you suggested. [~chrismattmann] and [~tpalsulich], I am going to create my own sample files using [ISACreator|http://www.isa-tools.org/software-suite/] tool and then I will add to the patch. Thanks a lot for your feedback. was (Author: gostep): Hi [~chrismattmann], I apologize about that. I forgot to include the parsers. I updated right now the patch in [https://reviews.apache.org/r/32291/]. You can find the patch also in attachment. Thanks [~tpalsulich] for your review. The new patch should include what you suggested. [~chrismattmann] and [~tpalsulich], I am going to create my own sample files using [ISACreator|http://www.isa-tools.org/software-suite/] tool and then I will add to the patch. Thanks a lot for your feedback. ISA-Tab parsers --- Key: TIKA-1580 URL: https://issues.apache.org/jira/browse/TIKA-1580 Project: Tika Issue Type: New Feature Components: parser Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Priority: Minor Labels: new-parser Fix For: 1.8 Attachments: TIKA-1580.patch, TIKA-1580.v02.patch We are going to add parsers for ISA-Tab data formats. ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help to manage an increasingly diverse set of life science, environmental and biomedical experiments that employing one or a combination of technologies. The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular format. Therefore, ISA-Tab data format includes three types of file: Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file ({{a_.txt}}). These files are organized as [top-down hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation file includes one or more Study files: each Study files includes one or more Assay files. Essentially, the Investigation files contains high-level information about the related study, so it provides only metadata about ISA-Tab files. More details on file format specification are [available online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf]. The patch in attachment provides a preliminary version of ISA-Tab parsers (there are three parsers; one parser for each ISA-Tab filetype): * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts only metadata. * {{ISATabStudyParser.java}}: parses Study files. * {{ISATabAssayParser.java}}: parses Assay files. The most important improvements are: * Combine these three parsers in order to parse an ISArchive * Provide a better mapping of both study and assay data on XHML. Currently, {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping function relying on [Apache Commons CSV|https://commons.apache.org/proper/commons-csv/]. Thanks for supporting me on this work [~chrismattmann]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1580) ISA-Tab parsers
[ https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14370945#comment-14370945 ] Giuseppe Totaro commented on TIKA-1580: --- The patch has been uploaded for review on [https://reviews.apache.org/r/32291/]. ISA-Tab parsers --- Key: TIKA-1580 URL: https://issues.apache.org/jira/browse/TIKA-1580 Project: Tika Issue Type: New Feature Components: parser Reporter: Giuseppe Totaro Priority: Minor Attachments: TIKA-1580.patch We are going to add parsers for ISA-Tab data formats. ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help to manage an increasingly diverse set of life science, environmental and biomedical experiments that employing one or a combination of technologies. The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular format. Therefore, ISA-Tab data format includes three types of file: Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file ({{a_.txt}}). These files are organized as [top-down hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation file includes one or more Study files: each Study files includes one or more Assay files. Essentially, the Investigation files contains high-level information about the related study, so it provides only metadata about ISA-Tab files. More details on file format specification are [available online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf]. The patch in attachment provides a preliminary version of ISA-Tab parsers (there are three parsers; one parser for each ISA-Tab filetype): * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts only metadata. * {{ISATabStudyParser.java}}: parses Study files. * {{ISATabAssayParser.java}}: parses Assay files. The most important improvements are: * Combine these three parsers in order to parse an ISArchive * Provide a better mapping of both study and assay data on XHML. Currently, {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping function relying on [Apache Commons CSV|https://commons.apache.org/proper/commons-csv/]. Thanks for supporting me on this work [~chrismattmann]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1580) ISA-Tab parsers
[ https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1580: -- Summary: ISA-Tab parsers (was: ISA-Tab) ISA-Tab parsers --- Key: TIKA-1580 URL: https://issues.apache.org/jira/browse/TIKA-1580 Project: Tika Issue Type: New Feature Components: parser Reporter: Giuseppe Totaro Priority: Minor We are going to add parsers for ISA-Tab data formats. ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help to manage an increasingly diverse set of life science, environmental and biomedical experiments that employing one or a combination of technologies. The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular format. Therefore, ISA-Tab data format includes three types of file: Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file ({{a_.txt}}). These files are organized as [top-down hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation file includes one or more Study files: each Study files includes one or more Assay files. Essentially, the Investigation files contains high-level information about the related study, so it provides only metadata about ISA-Tab files. More details on file format specification are [available online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf]. The patch in attachment provides a preliminary version of ISA-Tab parsers (there are three parsers; one parser for each ISA-Tab filetype): * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts only metadata. * {{ISATabStudyParser.java}}: parses Study files. * {{ISATabAssayParser.java}}: parses Assay files. The most important improvements are: * Combine these three parsers in order to parse an ISArchive * Provide a better mapping of both study and assay data on XHML. Currently, {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping function relying on [Apache Commons CSV|https://commons.apache.org/proper/commons-csv/]. Thanks for supporting me on this work [~chrismattmann]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1580) ISA-Tab
Giuseppe Totaro created TIKA-1580: - Summary: ISA-Tab Key: TIKA-1580 URL: https://issues.apache.org/jira/browse/TIKA-1580 Project: Tika Issue Type: New Feature Components: parser Reporter: Giuseppe Totaro Priority: Minor We are going to add parsers for ISA-Tab data formats. ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help to manage an increasingly diverse set of life science, environmental and biomedical experiments that employing one or a combination of technologies. The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular format. Therefore, ISA-Tab data format includes three types of file: Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file ({{a_.txt}}). These files are organized as [top-down hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation file includes one or more Study files: each Study files includes one or more Assay files. Essentially, the Investigation files contains high-level information about the related study, so it provides only metadata about ISA-Tab files. More details on file format specification are [available online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf]. The patch in attachment provides a preliminary version of ISA-Tab parsers (there are three parsers; one parser for each ISA-Tab filetype): * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts only metadata. * {{ISATabStudyParser.java}}: parses Study files. * {{ISATabAssayParser.java}}: parses Assay files. The most important improvements are: * Combine these three parsers in order to parse an ISArchive * Provide a better mapping of both study and assay data on XHML. Currently, {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping function relying on [Apache Commons CSV|https://commons.apache.org/proper/commons-csv/]. Thanks for supporting me on this work [~chrismattmann]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1483) Create a general raw string parser
[ https://issues.apache.org/jira/browse/TIKA-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336144#comment-14336144 ] Giuseppe Totaro commented on TIKA-1483: --- Hi [~chrismattmann], I don't know why but sometimes the {{patch}} tool does not work well if there is not a newline at the end of file. On my laptop, the patch works well if you put a new line at the end of file (e.g., {{printf \n TIKA-1483_v2.patch}}). I hope this trivial trick may be useful. Thank you, Giuseppe Create a general raw string parser -- Key: TIKA-1483 URL: https://issues.apache.org/jira/browse/TIKA-1483 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Attachments: TIKA-1483.patch, TIKA-1483_v2.patch I think it can be very useful adding a general parser able to extract raw strings from files (like the strings command), which can be used as the fallback parser for all mimetypes not having a specific parser implementation, like application/octet-stream. It can also be used as a fallback for corrupt files throwing a TikaException. It must be configured with the script/language to be extracted from the files (currently I implemented one specific for Latin1). It can use heuristics to extract strings encoded with different charsets within the same file, mainly the common ISO-8859-1, UTF8 and UTF16. What the community thinks about that? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1541) StringsParser: a simple strings-based parser for Tika
[ https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327027#comment-14327027 ] Giuseppe Totaro edited comment on TIKA-1541 at 2/19/15 5:57 AM: Hi all, I added more unit tests, especially for {{StringsConfig}} class (inspired by work on TesseractOCRParser). You can find in attachment the patch ({{TIKA-1541.v02.02182015.patch}}). Thank you, Giuseppe was (Author: gostep): Hi all, I added more unit tests, especially for {{StringsConfig}} class (inspired by work on TesseractOCRParser). You can find in attachment the patch ({{TIKA-1541.v02.02182015.patch}). Thank you, Giuseppe StringsParser: a simple strings-based parser for Tika - Key: TIKA-1541 URL: https://issues.apache.org/jira/browse/TIKA-1541 Project: Tika Issue Type: Improvement Components: parser Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, TIKA-1541.TotaroMattmann.020615.patch.txt, TIKA-1541.TotaroMattmannBurchNassif.020715.patch, TIKA-1541.TotaroMattmannBurchNassif.020815.patch, TIKA-1541.TotaroMattmannBurchNassif.020915.patch, TIKA-1541.patch, TIKA-1541.v02.02182015.patch, testOCTET_header.dbase3 I thought to implement an extremely simple implementation of {{StringsParser}}, a parser based on the {{strings}} command (or {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} for undetected files. It is a preliminary work (you can see a lot of todos). It is inspired by the work on {{TesseractOCRParser}}. You can find the patch in attachment. I created a GitHub [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the code. As first test, you can clone the repo, build the code using the {{build.sh}} script, and then run the parser using the {{run.sh}} script on some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 016 subset) detected as {{application/octet-stream}}. The latter script launches a simple {{StringsTest}} class for testing. I hope you will find the {{StringsParser}} a good solution for extracting ASCII strings from undetected filetypes. As far as I understood, many sophisticated forensics tools work in a similar manner for indexing purposes. They use a sort of {{strings}} command against files that they are not able to detect. In addition to run {{strings}} on undetected files, the {{StringsParser}} launches the {{file}} command on undetected files and then writes the output in the {{strings:file_output}} property (I noticed that sometimes the {{file}} command is able to detect the media type for documents not detected by Tika). Finally, you can fine an old discussion about this topic [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. Thanks [~chrismattmann]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1541) StringsParser: a simple strings-based parser for Tika
[ https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1541: -- Attachment: TIKA-1541.v02.02182015.patch Hi all, I added more unit tests, especially for {{StringsConfig}} class (inspired by work on TesseractOCRParser). You can find in attachment the patch ({{TIKA-1541.v02.02182015.patch}). Thank you, Giuseppe StringsParser: a simple strings-based parser for Tika - Key: TIKA-1541 URL: https://issues.apache.org/jira/browse/TIKA-1541 Project: Tika Issue Type: Improvement Components: parser Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, TIKA-1541.TotaroMattmann.020615.patch.txt, TIKA-1541.TotaroMattmannBurchNassif.020715.patch, TIKA-1541.TotaroMattmannBurchNassif.020815.patch, TIKA-1541.TotaroMattmannBurchNassif.020915.patch, TIKA-1541.patch, TIKA-1541.v02.02182015.patch, testOCTET_header.dbase3 I thought to implement an extremely simple implementation of {{StringsParser}}, a parser based on the {{strings}} command (or {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} for undetected files. It is a preliminary work (you can see a lot of todos). It is inspired by the work on {{TesseractOCRParser}}. You can find the patch in attachment. I created a GitHub [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the code. As first test, you can clone the repo, build the code using the {{build.sh}} script, and then run the parser using the {{run.sh}} script on some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 016 subset) detected as {{application/octet-stream}}. The latter script launches a simple {{StringsTest}} class for testing. I hope you will find the {{StringsParser}} a good solution for extracting ASCII strings from undetected filetypes. As far as I understood, many sophisticated forensics tools work in a similar manner for indexing purposes. They use a sort of {{strings}} command against files that they are not able to detect. In addition to run {{strings}} on undetected files, the {{StringsParser}} launches the {{file}} command on undetected files and then writes the output in the {{strings:file_output}} property (I noticed that sometimes the {{file}} command is able to detect the media type for documents not detected by Tika). Finally, you can fine an old discussion about this topic [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. Thanks [~chrismattmann]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1483) Create a general raw string parser
[ https://issues.apache.org/jira/browse/TIKA-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326875#comment-14326875 ] Giuseppe Totaro commented on TIKA-1483: --- Thanks [~lfcnassif]. I agree with you about the configuration object. Generally speaking, I use the configuration pattern for objects with three or more parameters. Thanks a lot, Giuseppe Create a general raw string parser -- Key: TIKA-1483 URL: https://issues.apache.org/jira/browse/TIKA-1483 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Attachments: TIKA-1483.patch, TIKA-1483_v2.patch I think it can be very useful adding a general parser able to extract raw strings from files (like the strings command), which can be used as the fallback parser for all mimetypes not having a specific parser implementation, like application/octet-stream. It can also be used as a fallback for corrupt files throwing a TikaException. It must be configured with the script/language to be extracted from the files (currently I implemented one specific for Latin1). It can use heuristics to extract strings encoded with different charsets within the same file, mainly the common ISO-8859-1, UTF8 and UTF16. What the community thinks about that? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika
[ https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313796#comment-14313796 ] Giuseppe Totaro commented on TIKA-1541: --- [~chrismattmann], probably it depends on the encoding option of the {{strings}} command. Even though, {{strings}} on *nix systems supports {{-e}} option, I noticed that also some non-Windows version of {{strings}} do not provide {{-e}} option. Therefore, the test fails if a version without support for {{-e}} is used. I modified the {{hasStrings}} method in {{StringsParser}} class in order to test if {{strings}} provides that option. I tried to build Tika+{{StringsParser}} using a {{strings}} version that does not support {{-e}} and all test work well. Please let me know if the updated patch works well for you. Thanks a lot, Giuseppe StringsParser: a simple strings-based parser for Tika - Key: TIKA-1541 URL: https://issues.apache.org/jira/browse/TIKA-1541 Project: Tika Issue Type: Improvement Components: parser Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, TIKA-1541.TotaroMattmann.020615.patch.txt, TIKA-1541.TotaroMattmannBurchNassif.020715.patch, TIKA-1541.TotaroMattmannBurchNassif.020815.patch, TIKA-1541.TotaroMattmannBurchNassif.020915.patch, TIKA-1541.patch, testOCTET_header.dbase3 I thought to implement an extremely simple implementation of {{StringsParser}}, a parser based on the {{strings}} command (or {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} for undetected files. It is a preliminary work (you can see a lot of todos). It is inspired by the work on {{TesseractOCRParser}}. You can find the patch in attachment. I created a GitHub [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the code. As first test, you can clone the repo, build the code using the {{build.sh}} script, and then run the parser using the {{run.sh}} script on some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 016 subset) detected as {{application/octet-stream}}. The latter script launches a simple {{StringsTest}} class for testing. I hope you will find the {{StringsParser}} a good solution for extracting ASCII strings from undetected filetypes. As far as I understood, many sophisticated forensics tools work in a similar manner for indexing purposes. They use a sort of {{strings}} command against files that they are not able to detect. In addition to run {{strings}} on undetected files, the {{StringsParser}} launches the {{file}} command on undetected files and then writes the output in the {{strings:file_output}} property (I noticed that sometimes the {{file}} command is able to detect the media type for documents not detected by Tika). Finally, you can fine an old discussion about this topic [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. Thanks [~chrismattmann]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1541) StringsParser: a simple strings-based parser for Tika
Giuseppe Totaro created TIKA-1541: - Summary: StringsParser: a simple strings-based parser for Tika Key: TIKA-1541 URL: https://issues.apache.org/jira/browse/TIKA-1541 Project: Tika Issue Type: Improvement Components: parser Reporter: Giuseppe Totaro I thought to implement an extremely simple implementation of {{StringsParser}}, a parser based on the {{strings}} command (or {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} for undetected files. It is a preliminary work (you can see a lot of todos). It is inspired by the work on {{TesseractOCRParser}}. I created a GitHub [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the code. As first test, you can clone the repo, build the code using the {{build.sh}} script, and then run the parser using the {{run.sh}} script on some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 016 subset) detected as {{application/octet-stream}}. The latter script launches a simple {{StringsTest}} class for testing. I hope you will find the {{StringsParser}} a good solution for extracting ASCII strings from undetected filetypes. As far as I understood, many sophisticated forensics tools work in a similar manner for indexing purposes. They use a sort of {{strings}} command against files that they are not able to detect. In addition to run {{strings}} on undetected files, the {{StringsParser}} launches the {{file}} command on undetected files and then writes the output in the {{strings:file_output}} property (I noticed that sometimes the {{file}} command is able to detect the media type for documents not detected by Tika). Finally, you can fine an old discussion about this topic [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. Thanks [~chrismattmann]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1541) StringsParser: a simple strings-based parser for Tika
[ https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1541: -- Description: I thought to implement an extremely simple implementation of {{StringsParser}}, a parser based on the {{strings}} command (or {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} for undetected files. It is a preliminary work (you can see a lot of todos). It is inspired by the work on {{TesseractOCRParser}}. You can find the patch in attachment. [file:Users/gtotaro/Desktop/TIKA-1541.patch] I created a GitHub [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the code. As first test, you can clone the repo, build the code using the {{build.sh}} script, and then run the parser using the {{run.sh}} script on some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 016 subset) detected as {{application/octet-stream}}. The latter script launches a simple {{StringsTest}} class for testing. I hope you will find the {{StringsParser}} a good solution for extracting ASCII strings from undetected filetypes. As far as I understood, many sophisticated forensics tools work in a similar manner for indexing purposes. They use a sort of {{strings}} command against files that they are not able to detect. In addition to run {{strings}} on undetected files, the {{StringsParser}} launches the {{file}} command on undetected files and then writes the output in the {{strings:file_output}} property (I noticed that sometimes the {{file}} command is able to detect the media type for documents not detected by Tika). Finally, you can fine an old discussion about this topic [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. Thanks [~chrismattmann]. was: I thought to implement an extremely simple implementation of {{StringsParser}}, a parser based on the {{strings}} command (or {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} for undetected files. It is a preliminary work (you can see a lot of todos). It is inspired by the work on {{TesseractOCRParser}}. I created a GitHub [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the code. As first test, you can clone the repo, build the code using the {{build.sh}} script, and then run the parser using the {{run.sh}} script on some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 016 subset) detected as {{application/octet-stream}}. The latter script launches a simple {{StringsTest}} class for testing. I hope you will find the {{StringsParser}} a good solution for extracting ASCII strings from undetected filetypes. As far as I understood, many sophisticated forensics tools work in a similar manner for indexing purposes. They use a sort of {{strings}} command against files that they are not able to detect. In addition to run {{strings}} on undetected files, the {{StringsParser}} launches the {{file}} command on undetected files and then writes the output in the {{strings:file_output}} property (I noticed that sometimes the {{file}} command is able to detect the media type for documents not detected by Tika). Finally, you can fine an old discussion about this topic [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html]. Thanks [~chrismattmann]. StringsParser: a simple strings-based parser for Tika - Key: TIKA-1541 URL: https://issues.apache.org/jira/browse/TIKA-1541 Project: Tika Issue Type: Improvement Components: parser Reporter: Giuseppe Totaro Attachments: TIKA-1541.patch I thought to implement an extremely simple implementation of {{StringsParser}}, a parser based on the {{strings}} command (or {{strings}}-alternative command), instead of using the dummy {{EmptyParser}} for undetected files. It is a preliminary work (you can see a lot of todos). It is inspired by the work on {{TesseractOCRParser}}. You can find the patch in attachment. [file:Users/gtotaro/Desktop/TIKA-1541.patch] I created a GitHub [repository|https://github.com/giuseppetotaro/StringsParser] for sharing the code. As first test, you can clone the repo, build the code using the {{build.sh}} script, and then run the parser using the {{run.sh}} script on some [govdocs1|http://digitalcorpora.org/corpora/govdocs] files (grabbed from 016 subset) detected as {{application/octet-stream}}. The latter script launches a simple {{StringsTest}} class for testing. I hope you will find the {{StringsParser}} a good solution for extracting ASCII strings from undetected filetypes. As far as I understood, many sophisticated forensics tools work in a similar manner for indexing purposes. They use a sort of {{strings
[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats
[ https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14297612#comment-14297612 ] Giuseppe Totaro commented on TIKA-1423: --- [~lewismc] your patch matches perfectly the improvements. Thank you. Build a parser to extract data from GRIB formats Key: TIKA-1423 URL: https://issues.apache.org/jira/browse/TIKA-1423 Project: Tika Issue Type: New Feature Components: metadata, mime, parser Affects Versions: 1.6 Reporter: Vineet Ghatge Assignee: Vineet Ghatge Priority: Critical Labels: features, newbie Fix For: 1.8 Attachments: GRIBParsertest.java, GribParser.java, NLDAS_FORA0125_H.A20130112.1200.002.grb, TIKA-1423.palsulich.120614.patch, TIKA-1423.patch, TIKA-1423v2.patch, fileName.html, gdas1.forecmwf.2014062612.grib2 Arctic dataset contains a MIME format called GRIB - General Regularlydistributed information in Binary form http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is a concise data format used in meteorology to store historical and weather data. There are 2 different types of the format GRIB 0, GRIB 2. The focus will be on GRIB 2 which is the most prevalent. Each GRIB record intended for either transmission or storage contains a single parameter with values located at an array of grid points, or represented as a set of spectral coefficients, for a single level (or layer), encoded as a continuous bit stream. Logical divisions of the record are designated as sections, each of which provides control information and/or data. A GRIB record consists of six sections, two of which are optional: (0) Indicator Section (1) Product Definition Section (PDS) (2) Grid Description Section (GDS) optional (3) Bit Map Section (BMS) optional (4) Binary Data Section (BDS) (5) '' (ASCII Characters) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats
[ https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291207#comment-14291207 ] Giuseppe Totaro commented on TIKA-1423: --- Hello [~vinegh], I noted in your parser that you instantiate a {{File}} object starting from {{RESOURCE_NAME_KEY}} string without using the {{InputStream}} object passed to the {{parse}} method: {code:title=gribParser.java|borderStyle=solid} … 49 //Get grib2 file name from metadata 50 51 File gribFile = new File(metadata.get(Metadata.RESOURCE_NAME_KEY)); 52 53 try { 54 NetcdfFile ncFile = NetcdfDataset.openFile(gribFile.getAbsolutePath(), … {code} This means that any implementation that does not define the {{RESOURCE_NAME_KEY}} property in the caller as follows {code} metadata.add(Metadata.RESOURCE_NAME_KEY, filename); {code} will fail because the {{File}} constructor throws a {{NullPointerException}}. Instead of adding {{RESOURCE_NAME_KEY}}, we can obtain the file from stream using the {{TikaInputStream}} class as well as in {{NetCDFParser.java}}: {code} 51 //File gribFile = new File(metadata.get(Metadata.RESOURCE_NAME_KEY)); 53 TikaInputStream tis = TikaInputStream.get(stream, new TemporaryResources()); 54 55 try { 57 NetcdfFile ncFile = NetcdfDataset.openFile(tis.getFile().getAbsolutePath(), null); {code} I tested it on my macbook and it works. I tried also the [netcdf-tools|http://netcdftools.sourceforge.net/] library for retrieving the set of global attributes but it does not work well and it seems outdated. Thank you for your great work, Giuseppe Build a parser to extract data from GRIB formats Key: TIKA-1423 URL: https://issues.apache.org/jira/browse/TIKA-1423 Project: Tika Issue Type: New Feature Components: metadata, mime, parser Affects Versions: 1.6 Reporter: Vineet Ghatge Assignee: Vineet Ghatge Priority: Critical Labels: features, newbie Fix For: 1.8 Attachments: GRIBParsertest.java, GribParser.java, NLDAS_FORA0125_H.A20130112.1200.002.grb, TIKA-1423.palsulich.120614.patch, TIKA-1423.patch, fileName.html, gdas1.forecmwf.2014062612.grib2 Arctic dataset contains a MIME format called GRIB - General Regularlydistributed information in Binary form http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is a concise data format used in meteorology to store historical and weather data. There are 2 different types of the format GRIB 0, GRIB 2. The focus will be on GRIB 2 which is the most prevalent. Each GRIB record intended for either transmission or storage contains a single parameter with values located at an array of grid points, or represented as a set of spectral coefficients, for a single level (or layer), encoded as a continuous bit stream. Logical divisions of the record are designated as sections, each of which provides control information and/or data. A GRIB record consists of six sections, two of which are optional: (0) Indicator Section (1) Product Definition Section (PDS) (2) Grid Description Section (GDS) optional (3) Bit Map Section (BMS) optional (4) Binary Data Section (BDS) (5) '' (ASCII Characters) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
[ https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892116#comment-13892116 ] Giuseppe Totaro commented on TIKA-1184: --- Hello, I've just run the tika-app-1.4.jar against files extracted from a disk image, and Tika hangs on a .sys file (attached). I tried tika-app-1.6-SNAPSHOT.jar and it worked fine. !file:///Users/giuseppe/Desktop/ansi.sys! Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...) -- Key: TIKA-1184 URL: https://issues.apache.org/jira/browse/TIKA-1184 Project: Tika Issue Type: Bug Components: cli, parser Affects Versions: 1.4 Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64) java version 1.7.0 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2)) IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 20130422_146026 (JIT enabled, AOT enabled) J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026 JIT - r11.b03_20130131_32403ifx4 GC - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS J9CL - 20130422_146026) JCL - 20130425_01 based on Oracle 7u21-b09 Reporter: Jürgen Enge tika hangs on identifying several types of files. the following example is an mp3 file with corrupt metadata. other filetypes which have the same problem are for example MSDOS device drivers (*.sys) i am not into java programming, but my guess would be, that tika is trying to seek() within a file and the target position is greater than filesize. java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [hangs forever without error message] ffmpeg gives some warnings about duration errors... ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e': Metadata: artist : album : Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
[ https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892116#comment-13892116 ] Giuseppe Totaro edited comment on TIKA-1184 at 2/5/14 1:52 PM: --- Hello, I've just run the tika-app-1.4.jar against files extracted from a disk image, and Tika hangs on a .sys file (attached). I tried tika-app-1.6-SNAPSHOT.jar and it worked fine. [file:///Users/giuseppe/Desktop/ansi.sys] was (Author: gostep): Hello, I've just run the tika-app-1.4.jar against files extracted from a disk image, and Tika hangs on a .sys file (attached). I tried tika-app-1.6-SNAPSHOT.jar and it worked fine. !file:///Users/giuseppe/Desktop/ansi.sys! Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...) -- Key: TIKA-1184 URL: https://issues.apache.org/jira/browse/TIKA-1184 Project: Tika Issue Type: Bug Components: cli, parser Affects Versions: 1.4 Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64) java version 1.7.0 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2)) IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 20130422_146026 (JIT enabled, AOT enabled) J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026 JIT - r11.b03_20130131_32403ifx4 GC - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS J9CL - 20130422_146026) JCL - 20130425_01 based on Oracle 7u21-b09 Reporter: Jürgen Enge tika hangs on identifying several types of files. the following example is an mp3 file with corrupt metadata. other filetypes which have the same problem are for example MSDOS device drivers (*.sys) i am not into java programming, but my guess would be, that tika is trying to seek() within a file and the target position is greater than filesize. java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [hangs forever without error message] ffmpeg gives some warnings about duration errors... ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e': Metadata: artist : album : Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Issue Comment Deleted] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
[ https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1184: -- Comment: was deleted (was: Hello, I've just run the tika-app-1.4.jar against files extracted from a disk image, and Tika hangs on a .sys file (attached). I tried tika-app-1.6-SNAPSHOT.jar and it worked fine. [file:///Users/giuseppe/Desktop/ansi.sys]) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...) -- Key: TIKA-1184 URL: https://issues.apache.org/jira/browse/TIKA-1184 Project: Tika Issue Type: Bug Components: cli, parser Affects Versions: 1.4 Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64) java version 1.7.0 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2)) IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 20130422_146026 (JIT enabled, AOT enabled) J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026 JIT - r11.b03_20130131_32403ifx4 GC - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS J9CL - 20130422_146026) JCL - 20130425_01 based on Oracle 7u21-b09 Reporter: Jürgen Enge tika hangs on identifying several types of files. the following example is an mp3 file with corrupt metadata. other filetypes which have the same problem are for example MSDOS device drivers (*.sys) i am not into java programming, but my guess would be, that tika is trying to seek() within a file and the target position is greater than filesize. java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [hangs forever without error message] ffmpeg gives some warnings about duration errors... ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e': Metadata: artist : album : Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
[ https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892122#comment-13892122 ] Giuseppe Totaro commented on TIKA-1184: --- Hello, I've just run the tika-app-1.4.jar against files extracted from a disk image, and Tika hangs on a .sys file (attached). I tried tika-app-1.6-SNAPSHOT.jar and it worked fine. Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...) -- Key: TIKA-1184 URL: https://issues.apache.org/jira/browse/TIKA-1184 Project: Tika Issue Type: Bug Components: cli, parser Affects Versions: 1.4 Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64) java version 1.7.0 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2)) IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 20130422_146026 (JIT enabled, AOT enabled) J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026 JIT - r11.b03_20130131_32403ifx4 GC - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS J9CL - 20130422_146026) JCL - 20130425_01 based on Oracle 7u21-b09 Reporter: Jürgen Enge Attachments: ansi.sys, ansi.sys tika hangs on identifying several types of files. the following example is an mp3 file with corrupt metadata. other filetypes which have the same problem are for example MSDOS device drivers (*.sys) i am not into java programming, but my guess would be, that tika is trying to seek() within a file and the target position is greater than filesize. java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [hangs forever without error message] ffmpeg gives some warnings about duration errors... ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e': Metadata: artist : album : Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
[ https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1184: -- Attachment: ansi.sys Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...) -- Key: TIKA-1184 URL: https://issues.apache.org/jira/browse/TIKA-1184 Project: Tika Issue Type: Bug Components: cli, parser Affects Versions: 1.4 Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64) java version 1.7.0 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2)) IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 20130422_146026 (JIT enabled, AOT enabled) J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026 JIT - r11.b03_20130131_32403ifx4 GC - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS J9CL - 20130422_146026) JCL - 20130425_01 based on Oracle 7u21-b09 Reporter: Jürgen Enge Attachments: ansi.sys, ansi.sys tika hangs on identifying several types of files. the following example is an mp3 file with corrupt metadata. other filetypes which have the same problem are for example MSDOS device drivers (*.sys) i am not into java programming, but my guess would be, that tika is trying to seek() within a file and the target position is greater than filesize. java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [hangs forever without error message] ffmpeg gives some warnings about duration errors... ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e': Metadata: artist : album : Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
[ https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1184: -- Attachment: ansi.sys Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...) -- Key: TIKA-1184 URL: https://issues.apache.org/jira/browse/TIKA-1184 Project: Tika Issue Type: Bug Components: cli, parser Affects Versions: 1.4 Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64) java version 1.7.0 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2)) IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 20130422_146026 (JIT enabled, AOT enabled) J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026 JIT - r11.b03_20130131_32403ifx4 GC - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS J9CL - 20130422_146026) JCL - 20130425_01 based on Oracle 7u21-b09 Reporter: Jürgen Enge Attachments: ansi.sys, ansi.sys tika hangs on identifying several types of files. the following example is an mp3 file with corrupt metadata. other filetypes which have the same problem are for example MSDOS device drivers (*.sys) i am not into java programming, but my guess would be, that tika is trying to seek() within a file and the target position is greater than filesize. java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [hangs forever without error message] ffmpeg gives some warnings about duration errors... ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e': Metadata: artist : album : Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Issue Comment Deleted] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
[ https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1184: -- Comment: was deleted (was: Hello, I've just run the tika-app-1.4.jar against files extracted from a disk image, and Tika hangs on a .sys file (attached). I tried tika-app-1.6-SNAPSHOT.jar and it worked fine. ) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...) -- Key: TIKA-1184 URL: https://issues.apache.org/jira/browse/TIKA-1184 Project: Tika Issue Type: Bug Components: cli, parser Affects Versions: 1.4 Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64) java version 1.7.0 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2)) IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 20130422_146026 (JIT enabled, AOT enabled) J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026 JIT - r11.b03_20130131_32403ifx4 GC - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS J9CL - 20130422_146026) JCL - 20130425_01 based on Oracle 7u21-b09 Reporter: Jürgen Enge Attachments: ansi.sys, ansi.sys tika hangs on identifying several types of files. the following example is an mp3 file with corrupt metadata. other filetypes which have the same problem are for example MSDOS device drivers (*.sys) i am not into java programming, but my guess would be, that tika is trying to seek() within a file and the target position is greater than filesize. java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [hangs forever without error message] ffmpeg gives some warnings about duration errors... ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e': Metadata: artist : album : Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
[ https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1184: -- Attachment: ansi.sys Hello, I've just run the tika-app-1.4.jar against files extracted from a disk image, and Tika hangs on a .sys file (attached). I tried tika-app-1.6-SNAPSHOT.jar and it worked fine. Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...) -- Key: TIKA-1184 URL: https://issues.apache.org/jira/browse/TIKA-1184 Project: Tika Issue Type: Bug Components: cli, parser Affects Versions: 1.4 Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64) java version 1.7.0 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2)) IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 20130422_146026 (JIT enabled, AOT enabled) J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026 JIT - r11.b03_20130131_32403ifx4 GC - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS J9CL - 20130422_146026) JCL - 20130425_01 based on Oracle 7u21-b09 Reporter: Jürgen Enge Attachments: ansi.sys, ansi.sys, ansi.sys tika hangs on identifying several types of files. the following example is an mp3 file with corrupt metadata. other filetypes which have the same problem are for example MSDOS device drivers (*.sys) i am not into java programming, but my guess would be, that tika is trying to seek() within a file and the target position is greater than filesize. java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [hangs forever without error message] ffmpeg gives some warnings about duration errors... ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e': Metadata: artist : album : Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
[ https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1184: -- Attachment: (was: ansi.sys) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...) -- Key: TIKA-1184 URL: https://issues.apache.org/jira/browse/TIKA-1184 Project: Tika Issue Type: Bug Components: cli, parser Affects Versions: 1.4 Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64) java version 1.7.0 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2)) IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 20130422_146026 (JIT enabled, AOT enabled) J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026 JIT - r11.b03_20130131_32403ifx4 GC - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS J9CL - 20130422_146026) JCL - 20130425_01 based on Oracle 7u21-b09 Reporter: Jürgen Enge Attachments: ansi.sys tika hangs on identifying several types of files. the following example is an mp3 file with corrupt metadata. other filetypes which have the same problem are for example MSDOS device drivers (*.sys) i am not into java programming, but my guess would be, that tika is trying to seek() within a file and the target position is greater than filesize. java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [hangs forever without error message] ffmpeg gives some warnings about duration errors... ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e': Metadata: artist : album : Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
[ https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1184: -- Attachment: (was: ansi.sys) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...) -- Key: TIKA-1184 URL: https://issues.apache.org/jira/browse/TIKA-1184 Project: Tika Issue Type: Bug Components: cli, parser Affects Versions: 1.4 Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64) java version 1.7.0 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2)) IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 20130422_146026 (JIT enabled, AOT enabled) J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026 JIT - r11.b03_20130131_32403ifx4 GC - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS J9CL - 20130422_146026) JCL - 20130425_01 based on Oracle 7u21-b09 Reporter: Jürgen Enge Attachments: ansi.sys, ansi.sys tika hangs on identifying several types of files. the following example is an mp3 file with corrupt metadata. other filetypes which have the same problem are for example MSDOS device drivers (*.sys) i am not into java programming, but my guess would be, that tika is trying to seek() within a file and the target position is greater than filesize. java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [hangs forever without error message] ffmpeg gives some warnings about duration errors... ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e': Metadata: artist : album : Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
[ https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1184: -- Attachment: (was: ansi.sys) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...) -- Key: TIKA-1184 URL: https://issues.apache.org/jira/browse/TIKA-1184 Project: Tika Issue Type: Bug Components: cli, parser Affects Versions: 1.4 Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64) java version 1.7.0 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2)) IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 20130422_146026 (JIT enabled, AOT enabled) J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026 JIT - r11.b03_20130131_32403ifx4 GC - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS J9CL - 20130422_146026) JCL - 20130425_01 based on Oracle 7u21-b09 Reporter: Jürgen Enge Attachments: ansi.sys, ansi.sys tika hangs on identifying several types of files. the following example is an mp3 file with corrupt metadata. other filetypes which have the same problem are for example MSDOS device drivers (*.sys) i am not into java programming, but my guess would be, that tika is trying to seek() within a file and the target position is greater than filesize. java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [hangs forever without error message] ffmpeg gives some warnings about duration errors... ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e': Metadata: artist : album : Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
[ https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892124#comment-13892124 ] Giuseppe Totaro commented on TIKA-1184: --- Hello, I've just run the tika-app-1.4.jar against files extracted from a disk image, and Tika hangs on a .sys file (attached). I tried tika-app-1.6-SNAPSHOT.jar and it worked fine. Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...) -- Key: TIKA-1184 URL: https://issues.apache.org/jira/browse/TIKA-1184 Project: Tika Issue Type: Bug Components: cli, parser Affects Versions: 1.4 Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64) java version 1.7.0 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2)) IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 20130422_146026 (JIT enabled, AOT enabled) J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026 JIT - r11.b03_20130131_32403ifx4 GC - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS J9CL - 20130422_146026) JCL - 20130425_01 based on Oracle 7u21-b09 Reporter: Jürgen Enge Attachments: ansi.sys, ansi.sys, ansi.sys tika hangs on identifying several types of files. the following example is an mp3 file with corrupt metadata. other filetypes which have the same problem are for example MSDOS device drivers (*.sys) i am not into java programming, but my guess would be, that tika is trying to seek() within a file and the target position is greater than filesize. java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [hangs forever without error message] ffmpeg gives some warnings about duration errors... ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e': Metadata: artist : album : Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1184) Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...)
[ https://issues.apache.org/jira/browse/TIKA-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated TIKA-1184: -- Attachment: ansi.sys Infinite halt on parsing old files (e.g. mp3, ms-dos drivers, ...) -- Key: TIKA-1184 URL: https://issues.apache.org/jira/browse/TIKA-1184 Project: Tika Issue Type: Bug Components: cli, parser Affects Versions: 1.4 Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64) java version 1.7.0 Java(TM) SE Runtime Environment (build pxa6470sr4fp2-20130426_01(SR4 FP2)) IBM J9 VM (build 2.6, JRE 1.7.0 Linux amd64-64 Compressed References 20130422_146026 (JIT enabled, AOT enabled) J9VM - R26_Java726_SR4_FP2_20130422_1320_B146026 JIT - r11.b03_20130131_32403ifx4 GC - R26_Java726_SR4_FP2_20130422_1320_B146026_CMPRSS J9CL - 20130422_146026) JCL - 20130425_01 based on Oracle 7u21-b09 Reporter: Jürgen Enge Attachments: ansi.sys, ansi.sys, ansi.sys tika hangs on identifying several types of files. the following example is an mp3 file with corrupt metadata. other filetypes which have the same problem are for example MSDOS device drivers (*.sys) i am not into java programming, but my guess would be, that tika is trying to seek() within a file and the target position is greater than filesize. java -jar tika-app-1.4.jar -m /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [hangs forever without error message] ffmpeg gives some warnings about duration errors... ffmpeg -i /u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e [mp3 @ 0x633240] max_analyze_duration 500 reached at 5015510 [mp3 @ 0x633240] Estimating duration from bitrate, this may be inaccurate Input #0, mp3, from '/u01/fk/xd/2/c/16866bc96e6a316d8cbdbd7ca2ce1e': Metadata: artist : album : Duration: 00:15:29.10, start: 0.00, bitrate: 192 kb/s Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1092) Parsing of old Word file causes a TikaException
[ https://issues.apache.org/jira/browse/TIKA-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13619639#comment-13619639 ] Giuseppe Totaro commented on TIKA-1092: --- Thanks Nick. I'll give you feedback as soon as possible. Parsing of old Word file causes a TikaException --- Key: TIKA-1092 URL: https://issues.apache.org/jira/browse/TIKA-1092 Project: Tika Issue Type: Bug Components: parser Reporter: Giuseppe Totaro Priority: Minor Labels: office, parse, word-exception I found an issue with the parse method of org.apache.tika.parser.microsoft.OfficeParser. This parser generates a Tika Exception when it try to parse very old file of Microsoft Word. I think this issue is not a priority because the files that cause the exception belong to an obsolete format/structure that even new Microsoft Office versions don't support them, but it's important to know that something wrong about these outdated types can happen. I report two links about old types (Microsoft support perspective): http://support.microsoft.com/?kbid=922850 http://support.microsoft.com/kb/922849/it For example, the message of TikaException is below: Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@789ab21d at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) Caused by: java.io.IOException: Invalid header signature; read 0x0410401F002DA5DB, expected 0xE11AB1A1E011CFD0 at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:140) at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:115) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:198) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:184) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1092) Parsing of old Word file causes a TikaException
[ https://issues.apache.org/jira/browse/TIKA-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602168#comment-13602168 ] Giuseppe Totaro commented on TIKA-1092: --- Hi Nick, thanks for your support. I'll send you the first 384 bytes (saved by hex editor for mac) of three old file .DOC that cause TikaException. Unfortunately I can't supply the first 2 KB because at byte 385 begins the confidential text. I'll try to send you other information. Thanks, Giuseppe Parsing of old Word file causes a TikaException --- Key: TIKA-1092 URL: https://issues.apache.org/jira/browse/TIKA-1092 Project: Tika Issue Type: Bug Components: parser Reporter: Giuseppe Totaro Priority: Minor Labels: office, parse, word-exception I found an issue with the parse method of org.apache.tika.parser.microsoft.OfficeParser. This parser generates a Tika Exception when it try to parse very old file of Microsoft Word. I think this issue is not a priority because the files that cause the exception belong to an obsolete format/structure that even new Microsoft Office versions don't support them, but it's important to know that something wrong about these outdated types can happen. I report two links about old types (Microsoft support perspective): http://support.microsoft.com/?kbid=922850 http://support.microsoft.com/kb/922849/it For example, the message of TikaException is below: Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@789ab21d at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) Caused by: java.io.IOException: Invalid header signature; read 0x0410401F002DA5DB, expected 0xE11AB1A1E011CFD0 at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:140) at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:115) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:198) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:184) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1092) Parsing of old Word file causes a TikaException
[ https://issues.apache.org/jira/browse/TIKA-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13601195#comment-13601195 ] Giuseppe Totaro commented on TIKA-1092: --- Hi Nick, most files were created in 1992 (before the launch of Word 6). When I try to open these files with my Word version (Office 2007) I receive the message: You are attempting to open a file that was created in an earlier version of Microsoft Office. This file type is blocked from opening in this version by your registry policy setting. To open the file I must apply the manual (or fix app) correction to Windows registry following the instructions reported in http://support.microsoft.com/kb/922849/en-us#fixit4me After the correction, I'm able to open the file with Word and I see the document text correctly. If I try to save the file (on itself), the Word application ask me to select a type. Thus I can see the file with Word but I'm not able to know the original type version of the document and the application used to create it. I attempted to know other information about these misterious files, but I didn't obtain relevant results. For example, I used the command-line tool file under Linux or other metadata analyzer (don't worry... Tika remains my favorite parser :)). Thanks, Giuseppe Parsing of old Word file causes a TikaException --- Key: TIKA-1092 URL: https://issues.apache.org/jira/browse/TIKA-1092 Project: Tika Issue Type: Bug Components: parser Reporter: Giuseppe Totaro Priority: Minor Labels: office, parse, word-exception I found an issue with the parse method of org.apache.tika.parser.microsoft.OfficeParser. This parser generates a Tika Exception when it try to parse very old file of Microsoft Word. I think this issue is not a priority because the files that cause the exception belong to an obsolete format/structure that even new Microsoft Office versions don't support them, but it's important to know that something wrong about these outdated types can happen. I report two links about old types (Microsoft support perspective): http://support.microsoft.com/?kbid=922850 http://support.microsoft.com/kb/922849/it For example, the message of TikaException is below: Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@789ab21d at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) Caused by: java.io.IOException: Invalid header signature; read 0x0410401F002DA5DB, expected 0xE11AB1A1E011CFD0 at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:140) at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:115) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:198) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:184) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1092) Parsing of old Word file causes a TikaException
[ https://issues.apache.org/jira/browse/TIKA-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13600065#comment-13600065 ] Giuseppe Totaro commented on TIKA-1092: --- Hi Nick, I'm agree with your first observation about old office documents. I don't think that someone has renamed the files. These files were created with an older version of Word (I think Microsoft Word 6.0) and they were saved with .doc extension. Unfortunately I can't supply my set of files because they are classified. I'll send you one or more files If I find documents without confidentiality limits that generate the same exception. Thanks, Giuseppe Parsing of old Word file causes a TikaException --- Key: TIKA-1092 URL: https://issues.apache.org/jira/browse/TIKA-1092 Project: Tika Issue Type: Bug Components: parser Reporter: Giuseppe Totaro Priority: Minor Labels: office, parse, word-exception I found an issue with the parse method of org.apache.tika.parser.microsoft.OfficeParser. This parser generates a Tika Exception when it try to parse very old file of Microsoft Word. I think this issue is not a priority because the files that cause the exception belong to an obsolete format/structure that even new Microsoft Office versions don't support them, but it's important to know that something wrong about these outdated types can happen. I report two links about old types (Microsoft support perspective): http://support.microsoft.com/?kbid=922850 http://support.microsoft.com/kb/922849/it For example, the message of TikaException is below: Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@789ab21d at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) Caused by: java.io.IOException: Invalid header signature; read 0x0410401F002DA5DB, expected 0xE11AB1A1E011CFD0 at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:140) at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:115) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:198) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:184) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1092) Parsing of old Word file causes a TikaException
Giuseppe Totaro created TIKA-1092: - Summary: Parsing of old Word file causes a TikaException Key: TIKA-1092 URL: https://issues.apache.org/jira/browse/TIKA-1092 Project: Tika Issue Type: Bug Components: parser Reporter: Giuseppe Totaro Priority: Minor I found an issue with the parse method of org.apache.tika.parser.microsoft.OfficeParser. This parser generates a Tika Exception when it try to parse very old file of Microsoft Word. I think this issue is not a priority because the files that cause the exception belong to an obsolete format/structure that even new Microsoft Office versions don't support them, but it's important to know that something wrong about these outdated types can happen. I report two links about old types (Microsoft support perspective): http://support.microsoft.com/?kbid=922850 http://support.microsoft.com/kb/922849/it For example, the message of TikaException is below: Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@789ab21d at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) Caused by: java.io.IOException: Invalid header signature; read 0x0410401F002DA5DB, expected 0xE11AB1A1E011CFD0 at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:140) at org.apache.poi.poifs.storage.HeaderBlock.init(HeaderBlock.java:115) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:198) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:184) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1081) Error in specification of glob pattern for awk files
[ https://issues.apache.org/jira/browse/TIKA-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575242#comment-13575242 ] Giuseppe Totaro commented on TIKA-1081: --- Thanks Chris :) Error in specification of glob pattern for awk files Key: TIKA-1081 URL: https://issues.apache.org/jira/browse/TIKA-1081 Project: Tika Issue Type: Bug Components: mime Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Priority: Trivial Labels: mime Fix For: 1.4 The tika-mimetypes.xml file presents an orthographic error at line n. 4591: {{glob *pattenr*=*.awk/}} I'm new of Apache community and I hope that my little, maybe unuseful, contribution is correct. Best regards, Giuseppe -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira