[
https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Giuseppe Totaro updated TIKA-2449:
----------------------------------
Attachment: flowchart_standards_extraction_v02.png
> Enabling extraction of standard references from text
> ----------------------------------------------------
>
> Key: TIKA-2449
> URL: https://issues.apache.org/jira/browse/TIKA-2449
> Project: Tika
> Issue Type: Improvement
> Components: handler
> Reporter: Giuseppe Totaro
> Labels: handler
> Attachments: flowchart_standards_extraction.png,
> flowchart_standards_extraction_v02.png, SOW-TacCOM.pdf,
> standards_extraction.patch
>
>
> Apache Tika currently provides many _ContentHandler_ which help to
> de-obfuscate some information from text. For instance, the
> {{PhoneExtractingContentHandler}} is used to extract phone numbers while
> parsing.
> This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a
> new ContentHandler that relies on regular expressions in order to identify
> and extract standard references from text.
> Basically, a standard reference is just a reference to a
> norm/convention/requirement (i.e., a standard) released by a standard
> organization. This work is maily focused on identifying and extracting the
> references to the standards already cited within a given document (e.g.,
> SOW/PWS) so the references can be stored and provided to the user as
> additional metadata in case the StandardExtractingContentHandler is used.
> In addition to the patch, the first version of the
> {{StandardsExtractingContentHandler}} along with an example class to easily
> execute the handler is available on
> [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler].
> The following sections provide more in detail how the
> {{StandardsExtractingHandler}} has been developed.
> h1. Background
> From a technical perspective, a standard reference is a string that is
> usually composed of two parts:
> # the name of the standard organization;
> # the alphanumeric identifier of the standard within the organization.
> Specifically, the first part can include the acronym or the full name of the
> standard organization or even both, and the second part can include an
> alphanumeric string, possibly containing one or more separation symbols
> (e.g., "-", "_", ".") depending on the format adopted by the organization,
> representing the identifier of the standard within the organization.
> Furthermore, the standard references are usually reported within the
> "Applicable Documents" or "References" section of a SOW, and they can be
> cited also within sections that include in the header the word "standard",
> "requirement", "guideline", or "compliance".
> Consequently, the citation of standard references within a SOW/PWS document
> can be summarized by the following rules:
> * *RULE #1*: standard references are usually reported within the section
> named "Applicable Documents" or "References".
> * *RULE #2*: standard references can be cited also within sections including
> the word "compliance" or another semantically-equivalent word in their name.
> * *RULE #3*: standard references is composed of two parts:
> ** Name of the standard organization (acronym, full name, or both).
> ** Alphanumeric identifier of the standard within the organization.
> * *RULE #4*: The name of the standard organization includes the acronym or
> the full name or both. The name must belong to the set of standard
> organizations {{S = O U V}}, where {{O}} represents the set of open standard
> organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific
> standard organizations (e.g., Motorola).
> * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be
> used between the name of the standard organization and the alphanumeric
> identifier.
> * *RULE #6*: The alphanumeric identifier of the standard is composed of
> alphabetic and numeric characters, possibly split in two or more parts by a
> separation symbol (e.g., "-", "_", ".").
> On the basis of the above rules, here are some examples of formats used for
> reporting standard references within a SOW/PWS:
> * {{<ORGANIZATION_ACRONYM><SEPARATION_SYMBOL><ALPHANUMERIC_IDENTIFIER>}}
> *
> {{<ORGANIZATION_ACRONYM><SEPARATION_SYMBOL>(<ORGANIZATION_FULL_NAME>)<SEPARATION_SYMBOL><ALPHANUMERIC_IDENTIFIER>}}
> *
> {{<ORGANIZATION_FULL_NAME><SEPARATION_SYMBOL>(<ORGANIZATION_FULL_NAME>)<SEPARATION_SYMBOL><ALPHANUMERIC_IDENTIFIER>}}
> Moreover, some standards are sometimes released by two standard
> organizations. In this case, the standard reference can be reported as
> follows:
> *
> {{<MAIN_ORGANIZATION_ACRONYM>/<SECOND_ORGANIZATION_ACRONYM><SEPARATION_SYMBOL><ALPHANUMERIC_IDENTIFIER>}}
> h1. Regular Expressions
> The {{StandardsExtractingContentHandler}} uses a helper class named
> {{StandardsText}} that relies on Java regular expressions and provides some
> methods to identify headers and standard references, and determine the score
> of the references found within the given text.
> Here are the main regular expressions used within the {{StandardsText}} class:
> * *REGEX_HEADER*: regular expression to match only uppercase headers.
> {code}
> (\d+\.(\d+\.?)*)\p{Blank}+([A-Z]+(\s[A-Z]+)*){5,}
> {code}
> * *REGEX_APPLICABLE_DOCUMENTS*: regular expression to match the the header of
> "APPLICABLE DOCUMENTS" and equivalent sections.
> {code}
>
> (?i:.*APPLICABLE\sDOCUMENTS|REFERENCE|STANDARD|REQUIREMENT|GUIDELINE|COMPLIANCE.*)
> {code}
> * *REGEX_FALLBACK*: regular expression to match a string that is supposed to
> be a standard reference.
> {code}
>
> \(?(?<mainOrganization>[A-Z]\w+)\)?((\s?(?<separator>\/)\s?)(\w+\s)*\(?(?<secondOrganization>[A-Z]\w+)\)?)?(\s(Publication|Standard))?(-|\s)?(?<identifier>([0-9]{3,}|([A-Z]+(-|_|\.)?[0-9]{2,}))((-|_|\.)?[A-Z0-9]+)*)
> {code}
> * *REGEX_STANDARD*: regular expression to match the standard organization
> within a string potentially representing a standard reference.
> This regular expression is obtained by using a helper class named
> {{StandardOrganizations}} that provides a list of the most important standard
> organizations reported on
> [Wikipedia|https://en.wikipedia.org/wiki/List_of_technical_standard_organisations].
> Basically, the list is composed of International standard organizations,
> Regional standard organizations, and American and British among
> Nationally-based standard organizations. Other lists of standard
> organizations are reported on
> [OpenStandards|http://www.openstandards.net/viewOSnet2C.jsp?showModuleName=Organizations]
> and [IBR Standards Portal|https://ibr.ansi.org/Standards/].
> h1. How To Use The Standards Extraction Capability
> The standard references identification performed by using the
> {{StandardsExtractingContentHandler}} is based on the following steps (see
> also the [flow chart|^flowchart_standards_extraction.png] in attachment):
> # searches for headers;
> # searches for patterns that are supposed to be standard references
> (basically, every string mostly composed of uppercase letters followed by an
> alphanumeric characters);
> # each potential standard reference starts with score equal to 0.25;
> # increases by 0.50 the score of references which include the name of a known
> standard organization;
> # increases by 0.25 the score of references which have been found within
> "Applicable Documents" and equivalent sections;
> # returns the standard references along with scores;
> # adds the standard references as additional metadata.
> The unit test is implemented within the
> *{{StandardsExtractingContentHandlerTest}}* class and extracts the standard
> references from a SoW downloaded from the [FOIA
> Library|https://foiarr.cbp.gov/streamingWord.asp?i=607]. This
> [SoW|^SOW-TacCOM.pdf] is also provided as PDF in attachment.
> The *{{StandardsExtractionExample}}* is a class to demonstrate how to use the
> {{StandardsExtractingContentHandler}} to get a list of the standard
> references from every file in a directory.
> The [patch|^standards_extraction.patch] in attachment includes all the
> changes to add the support for standards extraction.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)