[ 
https://issues.apache.org/jira/browse/TIKA-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-2449:
----------------------------------
    Attachment: standards_extraction_v02.png

> Enabling extraction of standard references from text
> ----------------------------------------------------
>
>                 Key: TIKA-2449
>                 URL: https://issues.apache.org/jira/browse/TIKA-2449
>             Project: Tika
>          Issue Type: Improvement
>          Components: handler
>            Reporter: Giuseppe Totaro
>              Labels: handler
>         Attachments: flowchart_standards_extraction.png, SOW-TacCOM.pdf, 
> standards_extraction.patch
>
>
> Apache Tika currently provides many _ContentHandler_ which help to 
> de-obfuscate some information from text. For instance, the 
> {{PhoneExtractingContentHandler}} is used to extract phone numbers while 
> parsing.
> This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a 
> new ContentHandler that relies on regular expressions in order to identify 
> and extract standard references from text. 
> Basically, a standard reference is just a reference to a 
> norm/convention/requirement (i.e., a standard) released by a standard 
> organization. This work is maily focused on identifying and extracting the 
> references to the standards already cited within a given document (e.g., 
> SOW/PWS) so the references can be stored and provided to the user as 
> additional metadata in case the StandardExtractingContentHandler is used.
> In addition to the patch, the first version of the 
> {{StandardsExtractingContentHandler}} along with an example class to easily 
> execute the handler is available on 
> [GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. 
> The following sections provide more in detail how the 
> {{StandardsExtractingHandler}} has been developed.
> h1. Background
> From a technical perspective, a standard reference is a string that is 
> usually composed of two parts: 
> # the name of the standard organization; 
> # the alphanumeric identifier of the standard within the organization. 
> Specifically, the first part can include the acronym or the full name of the 
> standard organization or even both, and the second part can include an 
> alphanumeric string, possibly containing one or more separation symbols 
> (e.g., "-", "_", ".") depending on the format adopted by the organization, 
> representing the identifier of the standard within the organization.
> Furthermore, the standard references are usually reported within the 
> "Applicable Documents" or "References" section of a SOW, and they can be 
> cited also within sections that include in the header the word "standard", 
> "requirement", "guideline", or "compliance".
> Consequently, the citation of standard references within a SOW/PWS document 
> can be summarized by the following rules:
> * *RULE #1*: standard references are usually reported within the section 
> named "Applicable Documents" or "References".
> * *RULE #2*: standard references can be cited also within sections including 
> the word "compliance" or another semantically-equivalent word in their name.
> * *RULE #3*: standard references is composed of two parts:
> ** Name of the standard organization (acronym, full name, or both).
> ** Alphanumeric identifier of the standard within the organization.
> * *RULE #4*: The name of the standard organization includes the acronym or 
> the full name or both. The name must belong to the set of standard 
> organizations {{S = O U V}}, where {{O}} represents the set of open standard 
> organizations (e.g., ANSI) and {{V}} represents the set of vendor-specific 
> standard organizations (e.g., Motorola).
> * *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be 
> used between the name of the standard organization and the alphanumeric 
> identifier.
> * *RULE #6*: The alphanumeric identifier of the standard is composed of 
> alphabetic and numeric characters, possibly split in two or more parts by a 
> separation symbol (e.g., "-", "_", ".").
> On the basis of the above rules, here are some examples of formats used for 
> reporting standard references within a SOW/PWS:
> * {{<ORGANIZATION_ACRONYM><SEPARATION_SYMBOL><ALPHANUMERIC_IDENTIFIER>}}
> * 
> {{<ORGANIZATION_ACRONYM><SEPARATION_SYMBOL>(<ORGANIZATION_FULL_NAME>)<SEPARATION_SYMBOL><ALPHANUMERIC_IDENTIFIER>}}
> * 
> {{<ORGANIZATION_FULL_NAME><SEPARATION_SYMBOL>(<ORGANIZATION_FULL_NAME>)<SEPARATION_SYMBOL><ALPHANUMERIC_IDENTIFIER>}}
> Moreover, some standards are sometimes released by two standard 
> organizations. In this case, the standard reference can be reported as 
> follows:
> * 
> {{<MAIN_ORGANIZATION_ACRONYM>/<SECOND_ORGANIZATION_ACRONYM><SEPARATION_SYMBOL><ALPHANUMERIC_IDENTIFIER>}}
> h1. Regular Expressions
> The {{StandardsExtractingContentHandler}} uses a helper class named 
> {{StandardsText}} that relies on Java regular expressions and provides some 
> methods to identify headers and standard references, and determine the score 
> of the references found within the given text.
> Here are the main regular expressions used within the {{StandardsText}} class:
> * *REGEX_HEADER*: regular expression to match only uppercase headers.
>   {code}
>   (\d+\.(\d+\.?)*)\p{Blank}+([A-Z]+(\s[A-Z]+)*){5,}
>   {code}
> * *REGEX_APPLICABLE_DOCUMENTS*: regular expression to match the the header of 
> "APPLICABLE DOCUMENTS" and equivalent sections.
>   {code}
>   
> (?i:.*APPLICABLE\sDOCUMENTS|REFERENCE|STANDARD|REQUIREMENT|GUIDELINE|COMPLIANCE.*)
>   {code}
> * *REGEX_FALLBACK*: regular expression to match a string that is supposed to 
> be a standard reference.
>   {code}
>   
> \(?(?<mainOrganization>[A-Z]\w+)\)?((\s?(?<separator>\/)\s?)(\w+\s)*\(?(?<secondOrganization>[A-Z]\w+)\)?)?(\s(Publication|Standard))?(-|\s)?(?<identifier>([0-9]{3,}|([A-Z]+(-|_|\.)?[0-9]{2,}))((-|_|\.)?[A-Z0-9]+)*)
>   {code}
> * *REGEX_STANDARD*: regular expression to match the standard organization 
> within a string potentially representing a standard reference.
>   This regular expression is obtained by using a helper class named 
> {{StandardOrganizations}} that provides a list of the most important standard 
> organizations reported on 
> [Wikipedia|https://en.wikipedia.org/wiki/List_of_technical_standard_organisations].
>  Basically, the list is composed of International standard organizations, 
> Regional standard organizations, and American and British among 
> Nationally-based standard organizations. Other lists of standard 
> organizations are reported on 
> [OpenStandards|http://www.openstandards.net/viewOSnet2C.jsp?showModuleName=Organizations]
>  and [IBR Standards Portal|https://ibr.ansi.org/Standards/].
> h1. How To Use The Standards Extraction Capability
> The standard references identification performed by using the 
> {{StandardsExtractingContentHandler}} is based on the following steps (see 
> also the [flow chart|^flowchart_standards_extraction.png] in attachment):
> # searches for headers;
> # searches for patterns that are supposed to be standard references 
> (basically, every string mostly composed of uppercase letters followed by an 
> alphanumeric characters);
> # each potential standard reference starts with score equal to 0.25;
> # increases by 0.50 the score of references which include the name of a known 
> standard organization;
> # increases by 0.25 the score of references which have been found within 
> "Applicable Documents" and equivalent sections;
> # returns the standard references along with scores;
> # adds the standard references as additional metadata.
> The unit test is implemented within the 
> *{{StandardsExtractingContentHandlerTest}}* class and extracts the standard 
> references from a SoW downloaded from the [FOIA 
> Library|https://foiarr.cbp.gov/streamingWord.asp?i=607]. This 
> [SoW|^SOW-TacCOM.pdf] is also provided as PDF in attachment.
> The *{{StandardsExtractionExample}}* is a class to demonstrate how to use the 
> {{StandardsExtractingContentHandler}} to get a list of the standard 
> references from every file in a directory.
> The [patch|^standards_extraction.patch] in attachment includes all the 
> changes to add the support for standards extraction. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to