Giuseppe Totaro created TIKA-2449:
-------------------------------------

             Summary: Enabling extraction of standard references from text
                 Key: TIKA-2449
                 URL: https://issues.apache.org/jira/browse/TIKA-2449
             Project: Tika
          Issue Type: Improvement
          Components: handler
            Reporter: Giuseppe Totaro


Apache Tika currently provides many _ContentHandler_ which help to de-obfuscate 
some information from text. For instance, the {{PhoneExtractingContentHandler}} 
is used to extract phone numbers while parsing.

This improvement adds the *{{StandardsExtractingContentHandler}}* to Tika, a 
new ContentHandler that relies on regular expressions in order to identify and 
extract standard references from text. 
Basically, a standard reference is just a reference to a 
norm/convention/requirement (i.e., a standard) released by a standard 
organization. This work is maily focused on identifying and extracting the 
references to the standards already cited within a given document (e.g., 
SOW/PWS) so the references can be stored and provided to the user as additional 
metadata in case the StandardExtractingContentHandler is used.

In addition to the patch, the first version of the 
{{StandardsExtractingContentHandler}} along with an example class to easily 
execute the handler is available on 
[GitHub|https://github.com/giuseppetotaro/StandardsExtractingContentHandler]. 
The following sections provide more in detail how the 
{{StandardsExtractingHandler}} has been developed.

h1. Background

>From a technical perspective, a standard reference is a string that is usually 
>composed of two parts: 
# the name of the standard organization; 
# the alphanumeric identifier of the standard within the organization. 
Specifically, the first part can include the acronym or the full name of the 
standard organization or even both, and the second part can include an 
alphanumeric string, possibly containing one or more separation symbols (e.g., 
"-", "_", ".") depending on the format adopted by the organization, 
representing the identifier of the standard within the organization.

Furthermore, the standard references are usually reported within the 
"Applicable Documents" or "References" section of a SOW, and they can be cited 
also within sections that include in the header the word "standard", 
"requirement", "guideline", or "compliance".

Consequently, the citation of standard references within a SOW/PWS document can 
be summarized by the following rules:
* *RULE #1*: standard references are usually reported within the section named 
"Applicable Documents" or "References".
* *RULE #2*: standard references can be cited also within sections including 
the word "compliance" or another semantically-equivalent word in their name.
* *RULE #3*: standard references is composed of two parts:
** Name of the standard organization (acronym, full name, or both).
** Alphanumeric identifier of the standard within the organization.
* *RULE #4*: The name of the standard organization includes the acronym or the 
full name or both. The name must belong to the set of standard organizations S 
= O U V, where O represents the set of open standard organizations (e.g., ANSI) 
and V represents the set of vendor-specific standard organizations (e.g., 
Motorola).
* *RULE #5*: A separation symbol (e.g., "-", "_", "." or whitespace) can be 
used between the name of the standard organization and the alphanumeric 
identifier.
* *RULE #6*: The alphanumeric identifier of the standard is composed of 
alphabetic and numeric characters, possibly split in two or more parts by a 
separation symbol (e.g., "-", "_", ".").

On the basis of the above rules, here are some examples of formats used for 
reporting standard references within a SOW/PWS:
* {{<ORGANIZATION_ACRONYM><SEPARATION_SYMBOL><ALPHANUMERIC_IDENTIFIER>}}
* 
{{<ORGANIZATION_ACRONYM><SEPARATION_SYMBOL>(<ORGANIZATION_FULL_NAME>)<SEPARATION_SYMBOL><ALPHANUMERIC_IDENTIFIER>}}
* 
{{<ORGANIZATION_FULL_NAME><SEPARATION_SYMBOL>(<ORGANIZATION_FULL_NAME>)<SEPARATION_SYMBOL><ALPHANUMERIC_IDENTIFIER>}}

Moreover, some standards are sometimes released by two standard organizations. 
In this case, the standard reference can be reported as follows:
* 
{{<MAIN_ORGANIZATION_ACRONYM>/<SECOND_ORGANIZATION_ACRONYM><SEPARATION_SYMBOL><ALPHANUMERIC_IDENTIFIER>}}

h1. Regular Expressions

The {{StandardsExtractingContentHandler}} uses a helper class named 
`StandardsText` that relies on Java regular expressions and provides some 
methods to identify headers and standard references, and determine the score of 
the references found within the given text.

Here are the main regular expressions used within the StandardsText class:
* *REGEX_HEADER*: regular expression to match only uppercase headers.
  {code}
  (\d+\.(\d+\.?)*)\p{Blank}+([A-Z]+(\s[A-Z]+)*){5,}
  {code}
* *REGEX_APPLICABLE_DOCUMENTS*: regular expression to match the the header of 
"APPLICABLE DOCUMENTS" and equivalent sections.
  {code}
  
(?i:.*APPLICABLE\sDOCUMENTS|REFERENCE|STANDARD|REQUIREMENT|GUIDELINE|COMPLIANCE.*)
  {code}
* *REGEX_FALLBACK*: regular expression to match a string that is supposed to be 
a standard reference.
  {code}
  
\(?(?<mainOrganization>[A-Z]\w+)\)?((\s?(?<separator>\/)\s?)(\w+\s)*\(?(?<secondOrganization>[A-Z]\w+)\)?)?(\s(Publication|Standard))?(-|\s)?(?<identifier>([0-9]{3,}|([A-Z]+(-|_|\.)?[0-9]{2,}))((-|_|\.)?[A-Z0-9]+)*)
  {code}
* *REGEX_STANDARD*: regular expression to match the standard organization 
within a string potentially representing a standard reference.
  This regular expression is obtained by using a helper class named 
`StandardOrganizations` that provides a list of the most important standard 
organizazions reported on 
[Wikipedia|https://en.wikipedia.org/wiki/List_of_technical_standard_organisations].
 Basically, the list is composed of International standard organizations, 
Regional standard organizations, and American and British among 
Nationally-based standard organizations. Other lists of standard organizations 
are reported on 
[OpenStandards|http://www.openstandards.net/viewOSnet2C.jsp?showModuleName=Organizations]
 and [IBR Standards Portal|https://ibr.ansi.org/Standards/].

h1. How to use the Standards Extraction

The standard references identification performed by using the 
{{StandardsExtractingContentHandler}} is based on the following steps (see also 
the flow chart in attachment):
# searches for headers;
# searches for patterns that are supposed to be standard references (basically, 
every string mostly composed of uppercase letters followed by an alphanumeric 
characters);
# each potential standard reference starts with score equal to 0.25;
# increases by 0.50 the score of references which include the name of a known 
standard organization;
# increases by 0.25 the score of references which have been found within 
"Applicable Documents" and equivalent sections;
# returns the standard references along with scores;
# adds the standard references as additional metadata.

The unit test is implemented within the 
*{{StandardsExtractingContentHandlerTest}}* class and extracts the standard 
references from a SoW downloaded from the [FOIA 
Library|https://foiarr.cbp.gov/streamingWord.asp?i=607]. This SoW is also 
provided as PDF in attachment.

The *{{StandardsExtractionExample}}* is a class to demonstrate how to use the 
{{StandardsExtractingContentHandler}} to get a list of the standard references 
from every file in a directory.

The patch in attachment includes all the changes to add the support for 
standards extraction. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to