Hello. I am working on a project where one system (System A) contains seven
text fields (unstructured data for comments).
I have concatenated all of the fields into a single field.
There is a second system (System B) containing two unstructured fields that
capture text comments. I have concatenated these fields into a single field
just as I did for the first system. This system contains highly sensitive and
prohibitive data.
The issue that I'm trying to solve is that there should not be any text data
from System B (sensitive narratives, investigative IDs, etc.)
In essence, I am trying to find the following three items:
1) Find direct references to investigations ("Investigation number ABC123")
2) Language that talks about references (i.e. "Jane Doe is under investigation")
3) Actual cut-and-paste segments where they copied something verbatim from
System B to System A in the commentary fields.
It seems as though I may have to use different text similarity (comparison
between System A and System B text) or search techniques for one or more of the
three items.
I was thinking that Cosine Similarity Computation (CSC) would perhaps be
useful, but I thought I would solicit some advice as I'm a recent text analyst
using Python.
Thank you in advance.
Kenneth R Adams
Compliance Technology and Analytics
TAS -Text Analytics as a Service
Wells Fargo & Co. | 401 South Tryon Street, Twenty-sixth Floor | Charlotte, NC
28202
MAC: D1050-262
Cell: 704-408.5157
[email protected]<mailto:[email protected]>
[WellsFargoLogo_w_SC]
--
https://mail.python.org/mailman/listinfo/python-list