On Fri, Mar 7, 2014 at 1:45 PM, Mark G <[email protected]> wrote: > Hello all, I would like to propose the development of a Temporal Extraction > addon. In the industry I work in, there is a need to support search of > documents/entities by location and date mentions within the document text. > I feel pretty good about the GeoEntityLinker addon for providing geocoding, > but now I need to do date extraction. > > This addon I propose would take text, and return a real java.util.Date, > with a precision, likely stored in an extended Span object. Initially, I > would like it to deal with year, seasonal, month, and day level references, > and return a real Date and a precision. Don't care so much about days of > week mentions and such, this is geared more towards supporting search and > other datetime related analytics. > > I have done this before to some degree a while back, and I have done > research that leads to a couple different approaches: > 1. All regex based extraction, and then a series of rules for cleaning the > results. > pros: no training, simple configuration, predictable output > cons: regexes are confusing as they mature, regexes are not context > specific > 2. Machine learning (like the current opennlp model/NER can do pretty well) > pros: based on user data (if trained on it), adaptive etc > cons:unpredictable strings as a result, hard to deal with. > 3. A combination of Regex extraction and ML, in which the regex results are > highly specific and used for sentence annotation for building a model. > pros: model based on regex results on user data, adaptive, more recall than > option 1, more predicatble results than option 2 > cons:laborious processing (run regex extraction , produce annotations, > build a model etc), still deal with unpredictable results > > My recommendation is option 3. I would like to write a regex based > extractor that stands alone, but also write an impl for the > modelbuilder-addon that would use the regex based extractor to create > annotations for the model building process that occurs in the > modelbuilder-addon (which automates annotation and model building based on > user defined "known entities" and sentences). Option three would also > provide "simple" and "advanced" versions of temporal extraction. > > this is a complex process, let us know if you see utility in this, and > please provide any insights. > > sorry for the long email > > thanks > Mark G >
-- Adriano Araújo Santos *"A mente que se abre a uma nova idéia jamais voltará ao seu tamanho original."*
