Revision: 17537
http://sourceforge.net/p/gate/code/17537
Author: adamfunk
Date: 2014-03-04 21:00:13 +0000 (Tue, 04 Mar 2014)
Log Message:
-----------
More details of the new TR stuff.
Modified Paths:
--------------
userguide/trunk/misc-creole.tex
Modified: userguide/trunk/misc-creole.tex
===================================================================
--- userguide/trunk/misc-creole.tex 2014-03-04 19:49:45 UTC (rev 17536)
+++ userguide/trunk/misc-creole.tex 2014-03-04 21:00:13 UTC (rev 17537)
@@ -3252,9 +3252,9 @@
\sect[sec:creole:termraider]{TermRaider term extraction tools}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TermRaider is a set of term extraction and scoring tools developed in the NeOn
-and ARCOMEM projects. Although the plugin is still experimental, we are now
-including it in GATE as a response to frequent requests from GATE users who
have
-read publications related to those projects.
+and ARCOMEM projects. Although some parts of the plugin are still
experimental,
+we are now including it in GATE as a response to frequent requests from GATE
+users who have read publications related to those projects.
The easiest way to test TermRaider is to populate a corpus with related
documents, load the sample
@@ -3262,8 +3262,8 @@
and run it. This application will process the documents and create instances
of
three termbank language resources with sensible parameters.
-All the language resources in TermRaider are properly serializable and so can
be
-stored in GATE datastores.
+All the language resources in TermRaider are serializable and can be stored in
+GATE datastores.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsect{Termbank language resources}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -3274,25 +3274,34 @@
\begin{itemize}
\item \textbf{corpora}: a \texttt{Set<gate.Corpus>} from which the termbank is
generated.
-\item \textbf{inputASName}: the annotation set name in which to find the term
- candidates.
+\item \textbf{inputASName} (\texttt{String}): the annotation set name in which
+ to find the term candidates.
\item \textbf{inputAnnotationTypes} (\texttt{Set<String>}): annotation types
which are treated as term candidates.
-\item \textbf{inputAnnotationFeature}: the feature of each annotation used as
- the term string (if the feature is missing from the annotation, the
underlying
- document content will be whitespace-trimmed and used). Note that these
values
- are case-sensitive; normally the lemma (\emph{root} feature from the GATE
- Morphological Analyser) is used for consistency.
+\item \textbf{inputAnnotationFeature} (\texttt{String}): the feature of each
+ annotation used as the term string (if the feature is missing from the
+ annotation, the underlying document content will be whitespace-trimmed and
+ used). Note that these values are case-sensitive; normally the lemma
+ (\emph{root} feature from the GATE Morphological Analyser) is used for
+ consistency.
\item \textbf{languageFeature} (\texttt{String}): the feature of each
annotation
identifying the language of the term. (Annotations without the feature will
- get a blank language code.)
-\item \textbf{scoreProperty}: a description of the score, used in the CSV
- output and the Termbank Score Copier PR.
+ get an empty string as a language code, which can match language-coded terms
+ more flexibly in some situations.)
+\item \textbf{scoreProperty} (\texttt{String}): a description of the principal
+ output score, used in the termbank's GUI and CSV output and in the Termbank
+ Score Copier PR. (A sensible default is provided for each termbank type.)
\item \textbf{debugMode} (\texttt{Boolean}): this sets the verbosity of the
output while creating the termbank.
\end{itemize}
+Each type of termbank has one or more score types, shown as columns in the
+\emph{Details} tab of the GUI and listed in the \emph{Type} pull-down menu in
+the \emph{Term Cloud} tab. The first score is always the principal one named
by
+the \emph{scoreProperty} parameter above.
+
+
The \texttt{Term} class is defined in terms of the term string itself, the
language code, and the annotation type, so it is possible to distinguish
\emph{affect}(\emph{english},\emph{Noun}) from
@@ -3308,7 +3317,7 @@
of tf.idf over other corpora.
A document frequency bank can be constructed from one or more corpora, from one
-or more existing document frequency banks, or from a combination of them, so
+or more existing document frequency banks, or from a combination of both, so
that document frequency counts from different sources can be compiled together.
It therefore has one additional parameter:
%%
@@ -3316,6 +3325,10 @@
\item \textbf{inputBanks} zero or more other instances of
\emph{DocumentFrequencyBank}.
\end{itemize}
+
+
+This type of termbank has only the principal score type.
+%% TODO document the flexible language code matching
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsect{TfIdf Termbank}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -3323,52 +3336,76 @@
of corpora. It has the following additional init parameters.
%%
\begin{itemize}
-\item \textbf{docFreqSource}
+\item \textbf{docFreqSource}: an instance of \emph{DocumentFrequencyBank},
which
+ could be derived from another set of corpora (as described above); if this
+ parameter is \texttt{null} (\verb!<none>! in the GUI), an instance of
+ DocumentFrequencyBank will be constructed from this LR's corpora parameter
and
+ used here.
\item \textbf{idfCalculation}: an enum (pull-down menu in the GUI) with the
- following options for inverted document frequency:
+ following options for adjusting inverted document frequency (all adjusted so
+ they must return a positive value, to prevent division by zero), $g(df)$:
\begin{itemize}
- \item \emph{Logarithmic} $=\log_{2}(n/df)$;
- \item \emph{Scaled} $= 1/df$;
- \item \emph{Natural} $= 1/df$;
+ % TODO: add unscaled Logarithmic as below
+ % change below to LogarithmicScaled
+ \item \emph{Logarithmic} $=\log_{2}(1+n/\mathit{df})$;
+ \item \emph{Scaled} $=(1+n)/(1+\mathit{df})$;
+ \item \emph{Natural} $=1/(1+\mathit{df})$.
\end{itemize}
-\item \textbf{normalization}
+\item \textbf{normalization}: an enum (pull-down) with the following options
for
+ normalizing the raw score $s$, where $s=f(\mathit{tf}){\times}g(idf)$:
\begin{itemize}
- \item \emph{None}
- \item \emph{Hundred}
- \item \emph{Sigmoid}
+ \item \emph{None} $=s$ (this may return numbers in a low range);
+ \item \emph{Hundred} $=100s$ (this makes the sliders easier to use);
+ \item \emph{Sigmoid} $=\frac{200}{1+e^{-s/k}}-100$ (this maps all raw scores
+ monotonically to values in the 0--100 range, so that $0{\rightarrow}0$ and
+ ${\infty}{\rightarrow}100$).
\end{itemize}
\item \textbf{tfCalculation}: an enum (pull-down) with the following options
for
- term frequency:
+ adjusting term frequency $f(\mathit{tf})$:
\begin{itemize}
- \item \emph{Natural} $=tf$;
- \item \emph{Sqrt}
- \item \emph{Logarithmic} $=1+\log_{2} tf$.
+ \item \emph{Natural} $=\mathit{tf}$;
+ \item \emph{Sqrt} $=\sqrt{\mathit{tf}}$;
+ \item \emph{Logarithmic} $=1+\log_{2} \mathit{tf}$.
\end{itemize}
\end{itemize}
%%
-For these calcutations, $tf$ is the term frequency (number of occurrences of
the
-term in the corpora), $df$ is the document frequency according to the
-DocumentFrequencySource, and $n$ is the total number of documents.
+For the calculations above, $\mathit{tf}$ is the term frequency (number of
+individual occurrences of the term in the current corpora), whereas
+$\mathit{df}$ is the document frequency of the term according to the
+DocumentFrequencySource and $n$ is the total number of documents in the
+DocumentFrequencySource. The raw score
+$s=f(\mathit{tm}){\times}g(\mathit{df})$.
+
+This type of termbank has five score types: the principal one (normalized), the
+raw score ($s$ above, with the principal name plus the suffix ``.raw''),
+\emph{termFrequency}, \emph{localDocFrequency} (number of documents in the
+current corpora containing the term; not used in the tf.idf calculation), and
+\emph{refDocFrequency} ($\mathit{df}$ above; this will be the same as
+\emph{localDocFrequency} if no external \emph{docFreqSource} was specified).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsect{Annotation Termbank}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-This termbank collects the values of scoring features on all the term
candidates
-and selects the minimum or maximum score or averages them, according to the
-\textbf{mergingMode} parameter. It has the following additional init
-parameters.
+This termbank collects the values of scoring features on all the term candidate
+annotations, and for each term determines the minimum, maximum, or mean
+according to the \textbf{mergingMode} parameter. It has the following
+additional parameters.
%%
\begin{itemize}
\item \textbf{inputScoreFeature}: an annotation feature whose value should be a
\texttt{Number} or interpretable as a number.
\item \textbf{mergingMode}: an enum (pull-down menu in the GUI) with the
options
\emph{MINIMUM}, \emph{MEAN}, or \emph{MAXIMUM}.
-\item \textbf{normalization}
- \begin{itemize}
- \item \emph{None}
- \item \emph{Hundred}
- \item \emph{Sigmoid}
- \end{itemize}
+\item \textbf{normalization}: the same normalization options as for the TfIdf
+ Termbank above. To produce augmented tf.idf scores (as in the sample
+ application), it is generally better to augment the \texttt{tfIdfScore.raw}
+ values, compile them into an Annotation Termbank, and normalize the results
+ (rather than carrying out augmentation on the normalized tf.idf scores).
\end{itemize}
+
+This type of termbank has four score types: the principal one (normalized), the
+raw score (minimum, maximum, or mean above; with the principal name plus the
+suffix ``.raw''), \emph{termFrequency}, \emph{localDocFrequency} (the last two
+are not used in the calculation).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsect{Hyponymy Termbank}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -3378,43 +3415,39 @@
\begin{itemize}
\item \textbf{inputHeadFeatures} (\texttt{List<String>}): annotation features
on
term candidates containing the head of the expression.
-\item \textbf{normalization}
- \begin{itemize}
- \item \emph{None}
- \item \emph{Hundred}
- \item \emph{Sigmoid}
- \end{itemize}
+\item \textbf{normalization}: the same normalization options as for the TfIdf
+ Termbank above.
\end{itemize}
%%
Head information is generated by the multiword JAPE grammar included in the
-application. We consider $T_1$ a hyponym of $T_2$ if and only if $T_2$'s head
-feature value ends with $T_1$'s head or string feature value.
+application. This LR treats $T_1$ a hyponym of $T_2$ if and only if $T_2$'s
+head feature's value ends with $T_1$'s head or string feature's value. (This
+depends on \emph{head-final} construction of compound nouns, as used in English
+and German.)
+
+This type of termbank has five score types: the principal one (normalized), the
+raw score ($s$ above, with the principal name plus the suffix ``.raw''),
+\emph{termFrequency}, \emph{hyponymCount} (number of distinct hyponyms found in
+the current corpora), and \emph{localDocFrequency} (number of documents in the
+current corpora containing the term; not used in other calculations).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsect{Termbank Score Copier}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-This processing resource copies the scores and optionally frequencies from a
-termbank into features of the term annotations. It has no init parameters and
-two runtime parameters.
+This processing resource copies the scores from a termbank onto features of the
+term annotations. It has no init parameters and two runtime parameters.
%%
\begin{itemize}
\item \textbf{annotationSetName}
-\item \textbf{frequencyFeature} default value \emph{frequency}
-\item \textbf{docFrequencyFeature} default value \emph{docFrequency}
\item \textbf{termbank}
\end{itemize}
%%
-This PR uses the annotation types, string and language code features, and score
-features from the selected termbank. It treats any annotation with a matching
-type and matching string and language feature (where a missing feature matches
-the triple-underscore ``not found'' code) as a match, and copies the score to a
-feature on the annotation specified by the termbank's \emph{scoreProperty}
-parameter.
-
-It also copies the term frequency to the annotation's \emph{frequencyFeature}
-unless that parameter is blank; and copies the document frequency to the
-\emph{docFrequencyFeature} unless that is blank. Note that the default values
-are not blank---you need to clear either or both parameters to prevent these
-annotation features from being filled in.
+This PR uses the annotation types, string and language code features, and
scores
+from the selected termbank. It treats any annotation with a matching type and
+matching string and language feature as a match (although a missing language
+feature matches the empty string used as a ``not found'' code), and copies all
+the termbanks' scores to features on the annotation with the scores' names.
+(The principal score name is determined by the termbank's \emph{scoreProperty}
+feature.)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsect{The PMI bank language resource}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
This was sent by the SourceForge.net collaborative development platform, the
world's largest Open Source development site.
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs