Revision: 18448 http://sourceforge.net/p/gate/code/18448 Author: ian_roberts Date: 2014-11-08 17:26:05 +0000 (Sat, 08 Nov 2014) Log Message: ----------- Documentation for the latest set of JSON input/output changes.
Modified Paths: -------------- userguide/trunk/social-media.tex Added Paths: ----------- userguide/trunk/save-as-json.png Added: userguide/trunk/save-as-json.png =================================================================== (Binary files differ) Index: userguide/trunk/save-as-json.png =================================================================== --- userguide/trunk/save-as-json.png 2014-11-08 15:19:27 UTC (rev 18447) +++ userguide/trunk/save-as-json.png 2014-11-08 17:26:05 UTC (rev 18448) Property changes on: userguide/trunk/save-as-json.png ___________________________________________________________________ Added: svn:mime-type ## -0,0 +1 ## +image/png \ No newline at end of property Modified: userguide/trunk/social-media.tex =================================================================== --- userguide/trunk/social-media.tex 2014-11-08 15:19:27 UTC (rev 18447) +++ userguide/trunk/social-media.tex 2014-11-08 17:26:05 UTC (rev 18448) @@ -31,14 +31,14 @@ The \verb!Twitter! plugin contains several tools useful for processing tweets. This plugin depends on the \verb!Stanford_CoreNLP! plugin, which must be loaded -first. This includes tools to load documents into GATE from the JSON format -provided by the Twitter APIs, a tokeniser and POS tagger tuned specifically for -Tweets, a tool to split up multi-word hashtags, and an example named entity -recognition application called {\em TwitIE} which demonstrates all these -components working together. +first. This includes tools to load and save documents in GATE using the JSON +format provided by the Twitter APIs, a tokeniser and POS tagger tuned +specifically for Tweets, a tool to split up multi-word hashtags, and an example +named entity recognition application called {\em TwitIE} which demonstrates all +these components working together. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -\subsect[sec:social:twitter:format]{Twitter JSON format} +\sect[sec:social:twitter:format]{Twitter JSON format} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Twitter provides APIs to search for Tweets according to various criteria, and @@ -48,19 +48,35 @@ includes the text of the Tweet plus a large amount of supporting metadata. The GATE \verb!Twitter! plugin contains a format analyser for this JSON format which allows you to load a file of one or more JSON Tweets into a GATE -document. Loading the plugin registers the document format with GATE, so that -it will be automatically associated with files whose names end in -``\verb!.json!''; otherwise you need to specify \verb!text/x-json-twitter! for -the document mimeType parameter. This will work both when directly creating a -single new GATE document and when populating a corpus. +document. The format analyser can handle multiple Tweets in one file, +represented as any of: +\begin{itemize} +\item a top-level JSON array \verb![{...},{...}]! +\item a top-level JSON object containing properties ``search\_metadata'' and + ``statuses'', where the ``statuses'' property is an array of Tweets (this is + the format returned by a call to Twitter's ``search'' API) +\item or simply concatenated together, optionally with white space or newline + characters between adjacent objects (this is the format returned by Twitter's + streaming APIs). +\end{itemize} +Loading the plugin registers the +document format with GATE, so that it will be automatically associated with +files whose names end in ``\verb!.json!''; otherwise you need to specify +\verb!text/x-json-twitter! for the document mimeType parameter. This will work +both when directly creating a single new GATE document and when populating a +corpus. -Each tweet object's \verb!text! value is converted into the document content, +Each tweet object's \verb!text! value is converted into the document +content\footnote{HTML entity references \texttt{\&}, \texttt{\<} and +\texttt{\>} are decoded into the corresponding characters}, which is covered with a \emph{Tweet} annotations whose features represent (recursively when appropriate, using \emph{Map} and \emph{List}) all the other key-value pairs in the tweet object. \textbf{Note:} these recursive values are difficult to work with in JAPE; the special corpus population tool described next allows important key-sequences to be ``brought up'' to the document content -and the top level of the annotation features. +and the top level of the annotation features. Any entities described by the +standoff markup ``entities'' JSON property will be converted into their +corresponding GATE annotations (see below for details). Multiple tweet objects in the same JSON file are separated by blank lines (which are not covered by \emph{Tweet} annotations). @@ -77,6 +93,9 @@ \item[One document per tweet] If this box is ticked (the default), each tweet will produce a separate document. If not, each {\em input file} will produce one GATE document. +\item[Annotations for ``entities''] If this box is ticked (the default), any + entities described by the standoff markup ``entities'' JSON property will be + converted into their corresponding GATE annotations (see below). \item[Content keys] The values of these JSON keys are converted into strings and concatenated into each tweet's document content. Colon-delimited strings specify nested keys, e.g., ``\texttt{user:name}'' will yield the value of the @@ -92,17 +111,126 @@ configuration. \end{description} %% -Every tweet is covered by a \texttt{Tweet} annotation with features specified by -the ``feature keys'' option. Multiple tweets in the same GATE document are -separated by a blank line (two newlines). +Again, the input can be in any of the three formats discussed above (an array +of Tweets, a search result, or a stream of concatenated objects). +Every tweet in the resulting GATE documents is covered by a \texttt{Tweet} +annotation with features specified by the ``feature keys'' option. Multiple +tweets in the same GATE document are separated by a blank line (two newlines). Corpus population from Twitter JSON files is also accessible programmatically when this plugin is loaded, using the public static void method \texttt{gate.corpora.twitter.Population.populateCorpus(final Corpus corpus, URL inputUrl, String encoding, List<String> contentKeys, List<String> featureKeys, - int tweetsPerDoc)}. + int tweetsPerDoc, boolean processEntities)}. +\subsect[sec:social:twitter:entities]{Entity annotations in JSON} + +Twitter's JSON format provides a mechanism to represent annotations over the +Tweet text as standoff markup, via a JSON property named ``entities''. The +value of this property is an object with one property for each entity +\emph{type}, whose value is a list of objects representing the individual +annotations. Within each individual entity object, the ``indices'' property +gives start and end character offsets of the annotation within the Tweet text. + +\begin{verbatim} +{ + "text":"@some_user this is a nice #example", + "entities":{ + "user_mentions":[ + { + "indices":[0,10], + "screen_name":"some_user", + ... + } + ], + "hashtags":[ + { + "indices":[26,34], + "text":"example" + } + ] + } +} +\end{verbatim} + +Both the single document format parser and the corpus population tool are able +to convert this structure into GATE annotations. The entity type (e.g. +\verb!user_mentions!) becomes the annotation type, the \verb!indices! property +provides the offsets, and the other properties become features of the generated +annotation. + +By default, the entity annotations are created in the ``Original markups'' +annotation set, as is the usual convention for annotations generated by a +document format. However, if the entity type contains a colon character (e.g. +\verb!"Key:Person":[...]!) then the portion before the colon is taken to be an +annotation set name and the portion after the colon is the annotation type (in +this example, a ``Person'' annotation in the ``Key'' annotation set). An +empty annotation set name (i.e. \verb!":Person"!) creates the corresponding +annotations in the default annotation set. This scheme is designed to be +compatible with the GATE JSON export mechanism described in the next section. + %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +\sect[sec:social:twitter:export]{Exporting GATE documents as JSON} + +Loading the \verb!Twitter! plugin also adds a ``GATE JSON'' option to the +``Save as\ldots'' right-click menu on documents and corpora, to export GATE +documents in the Twitter-style JSON format. This tool can save a document or +corpus of documents as a single file where each Tweet in the document or corpus +is represented as a JSON object, and the set of objects are represented either +as a single top-level JSON array (\verb![{...},{...}]!) or simply as one object +per line (as per Twitter's streaming APIs). This exporter can be used for any +GATE document, not just for documents that were initially loaded from Twitter +JSON format, and can be used as a much more compact alternative to GATE XML, or +as an easy-to-parse interchange format to pass GATE-annotated documents to +non-GATE tools. + +The format is the same as Twitter's -- the text becomes a property ``text'' in +the JSON, and annotations are represented as standoff markup in the +``entities'' property, which is an object whose keys are annotation types and +whose corresponding values are arrays of objects representing the annotations. + +\begin{figure}[htb] + \centering + \includegraphics[width=0.8\textwidth]{save-as-json.png} + \caption{Options dialog for saving a document or corpus as JSON} + \label{fig:social:save-as-json} +\end{figure} + +The available options for the JSON exporter are shown in +figure~\ref{fig:social:save-as-json}. In detail, they are: +\begin{description} +\item[documentAnnotationASName/documentAnnotationType] the annotation set and + type that should be treated as covering each span of text that should be output + as a separate JSON object. By default this is annotations of type ``Tweet'' in + the ``Original markups'' set (i.e. the annotations covering individual Tweets + parsed by the JSON document format parser or corpus population tool). If a + document contains any annotations of the specified type then one JSON object + will be output for each such annotation $X$, with the text and entity + annotations constrained to the span of $X$. In addition, features of $X$ + will become top-level properties of the resulting JSON object. Text that is + not covered by any such annotation will not be saved. If there are no + document annotations found in a particular document (or if the + documentAnnotationType parameter is unset) then the whole of the document + text will be output as a single JSON object. +\item[entitiesAnnotationSetName] the primary annotation set that should be + scanned for entity annotations. +\item[annotationTypes] the entity annotation types to output. +\item[exportAsArray] if true, output the objects as a top-level JSON array. If + false (the default), output the JSON objects directly at the top level, + separated by newlines. +\end{description} + +Annotation types to be saved can be specified in two ways. Plain annotation +type names such as ``Person'' will be taken from the specified +\emph{entitiesAnnotationSetName}, but if a type name contains a colon character +(e.g. ``Key:Person'') then the portion before the colon is treated as the +annotation set name and the portion after the colon as the annotation type. +The full name including the colon will be used as the type label in the +``entities'' object, so if the resulting JSON were re-loaded into GATE the +annotations would be re-created in the same annotation sets they originally +came from. + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \sect[sec:social:twitter:prs]{Low-level PRs for Tweets} The \verb!Twitter! plugin provides a number of low-level language processing components that are specifically tuned to Twitter data. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. ------------------------------------------------------------------------------ _______________________________________________ GATE-cvs mailing list GATE-cvs@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/gate-cvs