JSON is a pure data-oriented format, unlike XML which can be used in either a 
data-oriented or document-oriented mode. For document-oriented XML, one can 
usually just extract the text from all the text nodes. For 
For data-oriented formats, extracting sensible text requires some knowledge 
about the structure of the format.  

Some languages like Groovy [1] have builtin support for JSON that should 
facilitate the implementation of a simple script to extract text from whatever 
JSON format you have.

Searching for "xslt json" on Google yields for me some links to xslt-like 
tools/ideas for JSON that might also be applicable for you... but I have not 
tried any of them.


-- Richard

[1] http://groovy-lang.org/json.html

> On 17.09.2016, at 01:55, Marcellino, William <bmarc...@rand.org> wrote:
> Howdy Friends,
>   Any suggestions for sources on best practices to get clean UTF-8 text from 
> JSON files?  My goal is to get plain text for analysis, with no funky 
> formatting for curly quotes, diacritic marks kept intact, etc.  
>   Semper Fidelis,
>     -Bill Marcellino

Dr. Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing (UKP) Lab
FB 20 / Computer Science Department      
Technische Universit├Ąt Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone [+49] (0)6151 16-25299, fax -25295, room S2/02/B117

Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources 
(AIPHES): www.aiphes.tu-darmstadt.de 
PhD program: Knowledge Discovery in Scientific Literature (KDSL) 

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list

Reply via email to