JSON is a pure data-oriented format, unlike XML which can be used in either a 
data-oriented or document-oriented mode. For document-oriented XML, one can 
usually just extract the text from all the text nodes. For 
For data-oriented formats, extracting sensible text requires some knowledge 
about the structure of the format.  

Some languages like Groovy [1] have builtin support for JSON that should 
facilitate the implementation of a simple script to extract text from whatever 
JSON format you have.

Searching for "xslt json" on Google yields for me some links to xslt-like 
tools/ideas for JSON that might also be applicable for you... but I have not 
tried any of them.

Cheers,

-- Richard

[1] http://groovy-lang.org/json.html

> On 17.09.2016, at 01:55, Marcellino, William <bmarc...@rand.org> wrote:
> 
> Howdy Friends,
> 
>   Any suggestions for sources on best practices to get clean UTF-8 text from 
> JSON files?  My goal is to get plain text for analysis, with no funky 
> formatting for curly quotes, diacritic marks kept intact, etc.  
> 
>   Semper Fidelis,
> 
>     -Bill Marcellino

-- 
------------------------------------------------------------------- 
Dr. Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing (UKP) Lab
FB 20 / Computer Science Department      
Technische Universit├Ąt Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone [+49] (0)6151 16-25299, fax -25295, room S2/02/B117
eck...@ukp.informatik.tu-darmstadt.de 
www.ukp.tu-darmstadt.de 

Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources 
(AIPHES): www.aiphes.tu-darmstadt.de 
PhD program: Knowledge Discovery in Scientific Literature (KDSL) 
www.kdsl.tu-darmstadt.de 
-------------------------------------------------------------------





_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora@uib.no
http://mailman.uib.no/listinfo/corpora

Reply via email to