On Tue, Apr 21, 2015 at 9:09 PM, Ashish Sharma <ashishonline at gmail.com> wrote:
> Hi > > Many times I store JSON data in sqlite. It will be useful if sqlite came > with functions which understand JSON. Presto has a nice set > https://prestodb.io/docs/current/functions/json.html > I wrote a little tool that more or less allows the user to ingest JSON into an SQLite database, for use in tiny platforms (think a VM running inside a low-power home router). I was considering open-sourcing it. Below is the man page for the tool. If there is enough interest, I'll do the requisite work to remove dependencies on other bits of my infrastructure and publish it somewhere. Let me know. Eric json2sqlite(1) Aterlo Networks, Inc. json2sqlite(1) NAME json2sqlite - convert JSON to SQLite database rows SYNOPSIS json2sqlite [OPTION]... DESCRIPTION Read arbitrary JSON from a file (or standard input) and emit some SQLite rows. The form the rows take is dictated by the options. The entire run of the program is wrapped in a single SQLite transaction that is only committed if we do not encounter any critical errors. This program's original intent was to be run on small, low-power devices. It can be provided a strict memory cap with the -M flag. Input selection -i, --input-file=FILE Read JSON from FILE. If FILE is '-', then read from standard input. If no value is passed, then '-' is assumed. If you want to read from a file whose literal name is '-', you can qualify the path to the file, e.g. './-'. -P, --input-path=JSON_PATH Specify the path of the object you wish to import. The root-level JSON element has the path '.'. If the root- level element is an object, then JSON_PATH may contain a hierar- chy of key names at which the input array can be found. For example, if we have JSON {"a": [ <data> ]} then the array con- taining <data> can be imported by saying '--input-path .a'. The leading dot may be omitted. (This leading-dot syntax is just provided to make addressing the root-level object look a little nicer on the command-line). If left unspecified, then '.' (i.e. the whole document) is assumed. -T, --input-type=[array|object] If 'array' is given, then we import the JSON array that can be found at the JSON_PATH specified by --input-path. When importing an array, it is assumed that each of the elements of the array is itself a JSON object. These objects may either be a simple name->scalar mapping, or they may have some more complex nested form. The objects are usually either imported as one row per object, or one row per name/value pair, depending on the other arguments to this program. Otherwise, if 'object' is given, then import a single JSON object whose path is the JSON_PATH specified by --input-path. The object is imported just as if it were the unique element in a containing array and the array had been specified for import via --input-type=array. If left unspecified, then 'array' is assumed. Output Selection -o, --output-database=DATABASE_FILENAME Emit rows to the database file DATABASE_FILENAME. This argument is mandatory. -t, --output-table=TABLE_NAME Emit rows to the given TABLE_NAME. This argument is mandatory. Initialization -s, --schema-file=SCHEMA_FILE Run the SQL statements in SCHEMA_FILE before doing anything else. -D, --delete-first Run 'DELETE FROM <output-table>;' before ingesting any records (but still within the process's global transaction, so that sub- sequent failures will not lead to data loss). Transforming input to output Note that the input selection mechanism you chose above with --input- type and --input-path has given us a list of objects (where the list has length 0, 1, or more). This is called the 'initial result set' in the below. The following options instruct json2sqlite how to convert each of those objects to some list of table rows. There are two over- all structures that each object can have (nested or flat), and two overall approaches for converting those structures to SQLite rows (either one row per name-value pair or one row per object). Input structure specification You must indicate to json2sqlite the structure of the objects in the initial result set. If you don't specify one, then --flat is assumed. -F, --flat Each object in the initial result set should be presumed to con- tain a simple listing of name-value pairs, where each value is a scalar. For example, if the overall JSON is: { "foo": "bar", "zippy": 123 } then it's appropriate to pass --flat. -N, --nested Each object in the initial result set should be presumed to map a name to either an object or an array of objects. Each of these subobjects is itself presumed to be a simple listing of name-value pairs. For example, if the overall JSON is: { "foo": { "mean": 106.5, "stddev": 8.5 }, "zippy": { "mean": 108.1, "stddev": 2.5 } } and you wish to ingest this whole data set, then --nested is appropriate. Here is another example of a nested structure that you can import: { "foo": [ { "weight": 107, "unit": "grams", "timestamp": 1422812035 }, { "weight": 117, "unit": "grams", "timestamp": 1422725653 } ], "bar": [ { "weight": 57, "unit": "grams", "timestamp": 1422812035 }, { "weight": 83, "unit": "grams", "timestamp": 1422725653 } ] } Note that passing --nested essentially expands the initial result set to contain potentially a larger number of objects. When --nested is passed, the object key becomes available in the output structure specification, addressable as '_PARENT_KEY_'. Output structure specification Now you must choose whether to emit one SQLite row per object in the result set, or one SQLite row per name/value pair in the objects of the result set. If you don't specify one of the two following options, then '--row-per object' is assumed. -r, --row-per=object Each object in the result set is converted to one SQLite row. The keys in the result set object correspond to column names in the SQLite output rows, and the values in the result set corre- spond to the values in the SQLite output rows. -r, --row-per=key Each name/value pair in each object in the result set is con- verted to one SQLite row. The key and the value both become data in the SQLite table, and the SQLite column names must be specified separately. The key is addressable in the output map specification (see below) as '_KEY_', and the value is address- able as '_VALUE_'. Finally, you must specify how to translate between the 'column' names in the JSON result set and the SQLite column names. -m, --column-map='J1 S1 J2 S2 ... Jn Sn' This argument is mandatory. When --row-per is 'object', then map JSON key name Ji in the result set to SQLite column Si. If '--nested' was passed, then Ji can also have the sentintel value '_PARENT_KEY_', in which case column Si is populated with the key parent object of the present object in the result set. When --row-per is 'key', then Ji may be one of: _PARENT_KEY_ _KEY_ _VALUE_ and Si is the column name to which to map Ji. Ji may also be the name of a key, in which case the value Si is set to the value for that key within the object currently being traversed. If json2sqlite encounters an object in the result set that does not have one of the keys specified in the column map, then it will attempt to set the corresponding SQLite column to NULL. (Note well that this is identical behavior to the case where the object contains the key and the key is mapped to the JSON null value.) This will fail if the SQLite schema does not not allow this. At the moment, json2sqlite considers this a permanent failure in the input data set, rolls back the whole transaction, and exits with an unhappy status. Miscellaneous -M, --memory-cap=BYTES Do not consume more than BYTES bytes of heap. If more space is required to hold the whole document and/or write to the data- base, then json2sqlite emits an error message and exits with a special status described below. The default is 10 megabytes (10000000). Use 0 to specify no limit. At the moment, 75% of the memory cap is apportioned to storing the JSON document (the whole of which is read into RAM prior to any other processing), and 25% is apportioned to the SQLite page cache. --trace Emit a trace (to stderr) of the SQLite statements that json2sqlite executes. -h, --help Print help and quit. -V, --version Print version information and quit. EXIT STATUS json2sqlite exits with one of the following statuses: 0 OK. All records were successfully ingested with no warnings or errors. 1 Unhandled exception. This indicates a bug in json2sqlite. 10 Input error. The JSON document was malformed and could not be parsed. 11 Input error. The JSON did not follow the format specified by the command-line arguments. 12 Input error. The JSON document could not be completely read because the input datasource was truncated, or the socket was closed, or so on. 13 Input error. The input file could not be opened for reading. 14 Input error. Could not find the JSON_PATH specified by --input- path. 18 Out of memory. The JSON document could not be completely read because json2sqlite ran out of memory. 25 SQLite error. SQLite complained about something, so we couldn't complete the transaction. 26 Failed to read the SQLite schema file. 100 Bad command-line arguments. EXAMPLES Example 1 - a single, flat object Suppose you have a JSON document like this: { "first_name": "Alice", "last_name": "Doe", "age": 37, "height": 180, "height_units": "cm" } and you would like to ingest this as one row per person into the same database: CREATE TABLE person(first_name, last_name, age, height, height_units); Then issue the following command: cat json | json2sqlite --input-type object \ --output-database my.db \ --output-table person \ --column-map ' first_name first_name last_name last_name age age height height height_units height_units ' Example 2 - an array of flat objects Suppose the document in Example 1 is now an array with many person objects in it. And this time, the names in your input document are different from the names in your SQLite database. Finally, the object is in some nested location in a larger document: { "demographics": { "persons": [ { "name": "Alice", "last_name": "Doe", "age_years": 37, "height": 180, "height_units": "cm" }, { "name": "Bob", "last_name": "Johnson", "age_years": 24, "height": 172, "height_units": "cm" } ] } } and you would like to ingest this as one row per person into the same database: CREATE TABLE person(first_name, last_name, age, height, height_units); Then issue the following command: cat json | json2sqlite --input-type array \ --input-path .demographics.persons \ --output-database my.db \ --output-table person \ --column-map ' name first_name last_name last_name age_years age height height height_units height_units ' Example 3 - arbitrary name-value pairs in a flat JSON object Suppose you have a set of name-value pairs arriving as a flat JSON doc- ument: { "screen_resolution_horiz": 640, "screen_resolution_vert": 480, "cpu_mhz": 1434, "vendor": "Dell, Inc.", "product_name": "Latitude", "product_id": "E6510", "price": 500, "price_units": "USD" } and you wish to store this information in a table that is a simple, opaque name-value pair set: CREATE TABLE nvpairs(name TEXT PRIMARY KEY, value NOT NULL); Then issue the following command: cat json | json2sqlite --output-database my.db \ --output-table nvpairs \ --input-type object \ --row-per key \ --column-map ' _KEY_ name _VALUE_ value ' This will yield the following rows: screen_resolution_horiz,640 screen_resolution_vert,480 cpu_mhz,1434 vendor,"Dell, Inc." product_name,Latitude product_id,E6510 price,500 price_units,USD Example 4 - arbitrary name-value pairs with a parent key Building on Example 3, consider a similar JSON document, but with many computer models represented, each one with its own model ID as a pri- mary key: { "1034023432": { "screen_res_horiz": 640, "screen_res_vert": 480, "cpu_mhz": 1434, "vendor": "Dell, Inc.", "product_name": "Latitude", "product_id": "E6510", "price": 500, "price_units": "USD" }, "4589734534": { "screen_res_horiz": 1024, "screen_res_vert": 768, "cpu_mhz": 5000, "vendor": "Dell, Inc.", "product_name": "Inspiron", "product_id": "5600", "price": 750, "price_units": "USD" }, etc } and this time your schema maps a (model ID, attribute) pair to its value: CREATE TABLE nvpairs( model_id TEXT NOT NULL, name TEXT NOT NULL, value NOT NULL, PRIMARY KEY(model_id, name) ); Here is how you'd ingest that data set: cat json | json2sqlite --output-database my.db \ --output-table nvpairs \ --input-type object \ --nested \ --row-per key \ --column-map ' _PARENT_KEY_ model_id _KEY_ name _VALUE_ value ' This will emit the following records: INSERT INTO nvpairs VALUES('1034023432','screen_res_horiz',640); INSERT INTO nvpairs VALUES('1034023432','screen_res_vert',480); INSERT INTO nvpairs VALUES('1034023432','cpu_mhz',1434); INSERT INTO nvpairs VALUES('1034023432','vendor','Dell, Inc.'); INSERT INTO nvpairs VALUES('1034023432','product_name','Latitude'); INSERT INTO nvpairs VALUES('1034023432','product_id','E6510'); INSERT INTO nvpairs VALUES('1034023432','price',500); INSERT INTO nvpairs VALUES('1034023432','price_units','USD'); INSERT INTO nvpairs VALUES('4589734534','screen_res_horiz',1024); INSERT INTO nvpairs VALUES('4589734534','screen_res_vert',768); INSERT INTO nvpairs VALUES('4589734534','cpu_mhz',5000); INSERT INTO nvpairs VALUES('4589734534','vendor','Dell, Inc.'); INSERT INTO nvpairs VALUES('4589734534','product_name','Inspiron'); INSERT INTO nvpairs VALUES('4589734534','product_id','5600'); INSERT INTO nvpairs VALUES('4589734534','price',750); INSERT INTO nvpairs VALUES('4589734534','price_units','USD'); BUGS Please submit bug reports to json2sqlite at aterlo.com. The memory cap is probably not enforced particularly well for certain kinds of documents. For example, you could probably get json2sqlite to consume perhaps 1.5 times its memory cap by providing an object with single enormous key as input. For the moment, we are not going to sweat this. The design obviously limits json2sqlite's capabilities to certain kinds of JSON documents, and the authors welcome clean ways of handling a wider variety of document forms. ACKNOWLEDGEMENTS json2sqlite is based on the lightweight JSON parser 'jsmn', for which we acknowledge and thank zserge; and of course SQLite, for which the world will forever be indebted to D. Richard Hipp. COPYRIGHT json2sqlite is Copyright (c) Aterlo Networks, Inc. All rights reserved. Aterlo Networks, Inc. 1 Feb 2015 json2sqlite(1)