Re: --json and --xml

Achim Gratz Sun, 10 Aug 2014 00:45:13 -0700

Ole Tange writes:
> A couple of years ago I was asked to make GNU Parallel parse JSON
> elements. At that time it would have been a major task. With the --rpl
> and {= perlexpression =} infrastructure I think it is doable.


Both JSON and XML are producing tree-structured data, while parallel is
expecting records.  Even when you know a path expression that results in
record-like data, the tree processing / parsing that is required may
easily be the part that takes the most time and it isn't easily
parallelized in general and none of the practical implementation of
parallel parsing that I'm aware of are using a shared-nothing model
(i.e. they're all threaded).

> And if we are looking at JSON, we might look at XML as well. XML has
> XPath to address elements. So I imagine that you somehow tell where a
> new record starts (maybe using --recend --recstart?) and then use
> XPath to tell which elements to loop over - similar to {1} {2}  and
> named columns for tsv:
>
> cat foo.xml | paralllel --xml echo
> '{//draw:frame[@svg:width="28.2cm"]/draw:image/@xlink:href}'

So where do you envision that path expression being evaluated, given
that you cannot (in general) know where to split the input?

> Does JSON have a similar standardized way of addressing? JSONPath?
> http://goessner.net/articles/JsonPath/

I think that's now called JSON Pointer.

http://tools.ietf.org/html/rfc6901
http://tools.ietf.org/html/rfc6902

> cat books.json | parallel --json echo '{$..book[-1:].author}'
>
> I have never worked heavily with neither JSON nor XML, so I would like
> some input on how you would like to see the syntax. Especially if
> there already is a standardized way (such as XPath) that we can use.
>
> I imagine some of you use GNU Parallel on both JSON files and XML
> files, and you pre-process them before feeding them to GNU Parallel.
> How should GNU Parallel work, so your pre-processing becomes
> unnecessary?

Given the way the parsing has to work, pre-splitting large data files in
either format is the only sane way to do things at least for I/O bound
jobs.  The splits do not have to be physical if the data already is in
the correct order.  For instance I get production data in XML as an
exchange format and I know exactly in which sequence the XML writer
picks the data from the original database, which happens to be (not
incidentally) the order in which I need to process it most of the time.
But then we already receive the data in many small files rather than a
single big one so that database loading can proceed in parallel.


Regards,
Achim.
-- 
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+

SD adaptations for KORG EX-800 and Poly-800MkII V0.9:
http://Synth.Stromeko.net/Downloads.html#KorgSDada

Re: --json and --xml

Reply via email to