Ole Tange writes: > A couple of years ago I was asked to make GNU Parallel parse JSON > elements. At that time it would have been a major task. With the --rpl > and {= perlexpression =} infrastructure I think it is doable.
Both JSON and XML are producing tree-structured data, while parallel is expecting records. Even when you know a path expression that results in record-like data, the tree processing / parsing that is required may easily be the part that takes the most time and it isn't easily parallelized in general and none of the practical implementation of parallel parsing that I'm aware of are using a shared-nothing model (i.e. they're all threaded). > And if we are looking at JSON, we might look at XML as well. XML has > XPath to address elements. So I imagine that you somehow tell where a > new record starts (maybe using --recend --recstart?) and then use > XPath to tell which elements to loop over - similar to {1} {2} and > named columns for tsv: > > cat foo.xml | paralllel --xml echo > '{//draw:frame[@svg:width="28.2cm"]/draw:image/@xlink:href}' So where do you envision that path expression being evaluated, given that you cannot (in general) know where to split the input? > Does JSON have a similar standardized way of addressing? JSONPath? > http://goessner.net/articles/JsonPath/ I think that's now called JSON Pointer. http://tools.ietf.org/html/rfc6901 http://tools.ietf.org/html/rfc6902 > cat books.json | parallel --json echo '{$..book[-1:].author}' > > I have never worked heavily with neither JSON nor XML, so I would like > some input on how you would like to see the syntax. Especially if > there already is a standardized way (such as XPath) that we can use. > > I imagine some of you use GNU Parallel on both JSON files and XML > files, and you pre-process them before feeding them to GNU Parallel. > How should GNU Parallel work, so your pre-processing becomes > unnecessary? Given the way the parsing has to work, pre-splitting large data files in either format is the only sane way to do things at least for I/O bound jobs. The splits do not have to be physical if the data already is in the correct order. For instance I get production data in XML as an exchange format and I know exactly in which sequence the XML writer picks the data from the original database, which happens to be (not incidentally) the order in which I need to process it most of the time. But then we already receive the data in many small files rather than a single big one so that database loading can proceed in parallel. Regards, Achim. -- +<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+ SD adaptations for KORG EX-800 and Poly-800MkII V0.9: http://Synth.Stromeko.net/Downloads.html#KorgSDada