On Wed, Nov 19, 2014 at 10:35 AM, Joshua M. Clulow <[email protected]> wrote:

> On 19 November 2014 09:13, Francois Billard <[email protected]>
> wrote:
> > we print the standardized column name in 'zfs_do_list' function :
> > static char default_fields[] =
> "name,used,available,referenced,mountpoint";
> > the name of properties MUST not ever change, else the code that will
> > use them will break every time.
>
> I agree, and this is what I was attempting to convey: that they be the
> standard, lowercase names as provided to "-o".  Sorry for the
> confusion.
>
> > Your suggestion about parseable values and human readable values are
> > already reflected (zfs natural way) :
> >
> > with human readable values :
> >
> >> zfs list -J -o used | python -m  json.tool
> > {
> >     "cmd": "zfs list -J -o used",
> >     "stdout": [
> >         {
> >             "used": "55K"
> >         },
> >         {
> >             "used": "56,5K"
> >         }
> >     ]
> > }
> >
> > and with bytes values  (-p option) :
> >
> >> zfs list -pJ -o used | python -m  json.tool
> > {
> >     "cmd": "zfs list -pJ -o used",
> >     "stdout": [
> >         {
> >             "used": "56320"
> >         },
> >         {
> >             "used": "57856"
> >         }
> >     ]
> > }
>
> So, I actually think that "-J" should _imply_ (i.e. force) "-p".  It
> does not make sense to provide non-parsable values in a
> machine-readable format, especially if we are aiming for a strict,
> well-documented schema for the resultant output that we commit to
> supporting over time.
>
> > Concerning the streaming manner (a JSON objects on each line) : if you
> > do that, you will not have JSON output, but a bloc of text containing
> > several json object and you will have to parse it with regexp to load
> > each json object : very complicated.
>
> No, this is absolutely not true.  The format I'm referring to is often
> described as LDJSON or "Line Delimited JSON"[1], a kind of JSON
> streaming format[2].  Critically, no newline characters (the byte
> 0x0A) appear anywhere within a JSON record -- only _between_ records.
> This makes it trivial to read and parse in basically any modern
> environment:
>
>   - In C, use getline(3C) to read lines from a FILE * and then pass each
>     one into a JSON parsing library
>
>   - In node.js, use the "lstream" module to read one line at a time and
>     JSON.parse()
>
>   - In shell, use a sed(1)-like utility that understands line-delimited
>     JSON, like json[3] or jq[4]; these make it trivial to manipulate
>     each JSON object into some filtered or transformed version as part
>     of a shell pipeline
>
>   - Other environments such as Python, Ruby and Java all have similar
>     library routines to read one line at a time from a file or other
>     input source; each line is then run through the JSON parser to
>     produce an object describing the current filesystem or other record
>
> [1] http://en.wikipedia.org/wiki/Line_Delimited_JSON
> [2] http://en.wikipedia.org/wiki/JSON_Streaming
> [3] https://github.com/trentm/json
> [4] http://stedolan.github.io/jq
>
> > A well formed JSON object must have root element (as list, dict),
> > which is easily loaded by code that will use the json output on server
> > side (python, java,..)
>
> In contrast, each _line_ in an LDJSON stream is a well-formed JSON
> object containing just the data pertaining to the current record.
> This enables the consumer to work on one record at a time, if that is
> what they require, or to collate incoming records into whatever
> application-specific data structure makes sense to them.  Of the
> utmost importance, it requires neither zfs(1M) nor the application
> consuming the stream to produce (and subsequently parse) all of the
> data at one time.
>

I'm not sure I agree that it's of "utmost importance", but this does seem
like it could be a nice performance enhancement over the existing interface.

--matt


>
> This is akin to the difference between scandir(3C) and readdir(3C).
> The former will load the entire directory into memory, sort it, then
> return it in one result to the user.  That's fine for small
> directories, but for larger directories with millions of files it can
> take a very long time, and consume a considerable amount of memory and
> cycles in doing so.  Using an interface like scandir(3C) has the
> unfortunate result that processes with memory constraints (e.g. Java
> with a fixed VM heap cap, or Node.js with its ~1.5GB heap limitation)
> are unable to process directories beyond a certain size at all.  In
> contrast, a streaming interface like readdir(3C) allows the program to
> read a few directories, do some processing, and then throw that
> storage away.
>
> By using LDJSON for the output here, we are allowing for more flexible
> usage of the tooling -- especially on large systems with thousands or
> tens of thousands of filesystems, volumes or snapshots.  I speak from
> painful experience dealing with processing large JSON datasets from
> order 50MB up to a couple of gigabytes, often in programming
> environments that simply cannot parse and store the entire object tree
> in memory.
>
>
> Cheers.
>
> --
> Joshua M. Clulow
> UNIX Admin/Developer
> http://blog.sysmgr.org
>
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to