Re: CLI Performance usage...

Sloane, Brandon Fri, 08 Nov 2019 11:23:42 -0800

> By stream mode, do you mean using the JAVA API to not the CLI implementation? 
>  Meaning using Input/Output streams instead of command line interface?


The Daffodil CLI has a "--stream" flag you can use with parse and unparse.

When used in parse mode, Daffodil will first parse the input as normal. When it 
reaches the end of the schema, if there is still more data in the input, 
Daffodil will start a new parse from the point in the input-stream where the 
previous part left off. The infosets will be output as they are computed with a 
NUL character separating them.

For example, you could do:

> cat infile1 infile2 | daffodil parse --stream -s schema.dfdl.xsd

which will output 2 infosets. Depending on what program is feeding the pipe, 
you do not need to have all your inputs available when you do this (Daffodil 
will block on stdin if it reaches the end before the writer closed the pipe).

For a format such as CSV, this can be a bit tricky to do, as you generally 
detect the end of document only by the fact that it is at the end of input. If 
you want to take this approach, you would probably need to create a wrapper 
format where you, for example, prefix the length of the document. Then you 
would update your schema to first parse the length, then treat the entire CSG 
file as a fixed length format. You can make the length field a hidden group so 
that consumers of the infoset do not need to be updated.

> The "parse time" time vaue, is that measured as the amount of time it takes 
> to compile the parser, parse the data according to the schema, and output the 
> data to the console or file?

Steve probably knows this better then I do, and he thinks it is just the time 
to parse the data. I would caution that, because of how Daffodil is designed, 
it is possible that some of the work for compilation is actually deferred until 
parse time. Pre-compiling the parser forces Daffodil to fully compile it before 
starting the parse, which may be why we have seen pre-compiled parsers score 
better.
________________________________
From: Rose, Rob P <[email protected]>
Sent: Friday, November 8, 2019 12:41 PM
To: [email protected] <[email protected]>
Cc: Hanna, Maria <[email protected]>
Subject: RE: CLI Performance usage...

Brandon,

        Thank you so much for the useful information!  It is a huge help!

        I have a follow up question:
        You mention " I would suggest either using daffodil in stream mode, or 
using it as a library as part of a long-lived process "
                          By stream mode, do you mean using the JAVA API to not 
the CLI implementation?  Meaning using Input/Output streams instead of command 
line interface?


        Second question:
                The "parse time" time vaue, is that measured as the amount of 
time it takes to compile the parser, parse the data according to the schema, 
and output the data to the console or file?

Thanks so much again!
Rob


-----Original Message-----
From: Sloane, Brandon <[email protected]>
Sent: Friday, November 8, 2019 11:35 AM
To: [email protected]
Cc: Hanna, Maria <[email protected]>
Subject: Re: CLI Performance usage...

I am not familiar with how daffodil's performance stats are reported 
(particularly how the average rate is faster then the max rate).

However, the biggest bottlenecks for Daffodil performance is schema 
compilation. If performance is a concern, I would recommend pre-compiling your 
parser using the `daffodil save-parser` command. You can then use the 
pre-compiled parser using the '-P' flag instead of '-s'. Note that Daffodil 
does not have a stable format for pre-compiled parsers, so the Daffodil version 
used to save the parser would need to match the version used to run it.

A similar issue (which wouldn't be captured by daffodil performance) is startup 
time. Since Daffodil runs on the JVM, just starting it takes a substantial 
amount of time (`time daffodil --help` is about 800ms on my development 
system). On your actual system, I would suggest either using daffodil in stream 
mode, or using it as a library as part of a long-lived process. If you do 
either of these, them pre-compiling would help reduce your startup time, but 
would not offer any additional benefits to throughput.
________________________________
From: Rose, Rob P <[email protected]>
Sent: Friday, November 8, 2019 10:45 AM
To: [email protected] <[email protected]>
Cc: Hanna, Maria <[email protected]>
Subject: CLI Performance usage...


All,



                I am trying to port the Apache daffodil libraries onto an cross 
domain guard that runs in a very small form factor.



                We have cross compiled OpenJDK 12 for the aarch64 (ARM 
processor) and loaded into memory.

                I have built the source using sbt (sbt daffodil-cli/stage) and 
loaded the necessary jars into memory on the board.



                Here are some of the specifics of the hardware platform running 
on this guard:

*         2 GB DDR RAM

o   Memory Management Unit (MMU) Page Tables used in this system are one-to-one 
mapping.

*         ARM Cortex A53 4 Core Processor



Here are some the specifics for the software components

*         SELinux

*         Busybox



Here is some of the performance numbers we are seeing from the performance 
testing:



                NOTE:  These tests were run using the attached csv file and the 
attached schema





# ./daffodil performance -s demo/csv.dfdl.xsd -N 100 -t 5 demo/test_file.csv

total parse time (sec): 2.443824

*         What does the total parse time value mean ?

*         How is it calculated ?

*         Is this poor performance?

min rate (files/sec): 1.535568

*         What is the min rate (files/sec)  What does this mean ?

max rate (files/sec): 29.460340

*         What is the max rate (files/sec)  What does this mean ?

avg rate (files/sec): 40.919485

*         What is the avg rate (files/sec)  What does this mean ?



*         Do you have any suggestions how to improve parse/unparsed speed on an 
ARM processor?



*         Any suggestions are greatly appreciated!







# ./daffodil performance -s demo/csv.dfdl.xsd -N 200 -t 5 demo/test_file.csv

total parse time (sec): 3.175893

min rate (files/sec): 1.520884

max rate (files/sec): 107.223428

avg rate (files/sec): 62.974409



# ./daffodil performance -s demo/csv.dfdl.xsd -N 300 -t 5 demo/test_file.csv

total parse time (sec): 3.656587

min rate (files/sec): 1.551273

max rate (files/sec): 180.155186

avg rate (files/sec): 82.043712





# ./daffodil performance -s demo/csv.dfdl.xsd -N 1000 -t 5 demo/test_file.csv

total parse time (sec): 5.602554

min rate (files/sec): 1.459977

max rate (files/sec): 301.144046

avg rate (files/sec): 178.490026







Sincerely,



Rob Rose

Sr. Principal Software Engineer

General Dynamics Mission Systems

Office: 508-880-1866

Cell:      508-341-5216



This message and/or attachments may include information subject to GD Corporate 
Policies 07-103 and 07-105 and is intended to be accessed only by authorized 
recipients.  Use, storage and transmission are governed by General Dynamics and 
its policies. Contractual restrictions apply to third parties.  Recipients 
should refer to the policies or contract to determine proper handling.  
Unauthorized review, use, disclosure or distribution is prohibited.  If you are 
not an intended recipient, please contact the sender and destroy all copies of 
the original message.

Re: CLI Performance usage...

Reply via email to