Re: [Kicad-developers] [rfc] actual sexpression parsing

David Godfrey Sat, 19 Dec 2015 19:52:39 -0800

Hi,

I don't often post here, but I will add my ~30 years of experience with binary file formats.
THEY ARE EVIL EVIL THINGS !!

They tend to be extremely fragile, any change and something is likely to break.
They require parsers that suck to maintain
In most cases general purpose parsers *CAN NOT* be used, requiring a custom parser.
it is *VERY* difficult to keep compatibility with an older version of the format.
Essentially you need to keep multiple copies of the parser and have some form of VersionID that can be used to select between parsers.
To all intents it is *IMPOSSIBLE* to handle forward compatibility.
ie: open a file with version 1.1 that was written with version 1.1a
A permanent and accurate copy of documentation is required to be maintained for *EVERY* version of the parser ever written
Parser VersionID magic must uniquely identify the parser version and be the first entry in the file.
NOTE: other locations are possible, but cause significant problems with extracting the version
The more involved the protocol the worse all of these issues get.
Documentation is hard to write, and even harder to read and understand sufficiently to code against.
Debugging read/write issues is an absolute nightmare.
You either have to provide a reference to test against by either
- hand encode or decode a file
- have an independent programmer write a "clean room" implementation of the parser
  This requires the previously mentioned documentation to be correct for this version
  It can introduce additional bugs (you now have 2 parsers to debug)
Many potential developers of addons or core code may
- be scared off by the complexity
- accidentally introduce bugs or incompatibilities that don't show up until a user looses data
Inadvertently force full or partial Vendor Lock In.
If a protocol (file format) is
- too hard to implement
- too hard to keep up with changes
- too prone to breakage (fragile)
- too obscure (poorly documented or the documentation is not easily available)
Then other vendors won't put in the effort to support the protocol (file format)

I could go on with a lot more points but you get the picture.

Binary file formats made a lot of sense back in the days of

low speed serial communications
very small storage devices
expensive data links (anyone remember gprs @ 10c per kilo byte?)
- Expensive GPRS lead to WAP which was (roughly) a tokenised binary HTML/XML
  It was not widely used due to some of the above problems, and it also scared off content developers
memory constrained systems like
- Old PC's (486 and earlier)
- old microcontrollers were short on memory and clock speed
- some modern microcontrollers can be affected here, but only the lower end ones
slow radio links (bluetooth and wifi are not slow)

They do make sense when used internally to a compression algorithm or an encryption scheme.
They don't make sense in (almost) all other cases today.
(No flames here please, I am not trolling, just making a very generalized statement)

For a format like .png binary is not so bad, the format is unlikely to change much if ever (even so there are at least 6 different png formats out there, some of them ascii and some binary), but it is a simple format to describe and implement. Any program that works with png's needs to implement support for all versions of png, or clearly explain to the user what is and is not supported with appropriate error messages. Any png programs that don't support all versions tend to fall into disuse and die a slow death.

Text based formats on the other hand, if well designed, are (to a fair extent)

human readable
self documenting (although good documentation is *ALWAYS* recommended
general purpose parsers can be used allowing our code to focus on using the information, not extracting it
tolerant of extra nodes (features)
tolerant of missing nodes
often a newer version of a parser can parse older versions of a file without problem.
It is trivial for a newer version to cater for specific version based variations in a protocol.
If the protocol is well designed new versions will only add nodes to the previous version, never alter the way data is stored.
- This allows a newer parser to parse any historical version file.
- In almost all cases an older version parser can parse a newer file, but
  - any nodes that are unknown will just be skipped.
  - Skipped nodes may or may not be a problem for the way a use sees the result,
    but there should be a lockout preventing the file from being saved.
    If the lockout can be overridden, then we should force a filename change
  - Ideally any unknown nodes should be retained in the tree and written back when the file is saved.
    This can get complicated especially if associated nodes have been modified
Easily debugged as the human readable nature allows direct inspection to validate what has been read/written

I have been writing software for PC's and microcontrollers since the early 1980's much of the time dealing with communications between different pieces of hardware and invariably the binary protocols take up 90% of the debugging time and way too much of the documentation and coding time.

I'll climb down from my soapbox now, but I hope I have assisted people's understanding of Wayne's refusal to use binary formats.

Regards
David G

On 19/12/15 05:46, Wayne Stambaugh wrote:

Mark,

I haven't had time to look at your commit so I'm not going to comment on
that until I do.  I don't know when I'll have time as I have a lot of
stuff to do and I'm traveling over the holidays so my review time will
be limited.  In the mean time I will comment on some of the things I've
read in this thread.


Binary file formats are not going happen while I'm project leader. I
have 30 years of less than positive experience with them and I'm not
about to start using them now.

Be careful with the C++ << operators. They may not do the correct thing
with floating point numbers. There may also be some overhead that the
current design does not have.  I will not accept a performance hit on
file parsing or any significant increase in formatted file size.

On 12/17/2015 10:32 PM, Mark Roszko wrote:

So awhile back, Wayne said to use sexpr for something I wanted to do.
Then I looked at the sexpr parsing and said NOPE.


Why NOPE? Because the current parsing regime is basically Visual Basic
6 parser written in a modern language with micro-optimizations meant
for someone running a Windows 3.1 computer with 5.25" storage drives.


This patch is a propsed sexpr parser that parses sepxr like its sexpr
and not a parenthesis format. Because if you are parsing with things
like NeedLeft() and NeedRight(), you are parsing it wrong.

Especially when you see something like this:

            case T_descr:
                if( sawDesc )
                    in->Duplicate( tok );
                sawDesc = true;
                in->NeedSYMBOLorNUMBER();
                row.SetDescr( in->FromUTF8() );
                break;


but then you realize that fp_lib_table has quoted strings for MOST
descr entries. And you question why the method is called
NeedSYMBOLorNumber when its a string with quotes. Then you go deeper
and realize the parser may as well be a csv reader/writer. You either
parse like sexpressions or you stop calling it sexpressions.



New proposed system:
1. Reads and generates an in memory tree structure of data as it
should be, i.e. lists, strings, numbers, etc.

I'm ok with the memory tree structure as long as it doesn't add any
significant parsing or conversion to internal object overhead.

2. Helpers to pull out each item as need be

OK

3. Backwards compatible

Backwards compatibility is not optional.

4. Doesn't do silly keywords micro optimization at compile time. You
do a string comparison to convert the value to integer anyway, using
if/else is no different each time. Kicad isn't parsing gigabyte sized
files nor hundreds of files, this optimization really isn't worth the
overhead in maintenance.

This change will probably kill your performance as the file format gets
new features which it will.  I could be wrong but there is not going to
be much faster token look up than integers.  Before you decide this
optimization isn't worth it, you need to get an 8 layer+ high density
board file from someone and do some testing.

5. Generate saved files from in memory tree structures, this will
avoid all possible formatting irregularities and differences because
someone handwrote unrolling all the data members.

I need to see how your doing this because keeping both the memory tree
and the board objects in memory doesn't make a lot of sense to me.

6. Avoid things like " ${KIGITHUB}/Air_Coils_SML_NEOSID.pretty" being
defined as a symbol instead of a string in the future.
7. Explicit definitions of symbols and strings. Strings are always
quoted. Period. No silly auto-detection logic.


So my first goal is to have 1:1 parity with the existing stuff for
kicad files for both reading and writing.


Benchmarks:
Old fp-lib-table read: 1ms
New fp-lib-table read: 2ms

Here's the actual commit for fp-lib-table:
https://github.com/marekr/kicad-sexpr/commit/9367f469be69962d14671411eddd6fd759ace1f2

Not expecting anyone to compile it or anything, more input that
anything. Yes its messy as its an initial proof of concept.



Yes, string parsing and writing isn't escaping properly, TBD (easy, just lazy).
There would probably be a smaller kicad_sexpr wrapper to implement
common sexpr pattern helpers such as the "list key-value pair" that's
used.



Also played around with .kicad-pcb but its not committed:
Old read: ~350ms
New read: ~230ms


I await the abuse.

_______________________________________________
Mailing list: https://launchpad.net/~kicad-developers
Post to     : kicad-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kicad-developers
More help   : https://help.launchpad.net/ListHelp


_______________________________________________
Mailing list: https://launchpad.net/~kicad-developers
Post to     : kicad-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kicad-developers
More help   : https://help.launchpad.net/ListHelp

_______________________________________________
Mailing list: https://launchpad.net/~kicad-developers
Post to     : kicad-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kicad-developers
More help   : https://help.launchpad.net/ListHelp

Re: [Kicad-developers] [rfc] actual sexpression parsing

Reply via email to