The new GEDCOM parser

Ron Savage Sun, 04 Nov 2012 23:09:25 -0800

Hi

The new GEDCOM parser

This document is a collection of ideas which have been percolatingin my mind for a long time.


    Comments welcome.

Ideas
  Module name
    Genealogy::Gedcom::Parser.

    A place-holder, Genealogy::Gedcom
    <http://metacpan.org/release/Genealogy-Gedcom>, is already on CPAN.

Note: This module was written before the new, major tools nowavailable were released. See Tools below.


  ETA
    There is no ETA for the parser.

    However, certain Perl-based tools are now available which will make
    coding a simple task. See Tools below.

    See also 'Famous Last Words' :-).

  UTF-8

The code will accept input files in utf-8, and generate filescontaining utf-8 characters.


  Apache and mod_perl

These will not be required. I only mention these because referencesto them appear in the Gedcom.pm distro.


  Logging

The code will have a built-in logger, so debugging, e.g., can beturned on with a parameter to new().


    This logger will use Log::Handler. See Tools below.

  Sub-classing
    Sub-classing the main module will be trivial, and samples will be
    provided.

Sub-classing will be done with Hash::FieldHash. See Tools and theFAQ below.


  Grammars and grammar generators
    Like Gedcom.pm, the code will read a GEDCOM grammar in BNF from a file.
    I'll run this phase before shipping the module, so you don't have to.
    See Tools below, specifically Marpa::Rules::Simple.

Bascially, this means the startling complexity of the code inGedcom.pm is a thing of the past.


  Operating the parser
    Using Marpa, callbacks are triggered when input is recognized.

    So, when lines like these are encountered:

            1 @<XREF:FAM>@ FAM
            2 RIN <AUTOMATED_RECORD_ID>

    Marpa will automatically call the callback attached to each tag.

Callbacks will probably have names like 'do_fam' and 'do_rin', i.e.of the format 'do_$tag'.

The parameters passed to the callback include the non-tag text onthe line.

Default callbacks for all tags will be provided, each one doing itspart in parsing the parameters to the tag, and storing the result.


    The result will probably be stored in a tree. See Tools below,
    specifically Tree::DAG_Node.

  Database support
    A DBD::SQLite database is possible.

  Tools
    o Hash::FieldHash
        Simplifies class-building.

        As for the obvious question, why not use Moose, see the FAQ below.

    o Log::Handling
        Simplifies logging.

    o Marpa::R2
        This is the modern way to do parsing.

        Home page <http://jeffreykegler.github.com/Marpa-web-site/>.

        Jeffrey's blog about Marpa
        <http://jeffreykegler.github.com/Ocean-of-Awareness-blog/>.

        My recent article about lexing and parsing with Marpa

<http://www.perl.com/pub/2012/10/an-overview-of-lexing-and-parsing.html>.

    o MarpaX::Simple::Rules
        This module reads a grammar in BNF and generates a Marpa grammar.

Hence it will read a BNF version of the GEDCOM spec and outputthe matching Marpa grammar.


    o Tree::DAG_Node
        The most sophisticated tree-handling code on CPAN. I've recently
        become co-maintainer of this module.

FAQ
  Why did you choose Hash::FieldHash over Moose?
    My policy is to use the light-weight Hash::FieldHash for stand-alone
    modules and Moose for applications.

  Why did you choose to store the data in a tree?

A GEDCOM file's structure can be viewed as a tree, so my initialplan is to store the data likewise.



--
Ron Savage
http://savage.net.au/
Ph: 0421 920 622

The new GEDCOM parser

Reply via email to