Re: The new GEDCOM parser

Stephen Woodbridge Tue, 06 Nov 2012 06:29:54 -0800

Hi Ron,

I totally agree with Jeremy, that the parser is the key. Obviously youneed to supply it with test callbacks for verification as you build it.But once you have that then it should be straight forward for people (oryou) to hook it into say Graph, or load it into a database or whatever.

I think having a reader and writer the conform to a grammar is the placeto start.


-Steve

On 11/6/2012 4:49 AM, Ron Savage wrote:

Hi Steve

On 06/11/12 16:11, Stephen Woodbridge wrote:

Hi Ron,

I work with graphs for doing vehicle routing so have some familiarity
with them.


Good.

I think this is the family of graph tool to look at using:

http://search.cpan.org/~jhi/Graph/


Yes, that's the one I earlier said /I/ had trouble with. See a prevous msg.

Perhaps I will need to adopt it despite my recent experience!

and I think these work with:

Graph::Writer
Graph::Reader


Yes they do. The author of Graph recommends them.

And it is likely that this can trivially be integrated with graph
rendering tools like GraphViz and/or one of the other tools using this:

http://search.cpan.org/~neilb/Graph-ReadWrite-2.03/lib/Graph/Writer/Dot.pm


I also found a couple of modules that might be interesting to play with:

Graph::Similarity
Graph::Matching


Clearly work has been done to write add-ons for Graph.

If these can be applied to the task matching and merging overlaping
gedcom files.


Interesting idea...

And this module can be used to save and restore Graph structures in
relational databases.


I did not check that deeply into their capabilities.

Ok, this is starting to look very interesting. I guess it all starts
with being able to move data to/from Graph structures and GEDCOM files.

Ron, you got this coded yet? ;)


No. I seem to be spending way too much time answering emails. Hahahaha.

Perhaps the familiarity you mention above will provide you with the
wherewithal to beat me to it :-)). I look forward to your coming upload
to CPAN....

-Steve

On 11/5/2012 10:21 PM, Ron Savage wrote:

Hi John

On 06/11/12 12:32, John Washburn wrote:

I agree with Mr. Woodbridge. A directed graph is a better model.


I aggree. Conceptually it must be graph.

Sure. The question is: Is there a module on CPAN which will do the job?

I was a bit careless with the terminology. One issue to do with this
which I have not previously stated is that there is a Graph module on
CPAN: https://metacpan.org/release/Graph

I had a little play with it, and the results were incomprehensible.

Another issue is that the docs are a bit terse, and are (reasonably)
aimed at experts in the field. The author does refer to 'my fiendish
code'.

I'm not an expert on graphs, and find the docs too difficult to follow
to make this module a possible contender.

So I still have the problem as to which CPAN module if any I adopt. I
have no intention to rewrite Graph, so I picked Tree::DAG_Node as a
first choice, not implying it's the best or most appropriate.

Solution: Don't know.

More below...

Consider adoption where the adoptee knows their biological parents.
This
individual has four "parents" two adoptive and two biological; or more
precisely two mothers and two fathers.

This does not fit well into a Tree (with 1 up) but is no problem for
the a
directed graph between these 5 individuals. Some are connected by an
edge
named "birth" and some by an edge called "adoption." A directed graph
handles the extension of this situation where the adoptive parents
have a
biological child as well and/or the biological parent has other
children not
put up for adoption or adopted by another family. A strict tree
structure
is at best messy for this situation.

A graph traversal though can report that this person is my brother by
adoption not blood and that this other person is my sister by blood
only
since we were adopted by different families.

The problem you will encounter is to take this nuanced, graph
structure and
"squashing" it down into a GEDCOM tree which has a design biased toward
bloodlines over other human connections which people cherish. The
directed
graph lets model the human connections and ignore this bias until it
is time
to export the data or create the report.

I could even see the edge having a truth value in the closed interval:
[0..1]. For example: I am 70% sure this Percilla Chase is my ancestor
and
30% that this Prissy Chase born in the same year one town over is my
ancestor. The edge connecting my ancestor has two sets of birth
edges; one
for the 70% connection and one set for the 30% connection. The only
one up
nature of a tree makes such uncertain/tentative connections
difficult to
model.


Weighting factors on edges are definitely nice-to-have, and I would do
nothing to preempt their implementation.

-----Original Message-----
From: Stephen Woodbridge [mailto:wood...@swoodbridge.com]
Sent: Monday, November 05, 2012 6:23 PM
To: perl-gedcom@perl.org
Subject: Re: The new GEDCOM parser

Ron,

I think this is a graph not a tree or at best an interconnect forest of
trees. Given a focus node like an individual or a family you can view
look
at the trees up or down from that node.

In graph theory you have nodes and edges, and you can use Dijkstra's
shortest path to find the shortest route through the graph between the
start
and end node. It does this by converting the graph into a tree, but
doing so
does not maintain all the node because many of them are parallel
paths. You
can do the same thing with a gedcom file where nodes are individuals
and
families and the relationships are the edges.

Nodes and edges for sure. Tree versus graph, probably better to think
of it
as a graph. Somethings you need to be able to model are:

IVF as you mentioned.
divorce/re-marriage/adoption/... again as you mentioned.
marriage and offspring of relations.

If it is a graph, you can use graph theory to process it and these are
well
defined algorithms for traversal and manipulation of graphs.

-Steve

On 11/5/2012 4:36 PM, Ron Savage wrote:

Hi Steve

On 06/11/12 00:54, Stephen Woodbridge wrote:

Hi Ron,

This all sounds great. I have a question on your choice of using a
tree structure, can you explain that more? Are you thinking of the
file being the root, then having leaves like: indi, fams, famc, etc
and then each of these have their respective data hanging off those
objects? Or are you thinking the tree would represent the family
relationships? I don't see how the later will work.


I got the idea from the nested structure in the GEDCOM doc itself. Any
nesting in the doc could be represented by children of a node in a
tree.

But frankly, I have not thought it through. I really mentioned it to
help clarify my thoughts, and I strongly suspect actual coding will
modify my plan.

But in a bit more detail: An individual can be seen as a node in a
tree, in which case:

o They have a list of (2) grandparents (IVF aside!)

o They have a list of (N) spouses

o They have a list of (N) children

o (As a child) They have a list of (N) care-givers

o (As a patient) They have a list of (N) donors or organs or whatever

And fundamentally, in a tree, every node has N links (normally 1 up, N
sideways and N down), and those links have metadata which would be
used to represent the type of link.

The most obvious problem with a tree is the cross-links due to
divorce/re-marriage/adoption/...

Still, whatever the data structure chosen, /something/ has to be
chosen just to hold the data in memory.

-Steve

On 11/5/2012 2:04 AM, Ron Savage wrote:

Hi

The new GEDCOM parser
This document is a collection of ideas which have been percolating
in my mind for a long time.

Comments welcome.

Ideas
Module name
Genealogy::Gedcom::Parser.

A place-holder, Genealogy::Gedcom
<http://metacpan.org/release/Genealogy-Gedcom>, is already on CPAN.

Note: This module was written before the new, major tools now
available were released. See Tools below.

ETA
There is no ETA for the parser.

However, certain Perl-based tools are now available which will make
coding a simple task. See Tools below.

See also 'Famous Last Words' :-).

UTF-8
The code will accept input files in utf-8, and generate files
containing utf-8 characters.

Apache and mod_perl
These will not be required. I only mention these because references
to them appear in the Gedcom.pm distro.

Logging
The code will have a built-in logger, so debugging, e.g., can be
turned on with a parameter to new().

This logger will use Log::Handler. See Tools below.

Sub-classing
Sub-classing the main module will be trivial, and samples will be
provided.

Sub-classing will be done with Hash::FieldHash. See Tools and the
FAQ below.

Grammars and grammar generators
Like Gedcom.pm, the code will read a GEDCOM grammar in BNF from a
file.
I'll run this phase before shipping the module, so you don't have
to.
See Tools below, specifically Marpa::Rules::Simple.

Bascially, this means the startling complexity of the code in
Gedcom.pm is a thing of the past.

Operating the parser
Using Marpa, callbacks are triggered when input is recognized.

So, when lines like these are encountered:

1 @<XREF:FAM>@ FAM
2 RIN<AUTOMATED_RECORD_ID>

Marpa will automatically call the callback attached to each tag.

Callbacks will probably have names like 'do_fam' and 'do_rin', i.e.
of the format 'do_$tag'.

The parameters passed to the callback include the non-tag text on
the line.

Default callbacks for all tags will be provided, each one doing its
part in parsing the parameters to the tag, and storing the result.

The result will probably be stored in a tree. See Tools below,
specifically Tree::DAG_Node.

Database support
A DBD::SQLite database is possible.

Tools
o Hash::FieldHash
Simplifies class-building.

As for the obvious question, why not use Moose, see the FAQ below.

o Log::Handling
Simplifies logging.

o Marpa::R2
This is the modern way to do parsing.

Home page<http://jeffreykegler.github.com/Marpa-web-site/>.

Jeffrey's blog about Marpa
<http://jeffreykegler.github.com/Ocean-of-Awareness-blog/>.

My recent article about lexing and parsing with Marpa

<http://www.perl.com/pub/2012/10/an-overview-of-lexing-and-parsing.html>.



o MarpaX::Simple::Rules
This module reads a grammar in BNF and generates a Marpa grammar.

Hence it will read a BNF version of the GEDCOM spec and output the
matching Marpa grammar.

o Tree::DAG_Node
The most sophisticated tree-handling code on CPAN. I've recently
become co-maintainer of this module.

FAQ
Why did you choose Hash::FieldHash over Moose?
My policy is to use the light-weight Hash::FieldHash for stand-alone
modules and Moose for applications.

Why did you choose to store the data in a tree?
A GEDCOM file's structure can be viewed as a tree, so my initial
plan is to store the data likewise.


-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2013.0.2742 / Virus Database: 2617/5876 - Release Date:
11/05/12

Re: The new GEDCOM parser

Reply via email to