Re: The new GEDCOM parser

Chris Clonch Tue, 06 Nov 2012 11:28:11 -0800

Hi everybody!

With a graph theory connection, you could easily implement RandyWilson's ideas [1] on merging GEDCOMs. I've attempted to give thoughtto how this could be implemented with Graph.pm (as seen in O'Reilly'sAlgorithms with Perl) & Gedcom.pm but I couldn't wrap my brain aroundthe ins and outs of the data as a graph.

Like Jeremy stated, if a solid parser is written, then it would likelybe trivial (for others :) to extend using graph, tree, whatever foradvanced manipulation of the data. And the reverse could be true,additional lexer's could be wrote to slurp in XML, FOAF, etc that comesdown the road...


Got me excited again!

-Chris


1. http://synapse.cs.byu.edu/~randy/gen/Remerge.html


On 2012-11-06 00:11, Stephen Woodbridge wrote:

Hi Ron,
I work with graphs for doing vehicle routing so have some familiaritywith them.
I think this is the family of graph tool to look at using:

http://search.cpan.org/~jhi/Graph/

and I think these work with:

Graph::Writer
Graph::Reader

And it is likely that this can trivially be integrated with graph
rendering tools like GraphViz and/or one of the other tools using
this:


http://search.cpan.org/~neilb/Graph-ReadWrite-2.03/lib/Graph/Writer/Dot.pm
I also found a couple of modules that might be interesting to playwith:
Graph::Similarity
Graph::Matching

If these can be applied to the task matching and merging overlaping
gedcom files.

And this module can be used to save and restore Graph structures in
relational databases.

Ok, this is starting to look very interesting. I guess it all starts
with being able to move data to/from Graph structures and GEDCOM
files.

Ron, you got this coded yet? ;)

-Steve

On 11/5/2012 10:21 PM, Ron Savage wrote:
Hi John

On 06/11/12 12:32, John Washburn wrote:
I agree with Mr. Woodbridge.  A directed graph is a better model.
I aggree. Conceptually it must be graph.
Sure. The question is: Is there a module on CPAN which will do thejob?
I was a bit careless with the terminology. One issue to do with this
which I have not previously stated is that there is a Graph moduleon
CPAN: https://metacpan.org/release/Graph

I had a little play with it, and the results were incomprehensible.

Another issue is that the docs are a bit terse, and are (reasonably)
aimed at experts in the field. The author does refer to 'my fiendishcode'.
I'm not an expert on graphs, and find the docs too difficult tofollow
to make this module a possible contender.
So I still have the problem as to which CPAN module if any I adopt.I
have no intention to rewrite Graph, so I picked Tree::DAG_Node as a
first choice, not implying it's the best or most appropriate.

Solution: Don't know.

More below...
Consider adoption where the adoptee knows their biological parents.Thisindividual has four "parents" two adoptive and two biological; ormore
precisely two mothers and two fathers.
This does not fit well into a Tree (with 1 up) but is no problemfor
the a
directed graph between these 5 individuals. Some are connected byan edgenamed "birth" and some by an edge called "adoption." A directedgraph
handles the extension of  this situation where the adoptive parents
have a
biological child as well and/or the biological parent has other
children not
put up for adoption or adopted by another family.  A strict tree
structure
is at best messy for this situation.
A graph traversal though can report that this person is my brotherbyadoption not blood and that this other person is my sister by bloodonly
since we were adopted by different families.

The problem you will encounter is to take this nuanced, graph
structure and
"squashing" it down into a GEDCOM tree which has a design biasedtoward
bloodlines over other human connections which people cherish.  The
directed
graph lets model the human connections and ignore this bias untilit
is time
to export the data or create the report.
I could even see the edge having a truth value in the closedinterval:[0..1]. For example: I am 70% sure this Percilla Chase is myancestor
and
30% that this Prissy Chase born in the same year one town over ismy
ancestor.  The edge connecting my ancestor has two sets of birth
edges; one
for the 70% connection and one set for the 30% connection. Theonly
one up
nature of a tree makes such uncertain/tentative connectionsdifficult to
model.
Weighting factors on edges are definitely nice-to-have, and I woulddo
nothing to preempt their implementation.
-----Original Message-----
From: Stephen Woodbridge [mailto:[email protected]]
Sent: Monday, November 05, 2012 6:23 PM
To: [email protected]
Subject: Re: The new GEDCOM parser

Ron,
I think this is a graph not a tree or at best an interconnectforest oftrees. Given a focus node like an individual or a family you canview
look
at the trees up or down from that node.
In graph theory you have nodes and edges, and you can useDijkstra'sshortest path to find the shortest route through the graph betweenthe
start
and end node. It does this by converting the graph into a tree, but
doing so
does not maintain all the node because many of them are parallel
paths. You
can do the same thing with a gedcom file where nodes areindividuals and
families and the relationships are the edges.
Nodes and edges for sure. Tree versus graph, probably better tothink
of it
as a graph. Somethings you need to be able to model are:

IVF as you mentioned.
divorce/re-marriage/adoption/... again as you mentioned.
marriage and offspring of relations.
If it is a graph, you can use graph theory to process it and theseare
well
defined algorithms for traversal and manipulation of graphs.

-Steve

On 11/5/2012 4:36 PM, Ron Savage wrote:
Hi Steve

On 06/11/12 00:54, Stephen Woodbridge wrote:
Hi Ron,
This all sounds great. I have a question on your choice of usingatree structure, can you explain that more? Are you thinking ofthefile being the root, then having leaves like: indi, fams, famc,etcand then each of these have their respective data hanging offthose
objects? Or are you thinking the tree would represent the family
relationships? I don't see how the later will work.
I got the idea from the nested structure in the GEDCOM doc itself.Anynesting in the doc could be represented by children of a node in atree.
But frankly, I have not thought it through. I really mentioned ittohelp clarify my thoughts, and I strongly suspect actual codingwill
modify my plan.

But in a bit more detail: An individual can be seen as a node in a
tree, in which case:

o They have a list of (2) grandparents (IVF aside!)

o They have a list of (N) spouses

o They have a list of (N) children

o (As a child) They have a list of (N) care-givers
o (As a patient) They have a list of (N) donors or organs orwhatever
And fundamentally, in a tree, every node has N links (normally 1up, N
sideways and N down), and those links have metadata which would be
used to represent the type of link.

The most obvious problem with a tree is the cross-links due to
divorce/re-marriage/adoption/...

Still, whatever the data structure chosen, /something/ has to be
chosen just to hold the data in memory.
-Steve

On 11/5/2012 2:04 AM, Ron Savage wrote:
Hi

The new GEDCOM parser
This document is a collection of ideas which have beenpercolating
in my mind for a long time.

Comments welcome.

Ideas
Module name
Genealogy::Gedcom::Parser.

A place-holder, Genealogy::Gedcom
<http://metacpan.org/release/Genealogy-Gedcom>, is already onCPAN.
Note: This module was written before the new, major tools now
available were released. See Tools below.

ETA
There is no ETA for the parser.
However, certain Perl-based tools are now available which willmake
coding a simple task. See Tools below.

See also 'Famous Last Words' :-).

UTF-8
The code will accept input files in utf-8, and generate files
containing utf-8 characters.

Apache and mod_perl
These will not be required. I only mention these becausereferences
to them appear in the Gedcom.pm distro.

Logging
The code will have a built-in logger, so debugging, e.g., can be
turned on with a parameter to new().

This logger will use Log::Handler. See Tools below.

Sub-classing
Sub-classing the main module will be trivial, and samples willbe
provided.
Sub-classing will be done with Hash::FieldHash. See Tools andthe
FAQ below.

Grammars and grammar generators
Like Gedcom.pm, the code will read a GEDCOM grammar in BNF froma
file.
I'll run this phase before shipping the module, so you don'thave to.
See Tools below, specifically Marpa::Rules::Simple.

Bascially, this means the startling complexity of the code in
Gedcom.pm is a thing of the past.

Operating the parser
Using Marpa, callbacks are triggered when input is recognized.

So, when lines like these are encountered:

1 @<XREF:FAM>@ FAM
2 RIN<AUTOMATED_RECORD_ID>

Marpa will automatically call the callback attached to each tag.
Callbacks will probably have names like 'do_fam' and 'do_rin',i.e.
of the format 'do_$tag'.
The parameters passed to the callback include the non-tag texton
the line.
Default callbacks for all tags will be provided, each one doingitspart in parsing the parameters to the tag, and storing theresult.
The result will probably be stored in a tree. See Tools below,
specifically Tree::DAG_Node.

Database support
A DBD::SQLite database is possible.

Tools
o Hash::FieldHash
Simplifies class-building.
As for the obvious question, why not use Moose, see the FAQbelow.
o Log::Handling
Simplifies logging.

o Marpa::R2
This is the modern way to do parsing.

Home page<http://jeffreykegler.github.com/Marpa-web-site/>.

Jeffrey's blog about Marpa
<http://jeffreykegler.github.com/Ocean-of-Awareness-blog/>.

My recent article about lexing and parsing with Marpa
<http://www.perl.com/pub/2012/10/an-overview-of-lexing-and-parsing.html>.
o MarpaX::Simple::Rules
This module reads a grammar in BNF and generates a Marpagrammar.
Hence it will read a BNF version of the GEDCOM spec and outputthe
matching Marpa grammar.

o Tree::DAG_Node
The most sophisticated tree-handling code on CPAN. I've recently
become co-maintainer of this module.

FAQ
Why did you choose Hash::FieldHash over Moose?
My policy is to use the light-weight Hash::FieldHash forstand-alone
modules and Moose for applications.

Why did you choose to store the data in a tree?
A GEDCOM file's structure can be viewed as a tree, so my initial
plan is to store the data likewise.
-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2013.0.2742 / Virus Database: 2617/5876 - Release Date:11/05/12
Scanned and tagged as non-SPAM with DSPAM 3.9.0 by Your ISP.com



Scanned and tagged as non-SPAM with DSPAM 3.9.0 by Your ISP.com

Re: The new GEDCOM parser

Reply via email to