RE: The new GEDCOM parser

John Washburn Mon, 05 Nov 2012 17:32:21 -0800

I agree with Mr. Woodbridge.  A directed graph is a better model.

Consider adoption where the adoptee knows their biological parents.  This
individual has four "parents" two adoptive and two biological; or more
precisely two mothers and two fathers.


This does not fit well into a Tree (with 1 up) but is no problem for the a
directed graph between these 5 individuals. Some are connected by an edge
named "birth" and some by an edge called "adoption."  A directed graph
handles the extension of  this situation where the adoptive parents have a
biological child as well and/or the biological parent has other children not
put up for adoption or adopted by another family.  A strict tree structure
is at best messy for this situation.

A graph traversal though can report that this person is my brother by
adoption not blood and that this other person is my sister by blood only
since we were adopted by different families.

The problem you will encounter is to take this nuanced, graph structure and
"squashing" it down into a GEDCOM tree which has a design biased toward
bloodlines over other human connections which people cherish.  The directed
graph lets model the human connections and ignore this bias until it is time
to export the data or create the report.  

I could even see the edge having a truth value in the closed interval:
[0..1].  For example: I am 70% sure this Percilla Chase is my ancestor and
30% that this Prissy Chase born in the same year one town over is my
ancestor.  The edge connecting my ancestor has two sets of birth edges; one
for the 70% connection and one set for the 30% connection.  The only one up
nature of a tree makes such uncertain/tentative connections difficult to
model.


-----Original Message-----
From: Stephen Woodbridge [mailto:wood...@swoodbridge.com] 
Sent: Monday, November 05, 2012 6:23 PM
To: perl-gedcom@perl.org
Subject: Re: The new GEDCOM parser

Ron,

I think this is a graph not a tree or at best an interconnect forest of
trees. Given a focus node like an individual or a family you can view look
at the trees up or down from that node.

In graph theory you have nodes and edges, and you can use Dijkstra's
shortest path to find the shortest route through the graph between the start
and end node. It does this by converting the graph into a tree, but doing so
does not maintain all the node because many of them are parallel paths. You
can do the same thing with a gedcom file where nodes are individuals and
families and the relationships are the edges.

Nodes and edges for sure. Tree versus graph, probably better to think of it
as a graph. Somethings you need to be able to model are:

IVF as you mentioned.
divorce/re-marriage/adoption/... again as you mentioned.
marriage and offspring of relations.

If it is a graph, you can use graph theory to process it and these are well
defined algorithms for traversal and manipulation of graphs.

-Steve

On 11/5/2012 4:36 PM, Ron Savage wrote:
> Hi Steve
>
> On 06/11/12 00:54, Stephen Woodbridge wrote:
>> Hi Ron,
>>
>> This all sounds great. I have a question on your choice of using a 
>> tree structure, can you explain that more? Are you thinking of the 
>> file being the root, then having leaves like: indi, fams, famc, etc 
>> and then each of these have their respective data hanging off those 
>> objects? Or are you thinking the tree would represent the family 
>> relationships? I don't see how the later will work.
>
> I got the idea from the nested structure in the GEDCOM doc itself. Any 
> nesting in the doc could be represented by children of a node in a tree.
>
> But frankly, I have not thought it through. I really mentioned it to 
> help clarify my thoughts, and I strongly suspect actual coding will 
> modify my plan.
>
> But in a bit more detail: An individual can be seen as a node in a 
> tree, in which case:
>
> o They have a list of (2) grandparents (IVF aside!)
>
> o They have a list of (N) spouses
>
> o They have a list of (N) children
>
> o (As a child) They have a list of (N) care-givers
>
> o (As a patient) They have a list of (N) donors or organs or whatever
>
> And fundamentally, in a tree, every node has N links (normally 1 up, N 
> sideways and N down), and those links have metadata which would be 
> used to represent the type of link.
>
> The most obvious problem with a tree is the cross-links due to 
> divorce/re-marriage/adoption/...
>
> Still, whatever the data structure chosen, /something/ has to be 
> chosen just to hold the data in memory.
>
>> -Steve
>>
>> On 11/5/2012 2:04 AM, Ron Savage wrote:
>>> Hi
>>>
>>> The new GEDCOM parser
>>> This document is a collection of ideas which have been percolating 
>>> in my mind for a long time.
>>>
>>> Comments welcome.
>>>
>>> Ideas
>>> Module name
>>> Genealogy::Gedcom::Parser.
>>>
>>> A place-holder, Genealogy::Gedcom
>>> <http://metacpan.org/release/Genealogy-Gedcom>, is already on CPAN.
>>>
>>> Note: This module was written before the new, major tools now 
>>> available were released. See Tools below.
>>>
>>> ETA
>>> There is no ETA for the parser.
>>>
>>> However, certain Perl-based tools are now available which will make 
>>> coding a simple task. See Tools below.
>>>
>>> See also 'Famous Last Words' :-).
>>>
>>> UTF-8
>>> The code will accept input files in utf-8, and generate files 
>>> containing utf-8 characters.
>>>
>>> Apache and mod_perl
>>> These will not be required. I only mention these because references 
>>> to them appear in the Gedcom.pm distro.
>>>
>>> Logging
>>> The code will have a built-in logger, so debugging, e.g., can be 
>>> turned on with a parameter to new().
>>>
>>> This logger will use Log::Handler. See Tools below.
>>>
>>> Sub-classing
>>> Sub-classing the main module will be trivial, and samples will be 
>>> provided.
>>>
>>> Sub-classing will be done with Hash::FieldHash. See Tools and the 
>>> FAQ below.
>>>
>>> Grammars and grammar generators
>>> Like Gedcom.pm, the code will read a GEDCOM grammar in BNF from a 
>>> file.
>>> I'll run this phase before shipping the module, so you don't have to.
>>> See Tools below, specifically Marpa::Rules::Simple.
>>>
>>> Bascially, this means the startling complexity of the code in 
>>> Gedcom.pm is a thing of the past.
>>>
>>> Operating the parser
>>> Using Marpa, callbacks are triggered when input is recognized.
>>>
>>> So, when lines like these are encountered:
>>>
>>> 1 @<XREF:FAM>@ FAM
>>> 2 RIN <AUTOMATED_RECORD_ID>
>>>
>>> Marpa will automatically call the callback attached to each tag.
>>>
>>> Callbacks will probably have names like 'do_fam' and 'do_rin', i.e.
>>> of the format 'do_$tag'.
>>>
>>> The parameters passed to the callback include the non-tag text on 
>>> the line.
>>>
>>> Default callbacks for all tags will be provided, each one doing its 
>>> part in parsing the parameters to the tag, and storing the result.
>>>
>>> The result will probably be stored in a tree. See Tools below, 
>>> specifically Tree::DAG_Node.
>>>
>>> Database support
>>> A DBD::SQLite database is possible.
>>>
>>> Tools
>>> o Hash::FieldHash
>>> Simplifies class-building.
>>>
>>> As for the obvious question, why not use Moose, see the FAQ below.
>>>
>>> o Log::Handling
>>> Simplifies logging.
>>>
>>> o Marpa::R2
>>> This is the modern way to do parsing.
>>>
>>> Home page <http://jeffreykegler.github.com/Marpa-web-site/>.
>>>
>>> Jeffrey's blog about Marpa
>>> <http://jeffreykegler.github.com/Ocean-of-Awareness-blog/>.
>>>
>>> My recent article about lexing and parsing with Marpa
>>>
>>>
<http://www.perl.com/pub/2012/10/an-overview-of-lexing-and-parsing.html>.
>>>
>>>
>>> o MarpaX::Simple::Rules
>>> This module reads a grammar in BNF and generates a Marpa grammar.
>>>
>>> Hence it will read a BNF version of the GEDCOM spec and output the 
>>> matching Marpa grammar.
>>>
>>> o Tree::DAG_Node
>>> The most sophisticated tree-handling code on CPAN. I've recently 
>>> become co-maintainer of this module.
>>>
>>> FAQ
>>> Why did you choose Hash::FieldHash over Moose?
>>> My policy is to use the light-weight Hash::FieldHash for stand-alone 
>>> modules and Moose for applications.
>>>
>>> Why did you choose to store the data in a tree?
>>> A GEDCOM file's structure can be viewed as a tree, so my initial 
>>> plan is to store the data likewise.
>>>
>>>
>>
>>
>>
>
>

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2013.0.2742 / Virus Database: 2617/5876 - Release Date: 11/05/12

RE: The new GEDCOM parser

Reply via email to