Hi everybody!

With a graph theory connection, you could easily implement Randy Wilson's ideas [1] on merging GEDCOMs. I've attempted to give thought to how this could be implemented with Graph.pm (as seen in O'Reilly's Algorithms with Perl) & Gedcom.pm but I couldn't wrap my brain around the ins and outs of the data as a graph.

Like Jeremy stated, if a solid parser is written, then it would likely be trivial (for others :) to extend using graph, tree, whatever for advanced manipulation of the data. And the reverse could be true, additional lexer's could be wrote to slurp in XML, FOAF, etc that comes down the road...

Got me excited again!

-Chris


1. http://synapse.cs.byu.edu/~randy/gen/Remerge.html


On 2012-11-06 00:11, Stephen Woodbridge wrote:
Hi Ron,

I work with graphs for doing vehicle routing so have some familiarity with them.

I think this is the family of graph tool to look at using:

http://search.cpan.org/~jhi/Graph/

and I think these work with:

Graph::Writer
Graph::Reader

And it is likely that this can trivially be integrated with graph
rendering tools like GraphViz and/or one of the other tools using
this:


http://search.cpan.org/~neilb/Graph-ReadWrite-2.03/lib/Graph/Writer/Dot.pm

I also found a couple of modules that might be interesting to play with:

Graph::Similarity
Graph::Matching

If these can be applied to the task matching and merging overlaping
gedcom files.

And this module can be used to save and restore Graph structures in
relational databases.

Ok, this is starting to look very interesting. I guess it all starts
with being able to move data to/from Graph structures and GEDCOM
files.

Ron, you got this coded yet? ;)

-Steve

On 11/5/2012 10:21 PM, Ron Savage wrote:
Hi John

On 06/11/12 12:32, John Washburn wrote:
I agree with Mr. Woodbridge.  A directed graph is a better model.

I aggree. Conceptually it must be graph.

Sure. The question is: Is there a module on CPAN which will do the job?

I was a bit careless with the terminology. One issue to do with this
which I have not previously stated is that there is a Graph module on
CPAN: https://metacpan.org/release/Graph

I had a little play with it, and the results were incomprehensible.

Another issue is that the docs are a bit terse, and are (reasonably)
aimed at experts in the field. The author does refer to 'my fiendish code'.

I'm not an expert on graphs, and find the docs too difficult to follow
to make this module a possible contender.

So I still have the problem as to which CPAN module if any I adopt. I
have no intention to rewrite Graph, so I picked Tree::DAG_Node as a
first choice, not implying it's the best or most appropriate.

Solution: Don't know.

More below...

Consider adoption where the adoptee knows their biological parents. This individual has four "parents" two adoptive and two biological; or more
precisely two mothers and two fathers.

This does not fit well into a Tree (with 1 up) but is no problem for
the a
directed graph between these 5 individuals. Some are connected by an edge named "birth" and some by an edge called "adoption." A directed graph
handles the extension of  this situation where the adoptive parents
have a
biological child as well and/or the biological parent has other
children not
put up for adoption or adopted by another family.  A strict tree
structure
is at best messy for this situation.

A graph traversal though can report that this person is my brother by adoption not blood and that this other person is my sister by blood only
since we were adopted by different families.

The problem you will encounter is to take this nuanced, graph
structure and
"squashing" it down into a GEDCOM tree which has a design biased toward
bloodlines over other human connections which people cherish.  The
directed
graph lets model the human connections and ignore this bias until it
is time
to export the data or create the report.

I could even see the edge having a truth value in the closed interval: [0..1]. For example: I am 70% sure this Percilla Chase is my ancestor
and
30% that this Prissy Chase born in the same year one town over is my
ancestor.  The edge connecting my ancestor has two sets of birth
edges; one
for the 70% connection and one set for the 30% connection. The only
one up
nature of a tree makes such uncertain/tentative connections difficult to
model.

Weighting factors on edges are definitely nice-to-have, and I would do
nothing to preempt their implementation.

-----Original Message-----
From: Stephen Woodbridge [mailto:wood...@swoodbridge.com]
Sent: Monday, November 05, 2012 6:23 PM
To: perl-gedcom@perl.org
Subject: Re: The new GEDCOM parser

Ron,

I think this is a graph not a tree or at best an interconnect forest of trees. Given a focus node like an individual or a family you can view
look
at the trees up or down from that node.

In graph theory you have nodes and edges, and you can use Dijkstra's shortest path to find the shortest route through the graph between the
start
and end node. It does this by converting the graph into a tree, but
doing so
does not maintain all the node because many of them are parallel
paths. You
can do the same thing with a gedcom file where nodes are individuals and
families and the relationships are the edges.

Nodes and edges for sure. Tree versus graph, probably better to think
of it
as a graph. Somethings you need to be able to model are:

IVF as you mentioned.
divorce/re-marriage/adoption/... again as you mentioned.
marriage and offspring of relations.

If it is a graph, you can use graph theory to process it and these are
well
defined algorithms for traversal and manipulation of graphs.

-Steve

On 11/5/2012 4:36 PM, Ron Savage wrote:
Hi Steve

On 06/11/12 00:54, Stephen Woodbridge wrote:
Hi Ron,

This all sounds great. I have a question on your choice of using a tree structure, can you explain that more? Are you thinking of the file being the root, then having leaves like: indi, fams, famc, etc and then each of these have their respective data hanging off those
objects? Or are you thinking the tree would represent the family
relationships? I don't see how the later will work.

I got the idea from the nested structure in the GEDCOM doc itself. Any nesting in the doc could be represented by children of a node in a tree.

But frankly, I have not thought it through. I really mentioned it to help clarify my thoughts, and I strongly suspect actual coding will
modify my plan.

But in a bit more detail: An individual can be seen as a node in a
tree, in which case:

o They have a list of (2) grandparents (IVF aside!)

o They have a list of (N) spouses

o They have a list of (N) children

o (As a child) They have a list of (N) care-givers

o (As a patient) They have a list of (N) donors or organs or whatever

And fundamentally, in a tree, every node has N links (normally 1 up, N
sideways and N down), and those links have metadata which would be
used to represent the type of link.

The most obvious problem with a tree is the cross-links due to
divorce/re-marriage/adoption/...

Still, whatever the data structure chosen, /something/ has to be
chosen just to hold the data in memory.

-Steve

On 11/5/2012 2:04 AM, Ron Savage wrote:
Hi

The new GEDCOM parser
This document is a collection of ideas which have been percolating
in my mind for a long time.

Comments welcome.

Ideas
Module name
Genealogy::Gedcom::Parser.

A place-holder, Genealogy::Gedcom
<http://metacpan.org/release/Genealogy-Gedcom>, is already on CPAN.

Note: This module was written before the new, major tools now
available were released. See Tools below.

ETA
There is no ETA for the parser.

However, certain Perl-based tools are now available which will make
coding a simple task. See Tools below.

See also 'Famous Last Words' :-).

UTF-8
The code will accept input files in utf-8, and generate files
containing utf-8 characters.

Apache and mod_perl
These will not be required. I only mention these because references
to them appear in the Gedcom.pm distro.

Logging
The code will have a built-in logger, so debugging, e.g., can be
turned on with a parameter to new().

This logger will use Log::Handler. See Tools below.

Sub-classing
Sub-classing the main module will be trivial, and samples will be
provided.

Sub-classing will be done with Hash::FieldHash. See Tools and the
FAQ below.

Grammars and grammar generators
Like Gedcom.pm, the code will read a GEDCOM grammar in BNF from a
file.
I'll run this phase before shipping the module, so you don't have to.
See Tools below, specifically Marpa::Rules::Simple.

Bascially, this means the startling complexity of the code in
Gedcom.pm is a thing of the past.

Operating the parser
Using Marpa, callbacks are triggered when input is recognized.

So, when lines like these are encountered:

1 @<XREF:FAM>@ FAM
2 RIN<AUTOMATED_RECORD_ID>

Marpa will automatically call the callback attached to each tag.

Callbacks will probably have names like 'do_fam' and 'do_rin', i.e.
of the format 'do_$tag'.

The parameters passed to the callback include the non-tag text on
the line.

Default callbacks for all tags will be provided, each one doing its part in parsing the parameters to the tag, and storing the result.

The result will probably be stored in a tree. See Tools below,
specifically Tree::DAG_Node.

Database support
A DBD::SQLite database is possible.

Tools
o Hash::FieldHash
Simplifies class-building.

As for the obvious question, why not use Moose, see the FAQ below.

o Log::Handling
Simplifies logging.

o Marpa::R2
This is the modern way to do parsing.

Home page<http://jeffreykegler.github.com/Marpa-web-site/>.

Jeffrey's blog about Marpa
<http://jeffreykegler.github.com/Ocean-of-Awareness-blog/>.

My recent article about lexing and parsing with Marpa



<http://www.perl.com/pub/2012/10/an-overview-of-lexing-and-parsing.html>.


o MarpaX::Simple::Rules
This module reads a grammar in BNF and generates a Marpa grammar.

Hence it will read a BNF version of the GEDCOM spec and output the
matching Marpa grammar.

o Tree::DAG_Node
The most sophisticated tree-handling code on CPAN. I've recently
become co-maintainer of this module.

FAQ
Why did you choose Hash::FieldHash over Moose?
My policy is to use the light-weight Hash::FieldHash for stand-alone
modules and Moose for applications.

Why did you choose to store the data in a tree?
A GEDCOM file's structure can be viewed as a tree, so my initial
plan is to store the data likewise.








-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2013.0.2742 / Virus Database: 2617/5876 - Release Date: 11/05/12







Scanned and tagged as non-SPAM with DSPAM 3.9.0 by Your ISP.com


Scanned and tagged as non-SPAM with DSPAM 3.9.0 by Your ISP.com

Reply via email to