RE: [Ldsoss] Automating Genealogy Research

John Finlay Fri, 03 Nov 2006 13:05:20 -0800

There are several sides to the genealogical research problem. There is the UI side where the actual entry of research data needs to be improved. There is also the artificial intelligence side where we should be able to apply AI techniques to help researchers. Then there is the collaboration side where there is too much duplicated research being done.

At the last FHT conference the keynote speaker from Google talked about how they are working on what he called “AI in the middle”. The concept being that users are still going to have to (and want to) do some manual filtering of search results, but the “AI in the middle” can aid in that. I think that we need to be thinking in the same way with AI in regards to genealogy. The researcher is still going to have to (and want to) manually analyze their data, but we can provide “AI in the middle” to help them with their research.

IMHO I think that web services will be a more effective area to place efforts than on the semantic web. I used to work for a major library which had a whole department dedicated to analyzing raw and textual data and categorizing it using Library of Congress ontologies. It took professionals thousands of hours to do 100 journals. I don’t think average genealogists will have the patience. So we would have to look at automated ways to do it. But it doesn’t make sense to have an automated system convert structured data into unstructured data and then apply a hidden markup to try and add structure back to it that will only be of use to another automated system. Instead we should work to connect the two automated systems using web services so that they share the structured data.

I am currently doing research into using Bayesian AI data mining techniques to analyze existing genealogy data and then using that existing data we can analyze records and guide researchers into the most likely sources to find information about a particular person and help them to fill in the holes.

For those interested, I have been working on a Research Assistant module as part of the PhpGedView project which is designed to meet these and other specific problems genealogical researchers face.

The specific goals of the research assistant are:

Collaborative research among family members. Allow families to better coordinate and focus their research so that there is less duplication through an online system.
Persistent log of all research activity linked to people and sources. The data only tells part of the story. There is still a wealth of information about where other researchers looked and did NOT find anything that also needs to be recorded and logged. It is extremely frustrating for researchers to go to the library and spend hours looking and not finding anything just to learn later that another family member (or even they themselves year earlier) had already looked in that source.
Source-centric data entry. Because the data and the research log are integrated, as you enter your results from genealogical research it is immediately applied to your genealogical data. It is also a source-centric data entry so that you may enter multiple facts about multiple people all related to the results of that research. This helps to solve the Person vs. Source centric problems we are all familiar with. We want to view and work with the data from a person perspective, but we want to enter and manipulate it from a source perspective. Consider the wealth of information in a census for example. With most programs you have to manually navigate to each person and enter all of the factual data you can glean from that census and you have to remember to properly cite and source all of that data back to the same census. With the research assistant you can enter all of that in once and it is automatically dispersed everywhere it needs to be. And it is properly sourced and cited.
Artificial intelligence to analyze research results. As you enter data, the program will automatically compare that data with the people it relates to and suggest facts and genealogical data that can be inferred from the results of the research. Then with a simple click of a single button, the user can choose to apply that inferred data or ignore the AI.

Anyway, I think that there is definitely work to be done that can help researchers and in the genealogy AI arena.

--John

John Finlay

Instructor - Neumont University

[EMAIL PROTECTED]

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jay Askren
Sent: Friday, November 03, 2006 7:15 AM
To: LDS Open Source Software
Subject: Re: [Ldsoss] Automating Genealogy Research

I would have to disagree about this not being a computer science problem. Computers can't solve the problem for sure, but they can take us farther than we are now. Our company focuses on Artificial Intelligence research and my background is in Computational Linguistics or in other words Natural Language Processing.

First, there are several tools that I know about that are at least heading in the right direction.

http://www.ancestry.com - Ancestry does have a little data mining group which is working on extracting information from the web automatically. I applied for a job there a couple of years ago, but they were looking for more of a web interface person which isn't my strongest skill. An example of where data mining/text processing is used is there obituary search. They have a crawler which extracts information from obituaries so it can be added to their search engine. It certainly isn't perfect, but it does a decent job. I believe they also have a group which tries to combine information from various databases and build family trees with it.

http://www.myheritage.com/FP/Company/myheritage-research.php - My Heritage has a meta search engine for genealogy. It searches quite a few other genealogy web sites for the name you give it and gives you all of the results.

http://www.werelate.org/ - We Relate searches the web for genealogy information. I think this is becoming a great resource and will only get better.

These are all search engines, which isn't what quite what you were asking for, but I think it's as close as we can get to for now. It may be even the best we will ever do, but if we can make better search engines, that will make genealogy much easier. It would really be great if we could digitize genealogy books and make those searchable.

One paradigm which has been around for a while but hasn't taken off yet is the semantic web. In theory it could enable more of what you are talking about. The idea is to codify information so it can be understood by a machine. So, in the context of genealogy, each person would have a unique identifier called a uri. Then we can make assertions about different people. For instance we can assert that http://www.familysearch.org/person1 is the same as http://www.ancestry.com/person2 and that http://www.familysearch.org/person1 is the child of http://www.ancestry.com/person3. The semantic web technologies can also do inference, so a software agent could infer that that person 2 is also the child of person 3 as well. Here's some more reading on it:

http://jay.askren.net/Projects/SemWeb/

http://polaris.gseis.ucla.edu/mleahey/genealogyAndSemanticWebXHTML.htm

In theory if all genealogy databases were coded up using the semantic web languages, computers could combine them to make family trees with some help from humans. In practice I don't know if this will ever happen. I don't know if the semantic web will ever taken off. It's still very much a research topic and has been for quite some time. I haven't really seen a real application come out of the research that couldn't be done just as easily with plain xml.

Now along those same lines, it sounds like the church's new genealogy web site uses the same principles in that each person has a unique identifier and that some at least limited inference can be done as far as asserting that people are the same people. I'm very interested in being able to use the application, and can't wait until it's finished. I haven't heard anything about it for a while.

A huge problem with having computers do the research in addtion to the Cambell's Soup problem is the problem of ambiguity. If I search for John Smith in Family Search, it will come back with a lot of names, and it's quite difficult to figure out which John Smith's are the same person. Human's would have to markup which John Smiths are the same, which it sounds like is the focus of the new church genealogy website.

Jay

On 11/2/06, Paul Penrod <[EMAIL PROTECTED]> wrote:

What you're describing is the Campbell's Soup problem, which was part of
the AI research and deployment
back in the 1980's when Lisp and Business Intelligence systems were in
vogue.

However, before we can dive into the "done" part of your request, you
need to narrow it down to something
more specific. The implied assumption is that databases and information
heaps are similar in nature with respect
data, arrangement, relationships, etc. They are not. Geopolitically,
there are in excess of 190 Countries, Kingdoms,
provinces, protectorates, etc. There are over 100+ languages spoken,
plus there are great dissimilarities in
record keeping in terms of important information.

If you want to discuss a more concrete solution to "flailing about
looking for diamonds", (research), then lets
narrow the discussion to something narrower in scope with a know
terminus. From your surname, I would
take an educated guess that many of your records start or lie within the
US/UK/Ireland/Scotland venue, and
branch out to other points within Europe due to intermarriage within
lesser and greater royal lines (typical
for many people).

The Church has already placed a great deal of effort already in this
data set, as well as many other organizations;
partly due to US immigrant heritage at the time, and partly due to the
adoption of English record keeping,
laws, and practices. We used to be a collection of English colonies, so
that is a natural process.

Data mining in and of itself in this environment will yield a plethora
of false positives, unless you know more
specifically what you are looking for, AND you know your HISTORY in the
area and time you are researching.
For example, it was common during the middle ages through the Industrial
Revolution for women who had lost
husbands to marry a relative (sometimes a brother of the deceased). This
could be for economic reasons,
family reasons, politics, survival or any other reason that made sense
to them at the time. On the genealogy
charts, you will see the same names and sometimes information show for
multiple marriages. This is not a
mistake, but people who do not educate themselves and trust in the
computer only will see it as a error
in the reporting. Data mining does not help here. This is not a computer
science problem. It sits in the history
and genealogy domain and information management is merely the tool to
help us see things more clearly
as long as we understand the CONTEXT of the data within those domains as
presented. Noodling out an
algorithm to apply these kinds of tenuous possible data relationships is
noble, but not needed, given we
have been blessed with sufficient intelligence to work out the
relationships in our head, along with the
gift of the Holy Ghost (for those who will bother to use it).

Applications like PAF, tools like GEDCOM, and it's derivatives, are
valuable in that they help to organize
existing data for ANALYSIS. They do not produce the end result.

So, Let's talk about a more narrow, concrete, scope to your problem.

Steven H. McCown wrote:
> Has anyone ever noticed that this list tends to concentrate on hashing and
> re-hashing which OSS tools are best?  Then, the discussion moves to whether
> client-server, webapps, or standalone apps are best.  Next, we always jump
> on to (my favorite) legal issues.  Goto line 1 and repeat...
>
> I'd like to take a sideline from that and discuss problem solving issues --
> just for a minute.
>
> I did some research for my family and came to a dead end.  At that point, I
> sat in several libraries and read book after book.  Eventually, place names
> and dates started to sound familiar.  I started reading genealogies for
> unrelated people that lived in the same place/time as my family.  Finally, I
> found families that had intermarried and surprisingly had clues for my own
> family.  I've since been able to tie into some very old family lines.
>
> That will sound very familiar to most researchers as that is the way
> genealogy is often done.
>
> With all that we know about computers, algorithms, searching, data mining,
> etc., is there anything that we can do to affect the research process?  To
> me, as a researcher, whether PAF is AJAX, C++, Python, is mainly a
> distraction.  The only real requirement is that gen apps be available to
> everyone -- whether on the net or not.
>
> So, the discussion that I'd like to hear is not an Info Tech discussion, but
> a hardcore Computer Science one.
>
> Given the research paradigm that I described above, have you done anything
> that might allow researchers to data mine across databases and make
> inferences or suggestions to where to look when we get stumped?
>
> Thanks,
>
> Steve
>
> _______________________________________________
> Ldsoss mailing list
> [email protected]
> http://lists.ldsoss.org/mailman/listinfo/ldsoss
>
>
>

_______________________________________________
Ldsoss mailing list
[email protected]
http://lists.ldsoss.org/mailman/listinfo/ldsoss

_______________________________________________
Ldsoss mailing list
[email protected]
http://lists.ldsoss.org/mailman/listinfo/ldsoss

RE: [Ldsoss] Automating Genealogy Research

Reply via email to