There are several sides to the
genealogical research problem. There is the UI side where the actual
entry of research data needs to be improved. There is also the artificial
intelligence side where we should be able to apply AI techniques to help
researchers. Then there is the collaboration side where there is too much
duplicated research being done.
At the last FHT conference the keynote
speaker from Google talked about how they are working on what he called “AI
in the middle”. The concept being that users are still going to
have to (and want to) do some manual filtering of search results, but the “AI
in the middle” can aid in that. I think that we need to be thinking
in the same way with AI in regards to genealogy. The researcher is still
going to have to (and want to) manually analyze their data, but we can provide “AI
in the middle” to help them with their research.
IMHO I think that web services will be a
more effective area to place efforts than on the semantic web. I used to
work for a major library which had a whole department dedicated to analyzing raw
and textual data and categorizing it using Library of Congress ontologies.
It took professionals thousands of hours to do 100 journals. I don’t
think average genealogists will have the patience. So we would have to
look at automated ways to do it. But it doesn’t make sense to have
an automated system convert structured data into unstructured data and then
apply a hidden markup to try and add structure back to it that will only be of
use to another automated system. Instead we should work to connect the
two automated systems using web services so that they share the structured
data.
I am currently doing research into using Bayesian
AI data mining techniques to analyze existing genealogy data and then using
that existing data we can analyze records and guide researchers into the most
likely sources to find information about a particular person and help them to
fill in the holes.
For those interested, I have been working
on a Research Assistant module as part of the PhpGedView project which is
designed to meet these and other specific problems genealogical researchers
face.
The specific goals of the research
assistant are:
- Collaborative research among
family members. Allow families to better coordinate and focus their
research so that there is less duplication through an online system.
- Persistent log of all research
activity linked to people and sources. The data only tells part of
the story. There is still a wealth of information about where other
researchers looked and did NOT find anything that also needs to be
recorded and logged. It is extremely frustrating for researchers to
go to the library and spend hours looking and not finding anything just to
learn later that another family member (or even they themselves year
earlier) had already looked in that source.
- Source-centric data
entry. Because the data and the research log are integrated, as you enter
your results from genealogical research it is immediately applied to your
genealogical data. It is also a source-centric data entry so that
you may enter multiple facts about multiple people all related to the
results of that research. This helps to solve the Person vs. Source centric
problems we are all familiar with. We want to view and work with the
data from a person perspective, but we want to enter and manipulate it
from a source perspective. Consider the wealth of information in a census
for example. With most programs you have to manually navigate to
each person and enter all of the factual data you can glean from that
census and you have to remember to properly cite and source all of that
data back to the same census. With the research assistant you can
enter all of that in once and it is automatically dispersed everywhere it
needs to be. And it is properly sourced and cited.
- Artificial intelligence to
analyze research results. As you enter data, the program will
automatically compare that data with the people it relates to and suggest
facts and genealogical data that can be inferred from the results of the
research. Then with a simple click of a single button, the user can
choose to apply that inferred data or ignore the AI.
Anyway, I think that there is definitely
work to be done that can help researchers and in the genealogy AI arena.
--John
From:
[EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jay Askren
Sent: Friday, November 03, 2006
7:15 AM
To: LDS Open Source Software
Subject: Re: [Ldsoss] Automating
Genealogy Research
I would have to disagree about this not being a computer science
problem. Computers can't solve the problem for sure, but they can take us
farther than we are now. Our company focuses on Artificial Intelligence
research and my background is in Computational Linguistics or in other words
Natural Language Processing.
First, there are several tools that I know about that are at least
heading in the right direction.
http://www.ancestry.com -
Ancestry does have a little data mining group which is working on extracting
information from the web automatically. I applied for a job there a
couple of years ago, but they were looking for more of a web interface person
which isn't my strongest skill. An example of where data mining/text
processing is used is there obituary search. They have a crawler
which extracts information from obituaries so it can be added to their search
engine. It certainly isn't perfect, but it does a decent job. I
believe they also have a group which tries to combine information from various
databases and build family trees with it.
http://www.werelate.org/ -
We Relate searches the web for genealogy information. I think this is
becoming a great resource and will only get better.
These are all search engines, which isn't what quite what you were
asking for, but I think it's as close as we can get to for now. It may be
even the best we will ever do, but if we can make better search engines, that
will make genealogy much easier. It would really be great if we could
digitize genealogy books and make those searchable.
One paradigm which has been around for a while but hasn't taken off yet
is the semantic web. In theory it could enable more of what you are
talking about. The idea is to codify information so it can be understood
by a machine. So, in the context of genealogy, each person would have a
unique identifier called a uri. Then we can make assertions about
different people. For instance we can assert that http://www.familysearch.org/person1
is the same as http://www.ancestry.com/person2 and
that http://www.familysearch.org/person1 is
the child of http://www.ancestry.com/person3.
The semantic web technologies can also do inference, so a software agent could
infer that that person 2 is also the child of person 3 as well. Here's
some more reading on it:
In theory if all genealogy databases were coded up using the semantic web
languages, computers could combine them to make family trees with some help
from humans. In practice I don't know if this will ever happen. I
don't know if the semantic web will ever taken off. It's still very much
a research topic and has been for quite some time. I haven't really seen
a real application come out of the research that couldn't be done just as
easily with plain xml.
Now along those same lines, it sounds like the church's new genealogy
web site uses the same principles in that each person has a unique identifier
and that some at least limited inference can be done as far as asserting that
people are the same people. I'm very interested in being able to use the
application, and can't wait until it's finished. I haven't heard anything
about it for a while.
A huge problem with having computers do the research in addtion to the
Cambell's Soup problem is the problem of ambiguity. If I search for John
Smith in Family Search, it will come back with a lot of names, and it's quite
difficult to figure out which John Smith's are the same person. Human's
would have to markup which John Smiths are the same, which it sounds like is
the focus of the new church genealogy website.
On 11/2/06, Paul
Penrod <[EMAIL PROTECTED]>
wrote:
What you're describing is the Campbell's Soup problem, which was part of
the AI research and deployment
back in the 1980's when Lisp and Business Intelligence systems were in
vogue.
However, before we can dive into the "done" part of your request, you
need to narrow it down to something
more specific. The implied assumption is that databases and information
heaps are similar in nature with respect
data, arrangement, relationships, etc. They are not. Geopolitically,
there are in excess of 190 Countries, Kingdoms,
provinces, protectorates, etc. There are over 100+ languages spoken,
plus there are great dissimilarities in
record keeping in terms of important information.
If you want to discuss a more concrete solution to "flailing about
looking for diamonds", (research), then lets
narrow the discussion to something narrower in scope with a know
terminus. From your surname, I would
take an educated guess that many of your records start or lie within the
US/UK/Ireland/Scotland venue, and
branch out to other points within Europe due to intermarriage within
lesser and greater royal lines (typical
for many people).
The Church has already placed a great deal of effort already in this
data set, as well as many other organizations;
partly due to US immigrant heritage at the time, and partly due to the
adoption of English record keeping,
laws, and practices. We used to be a collection of English colonies, so
that is a natural process.
Data mining in and of itself in this environment will yield a plethora
of false positives, unless you know more
specifically what you are looking for, AND you know your HISTORY in the
area and time you are researching.
For example, it was common during the middle ages through the Industrial
Revolution for women who had lost
husbands to marry a relative (sometimes a brother of the deceased). This
could be for economic reasons,
family reasons, politics, survival or any other reason that made sense
to them at the time. On the genealogy
charts, you will see the same names and sometimes information show for
multiple marriages. This is not a
mistake, but people who do not educate themselves and trust in the
computer only will see it as a error
in the reporting. Data mining does not help here. This is not a computer
science problem. It sits in the history
and genealogy domain and information management is merely the tool to
help us see things more clearly
as long as we understand the CONTEXT of the data within those domains as
presented. Noodling out an
algorithm to apply these kinds of tenuous possible data relationships is
noble, but not needed, given we
have been blessed with sufficient intelligence to work out the
relationships in our head, along with the
gift of the Holy Ghost (for those who will bother to use it).
Applications like PAF, tools like GEDCOM, and it's derivatives, are
valuable in that they help to organize
existing data for ANALYSIS. They do not produce the end result.
So, Let's talk about a more narrow, concrete, scope to your problem.
Steven H. McCown wrote:
> Has anyone ever noticed that this list tends to concentrate on hashing and
> re-hashing which OSS tools are best? Then, the discussion moves
to whether
> client-server, webapps, or standalone apps are best. Next, we
always jump
> on to (my favorite) legal issues. Goto line 1 and repeat...
>
> I'd like to take a sideline from that and discuss problem solving issues
--
> just for a minute.
>
> I did some research for my family and came to a dead end. At
that point, I
> sat in several libraries and read book after book. Eventually,
place names
> and dates started to sound familiar. I started reading
genealogies for
> unrelated people that lived in the same place/time as my
family. Finally, I
> found families that had intermarried and surprisingly had clues for my own
> family. I've since been able to tie into some very old family
lines.
>
> That will sound very familiar to most researchers as that is the way
> genealogy is often done.
>
> With all that we know about computers, algorithms, searching, data mining,
> etc., is there anything that we can do to affect the research
process? To
> me, as a researcher, whether PAF is AJAX, C++, Python, is mainly a
> distraction. The only real requirement is that gen apps be
available to
> everyone -- whether on the net or not.
>
> So, the discussion that I'd like to hear is not an Info Tech discussion,
but
> a hardcore Computer Science one.
>
> Given the research paradigm that I described above, have you done anything
> that might allow researchers to data mine across databases and make
> inferences or suggestions to where to look when we get stumped?
>
> Thanks,
>
> Steve
>
> _______________________________________________
> Ldsoss mailing list
> [email protected]
> http://lists.ldsoss.org/mailman/listinfo/ldsoss
>
>
>
_______________________________________________
Ldsoss mailing list
[email protected]
http://lists.ldsoss.org/mailman/listinfo/ldsoss
|