RE: [gdmxml] more thoughts on entering a source
Hans, This has been a fruitful discussion, I think. If I could offer a few thoughts from a LWG perspective (even though Velke and Anderson know a great deal more than I do about it) * The GDM was never mean to be a database design. I know that you've said that many times but it bears repeating. In this case it's useful to repeat because you are concerned about redundant storage and the LWG was not thinking about storage. At the same time, they were thinking about the relationships between entities and perhaps this one is one that can be decomposed. If we oversimplify (because that helps me understand), let's instantiate some of these classes. Repository - Library. Source - Book. In theory, if I associate a book with a library I am describing their collection. I could associate a lot of sources with a repository, including call numbers and their condition, without being involved in a genealogical search. I'm not certain, but I think that this association might best be referred to as a CATALOG, which is a well-established model for that association. I think that the LWG may have thought that all linking of sources to repositories would take place as the result of a research activity, hence the association of activity to this association of sources and repositories. On reflection, it seems reasonable to have two separate associations - one of SOURCE to REPOSITORY (called CATALOG?), and another of ACTIVITY to SOURCE-REPOSITORY (or CATALOG). I don't think that the LWG ever imagined that the Allen County Public Library might ever publish an electronic catalog that was compatible with a GDM compatible client. Hey, it was 1996. Now it doesn't seem so far-fetched that a GDM compatible client could contain links to online catalogs - assuming that they aren't being revised in ways that break the links. Does that complicate the issue sufficiently? Beau -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Hans Fugal Sent: Wednesday, July 10, 2002 11:08 PM To: [EMAIL PROTECTED] Subject: Re: [gdmxml] more thoughts on entering a source I spent a while wrestling this out with my brother Jacob today. There are situations where one would need to know more than just repository-id and source-id. For instance, if a particular repository had more than one copy of source and you wanted to indicate which one you had searched, repository-id and source-id are not sufficient - you would also need to know the call-number. But the call-number itself is not unique so can't be used as the primary key in repository-source. Using activity-id as the third key doesn't seem to work though, because of the extreme redundancy I pointed out. I think repository-source needs an id field as a primary key, then search can reference that repository-source-id instead of having repository-id and source-id, and we take activity-id out of repository-source. Jacob also helped me see the light on these associative tables (like repository-source and source-group-source). While I understood their importance in a database context, I was tempted to collapse them a bit in xml context. While that's possible to do while still keeping data integrity, it is better to keep it separate. As always, I welcome your feedback... hans/ * Stan Mitchell [Tue, 9 Jul 2002 at 23:12 -0700] quote Yes, it does seem that your suggestion reduces redundancy without sacrificing search capability. Hans Fugal wrote: But then you have to store call-numbers possibly many times. For example, a professional researcher would doubtless perform many searches in any particular US Census. For that Census the repository, source, call number and description would all be the same for every repository-source record. The only unique information in each record would be the activity-id. Yet if we take out the activity-id from repository-source we get rid of that redundancy. AFAICS there is no loss of querying power when we do so - search has all three keys, so if you want to know which searches you did on a particular call-number, you only have to query the search table with the repository-id and source-id. Or am I still missing something? ___ gdmxml mailing list [EMAIL PROTECTED] http://fugal.net/cgi-bin/mailman/listinfo/gdmxml /quote -- Everybody is talking about the weather but nobody does anything about it. -- Mark Twain ___ gdmxml mailing list [EMAIL PROTECTED] http://fugal.net/cgi-bin/mailman/listinfo/gdmxml ___ gdmxml mailing list [EMAIL PROTECTED] http://fugal.net/cgi-bin/mailman/listinfo/gdmxml
Re: [gdmxml] more thoughts on entering a source
Goodness, I must be getting bounce-happy. Sorry about that. Good thing I didn't expose any secret passwords or anything... Hans :) * Hans Fugal [Fri, 12 Jul 2002 at 10:56 -0600] quote Hi Beau, I will write more later - I have to get out the door in a minute. Was this intended to be off-list? May I bounce it to the list? Hans :) /quote -- Everybody is talking about the weather but nobody does anything about it. -- Mark Twain ___ gdmxml mailing list [EMAIL PROTECTED] http://fugal.net/cgi-bin/mailman/listinfo/gdmxml
Re: [gdmxml] more thoughts on entering a source
Actually, it clarifies it in my mind. I think what you described is exactly what I was thinking. I like the name catalog as well. I think that for a database implementation (which the GDM itself is not) and also for this xml implementation (although not strictly necessary) I will go with that model. In any case, it can be recomposed into the original GDM so I'll go that route. hans/ Beau Sharbrough wrote: Hans, This has been a fruitful discussion, I think. If I could offer a few thoughts from a LWG perspective (even though Velke and Anderson know a great deal more than I do about it) * The GDM was never mean to be a database design. I know that you've said that many times but it bears repeating. In this case it's useful to repeat because you are concerned about redundant storage and the LWG was not thinking about storage. At the same time, they were thinking about the relationships between entities and perhaps this one is one that can be decomposed. If we oversimplify (because that helps me understand), let's instantiate some of these classes. Repository - Library. Source - Book. In theory, if I associate a book with a library I am describing their collection. I could associate a lot of sources with a repository, including call numbers and their condition, without being involved in a genealogical search. I'm not certain, but I think that this association might best be referred to as a CATALOG, which is a well-established model for that association. I think that the LWG may have thought that all linking of sources to repositories would take place as the result of a research activity, hence the association of activity to this association of sources and repositories. On reflection, it seems reasonable to have two separate associations - one of SOURCE to REPOSITORY (called CATALOG?), and another of ACTIVITY to SOURCE-REPOSITORY (or CATALOG). I don't think that the LWG ever imagined that the Allen County Public Library might ever publish an electronic catalog that was compatible with a GDM compatible client. Hey, it was 1996. Now it doesn't seem so far-fetched that a GDM compatible client could contain links to online catalogs - assuming that they aren't being revised in ways that break the links. Does that complicate the issue sufficiently? Beau -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Hans Fugal Sent: Wednesday, July 10, 2002 11:08 PM To: [EMAIL PROTECTED] Subject: Re: [gdmxml] more thoughts on entering a source I spent a while wrestling this out with my brother Jacob today. There are situations where one would need to know more than just repository-id and source-id. For instance, if a particular repository had more than one copy of source and you wanted to indicate which one you had searched, repository-id and source-id are not sufficient - you would also need to know the call-number. But the call-number itself is not unique so can't be used as the primary key in repository-source. Using activity-id as the third key doesn't seem to work though, because of the extreme redundancy I pointed out. I think repository-source needs an id field as a primary key, then search can reference that repository-source-id instead of having repository-id and source-id, and we take activity-id out of repository-source. Jacob also helped me see the light on these associative tables (like repository-source and source-group-source). While I understood their importance in a database context, I was tempted to collapse them a bit in xml context. While that's possible to do while still keeping data integrity, it is better to keep it separate. As always, I welcome your feedback... hans/ * Stan Mitchell [Tue, 9 Jul 2002 at 23:12 -0700] quote Yes, it does seem that your suggestion reduces redundancy without sacrificing search capability. Hans Fugal wrote: But then you have to store call-numbers possibly many times. For example, a professional researcher would doubtless perform many searches in any particular US Census. For that Census the repository, source, call number and description would all be the same for every repository-source record. The only unique information in each record would be the activity-id. Yet if we take out the activity-id from repository-source we get rid of that redundancy. AFAICS there is no loss of querying power when we do so - search has all three keys, so if you want to know which searches you did on a particular call-number, you only have to query the search table with the repository-id and source-id. Or am I still missing something? ___ gdmxml mailing list [EMAIL PROTECTED] http://fugal.net/cgi-bin/mailman/listinfo/gdmxml /quote -- Everybody is talking about the weather but nobody does anything about it. -- Mark Twain ___ gdmxml
Re: [gdmxml] more thoughts on entering a source
I spent a while wrestling this out with my brother Jacob today. There are situations where one would need to know more than just repository-id and source-id. For instance, if a particular repository had more than one copy of source and you wanted to indicate which one you had searched, repository-id and source-id are not sufficient - you would also need to know the call-number. But the call-number itself is not unique so can't be used as the primary key in repository-source. Using activity-id as the third key doesn't seem to work though, because of the extreme redundancy I pointed out. I think repository-source needs an id field as a primary key, then search can reference that repository-source-id instead of having repository-id and source-id, and we take activity-id out of repository-source. Jacob also helped me see the light on these associative tables (like repository-source and source-group-source). While I understood their importance in a database context, I was tempted to collapse them a bit in xml context. While that's possible to do while still keeping data integrity, it is better to keep it separate. As always, I welcome your feedback... hans/ * Stan Mitchell [Tue, 9 Jul 2002 at 23:12 -0700] quote Yes, it does seem that your suggestion reduces redundancy without sacrificing search capability. Hans Fugal wrote: But then you have to store call-numbers possibly many times. For example, a professional researcher would doubtless perform many searches in any particular US Census. For that Census the repository, source, call number and description would all be the same for every repository-source record. The only unique information in each record would be the activity-id. Yet if we take out the activity-id from repository-source we get rid of that redundancy. AFAICS there is no loss of querying power when we do so - search has all three keys, so if you want to know which searches you did on a particular call-number, you only have to query the search table with the repository-id and source-id. Or am I still missing something? ___ gdmxml mailing list [EMAIL PROTECTED] http://fugal.net/cgi-bin/mailman/listinfo/gdmxml /quote -- Everybody is talking about the weather but nobody does anything about it. -- Mark Twain ___ gdmxml mailing list [EMAIL PROTECTED] http://fugal.net/cgi-bin/mailman/listinfo/gdmxml
Re: [gdmxml] more thoughts on entering a source
hmm, that's an interesting perpsective. This made me look closer at repository-source, and I am a little muddy now... It looks like repository-source ties 0 or 1 repositories to 0 or 1 sources to 0 or 1 activities (searches). It seems to me that this opens the door for data redundancy - there could be (and I think would be) many searches in one repository/source combination. But with repository-source as it is we have to duplicate not only the repository and source ids, but also the call-number and description. I don't see why search has repository-id and source-id and repository-source has activity-id. Why not take activity-id out of repository-source, leaving it only to link repositories and sources, and take out the repository-id and source-id from search? What am I missing - why did the Lexicon Group do it this way? hans/ * Stan Mitchell [Tue, 9 Jul 2002 at 10:35 -0700] quote A few thoughts on repository-source ... IMHO from an OO point-of-view, repository-source seems to be a separate class. It represents the association between no or one instance of source and no or one instance of repository, with the constraint that there be at least one source or one repository. When a search succeeds, then a source and repository are tied together, and information such as call-number and description of the condition of the particular source, are stored in the repository-source instance. Another way of looking at it, is as the link between the Administration and Evidence submodels, but with perhaps a closer link to Admin. Maybe repository-source could be a child of search. Stan Mitchell Hans Fugal wrote: One repository exists in one place, so it seems natural to make repository a child element of place. I've also made place-part a child of place for the same reason. The GDM calls for a sequence number on each place-part of a place, and an ordering scheme of the place-parts of a place. With XML order matters (unless we say it doesn't) so I see no need for a sequence number; it is implied. On those many-to-many relationships: repository-source isn't as clean cut in my mind as source-group-source was, and now I'm not as clear about that either. For one thing, the naming becomes hairy. Naturally we don't want to make source a child element of repository, because a source could exist in more than one repository; the other way around is even more ludicrous. So, we need to reference the sources in the repository or reference the repositories in the sources. So I think perhaps: source id=film0049002 citation-part citation-part-type=film0049002/citation-part repository-source idref=fhl/ /source That name, repository-source, makes perfect sense in database context, but I think it's confusing in this context, where it is a child element of the source element. Perhaps repository-ref. Maybe we can even allow a repository-source element from either a source element or a repository element - that may be harder to deal with in implementation though, and there is no way to avoid the possibilitiy of duplicates. So my question for anyone who has an opinion is which is better: to put it in one of the elements (i.e. a source element has a repository-ref child element), or to have a separate (non-child) repository-source element? hans/ ___ gdmxml mailing list [EMAIL PROTECTED] http://fugal.net/cgi-bin/mailman/listinfo/gdmxml /quote -- Everybody is talking about the weather but nobody does anything about it. -- Mark Twain ___ gdmxml mailing list [EMAIL PROTECTED] http://fugal.net/cgi-bin/mailman/listinfo/gdmxml
Re: [gdmxml] more thoughts on entering a source
Thanks for the explanations, Stan! Things are beginning to clear up. See below... * Stan Mitchell [Tue, 9 Jul 2002 at 15:03 -0700] Yes, you're right, it is a three-way association with 0-1 instances of activity. From a database perspective, repository-source is an associative table - no primary key only foreign keys. So it would be useful for performing queries on various combinations of the foreign keys. I think the three ids (activity, repository, source) serve to identify a specific search. In Activity/Search, activity-id is a primary key. Each record defines one of three possible kinds of searches: 1- source without a known repository (search for a repository?) 2- repository without a particular source in mind (search for sources?) 3- a source known to exist in a specific repository (normal search) Ahh, I see why we need the three keys in the search table. On the other hand, repository-source is indexed by the three ids. If a repository has several copies of a source, then a given search in one of those copies, would require a separate repository-source record to store the call number, etc. If you were interested in which copies (call-numbers) of the source you had looked at in a repository, a query of just those repository-source records which match repository-id and source-id would give that info. But then you have to store call-numbers possibly many times. For example, a professional researcher would doubtless perform many searches in any particular US Census. For that Census the repository, source, call number and description would all be the same for every repository-source record. The only unique information in each record would be the activity-id. Yet if we take out the activity-id from repository-source we get rid of that redundancy. AFAICS there is no loss of querying power when we do so - search has all three keys, so if you want to know which searches you did on a particular call-number, you only have to query the search table with the repository-id and source-id. Or am I still missing something? BTW, I'm a software engineer in the San Francisco bay area. Genealogy is a side interest of mine. I have studied the gdm spec and have an idea to represent it using UML. My background is more in C++, OOP, and systems programming. XML is still new to me. I've often thought a UML representation of the GDM would be useful. Let me know if you come up with one! hans/ -- Everybody is talking about the weather but nobody does anything about it. -- Mark Twain ___ gdmxml mailing list [EMAIL PROTECTED] http://fugal.net/cgi-bin/mailman/listinfo/gdmxml