Re: [CODE4LIB] MARC field lengths
Thanks, Bill. What you say about "assumptions" is a good part of what is motivating me to try to instigate a discussion. As you know, both FRBR and RDA were developed by the cataloging community with no input from technologists. There are sweeping statements about FRBR being "more efficient" than the MARC model, but without, that I can find, any real analysis. There was a study done at OCLC on the ratio of Works to Manifestations (and that shows in their stats today), but the OCLC catalog is not representative of the catalog of a single library. What I'm hoping to do is to surface some of the assumptions so that we can talk about them. I'll make a stab at an analysis, but I'm really interested in the conversation that could follow what I have to say. kc On 10/16/13 5:43 PM, Bill Dueber wrote: My guess is that traversing the WEM structure for display of a single record (e.g., in a librarian's ILS client or what not) will not be a problem at all, because the volume is so low. In terms of the OPAC interface itself, well, there are lots and lots of way to denormalize the data (meaning "copy over and inline data whose canonical values are in their own tables somewhere") for search and display purposes. Heck, lots of us do this on a smaller and less complicated scale already, as we dump data into Solr for our public catalogs. This adds complexity to the system (determining what to denormalize, determining when some underlying value has changed and knowing what other elements need updating), but it's the sort of complexity that's been well-studied and doesn't worry me too much. I'm much, *much* more "nerd" than "librarian," and if there's one thing I wish I could get across to people who swing the other way, it's that getting the data model right is so very much harder than figuring out how to process it. Make sure the individual elements are machine-intelligible, and there are hoards of smart people (both within and outside of the library world) who will figure out how efficiently(-enough) store and retrieve it. And, for the love of god, have someone around who can at least speak authoritatively about what sorts of things fall into the "hard" and "easy-peasy" categories in terms of the technology, instead of making assumptions. On Wed, Oct 16, 2013 at 6:23 PM, Karen Coyle wrote: Yes, that's my take as well, but I think it's worth quantifying if possible. There is the usual trade-off between time and space -- and I'd be interested in hearing whether anyone here thinks that there is any concern about traversing the WEM structure for each search and display. Does it matter if every display of author in a Manifestation has to connect M-E-W? Or is that a concern, like space, that is no longer relevant? kc On 10/16/13 12:57 PM, Bill Dueber wrote: If anyone out there is really making a case for FRBR based on whether or not it saves a few characters in a database, well, she should give up the library business and go make money off her time machine . Maybe -- *maybe* -- 15 years ago. But I have to say, I'm sitting on 10m records right now, and would happily figure out how to deal with double or triple the space requirements for added utility. Space is always a consideration, but it's slipped down into about 15th place on my Giant List of Things to Worry About. On Wed, Oct 16, 2013 at 3:49 PM, Karen Coyle wrote: On 10/16/13 12:33 PM, Kyle Banerjee wrote: BTW, I don't think 240 is a good substitute as the content is very different than in the regular title. That's where you'll find music, laws, selections, translations and it's totally littered with subfields. The 70.1 figure from the stripped 245 is probably closer to the mark Yes, you are right, especially for the particular purpose I am looking at. Thanks. IMO, what you stand to gain in functionality, maintenance, and analysis is much more interesting than potential space gains/losses. Yes, obviously. But there exists an apology for FRBR that says that it will save cataloger time and will be more efficient in a database. I think it's worth taking a look at those assumptions. If there is a way to measure functionality, maintenance, etc. then we should measure it, for sure. kc kyle On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle wrote: Thanks, Roy (and others!) It looks like the 245 is including the $c - dang! I should have been more specific. I'm mainly interested in the title, which is $a $b -- I'm looking at the gains and losses of bytes should one implement FRBR. As a hedge, could I ask what've you got for the 240? that may be closer to reality. kc On 10/16/13 10:57 AM, Roy Tennant wrote: I don't even have to fire it up. That's a statistic that we generate quarterly (albeit via Hadoop). Here you go: 100 - 30.3 245 - 103.1 600 - 41 610 - 48.8 611 - 61.4 630 - 40.8 648 - 23.8 650 - 35.1 651 - 39.6 653 - 33.3 654 - 38.1 655 - 22.5 656 - 30.6 657 - 27.4 658 - 30.7 662 - 41.7 Roy On Wed, O
Re: [CODE4LIB] MARC field lengths
My guess is that traversing the WEM structure for display of a single record (e.g., in a librarian's ILS client or what not) will not be a problem at all, because the volume is so low. In terms of the OPAC interface itself, well, there are lots and lots of way to denormalize the data (meaning "copy over and inline data whose canonical values are in their own tables somewhere") for search and display purposes. Heck, lots of us do this on a smaller and less complicated scale already, as we dump data into Solr for our public catalogs. This adds complexity to the system (determining what to denormalize, determining when some underlying value has changed and knowing what other elements need updating), but it's the sort of complexity that's been well-studied and doesn't worry me too much. I'm much, *much* more "nerd" than "librarian," and if there's one thing I wish I could get across to people who swing the other way, it's that getting the data model right is so very much harder than figuring out how to process it. Make sure the individual elements are machine-intelligible, and there are hoards of smart people (both within and outside of the library world) who will figure out how efficiently(-enough) store and retrieve it. And, for the love of god, have someone around who can at least speak authoritatively about what sorts of things fall into the "hard" and "easy-peasy" categories in terms of the technology, instead of making assumptions. On Wed, Oct 16, 2013 at 6:23 PM, Karen Coyle wrote: > Yes, that's my take as well, but I think it's worth quantifying if > possible. There is the usual trade-off between time and space -- and I'd be > interested in hearing whether anyone here thinks that there is any concern > about traversing the WEM structure for each search and display. Does it > matter if every display of author in a Manifestation has to connect M-E-W? > Or is that a concern, like space, that is no longer relevant? > > kc > > > > On 10/16/13 12:57 PM, Bill Dueber wrote: > >> If anyone out there is really making a case for FRBR based on whether or >> not it saves a few characters in a database, well, she should give up the >> library business and go make money off her time machine . Maybe -- >> *maybe* -- >> >> 15 years ago. But I have to say, I'm sitting on 10m records right now, and >> would happily figure out how to deal with double or triple the space >> requirements for added utility. Space is always a consideration, but it's >> slipped down into about 15th place on my Giant List of Things to Worry >> About. >> >> >> On Wed, Oct 16, 2013 at 3:49 PM, Karen Coyle wrote: >> >> On 10/16/13 12:33 PM, Kyle Banerjee wrote: >>> >>> BTW, I don't think 240 is a good substitute as the content is very different than in the regular title. That's where you'll find music, laws, selections, translations and it's totally littered with subfields. The 70.1 figure from the stripped 245 is probably closer to the mark Yes, you are right, especially for the particular purpose I am looking >>> at. >>> Thanks. >>> >>> >>> >>> IMO, what you stand to gain in functionality, maintenance, and analysis is much more interesting than potential space gains/losses. Yes, obviously. But there exists an apology for FRBR that says that it >>> will save cataloger time and will be more efficient in a database. I >>> think >>> it's worth taking a look at those assumptions. If there is a way to >>> measure >>> functionality, maintenance, etc. then we should measure it, for sure. >>> >>> kc >>> >>> >>> >>> kyle On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle wrote: Thanks, Roy (and others!) > It looks like the 245 is including the $c - dang! I should have been > more > specific. I'm mainly interested in the title, which is $a $b -- I'm > looking > at the gains and losses of bytes should one implement FRBR. As a hedge, > could I ask what've you got for the 240? that may be closer to reality. > > kc > > > On 10/16/13 10:57 AM, Roy Tennant wrote: > > I don't even have to fire it up. That's a statistic that we generate > >> quarterly (albeit via Hadoop). Here you go: >> >> 100 - 30.3 >> 245 - 103.1 >> 600 - 41 >> 610 - 48.8 >> 611 - 61.4 >> 630 - 40.8 >> 648 - 23.8 >> 650 - 35.1 >> 651 - 39.6 >> 653 - 33.3 >> 654 - 38.1 >> 655 - 22.5 >> 656 - 30.6 >> 657 - 27.4 >> 658 - 30.7 >> 662 - 41.7 >> >> Roy >> >> >> On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan >> wrote: >> >>That sounds like a request for Roy to fire up the ole OCLC Hadoop. >> >> -Sean >>> >>> >>> >>> On 10/16/13 1:06 PM, "Karen Coyle" wrote: >>> >>>Anybody have data for the average length of specific MARC fields >>> in >>> some >>> >>> reasonably representative
Re: [CODE4LIB] MARC field lengths
On 10/16/13 4:22 PM, Kyle Banerjee wrote: In some ways, FRBR strikes me as the catalogers' answer to the miserable seven layer OSI model which often confuses rather than clarifies -- largely because it doesn't reflect reality very well. Agreed. I am having trouble seeing FRBR as being beneficial, much less necessary. However, there is a wide-spread assumption that FRBR's WEMI will be implemented as a four-level, linked set of hierarchical entities, rather than that FRBR is a conceptual model (which is what the FRBR documentation says). If there are reasons to present users with works, expressions and manifestations, nothing in that requires a physical model that looks like some kind of relational database design. Yet, that seems to be what many people assume. So I'd like to expose that myth, or at least provide a way to discuss it. kc kyle On Wed, Oct 16, 2013 at 3:23 PM, Karen Coyle wrote: Yes, that's my take as well, but I think it's worth quantifying if possible. There is the usual trade-off between time and space -- and I'd be interested in hearing whether anyone here thinks that there is any concern about traversing the WEM structure for each search and display. Does it matter if every display of author in a Manifestation has to connect M-E-W? Or is that a concern, like space, that is no longer relevant? kc On 10/16/13 12:57 PM, Bill Dueber wrote: If anyone out there is really making a case for FRBR based on whether or not it saves a few characters in a database, well, she should give up the library business and go make money off her time machine . Maybe -- *maybe* -- 15 years ago. But I have to say, I'm sitting on 10m records right now, and would happily figure out how to deal with double or triple the space requirements for added utility. Space is always a consideration, but it's slipped down into about 15th place on my Giant List of Things to Worry About. On Wed, Oct 16, 2013 at 3:49 PM, Karen Coyle wrote: On 10/16/13 12:33 PM, Kyle Banerjee wrote: BTW, I don't think 240 is a good substitute as the content is very different than in the regular title. That's where you'll find music, laws, selections, translations and it's totally littered with subfields. The 70.1 figure from the stripped 245 is probably closer to the mark Yes, you are right, especially for the particular purpose I am looking at. Thanks. IMO, what you stand to gain in functionality, maintenance, and analysis is much more interesting than potential space gains/losses. Yes, obviously. But there exists an apology for FRBR that says that it will save cataloger time and will be more efficient in a database. I think it's worth taking a look at those assumptions. If there is a way to measure functionality, maintenance, etc. then we should measure it, for sure. kc kyle On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle wrote: Thanks, Roy (and others!) It looks like the 245 is including the $c - dang! I should have been more specific. I'm mainly interested in the title, which is $a $b -- I'm looking at the gains and losses of bytes should one implement FRBR. As a hedge, could I ask what've you got for the 240? that may be closer to reality. kc On 10/16/13 10:57 AM, Roy Tennant wrote: I don't even have to fire it up. That's a statistic that we generate quarterly (albeit via Hadoop). Here you go: 100 - 30.3 245 - 103.1 600 - 41 610 - 48.8 611 - 61.4 630 - 40.8 648 - 23.8 650 - 35.1 651 - 39.6 653 - 33.3 654 - 38.1 655 - 22.5 656 - 30.6 657 - 27.4 658 - 30.7 662 - 41.7 Roy On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan wrote: That sounds like a request for Roy to fire up the ole OCLC Hadoop. -Sean On 10/16/13 1:06 PM, "Karen Coyle" wrote: Anybody have data for the average length of specific MARC fields in some reasonably representative database? I mainly need 100, 245, 6xx. Thanks, kc -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] MARC field lengths
Depends on how many requests the service has to accommodate. Up to a point, it's no big deal. After a certain point, servicing lots of calls gets expensive and bang for the buck is brought into question. My bigger concern would be getting data encoded/structured consistently. Even though FRBR has been around for a long time, people spend a lot of time scratching their heads about really basic stuff (e.g. what level something belongs on) when dealing with real world use cases. And it's hard to automate tasks when the people aren't sure what the machine needs to do. In some ways, FRBR strikes me as the catalogers' answer to the miserable seven layer OSI model which often confuses rather than clarifies -- largely because it doesn't reflect reality very well. kyle On Wed, Oct 16, 2013 at 3:23 PM, Karen Coyle wrote: > Yes, that's my take as well, but I think it's worth quantifying if > possible. There is the usual trade-off between time and space -- and I'd be > interested in hearing whether anyone here thinks that there is any concern > about traversing the WEM structure for each search and display. Does it > matter if every display of author in a Manifestation has to connect M-E-W? > Or is that a concern, like space, that is no longer relevant? > > kc > > > > On 10/16/13 12:57 PM, Bill Dueber wrote: > >> If anyone out there is really making a case for FRBR based on whether or >> not it saves a few characters in a database, well, she should give up the >> library business and go make money off her time machine . Maybe -- >> *maybe* -- >> 15 years ago. But I have to say, I'm sitting on 10m records right now, and >> would happily figure out how to deal with double or triple the space >> requirements for added utility. Space is always a consideration, but it's >> slipped down into about 15th place on my Giant List of Things to Worry >> About. >> >> >> On Wed, Oct 16, 2013 at 3:49 PM, Karen Coyle wrote: >> >> On 10/16/13 12:33 PM, Kyle Banerjee wrote: >>> >>> BTW, I don't think 240 is a good substitute as the content is very different than in the regular title. That's where you'll find music, laws, selections, translations and it's totally littered with subfields. The 70.1 figure from the stripped 245 is probably closer to the mark Yes, you are right, especially for the particular purpose I am looking >>> at. >>> Thanks. >>> >>> >>> >>> IMO, what you stand to gain in functionality, maintenance, and analysis is much more interesting than potential space gains/losses. Yes, obviously. But there exists an apology for FRBR that says that it >>> will save cataloger time and will be more efficient in a database. I >>> think >>> it's worth taking a look at those assumptions. If there is a way to >>> measure >>> functionality, maintenance, etc. then we should measure it, for sure. >>> >>> kc >>> >>> >>> >>> kyle On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle wrote: Thanks, Roy (and others!) > It looks like the 245 is including the $c - dang! I should have been > more > specific. I'm mainly interested in the title, which is $a $b -- I'm > looking > at the gains and losses of bytes should one implement FRBR. As a hedge, > could I ask what've you got for the 240? that may be closer to reality. > > kc > > > On 10/16/13 10:57 AM, Roy Tennant wrote: > > I don't even have to fire it up. That's a statistic that we generate > >> quarterly (albeit via Hadoop). Here you go: >> >> 100 - 30.3 >> 245 - 103.1 >> 600 - 41 >> 610 - 48.8 >> 611 - 61.4 >> 630 - 40.8 >> 648 - 23.8 >> 650 - 35.1 >> 651 - 39.6 >> 653 - 33.3 >> 654 - 38.1 >> 655 - 22.5 >> 656 - 30.6 >> 657 - 27.4 >> 658 - 30.7 >> 662 - 41.7 >> >> Roy >> >> >> On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan >> wrote: >> >>That sounds like a request for Roy to fire up the ole OCLC Hadoop. >> >> -Sean >>> >>> >>> >>> On 10/16/13 1:06 PM, "Karen Coyle" wrote: >>> >>>Anybody have data for the average length of specific MARC fields >>> in >>> some >>> >>> reasonably representative database? I mainly need 100, 245, 6xx. Thanks, kc -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet -- >>> Karen Coyle > kco...@kcoyle.net http://kcoyle.net > m: 1-510-435-8234 > skype: kcoylenet > > > -- >>> Karen Coyle >>> kco...@kcoyle.net http://kcoyle.net >>> m: 1-510-435-8234 >>> skype: kcoylenet >>> >>> >> >> > -- > Karen Coyle > kco...@kcoyle.net http://kcoyle.net > m: 1-510-435-8234 > skype: kcoylenet >
Re: [CODE4LIB] MARC field lengths
Yes, that's my take as well, but I think it's worth quantifying if possible. There is the usual trade-off between time and space -- and I'd be interested in hearing whether anyone here thinks that there is any concern about traversing the WEM structure for each search and display. Does it matter if every display of author in a Manifestation has to connect M-E-W? Or is that a concern, like space, that is no longer relevant? kc On 10/16/13 12:57 PM, Bill Dueber wrote: If anyone out there is really making a case for FRBR based on whether or not it saves a few characters in a database, well, she should give up the library business and go make money off her time machine . Maybe -- *maybe* -- 15 years ago. But I have to say, I'm sitting on 10m records right now, and would happily figure out how to deal with double or triple the space requirements for added utility. Space is always a consideration, but it's slipped down into about 15th place on my Giant List of Things to Worry About. On Wed, Oct 16, 2013 at 3:49 PM, Karen Coyle wrote: On 10/16/13 12:33 PM, Kyle Banerjee wrote: BTW, I don't think 240 is a good substitute as the content is very different than in the regular title. That's where you'll find music, laws, selections, translations and it's totally littered with subfields. The 70.1 figure from the stripped 245 is probably closer to the mark Yes, you are right, especially for the particular purpose I am looking at. Thanks. IMO, what you stand to gain in functionality, maintenance, and analysis is much more interesting than potential space gains/losses. Yes, obviously. But there exists an apology for FRBR that says that it will save cataloger time and will be more efficient in a database. I think it's worth taking a look at those assumptions. If there is a way to measure functionality, maintenance, etc. then we should measure it, for sure. kc kyle On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle wrote: Thanks, Roy (and others!) It looks like the 245 is including the $c - dang! I should have been more specific. I'm mainly interested in the title, which is $a $b -- I'm looking at the gains and losses of bytes should one implement FRBR. As a hedge, could I ask what've you got for the 240? that may be closer to reality. kc On 10/16/13 10:57 AM, Roy Tennant wrote: I don't even have to fire it up. That's a statistic that we generate quarterly (albeit via Hadoop). Here you go: 100 - 30.3 245 - 103.1 600 - 41 610 - 48.8 611 - 61.4 630 - 40.8 648 - 23.8 650 - 35.1 651 - 39.6 653 - 33.3 654 - 38.1 655 - 22.5 656 - 30.6 657 - 27.4 658 - 30.7 662 - 41.7 Roy On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan wrote: That sounds like a request for Roy to fire up the ole OCLC Hadoop. -Sean On 10/16/13 1:06 PM, "Karen Coyle" wrote: Anybody have data for the average length of specific MARC fields in some reasonably representative database? I mainly need 100, 245, 6xx. Thanks, kc -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] MARC field lengths
If anyone out there is really making a case for FRBR based on whether or not it saves a few characters in a database, well, she should give up the library business and go make money off her time machine . Maybe -- *maybe* -- 15 years ago. But I have to say, I'm sitting on 10m records right now, and would happily figure out how to deal with double or triple the space requirements for added utility. Space is always a consideration, but it's slipped down into about 15th place on my Giant List of Things to Worry About. On Wed, Oct 16, 2013 at 3:49 PM, Karen Coyle wrote: > On 10/16/13 12:33 PM, Kyle Banerjee wrote: > >> BTW, I don't think 240 is a good substitute as the content is very >> different than in the regular title. That's where you'll find music, laws, >> selections, translations and it's totally littered with subfields. The >> 70.1 >> figure from the stripped 245 is probably closer to the mark >> > > Yes, you are right, especially for the particular purpose I am looking at. > Thanks. > > > >> IMO, what you stand to gain in functionality, maintenance, and analysis is >> much more interesting than potential space gains/losses. >> > > Yes, obviously. But there exists an apology for FRBR that says that it > will save cataloger time and will be more efficient in a database. I think > it's worth taking a look at those assumptions. If there is a way to measure > functionality, maintenance, etc. then we should measure it, for sure. > > kc > > > >> kyle >> >> >> >> >> On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle wrote: >> >> Thanks, Roy (and others!) >>> >>> It looks like the 245 is including the $c - dang! I should have been more >>> specific. I'm mainly interested in the title, which is $a $b -- I'm >>> looking >>> at the gains and losses of bytes should one implement FRBR. As a hedge, >>> could I ask what've you got for the 240? that may be closer to reality. >>> >>> kc >>> >>> >>> On 10/16/13 10:57 AM, Roy Tennant wrote: >>> >>> I don't even have to fire it up. That's a statistic that we generate quarterly (albeit via Hadoop). Here you go: 100 - 30.3 245 - 103.1 600 - 41 610 - 48.8 611 - 61.4 630 - 40.8 648 - 23.8 650 - 35.1 651 - 39.6 653 - 33.3 654 - 38.1 655 - 22.5 656 - 30.6 657 - 27.4 658 - 30.7 662 - 41.7 Roy On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan wrote: That sounds like a request for Roy to fire up the ole OCLC Hadoop. > -Sean > > > > On 10/16/13 1:06 PM, "Karen Coyle" wrote: > > Anybody have data for the average length of specific MARC fields in > some > >> reasonably representative database? I mainly need 100, 245, 6xx. >> >> Thanks, >> kc >> >> -- >> Karen Coyle >> kco...@kcoyle.net http://kcoyle.net >> m: 1-510-435-8234 >> skype: kcoylenet >> >> -- >>> Karen Coyle >>> kco...@kcoyle.net http://kcoyle.net >>> m: 1-510-435-8234 >>> skype: kcoylenet >>> >>> > -- > Karen Coyle > kco...@kcoyle.net http://kcoyle.net > m: 1-510-435-8234 > skype: kcoylenet > -- Bill Dueber Library Systems Programmer University of Michigan Library
Re: [CODE4LIB] MARC field lengths
For the HathiTrust catalog's 6,046,746 bibs and looking at only the lengths of the subfields $a and $b in 245s, I get an average length of 62.0 On Wed, Oct 16, 2013 at 3:24 PM, Kyle Banerjee wrote: > 245 not including $c, indicators, or delimiters, |h (which occurs before > |b), |n, |p, with trailing slash preceding |c stripped for about 9 million > records for Orbis Cascade collections is 70.1 > > kyle > > > On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle wrote: > > > Thanks, Roy (and others!) > > > > It looks like the 245 is including the $c - dang! I should have been more > > specific. I'm mainly interested in the title, which is $a $b -- I'm > looking > > at the gains and losses of bytes should one implement FRBR. As a hedge, > > could I ask what've you got for the 240? that may be closer to reality. > > > > kc > > > > > > On 10/16/13 10:57 AM, Roy Tennant wrote: > > > >> I don't even have to fire it up. That's a statistic that we generate > >> quarterly (albeit via Hadoop). Here you go: > >> > >> 100 - 30.3 > >> 245 - 103.1 > >> 600 - 41 > >> 610 - 48.8 > >> 611 - 61.4 > >> 630 - 40.8 > >> 648 - 23.8 > >> 650 - 35.1 > >> 651 - 39.6 > >> 653 - 33.3 > >> 654 - 38.1 > >> 655 - 22.5 > >> 656 - 30.6 > >> 657 - 27.4 > >> 658 - 30.7 > >> 662 - 41.7 > >> > >> Roy > >> > >> > >> On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan wrote: > >> > >> That sounds like a request for Roy to fire up the ole OCLC Hadoop. > >>> > >>> -Sean > >>> > >>> > >>> > >>> On 10/16/13 1:06 PM, "Karen Coyle" wrote: > >>> > >>> Anybody have data for the average length of specific MARC fields in > some > reasonably representative database? I mainly need 100, 245, 6xx. > > Thanks, > kc > > -- > Karen Coyle > kco...@kcoyle.net http://kcoyle.net > m: 1-510-435-8234 > skype: kcoylenet > > >>> > > -- > > Karen Coyle > > kco...@kcoyle.net http://kcoyle.net > > m: 1-510-435-8234 > > skype: kcoylenet > > > -- Bill Dueber Library Systems Programmer University of Michigan Library
Re: [CODE4LIB] MARC field lengths
On 10/16/13 12:33 PM, Kyle Banerjee wrote: BTW, I don't think 240 is a good substitute as the content is very different than in the regular title. That's where you'll find music, laws, selections, translations and it's totally littered with subfields. The 70.1 figure from the stripped 245 is probably closer to the mark Yes, you are right, especially for the particular purpose I am looking at. Thanks. IMO, what you stand to gain in functionality, maintenance, and analysis is much more interesting than potential space gains/losses. Yes, obviously. But there exists an apology for FRBR that says that it will save cataloger time and will be more efficient in a database. I think it's worth taking a look at those assumptions. If there is a way to measure functionality, maintenance, etc. then we should measure it, for sure. kc kyle On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle wrote: Thanks, Roy (and others!) It looks like the 245 is including the $c - dang! I should have been more specific. I'm mainly interested in the title, which is $a $b -- I'm looking at the gains and losses of bytes should one implement FRBR. As a hedge, could I ask what've you got for the 240? that may be closer to reality. kc On 10/16/13 10:57 AM, Roy Tennant wrote: I don't even have to fire it up. That's a statistic that we generate quarterly (albeit via Hadoop). Here you go: 100 - 30.3 245 - 103.1 600 - 41 610 - 48.8 611 - 61.4 630 - 40.8 648 - 23.8 650 - 35.1 651 - 39.6 653 - 33.3 654 - 38.1 655 - 22.5 656 - 30.6 657 - 27.4 658 - 30.7 662 - 41.7 Roy On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan wrote: That sounds like a request for Roy to fire up the ole OCLC Hadoop. -Sean On 10/16/13 1:06 PM, "Karen Coyle" wrote: Anybody have data for the average length of specific MARC fields in some reasonably representative database? I mainly need 100, 245, 6xx. Thanks, kc -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] MARC field lengths
BTW, I don't think 240 is a good substitute as the content is very different than in the regular title. That's where you'll find music, laws, selections, translations and it's totally littered with subfields. The 70.1 figure from the stripped 245 is probably closer to the mark IMO, what you stand to gain in functionality, maintenance, and analysis is much more interesting than potential space gains/losses. kyle On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle wrote: > Thanks, Roy (and others!) > > It looks like the 245 is including the $c - dang! I should have been more > specific. I'm mainly interested in the title, which is $a $b -- I'm looking > at the gains and losses of bytes should one implement FRBR. As a hedge, > could I ask what've you got for the 240? that may be closer to reality. > > kc > > > On 10/16/13 10:57 AM, Roy Tennant wrote: > >> I don't even have to fire it up. That's a statistic that we generate >> quarterly (albeit via Hadoop). Here you go: >> >> 100 - 30.3 >> 245 - 103.1 >> 600 - 41 >> 610 - 48.8 >> 611 - 61.4 >> 630 - 40.8 >> 648 - 23.8 >> 650 - 35.1 >> 651 - 39.6 >> 653 - 33.3 >> 654 - 38.1 >> 655 - 22.5 >> 656 - 30.6 >> 657 - 27.4 >> 658 - 30.7 >> 662 - 41.7 >> >> Roy >> >> >> On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan wrote: >> >> That sounds like a request for Roy to fire up the ole OCLC Hadoop. >>> >>> -Sean >>> >>> >>> >>> On 10/16/13 1:06 PM, "Karen Coyle" wrote: >>> >>> Anybody have data for the average length of specific MARC fields in some reasonably representative database? I mainly need 100, 245, 6xx. Thanks, kc -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet >>> > -- > Karen Coyle > kco...@kcoyle.net http://kcoyle.net > m: 1-510-435-8234 > skype: kcoylenet >
Re: [CODE4LIB] MARC field lengths
Are you familiar with OAI-PMH protocol? We have almost 2 miljoen records available over this protocol: http://search.ugent.be/meercat/x/oai?verb=ListRecords&metadataPrefix=marcxml From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Karen Coyle [li...@kcoyle.net] Sent: Wednesday, October 16, 2013 7:06 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] MARC field lengths Anybody have data for the average length of specific MARC fields in some reasonably representative database? I mainly need 100, 245, 6xx. Thanks, kc -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] MARC field lengths
245 not including $c, indicators, or delimiters, |h (which occurs before |b), |n, |p, with trailing slash preceding |c stripped for about 9 million records for Orbis Cascade collections is 70.1 kyle On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle wrote: > Thanks, Roy (and others!) > > It looks like the 245 is including the $c - dang! I should have been more > specific. I'm mainly interested in the title, which is $a $b -- I'm looking > at the gains and losses of bytes should one implement FRBR. As a hedge, > could I ask what've you got for the 240? that may be closer to reality. > > kc > > > On 10/16/13 10:57 AM, Roy Tennant wrote: > >> I don't even have to fire it up. That's a statistic that we generate >> quarterly (albeit via Hadoop). Here you go: >> >> 100 - 30.3 >> 245 - 103.1 >> 600 - 41 >> 610 - 48.8 >> 611 - 61.4 >> 630 - 40.8 >> 648 - 23.8 >> 650 - 35.1 >> 651 - 39.6 >> 653 - 33.3 >> 654 - 38.1 >> 655 - 22.5 >> 656 - 30.6 >> 657 - 27.4 >> 658 - 30.7 >> 662 - 41.7 >> >> Roy >> >> >> On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan wrote: >> >> That sounds like a request for Roy to fire up the ole OCLC Hadoop. >>> >>> -Sean >>> >>> >>> >>> On 10/16/13 1:06 PM, "Karen Coyle" wrote: >>> >>> Anybody have data for the average length of specific MARC fields in some reasonably representative database? I mainly need 100, 245, 6xx. Thanks, kc -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet >>> > -- > Karen Coyle > kco...@kcoyle.net http://kcoyle.net > m: 1-510-435-8234 > skype: kcoylenet >
Re: [CODE4LIB] MARC field lengths
Thanks, Roy (and others!) It looks like the 245 is including the $c - dang! I should have been more specific. I'm mainly interested in the title, which is $a $b -- I'm looking at the gains and losses of bytes should one implement FRBR. As a hedge, could I ask what've you got for the 240? that may be closer to reality. kc On 10/16/13 10:57 AM, Roy Tennant wrote: I don't even have to fire it up. That's a statistic that we generate quarterly (albeit via Hadoop). Here you go: 100 - 30.3 245 - 103.1 600 - 41 610 - 48.8 611 - 61.4 630 - 40.8 648 - 23.8 650 - 35.1 651 - 39.6 653 - 33.3 654 - 38.1 655 - 22.5 656 - 30.6 657 - 27.4 658 - 30.7 662 - 41.7 Roy On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan wrote: That sounds like a request for Roy to fire up the ole OCLC Hadoop. -Sean On 10/16/13 1:06 PM, "Karen Coyle" wrote: Anybody have data for the average length of specific MARC fields in some reasonably representative database? I mainly need 100, 245, 6xx. Thanks, kc -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] MARC field lengths
Argh. Must learn to write at third grade level I wanted to say I like breaking up 6XX as Roy has done because 6XX fields vary in purpose and tag frequency varies considerably. On Wed, Oct 16, 2013 at 11:08 AM, Kyle Banerjee wrote: > This squares with what I'm seeing. Data for all holdings of the Orbis > Cascade Alliance is: > > 100: 30.1 > 245: 114.1 > 6XX: 36.1 > > My values include indicators (2 characters) as well as any delimiters but > not the tag number itself. I breaking up 6XX up as Roy has as 6XX's are far > from created equal and frequency of occurrence varies radically with tag. > > I'm going to guess our 245 values are longer because we're an academic > consortium and holdings are biased towards academic titles which tend to be > longer. >
Re: [CODE4LIB] MARC field lengths
This squares with what I'm seeing. Data for all holdings of the Orbis Cascade Alliance is: 100: 30.1 245: 114.1 6XX: 36.1 My values include indicators (2 characters) as well as any delimiters but not the tag number itself. I breaking up 6XX up as Roy has as 6XX's are far from created equal and frequency of occurrence varies radically with tag. I'm going to guess our 245 values are longer because we're an academic consortium and holdings are biased towards academic titles which tend to be longer. kyle On Wed, Oct 16, 2013 at 10:57 AM, Roy Tennant wrote: > I don't even have to fire it up. That's a statistic that we generate > quarterly (albeit via Hadoop). Here you go: > > 100 - 30.3 > 245 - 103.1 > 600 - 41 > 610 - 48.8 > 611 - 61.4 > 630 - 40.8 > 648 - 23.8 > 650 - 35.1 > 651 - 39.6 > 653 - 33.3 > 654 - 38.1 > 655 - 22.5 > 656 - 30.6 > 657 - 27.4 > 658 - 30.7 > 662 - 41.7 > > Roy > > > On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan wrote: > > > That sounds like a request for Roy to fire up the ole OCLC Hadoop. > > > > -Sean > > > > > > > > On 10/16/13 1:06 PM, "Karen Coyle" wrote: > > > > >Anybody have data for the average length of specific MARC fields in some > > >reasonably representative database? I mainly need 100, 245, 6xx. > > > > > >Thanks, > > >kc > > > > > >-- > > >Karen Coyle > > >kco...@kcoyle.net http://kcoyle.net > > >m: 1-510-435-8234 > > >skype: kcoylenet > > >
Re: [CODE4LIB] MARC field lengths
I don't even have to fire it up. That's a statistic that we generate quarterly (albeit via Hadoop). Here you go: 100 - 30.3 245 - 103.1 600 - 41 610 - 48.8 611 - 61.4 630 - 40.8 648 - 23.8 650 - 35.1 651 - 39.6 653 - 33.3 654 - 38.1 655 - 22.5 656 - 30.6 657 - 27.4 658 - 30.7 662 - 41.7 Roy On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan wrote: > That sounds like a request for Roy to fire up the ole OCLC Hadoop. > > -Sean > > > > On 10/16/13 1:06 PM, "Karen Coyle" wrote: > > >Anybody have data for the average length of specific MARC fields in some > >reasonably representative database? I mainly need 100, 245, 6xx. > > > >Thanks, > >kc > > > >-- > >Karen Coyle > >kco...@kcoyle.net http://kcoyle.net > >m: 1-510-435-8234 > >skype: kcoylenet >
Re: [CODE4LIB] MARC field lengths
I'm running it against the HathiTrust catalog right now. It'll just take a while, given that I don't have access to Roy's Hadoop cluster :-) On Wed, Oct 16, 2013 at 1:38 PM, Sean Hannan wrote: > That sounds like a request for Roy to fire up the ole OCLC Hadoop. > > -Sean > > > > On 10/16/13 1:06 PM, "Karen Coyle" wrote: > > >Anybody have data for the average length of specific MARC fields in some > >reasonably representative database? I mainly need 100, 245, 6xx. > > > >Thanks, > >kc > > > >-- > >Karen Coyle > >kco...@kcoyle.net http://kcoyle.net > >m: 1-510-435-8234 > >skype: kcoylenet > -- Bill Dueber Library Systems Programmer University of Michigan Library
Re: [CODE4LIB] MARC field lengths
That sounds like a request for Roy to fire up the ole OCLC Hadoop. -Sean On 10/16/13 1:06 PM, "Karen Coyle" wrote: >Anybody have data for the average length of specific MARC fields in some >reasonably representative database? I mainly need 100, 245, 6xx. > >Thanks, >kc > >-- >Karen Coyle >kco...@kcoyle.net http://kcoyle.net >m: 1-510-435-8234 >skype: kcoylenet
[CODE4LIB] MARC field lengths
Anybody have data for the average length of specific MARC fields in some reasonably representative database? I mainly need 100, 245, 6xx. Thanks, kc -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet