Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records
On Mar 8, 2012, at 1:46 PM, Terray, James wrote: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 9: ordinal not in range(128) Hello everyone, I just ran into this the other day when trying to write to a file. I searched the documentation and found this: fp = codecs.open(dc.csv, mode=w, encoding=utf-8) This opens a file that is utf-8 aware and it let me write the file. Doesn't answer your question about the encoding but it will let you save the record. -- Brian Kennison Western Connecticut State University
Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records
I'm out of my depth here, but I'm curious how this all works. Is it true that, in MARC8 records, there is supposed to be an 066 field included that defines non-Latin character sets? I'm making this conclusion from some things I read on the LOC website. ANSEL is mentioned as one of the instances where this might be necessary. http://www.loc.gov/marc/specifications/speccharucs.html#field066 http://www.loc.gov/marc/specifications/speccharconversion.html#escape http://www.loc.gov/marc/bibliographic/bd066.html On Thu, Mar 8, 2012 at 1:02 PM, Godmar Back god...@gmail.com wrote: Hi, a few days ago, I showed pymarc to a group of technical librarians to demonstrate how easily certain tasks can be scripted/automated. Unfortunately, it blew up at me when I tried to write a record: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 9: ordinal not in range(128) Investigation revealed this culprit: =LDR 00916nam a2200241I 4500 =001 ocm10685946 =005 19880203211447.0 =007 cr\bn||abp =007 cr\bn||cda =008 840503s1939gw00010\ger\d =040 \\$aMBB$cMBB$dCRL =049 \\$aCRLL =100 10$aEsser, Hermann,$d1900- =245 14$aDie jE8udischer Weltpest ;$bjudendE1ammerung auf dem Erdball,$cvon Hermann Esser. =260 0\$aME8unchen,$bZentralverlag der N S D A P., F. Eher ahchf.,$c1939. =300 \\$a243 [1] p.$c23 cm. =533 \\$aAlso available as electronic reproduction.$bChicago :$cCenter for Research Libraries,$d[2009] =650 \0$aJewish question. =700 12$aBierbrauer, Johann Jacob,$d1705-1760? =710 2\$aCenter for Research Libraries (U.S.) =856 41$uhttp://dds.crl.edu/CRLdelivery.asp?tid=10538$zOnline version =907 \\$a.b28931622$b08-30-10$c08-30-10 =998 \\$awww$b08-30-10$cm$dz$e-$fger$ggw $h4$i0 The leader[9] field is set to 'a', so the record should contain UTF8-encoded Unicode [1], but E8 75 in the 245$a appears to be ANSEL where 'E8' denotes the Umlaut preceding the lowercase 'u' (0x75). [2] To me, this record looks misencoded... am I correct here? There are thousands of such records in the data set I'm dealing with, which was obtained using the 'Data Exchange' feature of III's Millennium system. My question is how others, especially pymarc users dealing with III records, deal with this issue or whatever other experiences/hints/practices/kludges exist in this area. Thanks. - Godmar [1] http://www.loc.gov/marc/bibliographic/bdleader.html [2] http://lcweb2.loc.gov/diglib/codetables/45.html
Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records
On Fri, Mar 9, 2012 at 7:23 AM, Godmar Back god...@gmail.com wrote: Mark, while I would be able to contribute code to pymarc, I probably won't (unless my collaborators' needs in respect to pymarc become urgent.) Such is our conundrum. Most of my uses of pymarc only involve reading records, not writing them. That's something occasional contributors cannot do, it requires work by the core team, in discussion with frequent users. If you've looked at some of the past issues, you may have seen that we've had some healthy discussions. Not all are resolved, clearly. Speaking as an individual and not for the pymarc team, I agree that we need this discussion. (I would have liked to take this discussion to a pymarc-users list, but didn't find any.) Per the README [0]: The pymarc developers encourage you to join the pymarc Google Group [1] if you need help. Also, please feel free to use issue tracking [2] on Github to to submit feature requests or bug reports. If you've got an itch to scratch, please scratch it, and send merge requests on Github [3]. [0] https://github.com/edsu/pymarc/blob/master/README.md [1] http://groups.google.com/group/pymarc [2] https://github.com/edsu/pymarc/issues [3] https://github.com/edsu/pymarc Mark
Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records
The internal discussion then becomes, I have a need, and I've written something that satisfies it. I think it could also be useful to others, but I'm not going to have time to make major changes or implement features others need. Should I open source this or keep it to myself? Does freeing my code come with an implicit requirement to maintain and support it? Should it? I'd vote open source just about every time. If someone sees the need and has the time to do a functional/requirements analysis and develop a core team around pymarc, more power to them. The code that's already there will give them a head start. Or they can start from scratch. Until then, it will remain a fork-patch-and-pull, community-supported project. On Fri, Mar 9, 2012 at 4:23 AM, Godmar Back god...@gmail.com wrote: On Thu, Mar 8, 2012 at 3:53 PM, Mark A. Matienzo m...@matienzo.org wrote: On Thu, Mar 8, 2012 at 3:32 PM, Godmar Back god...@gmail.com wrote: One side comment here; while smart handling/automatic detection of encodings would be a nice feature to have, it would help if pymarc could operate in an 'agnostic', or 'raw' mode where it would simply preserve the encoding that's there after a record has been read when writing the record. [ Right now, pymarc does not have such a mode - if leader[9] == 'a', the data is unconditionally utf8 encoded on output as per mbklein's patch. ] Please feel free to write a patch and submit a pull request if you're able to contribute code to do this. Mark, while I would be able to contribute code to pymarc, I probably won't (unless my collaborators' needs in respect to pymarc become urgent.) I've been contributing to open source for over 15 years, my first major contribution having been the ext2fs filesystem code in the FreeBSD kernel ( http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/filesystems-linux.html ) and I'm a bit confused by how the spirit in the community has changed. The phrase patches welcome used to be reserved for when there was a feature request somebody wanted, but you (the owner/maintainer of the software) didn't have the time or considered the problem not important. Back then, it used to be that all suggestions were welcome. For instance, if a user pointed out a typo, you'd fix it. Similarly, if a user or fellow developer pointed out a potential design flaw, you'd understand that you don't ask for patches, but that you go back to the drawing board and think about your software's design. In pymarc's case, what's needed is not more code (it already has a moderately confusing set of almost a dozen switches for reading/writing), but a requirement analysis where you think about use cases you want to support. For instance, whether you want to support reading/writing real world records in batches (without touching them) even if they have flaws or not. And/Or whether you insist on interpreting a record's data in terms of encoding, always. That's something occasional contributors cannot do, it requires work by the core team, in discussion with frequent users. (I would have liked to take this discussion to a pymarc-users list, but didn't find any.) - Godmar
Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records
On Fri, Mar 9, 2012 at 10:37 AM, Michael B. Klein mbkl...@gmail.com wrote: The internal discussion then becomes, I have a need, and I've written something that satisfies it. I think it could also be useful to others, but I'm not going to have time to make major changes or implement features others need. Should I open source this or keep it to myself? Does freeing my code come with an implicit requirement to maintain and support it? Should it? It used to be that way, at least it was this way when I grew up in open source (in the 90s, before Eric Raymond invented the term). And it makes sense, for successful projects that have at least a moderate number of users. Just dumping your code on github helps very few people. I'd vote open source just about every time. If someone sees the need and has the time to do a functional/requirements analysis and develop a core team around pymarc, more power to them. The code that's already there will give them a head start. Or they can start from scratch. Until then, it will remain a fork-patch-and-pull, community-supported project. It's not just an agreement on design goals the core team must reach, it's also the issue of maintaining a record (in email discussions/posts and in the developer's minds) of what issues arose, what legacy decisions were made, where backwards compatibility is required. That's something maintainers do, it enables them to reason about future design decisions. People who feel a sense of ownership and mental investment. Sure, I could throw in a flag 'dont_utf8_encode' to make the code work for my case. But it wouldn't improve the software. (In pymarc's case, I'd also recommend a discussion about data structures. For instance, what should the type of the elements of the subfield array be that's passed to a Field constructor? 8-bit string or unicode objects? The thread you link to shows ambiguity here.) Staying with fork-patch-and-pull may help individual people meet their individual needs, but can prevent wide-spread adoption - and creates frustration for users who may lack the expertise to track down encoding errors or who are even unable to understand where the code they're using lives on their machine. Once a piece of software has reached the stage where it's distributed as a package (which pymarc, I believe, is), the distributors have taken on a piece of responsibility. Related, being unwilling to fix even documentation typos unless someone clones the repository and delivers a pull request (on a silver platter?) seems unusual to me, but - perhaps I'm just too old and culturally out of tune with today's open source movement. (I'm not being ironic here, maybe there has been a shift and I should just get with it.) - Godmar
Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records
It used to be that way, at least it was this way when I grew up in open source (in the 90s, before Eric Raymond invented the term). And it makes sense, for successful projects that have at least a moderate number of users. Just dumping your code on github helps very few people. You realize this isn't Apache, right? It seems a small project, mostly maintained by folks as they get time. There's no SCRUM meetings or hallway meetings, no foundation, no checklist. Surely you can't generalize two interactions first as reflective as the culture of open source. It seems to have been a small piece of code shared so others wouldn't have to do it over again and it's grown with time. The primary thrust seems to be for library developers, not catalogers or folks learning python code. The typo you bought up was patched by one of the team-members within a hour or two from what I can tell. (Assuming you meant issue #22 https://github.com/edsu/pymarc/issues/22). From what I can tell someone patched it in less than an hour. In general though github is the sourceforge of years past, but even better. It seems entirely reasonable to ask for a patch to me. Perhaps it could have been handled more delicately by both sides. Perhaps you weren't treated as nicely as you'd like. There's probably some truth to that. But at the same time, Ed did include a wink at the end after requesting the patch. Had you perhaps cut him some slack instead of immediately responding incredulously you'd find it was fixed when he got time. Or not. He has his own priorities as do other folks who contributed to the code. If you're unhappy with the dump on github approach, then don't use the software. No one ran around forcing folks to do it. It's one of those lightweight github approaches, just another approach to open source software. In all the years I've also been involved with open source every project has had it's own unique culture. There's responsibility on the user before using software to figure out what it is. If it doesn't meet their expectation, I see little reason that the developer should feel compelled to change unless they're getting paid for the work. Obviously some people have found the dump on github approach useful if they've contributed patches. Can't we all just shake hands virtually or something? Jon Gorman
Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records
On Fri, Mar 9, 2012 at 11:48 AM, Jon Gorman jonathan.gor...@gmail.comwrote: Can't we all just shake hands virtually or something? Here's my hand ||*( [1]. I overreacted, for which I'm sorry. (Also, I didn't see the entire github conversation until I just now visited the website, the github email notification seems selective and only sent me Ed's replies (?) in my emailbox.) - Godmar [1] http://www.kadifeli.com/fedon/smiley.htm
Re: [CODE4LIB] Sharing code
NOOB to list and am appreciative of this discussion. My boss is encouraging me to share code and pointed me to code4lib. the majority of my code is recycled / repurposed from others so I've had reservations about sharing mainly because of what's taken from others. At the least, I'm mindful about leaving acknowledgements intact. Is there a good resource on how to start sharing code and ethical considerations? Thanks for letting me chime in and best regards, -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Godmar Back Sent: Friday, March 09, 2012 11:12 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records On Fri, Mar 9, 2012 at 11:48 AM, Jon Gorman jonathan.gor...@gmail.comwrote: Can't we all just shake hands virtually or something? Here's my hand ||*( [1]. I overreacted, for which I'm sorry. (Also, I didn't see the entire github conversation until I just now visited the website, the github email notification seems selective and only sent me Ed's replies (?) in my emailbox.) - Godmar [1] http://www.kadifeli.com/fedon/smiley.htm
[CODE4LIB] Job: Metadata and Taxonomy Librarian at Library of Parliament
**Metadata and Taxonomy Librarian** **Information and Document Resource Service** Indeterminate Position Classification: LS-3 ($69,866 - $83,554) (bilingual imperative: CBC/CBC) Closing Date: 2012/03/26 **The ideal candidate possesses the following:** * Knowledge of MARC coding, RCAA2, LCSH, CSH, RVM, LC classification and the new RDA standards * Knowledge of non-MARC metadata schemas such as Dublin Core, MODS and METS * Knowledge of information technologies, especially emerging technologies applicable to system interoperability * Knowledge of standards, codes and protocols used in standardized description and metadata * General knowledge of the Library of Parliament's products, services and publications * Excellent analytical skills to design tools adapted to client needs * Excellent oral and written communication skills in both official languages * Ability to soundly manage time and workload according to individual and team priorities * Flexibility, resourcefulness and sound judgement * Team spirit, initiative and good interpersonal skills **To be considered, candidates must have:** * Preference will be given to candidates with a Master's degree in Library Sciences or in Library and Information Sciences from a recognized university; a combination of education and extensive experience related directly to the position may also be considered * Experience with metadata schemas and the development of vocabularies * Experience in the standardized description of resources and the use of controlled vocabularies * Experience in project management, follow up and quality control * Experience writing and producing manuals for taxonomy users and overseeing the application of guidelines * Experience working with an integrated library system * Experience holding training sessions and providing technical advice is an asset **Candidates retained in this selection process will be required to obtain:** * A successful second-language evaluation (bilingual imperative: CBC/CBC) * A successful pre-employment screening **Additional information:** * This selection process is open to employees of the Senate, the House of Commons, the Library of Parliament, the Office of the Senate Ethics Officer, the Office of the Conflict of Interest and Ethics Commissioner, the public service and the public. * A written exam may be administered * Qualified candidates from this selection process may be considered for temporary or indeterminate positions requiring similar competencies at the Library of Parliament * Satisfactory references are an essential condition of employment * Education and experience requirements will be used as part of the initial selection process * Proof of education will be required * We are committed to employment equity To apply, please send your C.V. and cover letter clearly indicating how you meet each of the requirements of the position by March 26, 2012. Please quote Competition 11-I-44. By email: lop...@parl.gc.ca By fax: 613-995-9582 By mail: 50 O'Connor Street Library of Parliament Human Resources Division Ottawa, ON K1A 0A9 Please address questions to Human Resources at 613-996-2424 or lop...@parl.gc.ca. We thank all those who apply. Only those selected for further consideration will be contacted. Brought to you by code4lib jobs: http://jobs.code4lib.org/job/833/
[CODE4LIB] Job: Software Developer (Java) at University of Maryland, College Park
**Note: this is a reposting of[http://jobs.code4lib.org/job/801/](http://jobs.code4lib.org/job/801/) because the close date has been extended to March 23.** An opportunity exists for one or more experienced software developers to work within the team environment of the University of Maryland (UM) Libraries in College Park, the largest university library system in the region and in close proximity to the nation's capital. Visit the UM Libraries web-site at http://www.lib.umd.edu. Note: this announcement will be used to fill TWO vacancies. Responsibilities The UM Libraries' Information Technology Division supports the library automation needs of the University System of Maryland and Affiliated Institutions (USMAI). Working within a team environment, the successful candidate(s) will provide broad programming support to the UM Libraries for the design, development, and delivery of Java-based software applications, large-scale digital collections, and web interfaces. The successful candidate(s) will: * Design and develop tools for managing production workflows, large-scale ingestion, inventory control and preservation of digital collections; * Select and utilize appropriate software languages, frameworks and platforms for new and existing library projects; * Provide object-oriented programming for various library initiatives; * Provide web interface development support for digital collection management systems; * Research and develop applications to interface with bibliographic systems, acquisition systems, reference and circulation systems; * Utilize project management tools such as JIRA to record and monitor progress; and * Lead technical development on some projects. Qualifications Required: * Bachelor's degree in a field related to information sciences, computer sciences and engineering, or information management * Minimum of three (3) years of programming experience using the Java language * Experience creating web applications using JSP * Experience using JDBC to interact with a relational database such as PostgreSQL or MySQL * Experience using version control software such as Subversion or Git * Excellent interpersonal skills; Excellent written and verbal communications skills APPLICATIONS: Electronic applications required. Please apply online at https://jobs.umd.edu/applicants/Central?quickFind=56411. No relocation assistance will be provided. The University of Maryland Libraries will not sponsor individuals for employment. You must be legally able to work in the United States. An application consists of a cover letter which includes the source of advertisement, a resume and names/e-mail addresses of three references. Applications will be reviewed as they are received and accepted until March 9, 2012. The University of Maryland, College Park, actively subscribes to a policy of equal employment opportunity, and will not discriminate against any employee or applicant because of race, age, sex, color, sexual orientation, physical or mental disability, religion, ancestry or national origin, marital status, genetic information, or political affiliation. Minorities and women are encouraged to apply. Brought to you by code4lib jobs: http://jobs.code4lib.org/job/834/