Re: [CODE4LIB] De-dup MARC Ebook records

2013-08-25 Thread Ere Maijala

Mike,

our RecordManager (https://github.com/KDK-Alli/RecordManager) used in 
conjunction with VuFind does deduplication with our own algorithm. This 
might be of some interest to you. RecordManager works standalone, so no 
VuFind installation needed. For some parts it's still in active 
development, but the deduplication has been working pretty well so far. 
A short description of the algorithm is available at 
https://github.com/KDK-Alli/RecordManager/wiki/Deduplication, and the 
actual PHP code is in 
https://github.com/KDK-Alli/RecordManager/blob/master/classes/RecordManager.php 
starting at the dedupRecord function.


--Ere

22.8.2013 18.07, Michael Beccaria kirjoitti:

Steve,
I don't think it's so much find a control field (however, the closest match I 
can use is ISBN or eISBN which has its issues) but also normalizing the data in 
the fields so that matches are produced. It will no doubt take some time to 
figure out.

Mike Beccaria
Systems Librarian
Head of Digital Initiative
Paul Smith's College
518.327.6376
mbecca...@paulsmiths.edu
Become a friend of Paul Smith's Library on Facebook today!


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
McDonald, Stephen
Sent: Friday, August 16, 2013 8:16 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] De-dup MARC Ebook records

Michael Beccaria said:

Thanks for the replies. To clarify, I am working with 2 (or more in
the future) marc records outside of the ILS. I've tried using Marcedit
but my usage did vary...not much overlap with the control fields that
were available to me. I have a feeling they are a bit varied. I'm also
messing around with marcXimiL a little but I'm having trouble getting
it to output any records at all. I also was looking at the XC
aggregation module but I was having trouble getting that to work
properly as well and the listserv was unresponsive. It seemed like
good software but it required me to set up an OAI harvest source to
allow it to ingest the records and that...well...enough is enough... I
think I will probably need to write something, and at least that way I
know what it will be doing rather than plowing through software that
has little to no support. Please feel free to let me know of a particular 
strategy you think might work best in this regard...


If you couldn't get adequate deduping from the control fields available in 
MarcEdit deduping, what control fields do you think you need to dedup on?  You 
can actually specify any arbitrary field and subfield for deduping in MarcEdit.

Steve McDonald
steve.mcdon...@tufts.edu




--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: [CODE4LIB] Simple Web-based Dublin Core search engine?

2011-03-23 Thread Ere Maijala

Oops, should have read the other replies first. :)

--Ere

On 23.3.2011 17:05, Ere Maijala wrote:

Hi Edward,

I haven't actually used it for searching apart from a quick test, but
PKP Harvester2 (pkp.sfu.ca/harvester/) might fit your needs. It's LAMP,
open source and workable. We're building an OAI-PMH aggregator on top of
it.

--Ere

On 16.3.2011 17:00, Edward M. Corrado wrote:

Hi,

I [will soon] have a small set ( 1000 records) of Dublin Core
metadata published in OAI_DC format that I want to be searchable via a
Web browser. Normally we would use Ex Libris's Primo for this, but
this particular set of data may have some confidential information and
our repository only has minimal built in search functions. While we
still may go with Primo for these records, I am looking for at other
possibilities. The requirements as I see them are:

1) Can ingest records in OAI_DC format
2) Allow remote end-users who are familiar with the collection search
these ingest records via a Web browser.
3)Search should be keyword anywhere or individual fields although it
does not need to have every whizzbang feature out there. In other
words, basic search feature are fine.
4) Should support the ability to link to the display copy in our
repository (probably goes without saying)
5) Should be simple to install and maintain (Thus, at least in my
mind, eliminating something like Blacklight)
6) Preferably a LAMP application although a Windows server based
solution is a possibility as well
7) Preferably Open Source, or at least no- or low-cost

I haven't been able to find anything searching the Web, but it seems
like something people may have done before. Before I re-invent the
wheel or shoe-horn something together, does anyone have any
suggestions?

Edward







--
Ere Maijala
Kansalliskirjasto


Re: [CODE4LIB] Simple Web-based Dublin Core search engine?

2011-03-23 Thread Ere Maijala

Hi Edward,

I haven't actually used it for searching apart from a quick test, but 
PKP Harvester2 (pkp.sfu.ca/harvester/) might fit your needs. It's LAMP, 
open source and workable. We're building an OAI-PMH aggregator on top of it.


--Ere

On 16.3.2011 17:00, Edward M. Corrado wrote:

Hi,

I [will soon] have a small set (  1000 records) of Dublin Core
metadata published in OAI_DC format that I want to be searchable via a
Web browser.  Normally we would use Ex Libris's Primo for this, but
this particular set of data may have some confidential information and
our repository only has minimal built in search functions. While we
still may go with Primo for these records, I am looking for at other
possibilities. The requirements as I see them are:

1) Can ingest records in OAI_DC format
2) Allow remote end-users who are familiar with the collection search
these ingest records via a Web browser.
3)Search should be keyword anywhere or individual fields although it
does not need to have every whizzbang feature out there. In other
words, basic search feature are fine.
4) Should support the ability to link to the display copy in our
repository (probably goes without saying)
5) Should be simple to install and maintain (Thus, at least in my
mind, eliminating something like Blacklight)
6) Preferably a LAMP application although a Windows server based
solution is a possibility as well
7) Preferably Open Source, or at least no- or low-cost

I haven't been able to find anything searching the Web, but it seems
like something people may have done before. Before I re-invent the
wheel or shoe-horn something together, does anyone have any
suggestions?

Edward




--
Ere Maijala (Mr.)
The National Library of Finland


Re: [CODE4LIB] unwanted (bogus) characters in marc

2010-10-08 Thread Ere Maijala

On 7.10.2010 15:17, Thomas Krichel wrote:

   Ere Maijala writes


# Fix non-UTF-8 characters with two highest bits set (we assume they
are actually ISO-8859-1)


   What about

use Encode::Guess qw/latin-1/;
$decoded=decode(Guess, $dodgy_input);

   $decoded then should be a utf-8 string with utf8 flag on.


Would that work for a predominantly proper utf-8 input with some 
mistakes thrown in?


--Ere


Re: [CODE4LIB] unwanted (bogus) characters in marc

2010-10-07 Thread Ere Maijala

In Perl, something like this might do the trick:

# Fix non-UTF-8 characters with two highest bits set (we assume they are 
actually ISO-8859-1)
# Rule: there can't be a single byte with the high bits set followed by 
a byte in range 00-7F or C0-FF


$str =~ s/([\xC0-\xFF])(?=[\x00-\x7f\xC0-\xFF])/chr(0xC0 + (ord($1)  
6)) . chr(0x80 + (ord($1)  0x3F))/seg;


No wrapping there to keep it single-line. :)

--Ere

On 7.10.2010 14:56, Cowles, Esme wrote:

Eric-

I don't know the original source of those MARC files, but I've worked
with files from an III system where diacritics had to be entered as
character code escapes like Muse{226}e du Louvre (where 226 is the
ANSEL code for a combining acute accent).  So if somebody made a typo
and entered something like Muse{22}6e du Louvre instead, you'd get
some bogus invalid character.  I was working with MARCXML files in
Java, so I wrote a FilterReader class that removed any characters
that were invalid in UTF-8 XML.  I assume you could do something
similar in Perl (probably with a fancy one-line regex).

-Esme -- Esme Cowlesescow...@ucsd.edu

We've all heard that a million monkeys banging on a million
typewriters will eventually reproduce the works of Shakespeare. Now,
thanks to the Internet, we know this is not true. -- Robert
Wilensky

On Oct 7, 2010, at 6:51 AM, Eric Lease Morgan wrote:


How do I trap for unwanted (bogus) characters in MARC records?

I have a set of Internet Archive identifiers, and have written the
followoing Perl loop to get the MARC records associated with each
one:

# process each identifier my $ua = LWP::UserAgent-new( agent =
AGENT ); while (DATA  ) {

# get the identifier chop; my $identifier = $_; print $identifier,
\n;

# get its corresponding MARC record my $response = $ua-get( ROOT .
$identifier/$identifier . _meta.mrc ); if ( !
$response-is_success ) {

warn $response-status_line; next;

}

# save it open MARC,   $identifier.mrc or die Can't open
$identifier.mrc: $!\n; binmode MARC, :utf8; print MARC
$response-content; close MARC;

}

I then use the venerable marcdump to see the fruits of my labors:
marcdump *.mrc. Unfortunately, marcdump returns the following error
against (at least) one of my files:

bienfaitsducatho00pina.mrc utf8 \xC3 does not map to Unicode at
/System/Library/ Perl/5.10.0/darwin-thread-multi-2level/Encode.pm
line 162.

What is going on here? Am I saving my files incorrectly? Is the
original MARC data inherintly incorrect? Is there some way I can
fix the MARC record in question?

-- Eric Lease Morgan





--
Ere Maijala
Kansalliskirjasto


Re: [CODE4LIB] SRU indexes for Aleph

2010-06-11 Thread Ere Maijala

Here you are, enjoy!

Note that there is one peculiarity in this response. The bath profile  
actually doesn't define isbn index, but it exists in the default  
explain response of YAZ Proxy and is replicated here.


--Ere

Quoting Ziso, Ya'aqov z...@rowan.edu:


Ere, thanks. Can you share the response you get for your explain query then?
./Ya’aqov




On 6/10/10 11:04 AM, Ere Maijala ere.maij...@helsinki.fi wrote:


No, we have a perfectly working setup, explain included.

--Ere

Quoting Ziso, Ya'aqov z...@rowan.edu:


Ere,
So far Corey (NYU, given their no-implementation-yet) reported he
gets a 404 for an explain query.
Are you configured differently and ALSO get a 404?
Ya'aqov

From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Ere
Maijala [ere.maij...@helsinki.fi]
Sent: Thursday, June 10, 2010 10:10 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] SRU indexes for Aleph

Quoting LeVan,Ralph le...@oclc.org:

Something's not right with this picture.  The YAZ Proxy IS the SRU
server, so it should be delivering up the Explain record.  If it has
a configuration file that defines the mapping from CQL indexes to
z39.50 indexes, then it has all the information it needs to populate
the indexInfo part of the Explain record.

So, why no Explain record?


The explain record is a static xml fragment, a part of the
configuration XML. It is separate from the pqf mapping file. In the
default Aleph config it's completely missing, although the YAZ Proxy
distribution does have it.

--Ere

--
Ere Maijala (Mr.)
Kansalliskirjasto / The National Library of Finland











--
Ere Maijala (Mr.)
Kansalliskirjasto / The National Library of Finland
?xml version=1.0?
zs:explainResponse xmlns:zs=http://www.loc.gov/zing/srw/;zs:version1.1/zs:versionzs:recordzs:recordSchemahttp://explain.z3950.org/dtd/2.0//zs:recordSchemazs:recordPackingxml/zs:recordPackingzs:recordDataexplain xmlns=http://explain.z3950.org/dtd/2.0/;
  serverInfo
	hostlinda.linneanet.fi/host
	port210/port
	databasefin01/database
  /serverInfo
  
  databaseInfo
	titleLINDA/title
	description lang=en primary=true
	  SRU/Z39.50 Gateway to Union Catalog LINDA  
	/description
  /databaseInfo
  
  indexInfo
	set identifier=info:srw/cql-context-set/1/cql-v1.1 name=cql/
	set identifier=info:srw/cql-context-set/1/dc-v1.1 name=dc/
	set identifier=http://zing.z3950.org/cql/bath/2.0/; name=bath/
	
index id=12
  titleid/title
  mapname set=recid/name/map
/index

	index id=4
	  titletitle/title
	  mapname set=dctitle/name/map
	/index
	index id=21
	  titlesubject/title
	  mapname set=dcsubject/name/map
	/index
index id=30
  titledate/title
  mapname set=dcdate/name/map
/index
index id=62
  titledescription/title
  mapname set=dcdescription/name/map
/index
	index id=1003
	  titlecreator/title
	  mapname set=dccreator/name/map
	  mapname set=dcauthor/name/map
	/index
index id=1007
  titleidentifier/title
  mapname set=dcidentifier/name/map
/index
index id=1018
  titlepublisher/title
  mapname set=dcpublisher/name/map
/index
	index id=1020
	  titleeditor/title
	  mapname set=dceditor/name/map
	/index
	
	index id=7
	  titleisbn/title
	  mapname set=bathisbn/name/map
	/index
	index id=8
	  titleissn/title
	  mapname set=bathissn/name/map
	/index
index id=1002
  titlename/title
  mapname set=bathname/name/map
/index

  /indexInfo

  schemaInfo
	schema identifier=info:srw/schema/1/marcxml-v1.1 sort=false name=marcxml
	  titleMARCXML/title
	/schema
	
	schema identifier=info:srw/schema/1/dc-v1.1 sort=false name=dc
	  titleDublin Core/title
	/schema
	
	schema identifier=http://www.loc.gov/mods; sort=false name=mods2
	  titleMODS v2/title
	/schema

	schema identifier=info:srw/schema/1/mods-v3.0 sort=false name=mods
	  titleMODS v3/title
	/schema

  /schemaInfo

  configInfo
default type=numberOfRecords0/default
  /configInfo
/explain/zs:recordData/zs:record/zs:explainResponse

Re: [CODE4LIB] Zotero, unapi, and formats?

2010-04-06 Thread Ere Maijala
This was the only information I found when I developed unAPI support for 
our MetaLib installation: 
http://forums.zotero.org/discussion/1229/unapi-support/. Based on my 
experimentation and looking at the code, if my memory serves:


1. At least the formats mentioned in the forum post. I believe it uses 
the docs attribute to distinguish formats, as type can be e.g. 
application/xml for multiple formats.


2. Weird, but I don't remember how. I ended up providing only MARCXML, 
DC and RIS, because it chose MODS over MARCXML if it was available and 
did something that sucked. Things may have changed, this was in 2008.


3. Didn't test this one, we only provide a single record at a time.

4. It chose COinS over unAPI at least at the time, and I found that to 
be a bit problematic.


5. Dunno.

--Ere

On 6.4.2010 16:48, Jonathan Rochkind wrote:

Anyone know if there's any developer documentation for Zotero on it's
use of unAPI? Alternately, anyone know where I can find the answers to
these questions, or know the answers to these questions themselves?

1. What formats will Zotero use via unAPI. What mime content-types does
it use to recognize those formats (sometimes a format has several in
use, or no official content-type).

2. What is Zotero's order of preference when multiple formats via unAPI
are available?

3. Will Zotero get confused if different documents on the page have
different formats available? This can be described with unAPI, but it
seems atypical, so not sure if it will confuse Zotero.

4. If both unAPI and COinS are on a given page -- will Zotero use both
(resulting in possible double-import for citations exposed both ways).
Or only one? Or depends on how you set up the HTML?

5. Somewhere that now I can't find I saw a mention of a Zotero RDF
format that Zotero would consume via unAPI. Is there any documentation
of this format/vocabulary, how can I find out how to write it?



Re: [CODE4LIB] List of MARC flavors

2010-03-24 Thread Ere Maijala

Also worth taking a look at is the Z39.50 OID list:
http://www.loc.gov/z3950/agency/defns/oids.html

--Ere

On 23.3.2010 20:50, Houghton,Andrew wrote:

Does anyone know where there might be a list of the various flavors of MARC?

I currently have:

marc21
usmarc  US MARC Replaced by marc21
rusmarc Russian MARC
canmarc Canadian MARC   Replaced by marc21
ukmarc  UK MARC Replaced by marc21
cmarc   Chinese MARC
unimarc Uni-MARC




--
Ere Maijala (Mr.)
IT Research Specialist
The National Library of Finland


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-16 Thread Ere Maijala

On 03/15/2010 06:22 PM, Houghton,Andrew wrote:

Secondly, Bill's specification looses semantics from ISO 2709, as I
previously pointed out.  His specification clumps control and data
fields into one property named fields. According to ISO 2709, control
and data fields have different semantics.  You could have a control
field tagged as 001 and a data field tagged as 001 which have
different semantics.  MARC-21 has imposed certain rules for


I won't comment on Bill's proposal, but I'll just say that I don't think 
you can have a control field and a data field with the same code in a 
single MARC format. Well, technically it's possible, but in practice 
everything I've seen relies on rules of the MARC format at hand. You 
could actually say that ISO 2709 works more like Bill's JSON, and 
MARCXML is the different one, as in ISO 2709 the directory doesn't 
separate control and data fields.


--Ere

--
Ere Maijala (Mr.)
The National Library of Finland


Re: [CODE4LIB] HTML mark-up in MARC records

2009-06-25 Thread Ere Maijala

Jonathan Rochkind wrote:

Ere Maijala wrote:


That shouldn't be a problem as any sane OAI-PMH provider, unAPI or ATOM
serializer would escape the contents. Things that resemble HTML tags
could be present in MARC records without any HTML-in-MARC too.
  
Sure, and then, if you have html tags in your marc, that system doing 
the re-use is going to present content to users with escaped HTML in it, 
which isn't desirable either!


How the content is stored in the transport format is separate from how 
it is used. Whatever the re-using system does is not related to how the 
data was transferred to it. If it extracts the stuff from the XML, it 
will of course unescape the content, but what happens after that is up 
to the system and unrelated to the transport mechanism. So here is an 
example of the whole process:


MARC with embedded HTML
-
OAI-PMH provider escapes the MARC in some XML format
-
OAI-PMH harvester (the re-using system) unescapes the data from the XML 
format

-
Something is done with the data

It's the same as if the source system stores the data internally in 
MARCXML. The content must be escaped so that it can be stored in MARCXML 
and doesn't mess up the markup, but when the uses the data e.g. for 
display, it's first retrieved from XML and unescaped, and massaged to 
the desired display format only after that. If you use DOM to do the XML 
manipulation, all this will happen automatically. You just write and 
read strings and DOM manipulation takes care of escaping and unescaping.


You could substitute XML with e.g. Base64 encoding if it makes thinking 
about this stuff easier. For instance email clients often send binary 
files in Base64, but it doesn't mean the file is ruined, as the 
receiving email client can decode it back to the original binary.


--Ere


Re: [CODE4LIB] COinS in OL?

2008-12-02 Thread Ere Maijala

Gabriel Farrell wrote:

COinS would be great, but unAPI would be useful also. In the case of
Zotero, for example, more information can be passed along with unAPI
than with COinS.


I agree. They are not mutually exclusive and can be used for quite 
different purposes. In my experience COinS is great for getting the data 
for OpenURL linking, but unAPI is much better for saving records and 
gives the client the power to decide what information it needs.


--Ere

--
Ere Maijala (Mr.)
IT Research Specialist
The National Library of Finland


Re: [CODE4LIB] refworks developer documentation?

2008-06-06 Thread Ere Maijala

Jonathan,

Official documentation: http://www.refworks.com/DirectExport.htm

Sample code from our MetaLib implementation at
http://wiki.helsinki.fi/display/Nelli/Direct+Export+to+RefWorks+from+MetaLib
and Voyager at
http://www.nationallibrary.fi/libraries/linnea/pwebrecon2.html.

--Ere

Jonathan Rochkind wrote:

Does anyone know where, if anywhere, I find documentation on the ways to
send references to RefWorks for importing?

Not having any luck on their website. I know I've seen it before
though.  I remember there were a variety of formats and methods you
could send things to RefWorks for an import. Must be documentation
somewhere?  I bet some code4libber has done this before.

Jonathan

--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu




--
Ere Maijala (Mr.)
IT Research Specialist
The National Library of Finland
P.O.Box 26 (Teollisuuskatu 23)
FI-00014 University of Helsinki
FINLAND

ere.maijala(@)helsinki.fi
Tel. +358 9 191 44260