subject:"\[CODE4LIB\] Reconciling corporate names\?"

Re: [CODE4LIB] Reconciling corporate names?

2014-09-29 Thread Simon Brown

You could always web scrape, or download and then search the LCNAF with
some script that looks like:

#Build query for webscraping
query = paste(http://id.loc.gov/search/?q=;, URLencode(corporate name
here ), q=cs%3Ahttp%3A%2F%2Fid.loc.gov%2Fauthorities%2Fnames)

#Make the call
result = readLines(query)

#Find the lines containing Corporate Name
lines = grep(Corporate Name, result)

#Alternatively use approximate string matching on the downloaded LCNAF data
query - agrep(corporate name here,LCNAF_data_here)

#Parse for whatever info you want
...

My native programming language is R so I hope the functions like paste,
readLines, grep, and URLencode are generic enough for other languages to
have some kind of similar thing.  This can just be wrapped up into a for
loop:
for(i in 1:4){...}

Web scraping the results of one name at a time would be SLOW and obviously
using an API is the way to go but it didn't look like the OCLC LCNAF API
handled Corporate Name.  However, it sounds like in the previous message
someone found a work around.  Best of luck! -Simon






On Mon, Sep 29, 2014 at 8:45 AM, Matt Carruthers mcarr...@umich.edu wrote:

 Hi Patrick,

 Over the last few weeks I've been doing something very similar.  I was able
 to figure out a process that works using OpenRefine.  It works by searching
 the VIAF API first, limiting results to anything that is a corporate name
 and has an LC source authority.  OpenRefine then extracts the LCCN and puts
 that through the LCNAF API that OCLC has to get the name.  I had to use
 VIAF for the initial name search because for some reason the LCNAF API
 doesn't really handle corporate names as search terms very well, but works
 with the LCCN just fine (there is the possibility that I'm just doing
 something wrong, and if that's the case, anyone on the list can feel free
 to correct me).  In the end, you get the LC name authority that corresponds
 to your search term and a link to the authority on the LC Authorities
 website.

 Anyway,  The process is fairly simple to run (just prepare an Excel
 spreadsheet and paste JSON commands into OpenRefine).  The only reservation
 is that I don't think it will run all 40,000 of your names at once.  I've
 been using it to run 300-400 names at a time.  That said, I'd be happy to
 share what I did with you if you'd like to try it out.  I have some
 instructions written up in a Word doc, and the JSON script is in a text
 file, so just email me off list and I can send them to you.

 Matt

 Matt Carruthers
 Metadata Projects Librarian
 University of Michigan
 734-615-5047
 mcarr...@umich.edu

 On Fri, Sep 26, 2014 at 7:03 PM, Karen Hanson karen.han...@ithaka.org
 wrote:

  I found the WorldCat Identities API useful for an institution name
  disambiguation project that I worked on a few years ago, though my goal
  wasn't to confirm whether names mapped to LCNAF.  The API response
 includes
  a LCCN, and you can set it to fuzzy or exact matching, but you would need
  to write a script to pass each term in and process the results:
 
 
 http://oclc.org/developer/develop/web-services/worldcat-identities.en.html
 
  I also can't speak to whether all LC Name Authorities are represented, so
  there may be a chance of some false negatives.
 
  OCLC has another API, but not sure if it covers corporate names:
  https://platform.worldcat.org/api-explorer/LCNAF
 
  I suspect there are others on the list that know more about the inner
  workings of these APIs if this might be an option for you... :)
 
  Karen
 
  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
  Ethan Gruber
  Sent: Friday, September 26, 2014 3:54 PM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] Reconciling corporate names?
 
  I would check with the developers of SNAC (
  http://socialarchive.iath.virginia.edu/), as they've spent a lot of time
  developing named entity recognition scripts for personal and corporate
  names. They might have something you can reuse.
 
  Ethan
 
  On Fri, Sep 26, 2014 at 3:47 PM, Galligan, Patrick 
 pgalli...@rockarch.org
  
  wrote:
 
   I'm looking to reconcile about 40,000 corporate names against LCNAF to
   see whether they are authorized strings or not, but I'm drawing a
   blank about how to get it done.
  
   I've used http://freeyourmetadata.org/ for reconciling subject
   headings before, but I can't get it to work for LCNAF. Has anyone had
   any experience in a project like this? I'd love to hear some ideas for
   automatically dealing with a large data set like this that we did not
   create and do not know how the names were created.
  
   Thanks!
  
   -Patrick Galligan
  
 




-- 
Simon Brown
simoncbr...@gmail.com
simoncharlesbrown (Skype)
831.440.7466 (Phone)

*Following our will and wind we may just go where no one's been -- MJK*

Re: [CODE4LIB] Reconciling corporate names?

2014-09-29 Thread Trail, Nate

The ID.loc.gov site has a good known label service described here under known 
label retrieval :
http://id.loc.gov/techcenter/searching.html

Use  Curl and content negotiation to avoid screen scraping, for example, for LC 
Name authorities:

curl -L -H Accept: application/rdf+xml 
http://id.loc.gov/authorities/names/label/Library%20of%20Congress;

Nate

==
Nate Trail
LS/TECH/NDMSO
Library of Congress
n...@loc.gov


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Simon 
Brown
Sent: Monday, September 29, 2014 9:38 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Reconciling corporate names?

You could always web scrape, or download and then search the LCNAF with some 
script that looks like:

#Build query for webscraping
query = paste(http://id.loc.gov/search/?q=;, URLencode(corporate name here 
), q=cs%3Ahttp%3A%2F%2Fid.loc.gov%2Fauthorities%2Fnames)

#Make the call
result = readLines(query)

#Find the lines containing Corporate Name
lines = grep(Corporate Name, result)

#Alternatively use approximate string matching on the downloaded LCNAF data 
query - agrep(corporate name here,LCNAF_data_here)

#Parse for whatever info you want
...

My native programming language is R so I hope the functions like paste, 
readLines, grep, and URLencode are generic enough for other languages to have 
some kind of similar thing.  This can just be wrapped up into a for
loop:
for(i in 1:4){...}

Web scraping the results of one name at a time would be SLOW and obviously 
using an API is the way to go but it didn't look like the OCLC LCNAF API 
handled Corporate Name.  However, it sounds like in the previous message 
someone found a work around.  Best of luck! -Simon






On Mon, Sep 29, 2014 at 8:45 AM, Matt Carruthers mcarr...@umich.edu wrote:

 Hi Patrick,

 Over the last few weeks I've been doing something very similar.  I was 
 able to figure out a process that works using OpenRefine.  It works by 
 searching the VIAF API first, limiting results to anything that is a 
 corporate name and has an LC source authority.  OpenRefine then 
 extracts the LCCN and puts that through the LCNAF API that OCLC has to 
 get the name.  I had to use VIAF for the initial name search because 
 for some reason the LCNAF API doesn't really handle corporate names as 
 search terms very well, but works with the LCCN just fine (there is 
 the possibility that I'm just doing something wrong, and if that's the 
 case, anyone on the list can feel free to correct me).  In the end, 
 you get the LC name authority that corresponds to your search term and 
 a link to the authority on the LC Authorities website.

 Anyway,  The process is fairly simple to run (just prepare an Excel 
 spreadsheet and paste JSON commands into OpenRefine).  The only 
 reservation is that I don't think it will run all 40,000 of your names 
 at once.  I've been using it to run 300-400 names at a time.  That 
 said, I'd be happy to share what I did with you if you'd like to try 
 it out.  I have some instructions written up in a Word doc, and the 
 JSON script is in a text file, so just email me off list and I can send them 
 to you.

 Matt

 Matt Carruthers
 Metadata Projects Librarian
 University of Michigan
 734-615-5047
 mcarr...@umich.edu

 On Fri, Sep 26, 2014 at 7:03 PM, Karen Hanson 
 karen.han...@ithaka.org
 wrote:

  I found the WorldCat Identities API useful for an institution name 
  disambiguation project that I worked on a few years ago, though my 
  goal wasn't to confirm whether names mapped to LCNAF.  The API 
  response
 includes
  a LCCN, and you can set it to fuzzy or exact matching, but you would 
  need to write a script to pass each term in and process the results:
 
 
 http://oclc.org/developer/develop/web-services/worldcat-identities.en.
 html
 
  I also can't speak to whether all LC Name Authorities are 
  represented, so there may be a chance of some false negatives.
 
  OCLC has another API, but not sure if it covers corporate names:
  https://platform.worldcat.org/api-explorer/LCNAF
 
  I suspect there are others on the list that know more about the 
  inner workings of these APIs if this might be an option for you... 
  :)
 
  Karen
 
  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf 
  Of Ethan Gruber
  Sent: Friday, September 26, 2014 3:54 PM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] Reconciling corporate names?
 
  I would check with the developers of SNAC ( 
  http://socialarchive.iath.virginia.edu/), as they've spent a lot of 
  time developing named entity recognition scripts for personal and 
  corporate names. They might have something you can reuse.
 
  Ethan
 
  On Fri, Sep 26, 2014 at 3:47 PM, Galligan, Patrick 
 pgalli...@rockarch.org
  
  wrote:
 
   I'm looking to reconcile about 40,000 corporate names against 
   LCNAF to see whether they are authorized strings or not, but I'm

Re: [CODE4LIB] Reconciling corporate names?

2014-09-29 Thread Jonathan Rochkind

For yet another data set and API that may or may not meet your needs, 
consider VIAF -- Virtual International Authority File, operated by OCLC.


The VIAF's dataset includes the LC NAF as well as other national 
authority files, I'm not sure if the API is suitable to limiting matches 
to the LC NAF, I haven't done much work with it, but I know it has an API.


http://oclc.org/developer/develop/web-services/viaf.en.html

On 9/29/14 10:18 AM, Trail, Nate wrote:

The ID.loc.gov site has a good known label service described here under known label 
retrieval :
http://id.loc.gov/techcenter/searching.html

Use  Curl and content negotiation to avoid screen scraping, for example, for LC 
Name authorities:

curl -L -H Accept: application/rdf+xml 
http://id.loc.gov/authorities/names/label/Library%20of%20Congress;

Nate

==
Nate Trail
LS/TECH/NDMSO
Library of Congress
n...@loc.gov


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Simon 
Brown
Sent: Monday, September 29, 2014 9:38 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Reconciling corporate names?

You could always web scrape, or download and then search the LCNAF with some 
script that looks like:

#Build query for webscraping
query = paste(http://id.loc.gov/search/?q=;, URLencode(corporate name here ), 
q=cs%3Ahttp%3A%2F%2Fid.loc.gov%2Fauthorities%2Fnames)

#Make the call
result = readLines(query)

#Find the lines containing Corporate Name
lines = grep(Corporate Name, result)

#Alternatively use approximate string matching on the downloaded LCNAF data query - 
agrep(corporate name here,LCNAF_data_here)

#Parse for whatever info you want
...

My native programming language is R so I hope the functions like paste, 
readLines, grep, and URLencode are generic enough for other languages to have 
some kind of similar thing.  This can just be wrapped up into a for
loop:
for(i in 1:4){...}

Web scraping the results of one name at a time would be SLOW and obviously 
using an API is the way to go but it didn't look like the OCLC LCNAF API 
handled Corporate Name.  However, it sounds like in the previous message 
someone found a work around.  Best of luck! -Simon






On Mon, Sep 29, 2014 at 8:45 AM, Matt Carruthers mcarr...@umich.edu wrote:


Hi Patrick,

Over the last few weeks I've been doing something very similar.  I was
able to figure out a process that works using OpenRefine.  It works by
searching the VIAF API first, limiting results to anything that is a
corporate name and has an LC source authority.  OpenRefine then
extracts the LCCN and puts that through the LCNAF API that OCLC has to
get the name.  I had to use VIAF for the initial name search because
for some reason the LCNAF API doesn't really handle corporate names as
search terms very well, but works with the LCCN just fine (there is
the possibility that I'm just doing something wrong, and if that's the
case, anyone on the list can feel free to correct me).  In the end,
you get the LC name authority that corresponds to your search term and
a link to the authority on the LC Authorities website.

Anyway,  The process is fairly simple to run (just prepare an Excel
spreadsheet and paste JSON commands into OpenRefine).  The only
reservation is that I don't think it will run all 40,000 of your names
at once.  I've been using it to run 300-400 names at a time.  That
said, I'd be happy to share what I did with you if you'd like to try
it out.  I have some instructions written up in a Word doc, and the
JSON script is in a text file, so just email me off list and I can send them to 
you.

Matt

Matt Carruthers
Metadata Projects Librarian
University of Michigan
734-615-5047
mcarr...@umich.edu

On Fri, Sep 26, 2014 at 7:03 PM, Karen Hanson
karen.han...@ithaka.org
wrote:


I found the WorldCat Identities API useful for an institution name
disambiguation project that I worked on a few years ago, though my
goal wasn't to confirm whether names mapped to LCNAF.  The API
response

includes

a LCCN, and you can set it to fuzzy or exact matching, but you would
need to write a script to pass each term in and process the results:



http://oclc.org/developer/develop/web-services/worldcat-identities.en.
html


I also can't speak to whether all LC Name Authorities are
represented, so there may be a chance of some false negatives.

OCLC has another API, but not sure if it covers corporate names:
https://platform.worldcat.org/api-explorer/LCNAF

I suspect there are others on the list that know more about the
inner workings of these APIs if this might be an option for you...
:)

Karen

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
Of Ethan Gruber
Sent: Friday, September 26, 2014 3:54 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Reconciling corporate names?

I would check with the developers of SNAC (
http://socialarchive.iath.virginia.edu/), as they've spent a lot of
time developing

Re: [CODE4LIB] Reconciling corporate names?

2014-09-29 Thread Kyle Banerjee

IMO, API isn't the best tool for this job. My inclination would be to just
download the LCNAF data, normalize source and comparison data, and then
compare via hash.

That will be easier to write, and you'll be able to do thousands of
comparisons per second.

kyle

On Mon, Sep 29, 2014 at 8:24 AM, Jonathan Rochkind rochk...@jhu.edu wrote:

 For yet another data set and API that may or may not meet your needs,
 consider VIAF -- Virtual International Authority File, operated by OCLC.

 The VIAF's dataset includes the LC NAF as well as other national authority
 files, I'm not sure if the API is suitable to limiting matches to the LC
 NAF, I haven't done much work with it, but I know it has an API.

 http://oclc.org/developer/develop/web-services/viaf.en.html


 On 9/29/14 10:18 AM, Trail, Nate wrote:

 The ID.loc.gov site has a good known label service described here under
 known label retrieval :
 http://id.loc.gov/techcenter/searching.html

 Use  Curl and content negotiation to avoid screen scraping, for example,
 for LC Name authorities:

 curl -L -H Accept: application/rdf+xml http://id.loc.gov/
 authorities/names/label/Library%20of%20Congress

 Nate

 ==
 Nate Trail
 LS/TECH/NDMSO
 Library of Congress
 n...@loc.gov


 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Simon Brown
 Sent: Monday, September 29, 2014 9:38 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Reconciling corporate names?

 You could always web scrape, or download and then search the LCNAF with
 some script that looks like:

 #Build query for webscraping
 query = paste(http://id.loc.gov/search/?q=;, URLencode(corporate name
 here ), q=cs%3Ahttp%3A%2F%2Fid.loc.gov%2Fauthorities%2Fnames)

 #Make the call
 result = readLines(query)

 #Find the lines containing Corporate Name
 lines = grep(Corporate Name, result)

 #Alternatively use approximate string matching on the downloaded LCNAF
 data query - agrep(corporate name here,LCNAF_data_here)

 #Parse for whatever info you want
 ...

 My native programming language is R so I hope the functions like paste,
 readLines, grep, and URLencode are generic enough for other languages to
 have some kind of similar thing.  This can just be wrapped up into a for
 loop:
 for(i in 1:4){...}

 Web scraping the results of one name at a time would be SLOW and
 obviously using an API is the way to go but it didn't look like the OCLC
 LCNAF API handled Corporate Name.  However, it sounds like in the previous
 message someone found a work around.  Best of luck! -Simon






 On Mon, Sep 29, 2014 at 8:45 AM, Matt Carruthers mcarr...@umich.edu
 wrote:

  Hi Patrick,

 Over the last few weeks I've been doing something very similar.  I was
 able to figure out a process that works using OpenRefine.  It works by
 searching the VIAF API first, limiting results to anything that is a
 corporate name and has an LC source authority.  OpenRefine then
 extracts the LCCN and puts that through the LCNAF API that OCLC has to
 get the name.  I had to use VIAF for the initial name search because
 for some reason the LCNAF API doesn't really handle corporate names as
 search terms very well, but works with the LCCN just fine (there is
 the possibility that I'm just doing something wrong, and if that's the
 case, anyone on the list can feel free to correct me).  In the end,
 you get the LC name authority that corresponds to your search term and
 a link to the authority on the LC Authorities website.

 Anyway,  The process is fairly simple to run (just prepare an Excel
 spreadsheet and paste JSON commands into OpenRefine).  The only
 reservation is that I don't think it will run all 40,000 of your names
 at once.  I've been using it to run 300-400 names at a time.  That
 said, I'd be happy to share what I did with you if you'd like to try
 it out.  I have some instructions written up in a Word doc, and the
 JSON script is in a text file, so just email me off list and I can send
 them to you.

 Matt

 Matt Carruthers
 Metadata Projects Librarian
 University of Michigan
 734-615-5047
 mcarr...@umich.edu

 On Fri, Sep 26, 2014 at 7:03 PM, Karen Hanson
 karen.han...@ithaka.org
 wrote:

  I found the WorldCat Identities API useful for an institution name
 disambiguation project that I worked on a few years ago, though my
 goal wasn't to confirm whether names mapped to LCNAF.  The API
 response

 includes

 a LCCN, and you can set it to fuzzy or exact matching, but you would
 need to write a script to pass each term in and process the results:


  http://oclc.org/developer/develop/web-services/worldcat-identities.en.
 html


 I also can't speak to whether all LC Name Authorities are
 represented, so there may be a chance of some false negatives.

 OCLC has another API, but not sure if it covers corporate names:
 https://platform.worldcat.org/api-explorer/LCNAF

 I suspect there are others on the list that know more about the
 inner workings

Re: [CODE4LIB] Reconciling corporate names?

2014-09-29 Thread Jean Roth

What is the link to the downloadable LCNAF data?  --  Jean

On Mon, 29 Sep 2014, Kyle Banerjee wrote:

KB IMO, API isn't the best tool for this job. My inclination would be to just
KB download the LCNAF data, normalize source and comparison data, and then
KB compare via hash.
KB 
KB That will be easier to write, and you'll be able to do thousands of
KB comparisons per second.
KB 
KB kyle
KB 
KB On Mon, Sep 29, 2014 at 8:24 AM, Jonathan Rochkind rochk...@jhu.edu wrote:
KB 
KB  For yet another data set and API that may or may not meet your needs,
KB  consider VIAF -- Virtual International Authority File, operated by OCLC.
KB 
KB  The VIAF's dataset includes the LC NAF as well as other national authority
KB  files, I'm not sure if the API is suitable to limiting matches to the LC
KB  NAF, I haven't done much work with it, but I know it has an API.
KB 
KB  http://oclc.org/developer/develop/web-services/viaf.en.html
KB 
KB 
KB  On 9/29/14 10:18 AM, Trail, Nate wrote:
KB 
KB  The ID.loc.gov site has a good known label service described here under
KB  known label retrieval :
KB  http://id.loc.gov/techcenter/searching.html
KB 
KB  Use  Curl and content negotiation to avoid screen scraping, for example,
KB  for LC Name authorities:
KB 
KB  curl -L -H Accept: application/rdf+xml http://id.loc.gov/
KB  authorities/names/label/Library%20of%20Congress
KB 
KB  Nate
KB 
KB  ==
KB  Nate Trail
KB  LS/TECH/NDMSO
KB  Library of Congress
KB  n...@loc.gov
KB 
KB 
KB  -Original Message-
KB  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
KB  Simon Brown
KB  Sent: Monday, September 29, 2014 9:38 AM
KB  To: CODE4LIB@LISTSERV.ND.EDU
KB  Subject: Re: [CODE4LIB] Reconciling corporate names?
KB 
KB  You could always web scrape, or download and then search the LCNAF with
KB  some script that looks like:
KB 
KB  #Build query for webscraping
KB  query = paste(http://id.loc.gov/search/?q=;, URLencode(corporate name
KB  here ), q=cs%3Ahttp%3A%2F%2Fid.loc.gov%2Fauthorities%2Fnames)
KB 
KB  #Make the call
KB  result = readLines(query)
KB 
KB  #Find the lines containing Corporate Name
KB  lines = grep(Corporate Name, result)
KB 
KB  #Alternatively use approximate string matching on the downloaded LCNAF
KB  data query - agrep(corporate name here,LCNAF_data_here)
KB 
KB  #Parse for whatever info you want
KB  ...
KB 
KB  My native programming language is R so I hope the functions like paste,
KB  readLines, grep, and URLencode are generic enough for other languages to
KB  have some kind of similar thing.  This can just be wrapped up into a for
KB  loop:
KB  for(i in 1:4){...}
KB 
KB  Web scraping the results of one name at a time would be SLOW and
KB  obviously using an API is the way to go but it didn't look like the OCLC
KB  LCNAF API handled Corporate Name.  However, it sounds like in the 
previous
KB  message someone found a work around.  Best of luck! -Simon
KB 
KB 
KB 
KB 
KB 
KB 
KB  On Mon, Sep 29, 2014 at 8:45 AM, Matt Carruthers mcarr...@umich.edu
KB  wrote:
KB 
KB   Hi Patrick,
KB 
KB  Over the last few weeks I've been doing something very similar.  I was
KB  able to figure out a process that works using OpenRefine.  It works by
KB  searching the VIAF API first, limiting results to anything that is a
KB  corporate name and has an LC source authority.  OpenRefine then
KB  extracts the LCCN and puts that through the LCNAF API that OCLC has to
KB  get the name.  I had to use VIAF for the initial name search because
KB  for some reason the LCNAF API doesn't really handle corporate names as
KB  search terms very well, but works with the LCCN just fine (there is
KB  the possibility that I'm just doing something wrong, and if that's the
KB  case, anyone on the list can feel free to correct me).  In the end,
KB  you get the LC name authority that corresponds to your search term and
KB  a link to the authority on the LC Authorities website.
KB 
KB  Anyway,  The process is fairly simple to run (just prepare an Excel
KB  spreadsheet and paste JSON commands into OpenRefine).  The only
KB  reservation is that I don't think it will run all 40,000 of your names
KB  at once.  I've been using it to run 300-400 names at a time.  That
KB  said, I'd be happy to share what I did with you if you'd like to try
KB  it out.  I have some instructions written up in a Word doc, and the
KB  JSON script is in a text file, so just email me off list and I can send
KB  them to you.
KB 
KB  Matt
KB 
KB  Matt Carruthers
KB  Metadata Projects Librarian
KB  University of Michigan
KB  734-615-5047
KB  mcarr...@umich.edu
KB 
KB  On Fri, Sep 26, 2014 at 7:03 PM, Karen Hanson
KB  karen.han...@ithaka.org
KB  wrote:
KB 
KB   I found the WorldCat Identities API useful for an institution name
KB  disambiguation project that I worked on a few years ago, though my
KB  goal wasn't to confirm whether names mapped to LCNAF.  The API
KB  response
KB 
KB  includes
KB 
KB  a LCCN, and you can set it to fuzzy or exact matching

Re: [CODE4LIB] Reconciling corporate names?

2014-09-29 Thread Kyle Banerjee

After a quick search, http://id.loc.gov/download/ looks like the place to
go. I haven't downloaded it myself, but the file sizes make it look like
the right stuff.

kyle

On Mon, Sep 29, 2014 at 10:55 AM, Jean Roth jr...@nber.org wrote:

 What is the link to the downloadable LCNAF data?  --  Jean

 On Mon, 29 Sep 2014, Kyle Banerjee wrote:

 KB IMO, API isn't the best tool for this job. My inclination would be to
 just
 KB download the LCNAF data, normalize source and comparison data, and then
 KB compare via hash.
 KB
 KB That will be easier to write, and you'll be able to do thousands of
 KB comparisons per second.
 KB
 KB kyle

Re: [CODE4LIB] Reconciling corporate names?

2014-09-29 Thread Jean Roth

Thank you!  It looks like the files are available as  RDF/XML, Turtle, or 
N-triples files.

Any examples or suggestions for reading any of these formats?

The MARC Countries file is small, 31-79 kb.  I assume a script that 
would read a small file like that would at least be a start for the LCNAF 


Thanks,

Jean

On Mon, 29 Sep 2014, Kyle Banerjee wrote:

KB After a quick search, http://id.loc.gov/download/ looks like the place to
KB go. I haven't downloaded it myself, but the file sizes make it look like
KB the right stuff.
KB 
KB kyle
KB 
KB On Mon, Sep 29, 2014 at 10:55 AM, Jean Roth jr...@nber.org wrote:
KB 
KB  What is the link to the downloadable LCNAF data?  --  Jean
KB 
KB  On Mon, 29 Sep 2014, Kyle Banerjee wrote:
KB 
KB  KB IMO, API isn't the best tool for this job. My inclination would be to
KB  just
KB  KB download the LCNAF data, normalize source and comparison data, and 
then
KB  KB compare via hash.
KB  KB
KB  KB That will be easier to write, and you'll be able to do thousands of
KB  KB comparisons per second.
KB  KB
KB  KB kyle
KB 
KB

Re: [CODE4LIB] Reconciling corporate names?

2014-09-29 Thread Kyle Banerjee

The best way to handle them depends on what you want to do. You need to
actually download the NAF files rather than countries or other small files
as different kinds of data will be organized differently. Just don't try to
read multigigabyte files in a text editor :)

If you start with one of the giant XML files, the first thing you'll
probably want to do is extract just the elements that are interesting to
you. A short string parsing or SAX routine in your language of choice
should let you get the information in a format you like.

If you download the linked data files and you're interested in actual
headings (as opposed to traversing relationships), grep and sed in
combination with the join utility are handy for extracting the elements you
want and flattening the relationships into something more convenient to
work with. But there are plenty of other tools that you could also use.

If you don't already have a convenient environment to work on, I'm a  fan
of virtualbox. You can drag and drop things into and out of your regular
desktop or even access it directly. That way you can view/manipulate files
with the linux utilities without having to deal with a bunch of clunky file
transfer operations involving another machine. Very handy for when you have
to deal with multigigabyte files.

kyle

On Mon, Sep 29, 2014 at 11:19 AM, Jean Roth jr...@nber.org wrote:

 Thank you!  It looks like the files are available as  RDF/XML, Turtle, or
 N-triples files.

 Any examples or suggestions for reading any of these formats?

 The MARC Countries file is small, 31-79 kb.  I assume a script that
 would read a small file like that would at least be a start for the LCNAF

[CODE4LIB] Reconciling corporate names?

2014-09-26 Thread Galligan, Patrick

I'm looking to reconcile about 40,000 corporate names against LCNAF to see 
whether they are authorized strings or not, but I'm drawing a blank about how 
to get it done.

I've used http://freeyourmetadata.org/ for reconciling subject headings before, 
but I can't get it to work for LCNAF. Has anyone had any experience in a 
project like this? I'd love to hear some ideas for automatically dealing with a 
large data set like this that we did not create and do not know how the names 
were created.

Thanks!

-Patrick Galligan

Re: [CODE4LIB] Reconciling corporate names?

2014-09-26 Thread Ethan Gruber

I would check with the developers of SNAC (
http://socialarchive.iath.virginia.edu/), as they've spent a lot of time
developing named entity recognition scripts for personal and corporate
names. They might have something you can reuse.

Ethan

On Fri, Sep 26, 2014 at 3:47 PM, Galligan, Patrick pgalli...@rockarch.org
wrote:

 I'm looking to reconcile about 40,000 corporate names against LCNAF to see
 whether they are authorized strings or not, but I'm drawing a blank about
 how to get it done.

 I've used http://freeyourmetadata.org/ for reconciling subject headings
 before, but I can't get it to work for LCNAF. Has anyone had any experience
 in a project like this? I'd love to hear some ideas for automatically
 dealing with a large data set like this that we did not create and do not
 know how the names were created.

 Thanks!

 -Patrick Galligan

Re: [CODE4LIB] Reconciling corporate names?

2014-09-26 Thread Karen Hanson

I found the WorldCat Identities API useful for an institution name 
disambiguation project that I worked on a few years ago, though my goal wasn't 
to confirm whether names mapped to LCNAF.  The API response includes a LCCN, 
and you can set it to fuzzy or exact matching, but you would need to write a 
script to pass each term in and process the results:  

http://oclc.org/developer/develop/web-services/worldcat-identities.en.html

I also can't speak to whether all LC Name Authorities are represented, so there 
may be a chance of some false negatives.  

OCLC has another API, but not sure if it covers corporate names:
https://platform.worldcat.org/api-explorer/LCNAF

I suspect there are others on the list that know more about the inner workings 
of these APIs if this might be an option for you... :)

Karen

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ethan 
Gruber
Sent: Friday, September 26, 2014 3:54 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Reconciling corporate names?

I would check with the developers of SNAC ( 
http://socialarchive.iath.virginia.edu/), as they've spent a lot of time 
developing named entity recognition scripts for personal and corporate names. 
They might have something you can reuse.

Ethan

On Fri, Sep 26, 2014 at 3:47 PM, Galligan, Patrick pgalli...@rockarch.org
wrote:

 I'm looking to reconcile about 40,000 corporate names against LCNAF to 
 see whether they are authorized strings or not, but I'm drawing a 
 blank about how to get it done.

 I've used http://freeyourmetadata.org/ for reconciling subject 
 headings before, but I can't get it to work for LCNAF. Has anyone had 
 any experience in a project like this? I'd love to hear some ideas for 
 automatically dealing with a large data set like this that we did not 
 create and do not know how the names were created.

 Thanks!

 -Patrick Galligan

Re: [CODE4LIB] Reconciling corporate names?

Re: [CODE4LIB] Reconciling corporate names?

Re: [CODE4LIB] Reconciling corporate names?

Re: [CODE4LIB] Reconciling corporate names?

Re: [CODE4LIB] Reconciling corporate names?

Re: [CODE4LIB] Reconciling corporate names?

Re: [CODE4LIB] Reconciling corporate names?

Re: [CODE4LIB] Reconciling corporate names?

[CODE4LIB] Reconciling corporate names?

Re: [CODE4LIB] Reconciling corporate names?

Re: [CODE4LIB] Reconciling corporate names?

11 matches

Site Navigation

Mail list logo

Footer information