Re: [CODE4LIB] pdf2txt
Eric, You might want to consider using http://www.documentcloud.org to host your users document. That would also take care of privacy/authentication concerns. I know of a project in journalism domain (http://overview.ap.org/) which does that. As far as I remember they do provide an API interface and do some named entity recognition as well. Regards, Arash -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric Lease Morgan Sent: 11 October 2013 18:58 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] pdf2txt On Oct 11, 2013, at 1:49 PM, Matthew Sherman matt.r.sher...@gmail.com wrote: For a limited period of time I am making publicly available a Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8 Very slick, good work. I can see where this tool can be very helpful. It does have some issues with some characters, but this is rather common with most systems. Again, thank you for the support. Yes, there are some escaping issues to be resolved. Release early. Release often. I need help with the graphic design in general. Here's an enhancement I thought of: 1. allow readers to authenticate 2. allow readers to upload documents 3. documents get saved in readers' cache 4. allow interface to list documents in the cache 5. provide text mining services against reader-selected documents 6. go to Step #1 It would also be cool if I could figure out how to finish the installation of Tesseract to enable OCRing. [1] [1] OCRing - http://serials.infomotions.com/code4lib/archive/2013/201303/1554.html -- Eric Morgan - No virus found in this message. Checked by AVG - www.avg.com Version: 2014.0.4142 / Virus Database: 3604/6734 - Release Date: 10/08/13
[CODE4LIB] WorldCat API - myTags
Hi all, When viewing a work's metadata on WorldCat.org website, in the tag section of the page you are given the option to add new tags after logging in with your (free) account. I was wondering if there is a WorldCat api to do this from within my Java code. Thanks, Arash
[CODE4LIB] MARC field for FAST
Given a collection of scientific documents annotated with FAST subject headings, I was wondering what MARC field should be used to represent FAST? DDC (MARC-082) LCC (MARC-050) LCSH (MARC-650) FAST ? Thanks, Arash
Re: [CODE4LIB] articles using ddc
Thanks Karen, Rene also mentioned the BASE (thanks). They only go as far as the third level of the DDC and in all the cases I checked, the DDC classes were assigned automatically. Meanwhile, I have found out that some university libraries have assigned subject metadata to the technical reports and articles archived in their institutional repositories, e.g., see: http://www.worldcat.org/title/supporting-oo-design-heuristics/oclc/19148 1189referer=brief_results None of them have all the DDC, LCC, and LCSH metadata assigned, but it is a good start anyway. Some of them like 1400 research papers from the Carnegie Mellon University-School of Computer Science have proper LCSHs assigned but have been assigned the same DDC and LCC (i.e., LCC: QA76.C37 DDC: 510.7808), which I suppose is understandable considering the amount of work required for their proper manual classification. Thanks, Arash -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Karen Coyle Sent: 12 July 2012 18:12 To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] articles using ddc Someone asked a while back about a source of journal articles that had been indexed using DDC. I have found such a source here: http://www.base-search.net/Browse/Home No idea if it meets your needs, but it reminded me. kc -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet - No virus found in this message. Checked by AVG - www.avg.com Version: 2012.0.2195 / Virus Database: 2437/5127 - Release Date: 07/12/12
[CODE4LIB] Test dataset for evaluation of automatic classification of research documents according to FAST and DDC
Hi all, I am working on developing a software system designed to analyze the content of research documents (e.g., research papers, articles, etc.) archived in scientific repositories (e.g., http://citeseerx.ist.psu.edu http://citeseerx.ist.psu.edu/ , http://arxiv.org/ ) and automatically classify them according to FAST and DDC. In order to objectively qualify the performance of the system, a collection of research documents which have been manually classified according to the DDC and been assigned FAST subject heading would be required. I was wondering if anyone is aware of such dataset existing online. Regards, Arash
Re: [CODE4LIB] OCLC Classify API - sfa vs. nsfa
controlNumber: 47151174 DDC - afa:FIC nsfa:null controlNumber: 30576709 DDC - afa:510.7808 nsfa:510.78 controlNumber: 36240850 DDC - afa:510.7808 nsfa:510.78 controlNumber: 36240846 DDC - afa:510.7808 nsfa:510.78 controlNumber: 25415527 DDC - afa:510.7808 nsfa:510.78 controlNumber: 32043473 DDC - afa:510.7808 nsfa:510.78 controlNumber: 7559271 DDC - afa:748.2917 nsfa:748.291 controlNumber: 38735328 DDC - afa:E nsfa:null controlNumber: 122704504 DDC - afa:516.158 nsfa:516.15 controlNumber: 47198847 DDC - afa:E nsfa:null -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Steve Meyer Sent: 21 June 2012 13:46 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] OCLC Classify API - sfa vs. nsfa For the Classify service at OCLC, when it is LCC we use a regular expression: ^[a-zA-Z]{1,3}[1-9].*$. For DDC we filter out the truncation symbols, spaces, quotes, etc. -Steve On Wed, Jun 20, 2012 at 8:54 AM, Arash.Joorabchi arash.joorab...@ul.ie wrote: Hi all, I am using the OCLC Classify API. As show in the sample response snippet below the two attributes sfa and nsfa could hold different values. According to http://oclc.org/developer/documentation/classify/response-details: sfa - classification number from the subfield $a of 082/092 or 050/090, or 060/096 nsfa - normalized classification number from the subfield $a of 082/092 or 050/090, or 060/096 However,I would like to know how this normalization is done. Thanks, Arash recommendations graphhttp://chart.apis.google.com/chart?cht=pamp;chs=350x200amp;chd= t:100.0amp;chtt=All+Editionsamp;chdl=Classified (100.00%)/graph fast graphhttp://chart.apis.google.com/chart?cht=pamp;chs=475x175amp;chd= t:100.0,16.68,16.68,16.68amp;chl=Functional programming (Computer science)|Lambda calculus|Modality (Logic)|Type theory|/graph headings heading heldby=6 ident=fst00936086Functional programming (Computer science)/heading heading heldby=1 ident=fst00991011Lambda calculus/heading heading heldby=1 ident=fst01024350Modality (Logic)/heading heading heldby=1 ident=fst01159972Type theory/heading /headings /fast ddc mostPopular holdings=6 nsfa=510.78 sfa=510.7808/ mostRecent holdings=6 sfa=510.7808/ graphhttp://chart.apis.google.com/chart?cht=pamp;chs=350x200amp;chd= t:100.0amp;chtt=DDCamp;chdl=510.7808/graph /ddc /recommendations - No virus found in this message. Checked by AVG - www.avg.com Version: 2012.0.2180 / Virus Database: 2437/5082 - Release Date: 06/20/12
[CODE4LIB] OCLC Classify API - sfa vs. nsfa
Hi all, I am using the OCLC Classify API. As show in the sample response snippet below the two attributes sfa and nsfa could hold different values. According to http://oclc.org/developer/documentation/classify/response-details: sfa - classification number from the subfield $a of 082/092 or 050/090, or 060/096 nsfa - normalized classification number from the subfield $a of 082/092 or 050/090, or 060/096 However,I would like to know how this normalization is done. Thanks, Arash recommendations graphhttp://chart.apis.google.com/chart?cht=pamp;chs=350x200amp;chd= t:100.0amp;chtt=All+Editionsamp;chdl=Classified (100.00%)/graph fast graphhttp://chart.apis.google.com/chart?cht=pamp;chs=475x175amp;chd= t:100.0,16.68,16.68,16.68amp;chl=Functional programming (Computer science)|Lambda calculus|Modality (Logic)|Type theory|/graph headings heading heldby=6 ident=fst00936086Functional programming (Computer science)/heading heading heldby=1 ident=fst00991011Lambda calculus/heading heading heldby=1 ident=fst01024350Modality (Logic)/heading heading heldby=1 ident=fst01159972Type theory/heading /headings /fast ddc mostPopular holdings=6 nsfa=510.78 sfa=510.7808/ mostRecent holdings=6 sfa=510.7808/ graphhttp://chart.apis.google.com/chart?cht=pamp;chs=350x200amp;chd= t:100.0amp;chtt=DDCamp;chdl=510.7808/graph /ddc /recommendations
Re: [CODE4LIB] WorldCat SRU queries - elimination of records without a DDC no from the result set
Thank you Roy and Simon for the info. As for your second point, I suppose one advantage of using the WorldCat API at this experimental stage is that the returned bib records are already FRBR-ized. Ross - Thanks for the link of Open Library data dump. WorldCat collection is 2 orders of magnitude larger than open library which makes a significant difference considering the skewness and sparsity of bib records classified according to library taxonomies, e.g., DDC, LCC (for more info, see: http://cdm15003.contentdm.oclc.org/cdm/singleitem/collection/p267701coll 27/id/277/rec/28) Thanks, Arash -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Simon Spero Sent: 22 May 2012 19:47 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination of records without a DDC no from the result set Arash - you might not want to use a straight dump of worldcat catalog records- at least not without the associated holdings information.* There are a lot of quasi-duplicate records that are sufficiently broken that the worldcat de-duplication algorithm refuses to merge them. These records will usually only be used by a handful of institutions; the better records will tend to have more associated holdings. The holdings count should be used to weight the strength of association between class numbers and features. Also, since classification/categorization is something that is usually considered to be a property of works, rather than manifestations, one might get better results by using Work sets for training. I would suggest, er, contacting Thom Hickey. Simon * Well, not precisely holdings - you just need the number of distinct institutions with at least one copy. I call them 'hasings'. On Sat, May 19, 2012 at 8:42 PM, Roy Tennant roytenn...@gmail.com wrote: Arash, Yes, we have made WorldCat available to researchers under a special license agreement. I suggest contacting Thom Hickeyhic...@oclc.org about such an arrangement. Thanks, Roy On Fri, May 18, 2012 at 3:46 AM, Arash.Joorabchi arash.joorab...@ul.ie wrote: Dear Karen, I am conducting a research experiment on automatic text classification and I am trying to retrieve top matching bib records (which include DDC fields) for a set of keyphrases extracted from a given document. So, I suppose this is a rather exceptional use case. In fact, the right approach for this experiment is to process the full dump of WorldCat database directly rather than sending a limited number of queries via the API. I read here: http://dltj.org/article/worldcat-lld-may-become-available under-odc-by/ that WorldCat might become available as open linked data in future, which would solve my problem and help similar text mining projects. However, I wonder if it is currently available to researchers under a research/non-commercial use license agreement. Regards, Arash -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Karen Coombs Sent: 17 May 2012 08:37 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination of records without a DDC no from the result set I forwarded this thread to the Product Manager for the WorldCat Search API. She responded back that unfortunately this query is not possible using the API at this time. FYI, the SRU interface to WorldCat Search API doesn't currently support any scan type searches either. Is there a particular use case you're trying to support? Know that would help us document this as a possible enhancement. Karen Karen Coombs Senior Product Analyst Web Services OCLC coom...@oclc.org On Wed, May 16, 2012 at 9:49 PM, Arash.Joorabchi arash.joorab...@ul.ie wrote: Hi Andy, I am a SRU newbie myself, so I don't know how this could be achieved using scan operations and could not find much info on SRU website (http://www.loc.gov/standards/sru/). As for the wildcards, according to this guide: http://www.oclc.org/support/documentation/worldcat/searching/refcard/sea rchworldcatquickreference.pdf the symbols should be preceded by at least 3 characters, and therefore clauses like: ... AND srw.dd=* ... AND srw.dd=?.* ... AND srw/dd=###.* ... AND srw/dd=?3.* do not work and result in the following error: Diagnostics Identifier: info:srw/diagnostic/1/9 Meaning: Details: Message: Not enough chars in truncated term:Truncated words too short(9) Thanks, Arash From: Houghton,Andrew [mailto:hough...@oclc.org] Sent: 16 May 2012 11:58 To: Arash.Joorabchi Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination of records without a DDC no from the result set I'm not an SRU guru, but is it possible to do a scan and look for a postings of zero? Andy. On May
Re: [CODE4LIB] WorldCat SRU queries - elimination of records without a DDC no from the result set
Dear Karen, I am conducting a research experiment on automatic text classification and I am trying to retrieve top matching bib records (which include DDC fields) for a set of keyphrases extracted from a given document. So, I suppose this is a rather exceptional use case. In fact, the right approach for this experiment is to process the full dump of WorldCat database directly rather than sending a limited number of queries via the API. I read here: http://dltj.org/article/worldcat-lld-may-become-available under-odc-by/ that WorldCat might become available as open linked data in future, which would solve my problem and help similar text mining projects. However, I wonder if it is currently available to researchers under a research/non-commercial use license agreement. Regards, Arash -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Karen Coombs Sent: 17 May 2012 08:37 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination of records without a DDC no from the result set I forwarded this thread to the Product Manager for the WorldCat Search API. She responded back that unfortunately this query is not possible using the API at this time. FYI, the SRU interface to WorldCat Search API doesn't currently support any scan type searches either. Is there a particular use case you're trying to support? Know that would help us document this as a possible enhancement. Karen Karen Coombs Senior Product Analyst Web Services OCLC coom...@oclc.org On Wed, May 16, 2012 at 9:49 PM, Arash.Joorabchi arash.joorab...@ul.ie wrote: Hi Andy, I am a SRU newbie myself, so I don't know how this could be achieved using scan operations and could not find much info on SRU website (http://www.loc.gov/standards/sru/). As for the wildcards, according to this guide: http://www.oclc.org/support/documentation/worldcat/searching/refcard/sea rchworldcatquickreference.pdf the symbols should be preceded by at least 3 characters, and therefore clauses like: ... AND srw.dd=* ... AND srw.dd=?.* ... AND srw/dd=###.* ... AND srw/dd=?3.* do not work and result in the following error: Diagnostics Identifier: info:srw/diagnostic/1/9 Meaning: Details: Message: Not enough chars in truncated term:Truncated words too short(9) Thanks, Arash From: Houghton,Andrew [mailto:hough...@oclc.org] Sent: 16 May 2012 11:58 To: Arash.Joorabchi Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination of records without a DDC no from the result set I'm not an SRU guru, but is it possible to do a scan and look for a postings of zero? Andy. On May 16, 2012, at 6:39, Arash.Joorabchi arash.joorab...@ul.ie wrote: Hi mark, Srw.dd=* does not work either: Identifier: info:srw/diagnostic/1/27 Meaning: Details: srw.dd Message: The index [srw.dd] did not include a searchable value I suppose the only option left is to retrieve everything and filter the results on the client side. Thanks for your quick reply. Arash -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Mike Taylor Sent: 16 May 2012 10:43 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination of records without a DDC no from the result set There is no standard way in CQL to express field X is not empty. Depending on implementations, NOT srw.dd= might work (but evidently doesn't in this case). Another possibility is srw.dd=*, but again that may or may not work, and might be appallingly inefficient if it does. NOT srw.dd=null will definitely not work: null is not a special word in CQL. -- Mike. On 16 May 2012 10:32, Arash.Joorabchi arash.joorab...@ul.ie wrote: Hi all, I am sending SRU queries to the WorldCat in the following form: String host = http://worldcat.org/webservices/catalog/search/;; String query = sru?query=srw.kw=\ + keyword + \ + AND srw.ln exact \eng\ + AND srw.mt all \bks\ + AND srw.nt=\ + keyword + \ + servicelevel=full + maximumRecords=100 + sortKeys=relevance,,0 + wskey=[wskey]; And it is working fine, however I'd like to limit the results to those records that have a DDC number assigned to them, but I don't know what's the right way to specify this limit in the query
[CODE4LIB] WorldCat SRU queries - elimination of records without a DDC no from the result set
Hi all, I am sending SRU queries to the WorldCat in the following form: String host = http://worldcat.org/webservices/catalog/search/;; String query = sru?query=srw.kw=\ + keyword + \ + AND srw.ln exact \eng\ + AND srw.mt all \bks\ + AND srw.nt=\ + keyword + \ + servicelevel=full + maximumRecords=100 + sortKeys=relevance,,0 + wskey=[wskey]; And it is working fine, however I'd like to limit the results to those records that have a DDC number assigned to them, but I don't know what's the right way to specify this limit in the query. NOT srw.dd= NOT srw.dd=null Neither of above work Thanks, Arash -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Chad Benjamin Nelson Sent: 15 May 2012 21:54 To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Atlanta Digital Libraries meetup - May 23rd The first / next Atlanta Digital Libraries meetup is coming up soon: Wednesday, May 23rd 7pm Manuel's Tavernhttp://www.manuelstavern.com/location.php 602 N Highland Avenue Northeast Atlanta, GA 30307 North Avenue Room We have two scheduled talks, and are still looking others interested in presenting. It's informal, so even if it is just a short topic you want to get some feedback on, we'd love to hear it. So, come along if you are interested and in the area. Chad Chad Nelson Web Services Programmer University Library Georgia State University e: cnelso...@gsu.edu t: 404 413 2771 My Calendarhttp://bit.ly/qybPLJ - No virus found in this message. Checked by AVG - www.avg.com Version: 2012.0.2176 / Virus Database: 2425/5000 - Release Date: 05/15/12
Re: [CODE4LIB] WorldCat SRU queries - elimination of records without a DDC no from the result set
Hi mark, Srw.dd=* does not work either: Identifier: info:srw/diagnostic/1/27 Meaning: Details:srw.dd Message:The index [srw.dd] did not include a searchable value I suppose the only option left is to retrieve everything and filter the results on the client side. Thanks for your quick reply. Arash -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Mike Taylor Sent: 16 May 2012 10:43 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination of records without a DDC no from the result set There is no standard way in CQL to express field X is not empty. Depending on implementations, NOT srw.dd= might work (but evidently doesn't in this case). Another possibility is srw.dd=*, but again that may or may not work, and might be appallingly inefficient if it does. NOT srw.dd=null will definitely not work: null is not a special word in CQL. -- Mike. On 16 May 2012 10:32, Arash.Joorabchi arash.joorab...@ul.ie wrote: Hi all, I am sending SRU queries to the WorldCat in the following form: String host = http://worldcat.org/webservices/catalog/search/;; String query = sru?query=srw.kw=\ + keyword + \ + AND srw.ln exact \eng\ + AND srw.mt all \bks\ + AND srw.nt=\ + keyword + \ + servicelevel=full + maximumRecords=100 + sortKeys=relevance,,0 + wskey=[wskey]; And it is working fine, however I'd like to limit the results to those records that have a DDC number assigned to them, but I don't know what's the right way to specify this limit in the query. NOT srw.dd= NOT srw.dd=null Neither of above work Thanks, Arash
Re: [CODE4LIB] WorldCat SRU queries - elimination of records without a DDC no from the result set
Hi Andy, I am a SRU newbie myself, so I don't know how this could be achieved using scan operations and could not find much info on SRU website (http://www.loc.gov/standards/sru/). As for the wildcards, according to this guide: http://www.oclc.org/support/documentation/worldcat/searching/refcard/sea rchworldcatquickreference.pdf the symbols should be preceded by at least 3 characters, and therefore clauses like: ... AND srw.dd=* ... AND srw.dd=?.* ... AND srw/dd=###.* ... AND srw/dd=?3.* do not work and result in the following error: Diagnostics Identifier: info:srw/diagnostic/1/9 Meaning: Details: Message: Not enough chars in truncated term:Truncated words too short(9) Thanks, Arash From: Houghton,Andrew [mailto:hough...@oclc.org] Sent: 16 May 2012 11:58 To: Arash.Joorabchi Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination of records without a DDC no from the result set I'm not an SRU guru, but is it possible to do a scan and look for a postings of zero? Andy. On May 16, 2012, at 6:39, Arash.Joorabchi arash.joorab...@ul.ie wrote: Hi mark, Srw.dd=* does not work either: Identifier: info:srw/diagnostic/1/27 Meaning: Details:srw.dd Message:The index [srw.dd] did not include a searchable value I suppose the only option left is to retrieve everything and filter the results on the client side. Thanks for your quick reply. Arash -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Mike Taylor Sent: 16 May 2012 10:43 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination of records without a DDC no from the result set There is no standard way in CQL to express field X is not empty. Depending on implementations, NOT srw.dd= might work (but evidently doesn't in this case). Another possibility is srw.dd=*, but again that may or may not work, and might be appallingly inefficient if it does. NOT srw.dd=null will definitely not work: null is not a special word in CQL. -- Mike. On 16 May 2012 10:32, Arash.Joorabchi arash.joorab...@ul.ie wrote: Hi all, I am sending SRU queries to the WorldCat in the following form: String host = http://worldcat.org/webservices/catalog/search/;; String query = sru?query=srw.kw=\ + keyword + \ + AND srw.ln exact \eng\ + AND srw.mt all \bks\ + AND srw.nt=\ + keyword + \ + servicelevel=full + maximumRecords=100 + sortKeys=relevance,,0 + wskey=[wskey]; And it is working fine, however I'd like to limit the results to those records that have a DDC number assigned to them, but I don't know what's the right way to specify this limit in the query. NOT srw.dd= NOT srw.dd=null Neither of above work Thanks, Arash No virus found in this message. Checked by AVG - www.avg.com Version: 2012.0.2176 / Virus Database: 2425/5001 - Release Date: 05/15/12
Re: [CODE4LIB] linked data endpoints [wikipedia-miner]
It also has a built-in ML-based disambiguator reportedly achieving a high F1-measure of 97.1 [1] [1] http://www.cs.waikato.ac.nz/~dnk2/publications/CIKM08-LearningToLinkWith Wikipedia.pdf -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric Lease Morgan Sent: 17 May 2011 16:25 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] linked data endpoints [wikipedia-miner] On May 16, 2011, at 9:13 AM, Arash.Joorabchi wrote: If you think wikipedia articles could be used as good endpoints for your purposes then have a look at this opensource tool http://wikipedia-miner.sourceforge.net/ Wikipedia-miner is a pretty cool tool; it is a good example of various text mining techniques. It even supports a Web services interface. Thank you for bringing it to our attention. -- Eric Morgan University of Notre Dame
Re: [CODE4LIB] linked data endpoints
Hi Eric, If you think wikipedia articles could be used as good endpoints for your purposes then have a look at this opensource tool http://wikipedia-miner.sourceforge.net/ -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric Lease Morgan Sent: 16 May 2011 13:34 To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] linked data endpoints What are some of the ways to best insert Linked Data endpoints into an XML file? I have been playing lately with named-entity recognition/extraction technology. [1] Feed a text file, such as a novel, into the recognition program. Get back a rudimentary XML file where things like names, places, and organizations are marked with simple tags. I can then extract all the place names from a text, tabulate them, display a word-cloud, allow the reader to select items, guess latitude and longitude of the place, and finally plot them on a map. [2] This process works pretty well, but Google Maps only allows me to plot a limited number of items at a time. Consequently, I am thinking about preprocessing my data by looping through the XML file and adding latitude and longitude attributes to the place name elements. I then got to thinking about names and organizations. It would be nice to supplement these entities with canonical Linked Data endpoints. My application could then read the endpoints, extract the links associated with them, and display some sort of graphic illustrating relationships. Finally, I could allow the reader to select a relationship for further investigation. Given a name -- say, Plato or Thoreau -- how would one go about identifying good endpoints? What sort of query would I send to what sort of database? What might I get back? Assuming my goal is to enrich the text, what sort of link(s) should I insert into my XML? [1] NER - http://bit.ly/e0SnA6 [2] geo-location for WebKit mobile - http://bit.ly/msIu16 -- Eric Morgan University of Notre Dame
Re: [CODE4LIB] A call for your OPAC (or other system) statistics! (Browse interfaces)
The stats reported in this paper might help: http://homes.ukoln.ac.uk/~kg249/publ/RenardusFinal.pdf -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Bill Dueber Sent: 03 May 2010 19:09 To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] A call for your OPAC (or other system) statistics! (Browse interfaces) I got email from a person today saying, and I quote, I must say that [the lack of a browse interface] come as a shock (*which interface cannot browse??*) [Emphasis mine] Here, a browse interface is one where you can get a giant list of all the titles/authors/subjects whatever -- a view on the data devoid of any searching. Will those of you out there with browse interfaces in your system take a couple minutes to send along a guesstimate of what percentage of patron sessions involve their use? [Note that for right now, I'm excluding type-ahead search boxes although there's an obvious and, in my mind, strong argument to be made that they're substantially similar for many types of data] We don't have a browse interface on our (VuFind) OPAC right now. But in the interest of paying it forward, I can tell you that in Mirlyn, our OPAC, has numbers like this: Pct of Mirlyn sessions, Feb/March/April 2010, which included at least one basic search and also: Go to full record view 46% (we put a lot of info in search results) Select/favorite an item 15% Add a facet:13% Export record(s) to email/refworks/RIS/etc. 3.4% Send to phone (sms) 0.21% Click on faq/help/AskUs in footer0.17% (324 total) Based on 187,784 sessions, 2010.02.01 to 2010.04.31 So...anyone out there able to tell me anything about browse interfaces? -- Bill Dueber Library Systems Programmer University of Michigan Library