Re: [ol-discuss] Is there an up-to-date MARC database of lendable ebooks?

Roger Loran Bailey Tue, 17 Feb 2015 20:58:01 -0800

Spam detection software, running on the system "mail.archive.org",
has identified this incoming email as possible spam.  The original
message has been attached to this so you can view it or label
similar future email.  If you have any questions, see
the administrator of that system for details.


Content preview:  Concerning that problem with OCR quality, let me make a 
suggestion.
   I am a volunteer at Bookshare. There are two main tasks that volunteers 
perform.
   One is to scan books and to submit the scanned results to a checkout list
   where other volunteers do the second main task, proofread them. The person
   who submits the scan is supposed to do a proofreading of his or her work
  too, so the person who does the final proofreading picks up anything the first
   person has missed. Bookshare classifies these volunteer submissions into
  three grades of quality, fair, good and excellent. Several years ago they
  stopped accepting fair quality books, so it is really just two grades now.
   The so-called fair quality books should have been called awful because they
   were virtually unreadable. Of the books I have downloaded from Open Library
   I have yet to come across one that would be classified as fair by Bookshare.
   The vast majority of them would be classified as good. That means that they
   have problems with numerous scanning errors, but they are quite readable.
   Now, I understand that there is some kind of relationship between Open 
Library
   and Bookshare, but I don't know what that relationship is. If I am making
   a suggestion that shows that I don't know what I am talking about then I
  will fully admit that, but I have an idea of how both Bookshare and Open 
Library
   might take advantage of the relationship for mutual benefit. There is, of
   course, a lot of books that both Open Library and Bookshare hold in common,
   but there are also a lot of books available in protected Daisy format in
  Open Library that Bookshare does not have. Would it be possible for Open 
Library
   to dump all of those good quality books onto Bookshare's volunteer checkout
   list and let Bookshare volunteers proofread them and correct all of the 
scanning
   errors? Then, after the proofreading, a corrected copy could be returned
  to Open Library and another copy could enter the Bookshare collection. That
   way Bookshare benefits by acquiring any number of obscure books that it would
   [...] 

Content analysis details:   (5.1 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
 0.0 URIBL_BLOCKED          ADMINISTRATOR NOTICE: The query to URIBL was 
blocked.
                            See
                            
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
                             for more information.
                            [URIs: openlibrary.org]
 3.4 AXB_X_AOL_SEZ_S        AOL said this is S*
 0.5 RCVD_IN_DNSWL_NONE     RBL: Sender listed at http://www.dnswl.org/, no
                            trust
                            [64.12.237.10 listed in list.dnswl.org]
 0.0 FREEMAIL_FROM          Sender email is commonly abused enduser mail 
provider
                            (rogerbailey81[at]aol.com)
 0.7 HTML_MESSAGE           BODY: HTML included in message
-0.0 BAYES_40               BODY: Bayes spam probability is 20 to 40%
                            [score: 0.2835]
 0.5 T_DKIM_INVALID         DKIM-Signature header exists but is not valid

The original message was not completely plain text, and may be unsafe to
open with some email clients; in particular, it may contain a virus,
or confirm that your address can receive spam.  If you wish to view
it, it may be safer to save it to a file and open it with an editor.

--- Begin Message ---
Concerning that problem with OCR quality, let me make a suggestion.
I am a volunteer at Bookshare. There are two main tasks that volunteersperform. One is to scan books and to submit the scanned results to acheckout list where other volunteers do the second main task, proofreadthem. The person who submits the scan is supposed to do a proofreadingof his or her work too, so the person who does the final proofreadingpicks up anything the first person has missed.Bookshare classifies these volunteer submissions into three grades ofquality, fair, good and excellent. Several years ago they stoppedaccepting fair quality books, so it is really just two grades now. Theso-called fair quality books should have been called awful because theywere virtually unreadable.Of the books I have downloaded from Open Library I have yet to comeacross one that would be classified as fair by Bookshare. The vastmajority of them would be classified as good. That means that they haveproblems with numerous scanning errors, but they are quite readable.Now, I understand that there is some kind of relationship between OpenLibrary and Bookshare, but I don't know what that relationship is. If Iam making a suggestion that shows that I don't know what I am talkingabout then I will fully admit that, but I have an idea of how bothBookshare and Open Library might take advantage of the relationship formutual benefit. There is, of course, a lot of books that both OpenLibrary and Bookshare hold in common, but there are also a lot of booksavailable in protected Daisy format in Open Library that Bookshare doesnot have. Would it be possible for Open Library to dump all of thosegood quality books onto Bookshare's volunteer checkout list and letBookshare volunteers proofread them and correct all of the scanningerrors? Then, after the proofreading, a corrected copy could be returnedto Open Library and another copy could enter the Bookshare collection.That way Bookshare benefits by acquiring any number of obscure booksthat it would most likely not acquire otherwise and Open Library getsbooks that are of a much higher quality than it has now. This soundsreasonable to me, but if I am missing something that would not make itfeasible then let me know.
On 2/17/2015 11:03 PM, Tom Morris wrote:
On Tue, Feb 17, 2015 at 8:48 PM, Jon Leech <oddh...@sonic.net<mailto:oddh...@sonic.net>> wrote:
      I'm trying to obtain a list of lendable ebooks from Open Library (in
    order to dispose of corresponding parts of my personal library). My
    first try landed me at the "Lending library MARC records" section of
    https://openlibrary.org/data#bulk_download

    Following the link there to
    https://archive.org/details/marc_lendable_books
    shows that the MARC records download was last updated in June
    2011, and
    cursory inspection of what's in there suggests that it may not include
    even some lendable ebooks that were added to OL prior to June 2011.

        Is there a current source of MARC data corresponding to lendable
    ebooks in OL, or a straightforward way to derive one from other
    resources? I'd prefer not to have to learn how to write a web app
    simply
    in order to get at the catalog. Using a local database with the Perl
    MARC::Record package is my preference, if it's possible. If not, I'd
    appreciate suggestions on how to access this data with the least
    amount
    of pain. I'm a complete neophyte at the OL ecosystem, alas.
As far as I know, the MARC dump was a one-time thing. For a task likethis, no web app programming should be required since the best sourceof info is probably full OpenLibrary dump. The nice thing aboutworking off the dump is that it allows you to both be independent ofthe vagaries of a MARC dump update schedule and tailor the meaning of"lendable ebook" to something which suits your constituency (Englishonly? Languages commonly spoken in North America only? OCR gibberishexcluded? etc etc)
I don't do much Perl, but the dump is basically just a big hunk ofJSON that you can filter in the manner that you which and then convertto MARC.
I did a project in Python a while ago which was focused on classics inthe English language available as eBooks from Internet Archive. Theclient, Santa Clara County Library, made all the code available asopen source and you can find it here:https://github.com/tfmorris/openlibrary-utils One of the things weattempted to do as part of this project was to get a handle on OCRquality (becomes much of it on IA, frankly, sucks) as well as find thebest quality edition/scan of the many editions/scans on IA. A singlework may exist in multiple editions and even a single edition often,perhaps surprisingly, has multiple scans from different sources withwidely varying scan qualities, subsequent OCR qualities, etc.
Ideally an OL eBook MARC extractor would be something that could beset up as a tool chain that anyone could run with their favoriteparameters for languages, subjects, quality thresholds, etc. Isuspect there's enough interest there, but because Internet Archiveactively ignores OpenLibrary, there's really no place for a communityto form to explore common interests, etc.
Good luck with your project!  Let me know if I can help.

Tom



_______________________________________________
Ol-discuss mailing list - Ol-discuss@archive.org
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
Archives: http://www.mail-archive.com/ol-discuss@archive.org/
To unsubscribe from this mailing list, send email to 
ol-discuss-unsubscr...@archive.org
--- End Message ---

_______________________________________________
Ol-discuss mailing list - Ol-discuss@archive.org
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
Archives: http://www.mail-archive.com/ol-discuss@archive.org/
To unsubscribe from this mailing list, send email to 
ol-discuss-unsubscr...@archive.org

Re: [ol-discuss] Is there an up-to-date MARC database of lendable ebooks?

Reply via email to