Spam detection software, running on the system "mail.archive.org",
has identified this incoming email as possible spam.  The original
message has been attached to this so you can view it or label
similar future email.  If you have any questions, see
the administrator of that system for details.

Content preview:  Concerning that problem with OCR quality, let me make a 
suggestion.
   I am a volunteer at Bookshare. There are two main tasks that volunteers 
perform.
   One is to scan books and to submit the scanned results to a checkout list
   where other volunteers do the second main task, proofread them. The person
   who submits the scan is supposed to do a proofreading of his or her work
  too, so the person who does the final proofreading picks up anything the first
   person has missed. Bookshare classifies these volunteer submissions into
  three grades of quality, fair, good and excellent. Several years ago they
  stopped accepting fair quality books, so it is really just two grades now.
   The so-called fair quality books should have been called awful because they
   were virtually unreadable. Of the books I have downloaded from Open Library
   I have yet to come across one that would be classified as fair by Bookshare.
   The vast majority of them would be classified as good. That means that they
   have problems with numerous scanning errors, but they are quite readable.
   Now, I understand that there is some kind of relationship between Open 
Library
   and Bookshare, but I don't know what that relationship is. If I am making
   a suggestion that shows that I don't know what I am talking about then I
  will fully admit that, but I have an idea of how both Bookshare and Open 
Library
   might take advantage of the relationship for mutual benefit. There is, of
   course, a lot of books that both Open Library and Bookshare hold in common,
   but there are also a lot of books available in protected Daisy format in
  Open Library that Bookshare does not have. Would it be possible for Open 
Library
   to dump all of those good quality books onto Bookshare's volunteer checkout
   list and let Bookshare volunteers proofread them and correct all of the 
scanning
   errors? Then, after the proofreading, a corrected copy could be returned
  to Open Library and another copy could enter the Bookshare collection. That
   way Bookshare benefits by acquiring any number of obscure books that it would
   [...] 

Content analysis details:   (5.1 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
 0.0 URIBL_BLOCKED          ADMINISTRATOR NOTICE: The query to URIBL was 
blocked.
                            See
                            
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
                             for more information.
                            [URIs: openlibrary.org]
 3.4 AXB_X_AOL_SEZ_S        AOL said this is S*
 0.5 RCVD_IN_DNSWL_NONE     RBL: Sender listed at http://www.dnswl.org/, no
                            trust
                            [64.12.237.10 listed in list.dnswl.org]
 0.0 FREEMAIL_FROM          Sender email is commonly abused enduser mail 
provider
                            (rogerbailey81[at]aol.com)
 0.7 HTML_MESSAGE           BODY: HTML included in message
-0.0 BAYES_40               BODY: Bayes spam probability is 20 to 40%
                            [score: 0.2835]
 0.5 T_DKIM_INVALID         DKIM-Signature header exists but is not valid

The original message was not completely plain text, and may be unsafe to
open with some email clients; in particular, it may contain a virus,
or confirm that your address can receive spam.  If you wish to view
it, it may be safer to save it to a file and open it with an editor.

--- Begin Message ---
Concerning that problem with OCR quality, let me make a suggestion.
I am a volunteer at Bookshare. There are two main tasks that volunteers perform. One is to scan books and to submit the scanned results to a checkout list where other volunteers do the second main task, proofread them. The person who submits the scan is supposed to do a proofreading of his or her work too, so the person who does the final proofreading picks up anything the first person has missed. Bookshare classifies these volunteer submissions into three grades of quality, fair, good and excellent. Several years ago they stopped accepting fair quality books, so it is really just two grades now. The so-called fair quality books should have been called awful because they were virtually unreadable. Of the books I have downloaded from Open Library I have yet to come across one that would be classified as fair by Bookshare. The vast majority of them would be classified as good. That means that they have problems with numerous scanning errors, but they are quite readable. Now, I understand that there is some kind of relationship between Open Library and Bookshare, but I don't know what that relationship is. If I am making a suggestion that shows that I don't know what I am talking about then I will fully admit that, but I have an idea of how both Bookshare and Open Library might take advantage of the relationship for mutual benefit. There is, of course, a lot of books that both Open Library and Bookshare hold in common, but there are also a lot of books available in protected Daisy format in Open Library that Bookshare does not have. Would it be possible for Open Library to dump all of those good quality books onto Bookshare's volunteer checkout list and let Bookshare volunteers proofread them and correct all of the scanning errors? Then, after the proofreading, a corrected copy could be returned to Open Library and another copy could enter the Bookshare collection. That way Bookshare benefits by acquiring any number of obscure books that it would most likely not acquire otherwise and Open Library gets books that are of a much higher quality than it has now. This sounds reasonable to me, but if I am missing something that would not make it feasible then let me know.
On 2/17/2015 11:03 PM, Tom Morris wrote:
On Tue, Feb 17, 2015 at 8:48 PM, Jon Leech <oddh...@sonic.net <mailto:oddh...@sonic.net>> wrote:

      I'm trying to obtain a list of lendable ebooks from Open Library (in
    order to dispose of corresponding parts of my personal library). My
    first try landed me at the "Lending library MARC records" section of
    https://openlibrary.org/data#bulk_download

    Following the link there to
    https://archive.org/details/marc_lendable_books
    shows that the MARC records download was last updated in June
    2011, and
    cursory inspection of what's in there suggests that it may not include
    even some lendable ebooks that were added to OL prior to June 2011.

        Is there a current source of MARC data corresponding to lendable
    ebooks in OL, or a straightforward way to derive one from other
    resources? I'd prefer not to have to learn how to write a web app
    simply
    in order to get at the catalog. Using a local database with the Perl
    MARC::Record package is my preference, if it's possible. If not, I'd
    appreciate suggestions on how to access this data with the least
    amount
    of pain. I'm a complete neophyte at the OL ecosystem, alas.


As far as I know, the MARC dump was a one-time thing. For a task like this, no web app programming should be required since the best source of info is probably full OpenLibrary dump. The nice thing about working off the dump is that it allows you to both be independent of the vagaries of a MARC dump update schedule and tailor the meaning of "lendable ebook" to something which suits your constituency (English only? Languages commonly spoken in North America only? OCR gibberish excluded? etc etc)

I don't do much Perl, but the dump is basically just a big hunk of JSON that you can filter in the manner that you which and then convert to MARC.

I did a project in Python a while ago which was focused on classics in the English language available as eBooks from Internet Archive. The client, Santa Clara County Library, made all the code available as open source and you can find it here: https://github.com/tfmorris/openlibrary-utils One of the things we attempted to do as part of this project was to get a handle on OCR quality (becomes much of it on IA, frankly, sucks) as well as find the best quality edition/scan of the many editions/scans on IA. A single work may exist in multiple editions and even a single edition often, perhaps surprisingly, has multiple scans from different sources with widely varying scan qualities, subsequent OCR qualities, etc.

Ideally an OL eBook MARC extractor would be something that could be set up as a tool chain that anyone could run with their favorite parameters for languages, subjects, quality thresholds, etc. I suspect there's enough interest there, but because Internet Archive actively ignores OpenLibrary, there's really no place for a community to form to explore common interests, etc.

Good luck with your project!  Let me know if I can help.

Tom



_______________________________________________
Ol-discuss mailing list - Ol-discuss@archive.org
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
Archives: http://www.mail-archive.com/ol-discuss@archive.org/
To unsubscribe from this mailing list, send email to 
ol-discuss-unsubscr...@archive.org


--- End Message ---
_______________________________________________
Ol-discuss mailing list - Ol-discuss@archive.org
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
Archives: http://www.mail-archive.com/ol-discuss@archive.org/
To unsubscribe from this mailing list, send email to 
ol-discuss-unsubscr...@archive.org

Reply via email to