Spam detection software, running on the system "mail.archive.org",
has identified this incoming email as possible spam. The original
message has been attached to this so you can view it or label
similar future email. If you have any questions, see
the administrator of that system for details.
Content preview: Concerning that problem with OCR quality, let me make a
suggestion.
I am a volunteer at Bookshare. There are two main tasks that volunteers
perform.
One is to scan books and to submit the scanned results to a checkout list
where other volunteers do the second main task, proofread them. The person
who submits the scan is supposed to do a proofreading of his or her work
too, so the person who does the final proofreading picks up anything the first
person has missed. Bookshare classifies these volunteer submissions into
three grades of quality, fair, good and excellent. Several years ago they
stopped accepting fair quality books, so it is really just two grades now.
The so-called fair quality books should have been called awful because they
were virtually unreadable. Of the books I have downloaded from Open Library
I have yet to come across one that would be classified as fair by Bookshare.
The vast majority of them would be classified as good. That means that they
have problems with numerous scanning errors, but they are quite readable.
Now, I understand that there is some kind of relationship between Open
Library
and Bookshare, but I don't know what that relationship is. If I am making
a suggestion that shows that I don't know what I am talking about then I
will fully admit that, but I have an idea of how both Bookshare and Open
Library
might take advantage of the relationship for mutual benefit. There is, of
course, a lot of books that both Open Library and Bookshare hold in common,
but there are also a lot of books available in protected Daisy format in
Open Library that Bookshare does not have. Would it be possible for Open
Library
to dump all of those good quality books onto Bookshare's volunteer checkout
list and let Bookshare volunteers proofread them and correct all of the
scanning
errors? Then, after the proofreading, a corrected copy could be returned
to Open Library and another copy could enter the Bookshare collection. That
way Bookshare benefits by acquiring any number of obscure books that it would
[...]
Content analysis details: (5.1 points, 5.0 required)
pts rule name description
---- ---------------------- --------------------------------------------------
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was
blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: openlibrary.org]
3.4 AXB_X_AOL_SEZ_S AOL said this is S*
0.5 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no
trust
[64.12.237.10 listed in list.dnswl.org]
0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail
provider
(rogerbailey81[at]aol.com)
0.7 HTML_MESSAGE BODY: HTML included in message
-0.0 BAYES_40 BODY: Bayes spam probability is 20 to 40%
[score: 0.2835]
0.5 T_DKIM_INVALID DKIM-Signature header exists but is not valid
The original message was not completely plain text, and may be unsafe to
open with some email clients; in particular, it may contain a virus,
or confirm that your address can receive spam. If you wish to view
it, it may be safer to save it to a file and open it with an editor.
--- Begin Message ---
Concerning that problem with OCR quality, let me make a suggestion.
I am a volunteer at Bookshare. There are two main tasks that volunteers
perform. One is to scan books and to submit the scanned results to a
checkout list where other volunteers do the second main task, proofread
them. The person who submits the scan is supposed to do a proofreading
of his or her work too, so the person who does the final proofreading
picks up anything the first person has missed.
Bookshare classifies these volunteer submissions into three grades of
quality, fair, good and excellent. Several years ago they stopped
accepting fair quality books, so it is really just two grades now. The
so-called fair quality books should have been called awful because they
were virtually unreadable.
Of the books I have downloaded from Open Library I have yet to come
across one that would be classified as fair by Bookshare. The vast
majority of them would be classified as good. That means that they have
problems with numerous scanning errors, but they are quite readable.
Now, I understand that there is some kind of relationship between Open
Library and Bookshare, but I don't know what that relationship is. If I
am making a suggestion that shows that I don't know what I am talking
about then I will fully admit that, but I have an idea of how both
Bookshare and Open Library might take advantage of the relationship for
mutual benefit. There is, of course, a lot of books that both Open
Library and Bookshare hold in common, but there are also a lot of books
available in protected Daisy format in Open Library that Bookshare does
not have. Would it be possible for Open Library to dump all of those
good quality books onto Bookshare's volunteer checkout list and let
Bookshare volunteers proofread them and correct all of the scanning
errors? Then, after the proofreading, a corrected copy could be returned
to Open Library and another copy could enter the Bookshare collection.
That way Bookshare benefits by acquiring any number of obscure books
that it would most likely not acquire otherwise and Open Library gets
books that are of a much higher quality than it has now. This sounds
reasonable to me, but if I am missing something that would not make it
feasible then let me know.
On 2/17/2015 11:03 PM, Tom Morris wrote:
On Tue, Feb 17, 2015 at 8:48 PM, Jon Leech <oddh...@sonic.net
<mailto:oddh...@sonic.net>> wrote:
I'm trying to obtain a list of lendable ebooks from Open Library (in
order to dispose of corresponding parts of my personal library). My
first try landed me at the "Lending library MARC records" section of
https://openlibrary.org/data#bulk_download
Following the link there to
https://archive.org/details/marc_lendable_books
shows that the MARC records download was last updated in June
2011, and
cursory inspection of what's in there suggests that it may not include
even some lendable ebooks that were added to OL prior to June 2011.
Is there a current source of MARC data corresponding to lendable
ebooks in OL, or a straightforward way to derive one from other
resources? I'd prefer not to have to learn how to write a web app
simply
in order to get at the catalog. Using a local database with the Perl
MARC::Record package is my preference, if it's possible. If not, I'd
appreciate suggestions on how to access this data with the least
amount
of pain. I'm a complete neophyte at the OL ecosystem, alas.
As far as I know, the MARC dump was a one-time thing. For a task like
this, no web app programming should be required since the best source
of info is probably full OpenLibrary dump. The nice thing about
working off the dump is that it allows you to both be independent of
the vagaries of a MARC dump update schedule and tailor the meaning of
"lendable ebook" to something which suits your constituency (English
only? Languages commonly spoken in North America only? OCR gibberish
excluded? etc etc)
I don't do much Perl, but the dump is basically just a big hunk of
JSON that you can filter in the manner that you which and then convert
to MARC.
I did a project in Python a while ago which was focused on classics in
the English language available as eBooks from Internet Archive. The
client, Santa Clara County Library, made all the code available as
open source and you can find it here:
https://github.com/tfmorris/openlibrary-utils One of the things we
attempted to do as part of this project was to get a handle on OCR
quality (becomes much of it on IA, frankly, sucks) as well as find the
best quality edition/scan of the many editions/scans on IA. A single
work may exist in multiple editions and even a single edition often,
perhaps surprisingly, has multiple scans from different sources with
widely varying scan qualities, subsequent OCR qualities, etc.
Ideally an OL eBook MARC extractor would be something that could be
set up as a tool chain that anyone could run with their favorite
parameters for languages, subjects, quality thresholds, etc. I
suspect there's enough interest there, but because Internet Archive
actively ignores OpenLibrary, there's really no place for a community
to form to explore common interests, etc.
Good luck with your project! Let me know if I can help.
Tom
_______________________________________________
Ol-discuss mailing list - Ol-discuss@archive.org
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
Archives: http://www.mail-archive.com/ol-discuss@archive.org/
To unsubscribe from this mailing list, send email to
ol-discuss-unsubscr...@archive.org
--- End Message ---
_______________________________________________
Ol-discuss mailing list - Ol-discuss@archive.org
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
Archives: http://www.mail-archive.com/ol-discuss@archive.org/
To unsubscribe from this mailing list, send email to
ol-discuss-unsubscr...@archive.org