Re: [CODE4LIB] Durability of PDFs

2009-06-15 Thread Ben O'Steen
Jonathon,

Likewise that paragraph reads with the same accuracy with the
following alterations

s/UTF-8|Unicode/PDF/
s/encoding/version/

I think the key thing is that garbage in == garbage out, but I feel
happier with garbage that was meant to have been unicode at some
point, compared to a pdf that was made by a Word->PDF printer driver
that craps out on large files, but does so silently.

My experiences with PDF versions:

PDF 1.3 and earlier is evil, 1.4 not too bad aside from colour issues
and its ham-fisted way of attempting to shove CMYK info into itself,
1.6 has issues to some people that I am still trying to isolate, 1.7
is rare as hens teeth and PDF/A as a spec seems to be okay, but I've
only seen a few of those in the wild and only from OpenOffice too. It
would be interesting to see how OOo's idea of PDF/A stacks against
Adobe's.

And there is PDF/X(-3?) orsimilar which I've only even seen on an
options panel, before being swiftly ignored.

And on a final note, there have been PDF files that are useless to me,
I can't wheedle out anything from them, and that are only 10 years
old. However, I have resurrected a tex-based thesis from an earlier
period without difficulty, and created a PDF/A from the source.

Bottom line is that it's best to preserve the source materials as well
as the final disseminations - you can't always guarantee a viewer will
work as expected. The trend is that newer PDF versions are better, but
be very very wary of hidden DRM. If memory serves, an eBook publisher
lost 1/4(?) of their stock, due to losing the mechanism to unlock.
Let's not have that happen to repositories...

Ben

2009/6/15 Jonathan Rochkind :
> Fair enough.  Asking someone to give you a UTF-8 (or other Unicode encoding)
> plain text file though -- you better try to heuristically check the encoding
> before ingesting it, and plan on a lot of failures. Typical users using
> typical consumer software (which tends to be somewhat unpredictable with
> character encodings) can't be trusted to give you a UTF-8 encoding just
> because you specify it, or  to have any idea what this means or how to do
> it.
> And checking the to see if the 'true' encoding of a plain text file is what
> it's advertised as in an automated fashion is heuristic at best, and not
> going to be perfect.
> And you're still going to have trouble with complicated mathematical
> formulas, molecular diagrams, other diagrams, etc.
>
> Jonathan
>
> Doran, Michael D wrote:
>>>
>>> As far as electronic formats go, I think PDF is as good as anything --
>>> except maybe plain ASCII text, which is not
>>> nearly as useable (and doesn't allow diagrams,
>>> mathematical equations, non-English letters, etc).
>>>
>>
>> There is no requirement that plain text be limited to the ASCII character
>> set repertoire.  Although once they were almost synonymous, that is no
>> longer the case [1].  Plain text can encompass anything and everything in
>> the Unicode character set.  That includes non-Roman scripts, mathematical
>> symbols, yada, yada, yada.
>>
>> -- Michael
>>
>> [1] http://en.wikipedia.org/wiki/Plain_text
>>
>> # Michael Doran, Systems Librarian
>> # University of Texas at Arlington
>> # 817-272-5326 office
>> # 817-688-1926 mobile
>> # do...@uta.edu
>> # http://rocky.uta.edu/doran/
>>
>>
>>>
>>> -Original Message-
>>> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
>>> Jonathan Rochkind
>>> Sent: Monday, June 15, 2009 9:13 AM
>>> To: CODE4LIB@LISTSERV.ND.EDU
>>> Subject: Re: [CODE4LIB] Durability of PDFs
>>>
>>> The bet is that PDFs are so popular that _someone_ (the archival
>>> community if no-one else, but probably someone else) will ensure that they
>>> continue to be readable somehow.
>>>
>>> These are real non-trivial issues in electronic archiving though, issues
>>> that the archival community.  It is generally a safe assumption that good
>>> electronic archiving over the decades-and-more term requires some regular
>>> attention by an electronic archivist to make sure that files remain
>>> readable, and are converted to new formats when necessary. As well as
>>> attention to avoiding actual bit-level corruption of files. You can't
>>> neccesarily just dump files on a HD and ignore them and expect they'll be
>>> readable in 100 years, that much is true -- and true pretty much regardless
>>> of particular electronic format you choose.
>>>
>>> As far as electronic formats go, I think PDF is as good as anything --
>>> except maybe plain ASCII text, which is not nearly as useable (and doesn't
>>> allow diagrams, mathematical equations, non-English letters, etc). I don't
>>> know why you're colleague has decided that "30-40 years" is the horizon
>>> after which PDF specifically will become "unreadable", this seems like just
>>> a wild guess to me, but it would be interesting to see if he has any
>>> particular evidence to back up this claim.
>>> So there are real issues with electronic archiving, but unless they lead
>>>

Re: [CODE4LIB] FW: [CODE4LIB] Newbie asking for some suggestions with javascript

2009-06-15 Thread Godmar Back
On Mon, Jun 15, 2009 at 4:09 PM, Roy Tennant  wrote:

> It is worth following up on Xiaoming's statement of a limit of 100 uses per
> day of the xISSN service with the information that exceptions to this
> limite
> are certainly granted. Annette probably knows that just such an exception
> was granted to her LibX project, and LibX remains the single largest user
> of
> this service.
> Roy


Yes, Roy is correct.

We are very grateful for OCLC's generous support and would like to
acknowledge that publicly.

FWIW, I suggested the inclusion of ticTOCs RSS feed data in the survey OCLC
sent out two weeks ago, and less than a week later, OCLC rolls out the
improved service. Excellent!

[ As an aside, in LibX, we are changing the way we use the service;
previously, we were looking up all ISSNs on any page a user visits; we are
now retrieving the metadata if the user actually hovers over the link. Not
that OCLC complained - but CrossRef did when they noticed > 100,000 hits per
day against their service for DOI metadata lookups. In fairness to CrossRef,
they are working on beefing up their servers as well. ]

 - Godmar & Annette for Team LibX.


Re: [CODE4LIB] FW: [CODE4LIB] Newbie asking for some suggestions with javascript

2009-06-15 Thread Jonathan Rochkind
Does the xISSN documentation say that exceptions by non-OCLC members can 
be asked for, and instruct on where to make the request?  If you want to 
keep from discouraging use accidentally by people who don't know they 
can get an exception, it needs to say that on the same page that talks 
about the 100/day limit, not just on the code4lib listserv.


Roy Tennant wrote:

It is worth following up on Xiaoming's statement of a limit of 100 uses per
day of the xISSN service with the information that exceptions to this limite
are certainly granted. Annette probably knows that just such an exception
was granted to her LibX project, and LibX remains the single largest user of
this service. 
Roy


On 6/13/09 4:02 PM, "Xiaoming Liu"  wrote:

  

Annette's comment is correct. XISSN service allows 100 uses per day for
non-OCLC usage. I don't think xISSN's price proposal is ever approved, so we
don't have a price list for commercial usage.

XISSN's access control is sort of complex, for more details please check
http://xissn.worldcat.org/xissnadmin/doc/subscribe.htm , hopefully we can
clean it up in the future.

xiaoming


On 6/13/09 3:52 PM, "Hamparian,Don"  wrote:



You can also purchase. I thought it was 500 usages a day. Xiaoming?

  

-Original Message-
From: Tennant,Roy
Sent: Thursday, June 11, 2009 5:02 PM
To: Hamparian,Don
Subject: FW: [CODE4LIB] Newbie asking for some suggestions with
javascript

I think I have to say yes to this, although it isn't going to make us
look
great.
Roy


-- Forwarded Message
From: Annette Bailey 
Reply-To: Code for Libraries 
Date: Thu, 11 Jun 2009 16:55:57 -0400
To: 
Subject: Re: [CODE4LIB] Newbie asking for some suggestions with
javascript

Roy,

Just to clarify, you have to be an OCLC cataloging member to use this
beyond 100 uses per day, correct?

Thanks,
Annette

On Thu, Jun 11, 2009 at 4:48 PM, Roy Tennant wrote:


This data (the Tic-Tocs RSS URLs) is also available via xISSN. For
  

example:




Look for the "rssurl" attribute. For information on xISSN see:



Roy


On 6/11/09 6/11/09 € 12:36 PM, "Derik Badman" 
  

wrote:


On Thu, Jun 11, 2009 at 2:03 PM, Jon Gorman


wrote:


I guess the first question is if it is really necessary to use a
  

text


file?  I'm not entirely clear on this process, but perhaps the text
file could be imported into a database.
  

At this point the text file is a stop-gap api that ticTOCs is


offering


(supposedly working an actual api), so this will probably be a


temporary


situation. I could put all the data into mysql, though then I'd have


to


figure out how to check the text file for changes and then update


the


database accordingly.




Then of course perhaps there's some way to add this to the Serials
Solution database directly?  Then you don't need another javascript
  

at


all?
  

I'm so disillusioned with them, that I didn't even consider that...





cron + wget/curl would be a good first step it would seem.  You
  

might


want some sort of script that monitors changes or the like.  (Maybe
send you an email if there's no updates in x days or something like
that).
  

Thanks, I'll look into that.



--

  

-- End of Forwarded Message




-- End of Forwarded Message
  


Re: [CODE4LIB] Durability of PDFs

2009-06-15 Thread Jonathan Rochkind
Fair enough.  Asking someone to give you a UTF-8 (or other Unicode 
encoding) plain text file though -- you better try to heuristically 
check the encoding before ingesting it, and plan on a lot of failures. 
Typical users using typical consumer software (which tends to be 
somewhat unpredictable with character encodings) can't be trusted to 
give you a UTF-8 encoding just because you specify it, or  to have any 
idea what this means or how to do it. 

And checking the to see if the 'true' encoding of a plain text file is 
what it's advertised as in an automated fashion is heuristic at best, 
and not going to be perfect. 

And you're still going to have trouble with complicated mathematical 
formulas, molecular diagrams, other diagrams, etc.


Jonathan

Doran, Michael D wrote:
As far as electronic formats go, I think PDF is as good as 
anything -- except maybe plain ASCII text, which is not

nearly as useable (and doesn't allow diagrams,
mathematical equations, non-English letters, etc).



There is no requirement that plain text be limited to the ASCII character set 
repertoire.  Although once they were almost synonymous, that is no longer the 
case [1].  Plain text can encompass anything and everything in the Unicode 
character set.  That includes non-Roman scripts, mathematical symbols, yada, 
yada, yada.

-- Michael

[1] http://en.wikipedia.org/wiki/Plain_text

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/
  

  

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On 
Behalf Of Jonathan Rochkind

Sent: Monday, June 15, 2009 9:13 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Durability of PDFs

The bet is that PDFs are so popular that _someone_ (the archival 
community if no-one else, but probably someone else) will ensure that 
they continue to be readable somehow.


These are real non-trivial issues in electronic archiving 
though, issues 
that the archival community.  It is generally a safe assumption that 
good electronic archiving over the decades-and-more term 
requires some 
regular attention by an electronic archivist to make sure that files 
remain readable, and are converted to new formats when necessary. As 
well as attention to avoiding actual bit-level corruption of 
files. You 
can't neccesarily just dump files on a HD and ignore them and expect 
they'll be readable in 100 years, that much is true -- and 
true pretty 
much regardless of particular electronic format you choose.


As far as electronic formats go, I think PDF is as good as 
anything -- 
except maybe plain ASCII text, which is not nearly as useable (and 
doesn't allow diagrams, mathematical equations, non-English letters, 
etc). I don't know why you're colleague has decided that 
"30-40 years" 
is the horizon after which PDF specifically will become "unreadable", 
this seems like just a wild guess to me, but it would be 
interesting to 
see if he has any particular evidence to back up this claim. 

So there are real issues with electronic archiving, but 
unless they lead 
you to refuse to accept electronic submissions at all, you're 
just going 
to have to deal with them, it's not really an issue of PDF 
specifically, 
but it is true that "just dump files on a HD and forget about 
them and 
assume they'll be readable in 100 years" is not a particularly safe 
electronic archiving strategy.


Jonathan

Mike Taylor wrote:


Dear CODE4LIB colleagues,

In one of my alternative incarnations, I am a zoological taxonomist.
One of the big issues for taxonomy right now is whether to accept as
nomenclaturally valid papers that are published only in electronic
form, i.e. not printed on paper by a publisher.

In a discussion of this matter, a colleague has claimed:

  
  

[PDF files will not become unreadable] in the next 30-40 years.
Possibly not in the 20 years that will follow. After that, 


when only


30-year and older documents are in the PDF format, the danger will
increase that this information will not be readable any more. It is
generally considered as quite unlikely that PDF will be readable in
100 years.



I would appreciate any comments that anyone on this list has on the
likelihood that PDF will be unreadable in 100 years.

Many thanks,

 _/|_	 
  

___

/o ) \/  Mike Taylor
  

http://www.miketaylor.org.uk

)_v__/\  "Can't someone act COMPLETELY OUT OF CHARACTER 
  

without arousing


 suspicion?" -- Bob the Angry Flower, www.angryflower.com

  
  


  


[CODE4LIB] FW: [CODE4LIB] Newbie asking for some suggestions with javascript

2009-06-15 Thread Roy Tennant
It is worth following up on Xiaoming's statement of a limit of 100 uses per
day of the xISSN service with the information that exceptions to this limite
are certainly granted. Annette probably knows that just such an exception
was granted to her LibX project, and LibX remains the single largest user of
this service. 
Roy

On 6/13/09 4:02 PM, "Xiaoming Liu"  wrote:

> Annette's comment is correct. XISSN service allows 100 uses per day for
> non-OCLC usage. I don't think xISSN's price proposal is ever approved, so we
> don't have a price list for commercial usage.
> 
> XISSN's access control is sort of complex, for more details please check
> http://xissn.worldcat.org/xissnadmin/doc/subscribe.htm , hopefully we can
> clean it up in the future.
> 
> xiaoming
> 
> 
> On 6/13/09 3:52 PM, "Hamparian,Don"  wrote:
> 
>> You can also purchase. I thought it was 500 usages a day. Xiaoming?
>> 
>>> -Original Message-
>>> From: Tennant,Roy
>>> Sent: Thursday, June 11, 2009 5:02 PM
>>> To: Hamparian,Don
>>> Subject: FW: [CODE4LIB] Newbie asking for some suggestions with
>>> javascript
>>> 
>>> I think I have to say yes to this, although it isn't going to make us
>>> look
>>> great.
>>> Roy
>>> 
>>> 
>>> -- Forwarded Message
>>> From: Annette Bailey 
>>> Reply-To: Code for Libraries 
>>> Date: Thu, 11 Jun 2009 16:55:57 -0400
>>> To: 
>>> Subject: Re: [CODE4LIB] Newbie asking for some suggestions with
>>> javascript
>>> 
>>> Roy,
>>> 
>>> Just to clarify, you have to be an OCLC cataloging member to use this
>>> beyond 100 uses per day, correct?
>>> 
>>> Thanks,
>>> Annette
>>> 
>>> On Thu, Jun 11, 2009 at 4:48 PM, Roy Tennant wrote:
 This data (the Tic-Tocs RSS URLs) is also available via xISSN. For
>>> example:
 
 >> 9203?method=getMetadata
 &format=xml&fl=*>
 
 Look for the "rssurl" attribute. For information on xISSN see:
 
 
 
 Roy
 
 
 On 6/11/09 6/11/09 € 12:36 PM, "Derik Badman" 
>>> wrote:
 
> On Thu, Jun 11, 2009 at 2:03 PM, Jon Gorman
>>> wrote:
> 
>> I guess the first question is if it is really necessary to use a
>>> text
>> file?  I'm not entirely clear on this process, but perhaps the text
>> file could be imported into a database.
> 
> 
> At this point the text file is a stop-gap api that ticTOCs is
>>> offering
> (supposedly working an actual api), so this will probably be a
>>> temporary
> situation. I could put all the data into mysql, though then I'd have
>>> to
> figure out how to check the text file for changes and then update
>>> the
> database accordingly.
> 
> 
>> Then of course perhaps there's some way to add this to the Serials
>> Solution database directly?  Then you don't need another javascript
>>> at
>> all?
> 
> 
> I'm so disillusioned with them, that I didn't even consider that...
> 
> 
> 
>> cron + wget/curl would be a good first step it would seem.  You
>>> might
>> want some sort of script that monitors changes or the like.  (Maybe
>> send you an email if there's no updates in x days or something like
>> that).
> 
> 
> Thanks, I'll look into that.
> 
 
 --
 
>>> 
>>> 
>>> -- End of Forwarded Message
>> 


-- End of Forwarded Message


Re: [CODE4LIB] Durability of PDFs

2009-06-15 Thread Doran, Michael D
> As far as electronic formats go, I think PDF is as good as 
> anything -- except maybe plain ASCII text, which is not
> nearly as useable (and doesn't allow diagrams,
> mathematical equations, non-English letters, etc).

There is no requirement that plain text be limited to the ASCII character set 
repertoire.  Although once they were almost synonymous, that is no longer the 
case [1].  Plain text can encompass anything and everything in the Unicode 
character set.  That includes non-Roman scripts, mathematical symbols, yada, 
yada, yada.

-- Michael

[1] http://en.wikipedia.org/wiki/Plain_text

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/
  

> -Original Message-
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On 
> Behalf Of Jonathan Rochkind
> Sent: Monday, June 15, 2009 9:13 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Durability of PDFs
> 
> The bet is that PDFs are so popular that _someone_ (the archival 
> community if no-one else, but probably someone else) will ensure that 
> they continue to be readable somehow.
> 
> These are real non-trivial issues in electronic archiving 
> though, issues 
> that the archival community.  It is generally a safe assumption that 
> good electronic archiving over the decades-and-more term 
> requires some 
> regular attention by an electronic archivist to make sure that files 
> remain readable, and are converted to new formats when necessary. As 
> well as attention to avoiding actual bit-level corruption of 
> files. You 
> can't neccesarily just dump files on a HD and ignore them and expect 
> they'll be readable in 100 years, that much is true -- and 
> true pretty 
> much regardless of particular electronic format you choose.
> 
> As far as electronic formats go, I think PDF is as good as 
> anything -- 
> except maybe plain ASCII text, which is not nearly as useable (and 
> doesn't allow diagrams, mathematical equations, non-English letters, 
> etc). I don't know why you're colleague has decided that 
> "30-40 years" 
> is the horizon after which PDF specifically will become "unreadable", 
> this seems like just a wild guess to me, but it would be 
> interesting to 
> see if he has any particular evidence to back up this claim. 
> 
> So there are real issues with electronic archiving, but 
> unless they lead 
> you to refuse to accept electronic submissions at all, you're 
> just going 
> to have to deal with them, it's not really an issue of PDF 
> specifically, 
> but it is true that "just dump files on a HD and forget about 
> them and 
> assume they'll be readable in 100 years" is not a particularly safe 
> electronic archiving strategy.
> 
> Jonathan
> 
> Mike Taylor wrote:
> > Dear CODE4LIB colleagues,
> >
> > In one of my alternative incarnations, I am a zoological taxonomist.
> > One of the big issues for taxonomy right now is whether to accept as
> > nomenclaturally valid papers that are published only in electronic
> > form, i.e. not printed on paper by a publisher.
> >
> > In a discussion of this matter, a colleague has claimed:
> >
> >   
> >> [PDF files will not become unreadable] in the next 30-40 years.
> >> Possibly not in the 20 years that will follow. After that, 
> when only
> >> 30-year and older documents are in the PDF format, the danger will
> >> increase that this information will not be readable any more. It is
> >> generally considered as quite unlikely that PDF will be readable in
> >> 100 years.
> >> 
> >
> > I would appreciate any comments that anyone on this list has on the
> > likelihood that PDF will be unreadable in 100 years.
> >
> > Many thanks,
> >
> >  _/|_
> ___
> > /o ) \/  Mike Taylor
> http://www.miketaylor.org.uk
> > )_v__/\  "Can't someone act COMPLETELY OUT OF CHARACTER 
> without arousing
> >  suspicion?" -- Bob the Angry Flower, www.angryflower.com
> >
> >   
> 


[CODE4LIB] Preliminary report on user research for eXtensible Catalog

2009-06-15 Thread Jennifer Bowen
(Posted on behalf of Nancy Fried Foster, nfos...@library.rochester.edu) 

The eXtensible Catalog project at the University of Rochester's River Campus
Libraries is pleased to release the first report on the user research that
we conducted in support of XC software development. We thank the Andrew W.
Mellon Foundation and our user research partners - Cornell, Ohio State, Yale
and the University of Rochester - for their generous support of this project.

Use this URL - http://hdl.handle.net/1802/6873 - for a report that
summarizes the objectives, methods, and major software design findings from
the data collected in the user research portion of the eXtensible Catalog
(XC) project. A full analysis and interpretation of the data is not included
in the present report and will be provided at the conclusion of the project.
This report includes edited results from the brainstorming sessions and a
list of the features that emerged from the analysis of those results. (See
the eXtensible Catalog website at www.eXtensibleCatalog.org for more
information about the overall
project.)


Re: [CODE4LIB] Newbie asking for some suggestions with javascript

2009-06-15 Thread Derik Badman
Thanks for the suggestions and links, everyone.

I'll check them out and see what will work for me.

-- 
Derik A. Badman
Digital Services Librarian
Reference Librarian for Education and Social Work
Temple University Libraries
Paley Library 209
Philadelphia, PA
Phone: 215-204-5250
Email: dbad...@temple.edu
AIM: derikbad

"Research makes times march forward, it makes time march backward, and it
also makes time stand still." -Greil Marcus


Re: [CODE4LIB] Durability of PDFs

2009-06-15 Thread Jonathan Rochkind
The bet is that PDFs are so popular that _someone_ (the archival 
community if no-one else, but probably someone else) will ensure that 
they continue to be readable somehow.


These are real non-trivial issues in electronic archiving though, issues 
that the archival community.  It is generally a safe assumption that 
good electronic archiving over the decades-and-more term requires some 
regular attention by an electronic archivist to make sure that files 
remain readable, and are converted to new formats when necessary. As 
well as attention to avoiding actual bit-level corruption of files. You 
can't neccesarily just dump files on a HD and ignore them and expect 
they'll be readable in 100 years, that much is true -- and true pretty 
much regardless of particular electronic format you choose.


As far as electronic formats go, I think PDF is as good as anything -- 
except maybe plain ASCII text, which is not nearly as useable (and 
doesn't allow diagrams, mathematical equations, non-English letters, 
etc). I don't know why you're colleague has decided that "30-40 years" 
is the horizon after which PDF specifically will become "unreadable", 
this seems like just a wild guess to me, but it would be interesting to 
see if he has any particular evidence to back up this claim. 

So there are real issues with electronic archiving, but unless they lead 
you to refuse to accept electronic submissions at all, you're just going 
to have to deal with them, it's not really an issue of PDF specifically, 
but it is true that "just dump files on a HD and forget about them and 
assume they'll be readable in 100 years" is not a particularly safe 
electronic archiving strategy.


Jonathan

Mike Taylor wrote:

Dear CODE4LIB colleagues,

In one of my alternative incarnations, I am a zoological taxonomist.
One of the big issues for taxonomy right now is whether to accept as
nomenclaturally valid papers that are published only in electronic
form, i.e. not printed on paper by a publisher.

In a discussion of this matter, a colleague has claimed:

  

[PDF files will not become unreadable] in the next 30-40 years.
Possibly not in the 20 years that will follow. After that, when only
30-year and older documents are in the PDF format, the danger will
increase that this information will not be readable any more. It is
generally considered as quite unlikely that PDF will be readable in
100 years.



I would appreciate any comments that anyone on this list has on the
likelihood that PDF will be unreadable in 100 years.

Many thanks,

 _/|____
/o ) \/  Mike Taylorhttp://www.miketaylor.org.uk
)_v__/\  "Can't someone act COMPLETELY OUT OF CHARACTER without arousing
 suspicion?" -- Bob the Angry Flower, www.angryflower.com

  


Re: [CODE4LIB] Durability of PDFs

2009-06-15 Thread Toke Eskildsen
On Mon, 2009-06-15 at 12:37 +0200, Mike Taylor wrote:
> I would appreciate any comments that anyone on this list has on the
> likelihood that PDF will be unreadable in 100 years.

The problem with projections such as these are that we have very little
empiric evidence to build on. The classic fallacy is to look at any old
format that is now hard to read and use that as a guide. The problem
with this form of extrapolation is that Old Format X wasn't very
widespread in the general population, simply because the average person
did not have a computer then.

Forwarding to 2009, everyone and their dog produces a huge amount of
PDFs and JPEGs, to name two contenders for "they will be readable for
the next 1000 years".

When there's an ungodly amount of information in a given format:
Information not limited to a specific scientific field or area of
expertise, we, as a society, want to be able to read it. Forever. 

Your friend's argument that the only PDFs in the year 2070 will be 30
years old might be true in its premise, but I don't agree on his
conclusion. Why wouldn't we be interested in information from the
beginning of the millennium when we reach 2070?


That is, of course, with the premise that our society does not collapse
so that we can't maintain our current technology level. In that case
we're talking bit loss as well, which is a whole other discussion.

- Toke Eskildsen (_not_ a specialist in archival)


Re: [CODE4LIB] Durability of PDFs

2009-06-15 Thread Thomas Dowling
On 06/15/2009 07:45 AM, K.G. Schneider wrote:

> 
> Setting aside the paper/electronic argument, in terms of canonical files for
> documents intended for long-term preservation, PDF seems a very weak choice.
> Whether or not the actual files will "last" 100 years (I assume that we mean
> that they won't degrade to the point of nonreadability), using a proprietary
> binary format that doesn't readily convert to other formats seems a poor
> choice.


The PDF 1.7 spec was published last year as ISO 32000-1:2008.  PDF/A is also
ISO 19005-1:2005.  All just FWIW, since ISO's history with Microsoft OOXML
raises some questions about how "open" you really have to be to get an ISO 
number.


> Why not have the documents be sourced in one of the XML-based formats such
> as DocBook or DITA (well-documented, open, text-based, single-source
> publication formats)? Then you can have your PDF and preserve it too.
> (Donning tinfoil hat) You could even produce a handful of paper-based
> documents and hide them in caves around the world. 


I expect the conversation usually starts (and often ends) with, "Hmm...'File,
Save as PDF'.  Bingo!"


-- 
Thomas Dowling
tdowl...@ohiolink.edu


Re: [CODE4LIB] Durability of PDFs

2009-06-15 Thread Mike Taylor
K.G. Schneider writes:
 > > > [PDF files will not become unreadable] in the next 30-40 years.
 > > > Possibly not in the 20 years that will follow. After that, when
 > > > only 30-year and older documents are in the PDF format, the
 > > > danger will increase that this information will not be readable
 > > > any more. It is generally considered as quite unlikely that PDF
 > > > will be readable in 100 years.
 > 
 > Setting aside the paper/electronic argument, in terms of canonical
 > files for documents intended for long-term preservation, PDF seems
 > a very weak choice.  Whether or not the actual files will "last"
 > 100 years (I assume that we mean that they won't degrade to the
 > point of nonreadability), using a proprietary binary format that
 > doesn't readily convert to other formats seems a poor choice.

Just as a point of information, PDF is not proprietary: as of July
2008, it is an ISO standard.

 > Why not have the documents be sourced in one of the XML-based
 > formats such as DocBook or DITA (well-documented, open, text-based,
 > single-source publication formats)? Then you can have your PDF and
 > preserve it too.

The problem is how we get there from here.  Right now, the world is
full of journals that routinely make available PDFs of the articles
published in them.  Far, far fewer make available any XML-based format
-- in fact, I would imagine that only a tiny minority ever have the
documents in an XML-based format: most will go from MS-Word
submissions to PDF publications.

So the question becomes: if we want to rely on digital preservations
so that we don't need to print, do we first need to persuade the
world's journals to change their publishing practices?  If so, that
doesn't seem like a very realistic goal.

 _/|____
/o ) \/  Mike Taylorhttp://www.miketaylor.org.uk
)_v__/\  "When I can't fondle the hand I'm fond of, I fondle the hand at
 hand" -- E. Y. Harburg, "Finian's Rainbow"


Re: [CODE4LIB] Durability of PDFs

2009-06-15 Thread K.G. Schneider
> In one of my alternative incarnations, I am a zoological taxonomist.
> One of the big issues for taxonomy right now is whether to accept as
> nomenclaturally valid papers that are published only in electronic
> form, i.e. not printed on paper by a publisher.
> 
> In a discussion of this matter, a colleague has claimed

> 
> > [PDF files will not become unreadable] in the next 30-40 years.
> > Possibly not in the 20 years that will follow. After that, when only
> > 30-year and older documents are in the PDF format, the danger will
> > increase that this information will not be readable any more. It is
> > generally considered as quite unlikely that PDF will be readable in
> > 100 years.

Setting aside the paper/electronic argument, in terms of canonical files for
documents intended for long-term preservation, PDF seems a very weak choice.
Whether or not the actual files will "last" 100 years (I assume that we mean
that they won't degrade to the point of nonreadability), using a proprietary
binary format that doesn't readily convert to other formats seems a poor
choice. 

Why not have the documents be sourced in one of the XML-based formats such
as DocBook or DITA (well-documented, open, text-based, single-source
publication formats)? Then you can have your PDF and preserve it too.
(Donning tinfoil hat) You could even produce a handful of paper-based
documents and hide them in caves around the world. 

Karen G. Schneider


Re: [CODE4LIB] Durability of PDFs

2009-06-15 Thread Benjamin O'Steen
There are items/options that can be used within a given PDF that will
drastically affect how likely it is that the PDF will still be readable.

* Inclusion of 3D applets or any Adobe Acrobat specific features
I have seen PDFs with 3D chemical applets embedded somehow into the PDF
using a plugin. The longevity of this addon depends entirely on how long
the company/team that made the plugin, wants to support it.

* DRM of any kind - password-protected, print-disabled, etc
These features make it hard, verging on the legally impossible to
migrate the PDF to a newer format or read it on newer versions of
Acrobat or other PDF viewers. (Legally impossible, as circumventing this
would incur the wrath of the DMCA)

* Any other odd features.

There is a profile, which doesn't allow you to add any of the above, and
it is often referred to as PDF/A (A for Archival format)

The easiest way to create these at the moment, is to use OpenOffice 3
and choose the "Save As PDF" and tick the PDF/A option.

As for not becoming unreadable.. well, this all depends on age (and so
the version) of the PDF, and your current viewing software. I have
already had situations where older PDFs cannot be viewed correctly in
newer readers (majority of these were due to older 'print-ready' pdfs
with colour-information held within)

And this doesn't include the various issues that can arise from fonts
not being included or present on the client's system, 'print' fonts that
compress letters in interesting ways ("fi" -> single character, but a
non-unicode one), images that do not display, incomplete PDFs due to bad
exports that silently fail, etc.

My advice is to keep the source files alongside, especially if they are
in (la)tex or HTML. Text is always parsable.

Ben

On Mon, 2009-06-15 at 11:37 +0100, Mike Taylor wrote:
> Dear CODE4LIB colleagues,
> 
> In one of my alternative incarnations, I am a zoological taxonomist.
> One of the big issues for taxonomy right now is whether to accept as
> nomenclaturally valid papers that are published only in electronic
> form, i.e. not printed on paper by a publisher.
> 
> In a discussion of this matter, a colleague has claimed:
> 
> > [PDF files will not become unreadable] in the next 30-40 years.
> > Possibly not in the 20 years that will follow. After that, when only
> > 30-year and older documents are in the PDF format, the danger will
> > increase that this information will not be readable any more. It is
> > generally considered as quite unlikely that PDF will be readable in
> > 100 years.
> 
> I would appreciate any comments that anyone on this list has on the
> likelihood that PDF will be unreadable in 100 years.
> 
> Many thanks,
> 
>  _/|_  ___
> /o ) \/  Mike Taylorhttp://www.miketaylor.org.uk
> )_v__/\  "Can't someone act COMPLETELY OUT OF CHARACTER without arousing
>suspicion?" -- Bob the Angry Flower, www.angryflower.com


[CODE4LIB] Durability of PDFs

2009-06-15 Thread Mike Taylor
Dear CODE4LIB colleagues,

In one of my alternative incarnations, I am a zoological taxonomist.
One of the big issues for taxonomy right now is whether to accept as
nomenclaturally valid papers that are published only in electronic
form, i.e. not printed on paper by a publisher.

In a discussion of this matter, a colleague has claimed:

> [PDF files will not become unreadable] in the next 30-40 years.
> Possibly not in the 20 years that will follow. After that, when only
> 30-year and older documents are in the PDF format, the danger will
> increase that this information will not be readable any more. It is
> generally considered as quite unlikely that PDF will be readable in
> 100 years.

I would appreciate any comments that anyone on this list has on the
likelihood that PDF will be unreadable in 100 years.

Many thanks,

 _/|____
/o ) \/  Mike Taylorhttp://www.miketaylor.org.uk
)_v__/\  "Can't someone act COMPLETELY OUT OF CHARACTER without arousing
 suspicion?" -- Bob the Angry Flower, www.angryflower.com