Alas, 95% of "science" seems to be converting data from one file format
to another. I hate all file formats. There are FAR FAR too many of
them! For example, the suffix PDF can also mean "Powder Diffraction
File", which, believe it or not, IS a very common "machine-readable"
scientific data format. Do you have a program that can read it? I
thought so.
Every time I encounter a new file format (which seems to happen every
time I download a new computer program), I have to then go and figure
out how to convert it into text and write an awk program for parsing it
into something I can use. Strangely, this used to bother me a lot more
than it does now. Perhaps this is because I have resigned myself to the
fact that there is nothing I can do to stop the process of file format
proliferation.
That said, as file formats go, I don't think Adobe PDF is so bad. It
has the advantage of being widespread enough that it probably won't go
away for at least a few more decades, and it is a general way to
represent anything that can go onto a printed page. Yes, it is a
type-setting file format (every letter or word having a 2D coordinate,
plus a "font"), and yes they are a pain to parse! I once burned up an
entire week trying to extract author, title, journal, etc. from a pile
of 300 "sdarticle.pdf" files. It is NOT easy!
It was after this highly painful experience that I realized PDF files
are not documents. They are annotated 2D images. I think the "right
way" to think about PDF files is to consider them equivalent to a "hard
copy". For the younger readers: a "hard copy" is like a PDF file, but
after you have killed a tree to print it out. Believe it or not, once
upon a time there were no PDF files, and "hard copy" was the ONLY format
for long-term archival storage. Your university probably still has a
large amount of "hard copy" journals. They are usually located in that
great big building called "The Library", that you may or may not have
been to.
Fortunately, the technology for converting "hard copy" (or a PDF) into
something useful is maturing rapidly. Not so long ago I was faced with
trying to get a large table of numbers out of the International Tables
of Crystallography (of which I only have a "hard copy") and into a
computer program for doing absorption corrections. After spending an
hour or so typing in 5-digit numbers, I remembered that several years
ago I had bought an $80 HP print/scan/fax machine that can produce a
"searchable" PDF. I scanned in the table, (after masking off the
caption, etc with blank sheets of paper) which produced a PDF file. I
then easily selected the numbers in Acrobat, and pasted them into a text
file. One awk script later, and I was done!
I suppose what James Stroud would really like, however, is a way to
select the area of the PDF digitally and then "right click" for a
"convert to ..." option. I don't think such a feature is available just
yet. Google, however, has done a great deal of work on this, and they
seem to be doing pdf-to-html conversion automatically now if you gmail
yourself a PDF file.
The best "scriptable" programs for getting data out of PDFs I have found
so far are the "poppler" Linux pdf converter programs: pdftotext,
pdf2ps, ps2ascii. Usually some combination of them will give you text
with the formatting close to what you want.
Or, if all else fails, print it, cut out the table you want with
scissors, and then scan it back in with OCR turned on. You might even
be able to "out source" this: by giving an undergrad the gift of their
first trip to the "Library". One day, when they become a scientist,
they might think more carefully about how they format their
supplementary documents.
-James Holton
MAD Scientist
On 11/17/2010 7:41 AM, John R Helliwell wrote:
Dear Colleagues,
In trying to perhaps see some level of virtue in the PNAS approach one
can imagine that not all deposited data
can be well characterised in a way that is easy for computers to parse
automatically.
In such circumstances, a deposited PDF may be better than nothing at
all. As yet,
not all journal publishing platforms can or will serve a variety of different
file formats, which is probably in part why PDFs might be used, since
they are easy to generate.
That said I agree with previous postings today that Journals should
encourage authors
to supply data in well-characterised machine-readable formats ie to
the extent that this is
feasible.
For small molecule crystal structures within IUCr Journal articles,
and associated
crystal structure data sets, this is straightforward, since variants
of the IUCr's CIF standard
cover diffraction images, structure factors and refined coordinates
and ADPs. For protein crystal structures, as
this CCP4bb well knows, articles are accompanied by RCSB deposition of
coordinates and structure factors.
Nevertheless, it would be good to see research scientists increasing
pressure on journals to deposit and disseminate supplementary data in
machine-readable formats, since that would in the long run greatly
increase the value of the deposited material.
An open-access paper I recently published with a colleague from the
IUCr office discusses the importance of fully integrating experimental
data with the finished research analysis, to complete the scientific
record. See:
Helliwell, J. R.& McMahon, B. (2010) The record of experimental
science: archiving data with literature. Information Services and Use 30,
31-37; DOI: 10.3233/ISU-2010-0609.
Many of the things we discuss in that article are equally relevant to
supplementary information as discussed in this thread.
Yours sincerely,
John
Professor John R Helliwell DSc
On Wed, Nov 17, 2010 at 6:39 AM, James Stroud<[email protected]> wrote:
I was reading the PNAS author guidelines and I came across this gem:
Datasets: Supply Excel (.xls), RTF, or PDF files. This file type will be
published in raw format and will not be edited or composed.
Did I read those last two file formats correctly? I have actually came
across a dataset in supplementary information that was several dozen pages
of PDF. It was effectively impossible to extract the data from this
document. (I can dig it up if pressed, probably.) I had no idea that the
authors may have been encouraged to submit their data like that.
Does a premiere scientific journal actually request data to be in PDF
format?
I can think of dozens of other formats that would be more fitting. They are
summarized here:
http://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats
What is the scholarly equivalent to a torch and pitchfork march and how can
we organize such a march to encourage journals to require proper
serialization formats for datasets in supplementary info?
James
P.S. I am aware that it is better to submit data to a dedicated repository,
but let's consider those cases where research produces data for which there
is not yet a dedicated repository.