Re: [ccp4bb] [RANT] Publication Data Formats

James Holton Wed, 17 Nov 2010 11:31:47 -0800

Alas, 95% of "science" seems to be converting data from one file formatto another. I hate all file formats. There are FAR FAR too many ofthem! For example, the suffix PDF can also mean "Powder DiffractionFile", which, believe it or not, IS a very common "machine-readable"scientific data format. Do you have a program that can read it? Ithought so.

Every time I encounter a new file format (which seems to happen everytime I download a new computer program), I have to then go and figureout how to convert it into text and write an awk program for parsing itinto something I can use. Strangely, this used to bother me a lot morethan it does now. Perhaps this is because I have resigned myself to thefact that there is nothing I can do to stop the process of file formatproliferation.

That said, as file formats go, I don't think Adobe PDF is so bad. Ithas the advantage of being widespread enough that it probably won't goaway for at least a few more decades, and it is a general way torepresent anything that can go onto a printed page. Yes, it is atype-setting file format (every letter or word having a 2D coordinate,plus a "font"), and yes they are a pain to parse! I once burned up anentire week trying to extract author, title, journal, etc. from a pileof 300 "sdarticle.pdf" files. It is NOT easy!

It was after this highly painful experience that I realized PDF filesare not documents. They are annotated 2D images. I think the "rightway" to think about PDF files is to consider them equivalent to a "hardcopy". For the younger readers: a "hard copy" is like a PDF file, butafter you have killed a tree to print it out. Believe it or not, onceupon a time there were no PDF files, and "hard copy" was the ONLY formatfor long-term archival storage. Your university probably still has alarge amount of "hard copy" journals. They are usually located in thatgreat big building called "The Library", that you may or may not havebeen to.

Fortunately, the technology for converting "hard copy" (or a PDF) intosomething useful is maturing rapidly. Not so long ago I was faced withtrying to get a large table of numbers out of the International Tablesof Crystallography (of which I only have a "hard copy") and into acomputer program for doing absorption corrections. After spending anhour or so typing in 5-digit numbers, I remembered that several yearsago I had bought an $80 HP print/scan/fax machine that can produce a"searchable" PDF. I scanned in the table, (after masking off thecaption, etc with blank sheets of paper) which produced a PDF file. Ithen easily selected the numbers in Acrobat, and pasted them into a textfile. One awk script later, and I was done!

I suppose what James Stroud would really like, however, is a way toselect the area of the PDF digitally and then "right click" for a"convert to ..." option. I don't think such a feature is available justyet. Google, however, has done a great deal of work on this, and theyseem to be doing pdf-to-html conversion automatically now if you gmailyourself a PDF file.

The best "scriptable" programs for getting data out of PDFs I have foundso far are the "poppler" Linux pdf converter programs: pdftotext,pdf2ps, ps2ascii. Usually some combination of them will give you textwith the formatting close to what you want.

Or, if all else fails, print it, cut out the table you want withscissors, and then scan it back in with OCR turned on. You might evenbe able to "out source" this: by giving an undergrad the gift of theirfirst trip to the "Library". One day, when they become a scientist,they might think more carefully about how they format theirsupplementary documents.


-James Holton
MAD Scientist

On 11/17/2010 7:41 AM, John R Helliwell wrote:

Dear Colleagues,
In trying to perhaps see some level of virtue in the PNAS approach one
can imagine that not all deposited data
can be well characterised in a way that is easy for computers to parse
automatically.
In such circumstances, a deposited PDF may be better than nothing at
all. As yet,
not all journal publishing platforms can or will serve a variety of different
file formats, which is probably in part why PDFs might be used, since
they are easy to generate.

That said I agree with previous postings today that Journals should
encourage authors
to supply data in well-characterised machine-readable formats ie to
the extent that this is
feasible.

For small molecule crystal structures within IUCr Journal articles,
and associated
crystal structure data sets, this is straightforward, since variants
of the IUCr's CIF standard
cover diffraction images, structure factors and refined coordinates
and ADPs. For protein crystal structures, as
this CCP4bb well knows, articles are accompanied by RCSB deposition of
coordinates and structure factors.


Nevertheless, it would be good to see research scientists increasing
pressure on journals to deposit and disseminate supplementary data in
machine-readable formats, since that would in the long run greatly
increase the value of the deposited material.

An open-access paper I recently published with a colleague from the
IUCr office discusses the importance of fully integrating experimental
data with the finished research analysis, to complete the scientific
record.  See:
     Helliwell, J. R.&  McMahon, B. (2010) The record of experimental
     science: archiving data with literature. Information Services and Use 30,
     31-37; DOI: 10.3233/ISU-2010-0609.

Many of the things we discuss in that article are equally relevant to
supplementary information as discussed in this thread.

Yours sincerely,
John
Professor John R Helliwell DSc



On Wed, Nov 17, 2010 at 6:39 AM, James Stroud<[email protected]>  wrote:

I was reading the PNAS author guidelines and I came across this gem:

Datasets: Supply Excel (.xls), RTF, or PDF files. This file type will be
published in raw format and will not be edited or composed.

Did I read those last two file formats correctly? I have actually came
across a dataset in supplementary information that was several dozen pages
of PDF. It was effectively impossible to extract the data from this
document. (I can dig it up if pressed, probably.) I had no idea that the
authors may have been encouraged to submit their data like that.
Does a premiere scientific journal actually request data to be in PDF
format?
I can think of dozens of other formats that would be more fitting. They are
summarized here:

http://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats

What is the scholarly equivalent to a torch and pitchfork march and how can
we organize such a march to encourage journals to require proper
serialization formats for datasets in supplementary info?
James
P.S. I am aware that it is better to submit data to a dedicated repository,
but let's consider those cases where research produces data for which there
is not yet a dedicated repository.

Re: [ccp4bb] [RANT] Publication Data Formats

Reply via email to