Re: scientific publishing process (was Re: Cost and access)

Norman Gray Sat, 04 Oct 2014 05:50:52 -0700

Bernadette, hello.

On 2014 Oct 4, at 00:36, Bernadette Hyland <[email protected]> wrote:


... a really useful message which pulls several of these threads together.  The 
following is a rather fragmentary response.

As a reference point, I tend to think "publication" = "LaTeX -> PDF".  To 
pre-dispel a misconception, here, I'm not being a cheerleader for PDF below, 
but a fair fraction of the antagonism directed towards PDF in this thread is, I 
think, misplaced -- PDF is not the problem.

> We'd do ourselves a huge favor if we showed (STM) publishing executives why 
> this Linked Data stuff matters anyway.

They know.  A surprisingly large fraction of the Article Processing Charge we 
pay to them goes on extracting, managing and sharing metadata.  That includes 
DOIs, Crossref feeds, science direct, and so on and so on, and so (it seems) 
on.  It also includes conversion to XML: if you submit a LaTeX file to a big 
publisher, the first thing they'll do is convert it to XML+MathML (using 
workflows based on for example LaTeXML or TeX4ht) and preserve that; several of 
them then re-generate LaTeX for final production.

To a large extent, I suspect publishers now regard metadata management as their 
Job -- in the sense of their contribution to the scholarly endeavour -- and 
they could do without the dead trees.  If you can offer them a way of making 
metadata _insertion_ easier, which is cost effective, can be scaled up, and 
which a _broad_ range of authors will accept (the hard bit), they'll rip your 
arm off.

> 1) PDF works well for (STM) publishers who require fixed page display;

Yes, and for authors.  Given an alternative between an HTML version of a paper 
and a PDF version, I will _always_ choose the PDF, because it's zero-hassle, 
more reliably faithful to the author's original, more readable, and I can read 
it in the bath.

> 2) PDF doesn't take advantage of the advances we've made in machine 
> readability;

If by this you mean RDF, then yes, the naive ways of generating PDFs are not 
RDF-aware.  So we shouldn't be naive...

XMP is an ISO standard (as PDF is, and like it originating from Adobe) and is a 
type of RDF (well, an irritatingly 90% profile of RDF, but let that pass).  
Though it's not trivial, it's not hard to generate an XMP packet and get it 
into a PDF, and once there, the metadata job is mostly done.

> 3) In fact, PDFs suck on eBook readers which are all about flexible page 
> layout; and

Sure, but they're not intended for e-book readers, so of course they're poor at 
that.

> 4) We already have the necessary Web Standards to address the problem, so no 
> need to recreate the wheel.

If, again, you mean RDF, then I agree completely.

> --> Produce a Web-based tool that allows researchers to share their 
> [privately | publicly ] funded knowledge and produces a variety of outputs: 
> LaTeX, PDF and carries with it a machine readable representation.

Well, not web-based: I'd want something I can run on my own machine.

> Do people agree with the following SOLUTION approach?
> 
> The international standards to solve this exist. Standards from W3C and the 
> International Digital Publishing Forum (IDPF).[2]  Use (X)HTML for 
> generalized document creation/rendering. Use CSS for styling. Use MathML for 
> formulas. Use JS for action. Use RDF to model the metadata within HTML.  

PDF and XMP are both ISO standards, too.  LaTeX isn't a Standard standard, but 
it's pretty damn stable.

MathML one would _not_ want to type.  The only ways of generating MathML, that 
I'm slightly familiar with, start with TeX syntax.  There are presumably 
GUI-based ones, too *shudder*.

> I propose a 'walk before we run' approach but do better than basic metadata 
> (i.e., title, author name, institution, abstract).  Link to other scholarly 
> communities/projects such as Vivo.[3]  

I generate Atom feeds for my PDF lecture notes.  The feed content is extracted 
from the XMP and from the /Author, /Title, etc, metadata within the PDF.  That 
metadata gets there automatically from the \author{...}, \title{...} metadata 
which is necessarily within the LaTeX source.  The pipeline isn't production 
quality, but it's done.  That much isn't challenging.

> We've got to show the 1,200 lb gorillas (STM publishers) why they want to 
> come over to our part of the forest ... it isn't enough to stay with PDF to 
> facilitate typesetting in 2015!  The Web has moved on & so must the 
> publishers.  

While we're up our tree arguing, that din you can hear in the next clearing is 
the publishers spending their APCs on large-scale metadata extraction, and 
tearing out their hair at authors' apparent inability to follow simple 
instructions on how to make that easier.

(And just by the way: yes, publishers are in it for the money, ... monopoly 
rents..., yadda yadda, ... but I've never actually _caught_ one eating babies).

> Anything we do must be better than LaTeX in terms of ease-of-use.

Really?  What, exactly?

Word (and analogues)?  Sure, you can get metadata from WP files, but it takes a 
lot of heuristic effort, and requires authors to be pretty disciplined about 
using styles.

GUI XML editors?  I was talking to someone a couple of weeks ago who'd just 
completed a whole PhD detailing exactly how rubbish XML editors are in 
practical usability terms.

nxml-mode in Emacs?  Probably the best option for writing pointy-brackets, but 
still a bit painful for authoring extensive text.  And you can't write MathML.

> Publishers will make more money because their customers which include 
> researchers & universities, will be able to discover, access and re-use data 
> liberated from the 20th Century PDF.

That's why the publishers currently care about metadata.

----

PDFs are surprisingly flexible and open containers for transporting around 
Stuff (I haven't tried it, but I have little doubt you could bundle HTML, CSS 
and all the RDF you wanted into a PDF, should you somehow manage to devise a 
use-case for that).  The hard-ish bit is using that metadata in a visibly 
useful way -- tools tend not to rely on it, because it tends not to be there; 
and it tends not to be there because users don't demand it; and users don't 
demand it because tools don't display it.  The seriously hard bit is getting 
the metadata from the authors (who, to a first approximation, _really_, 
*really* don't care) into the PDF.

All the best,

Norman


-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK

Re: scientific publishing process (was Re: Cost and access)

Reply via email to