Generally, I'd rather have semantically tagged reflowable CSS-enabled XHTML documents, epub like. However, PDFs serve a useful
purpose too, and in some restricted cases it's hard to see how a particular goal can be achieved differently. An interesting
option is to do something like a "Hybrid PDF": store the original editable document and/or alternate forms (XHTML+CSS+semantic
markup) in the PDF, automatically and reliably sensing those alternates at any point. LibreOffice includes this feature now:
http://blogs.computerworlduk.com/simon-says/2012/03/the-magic-of-editable-pdfs/index.htm
It's possible, at least to a large extent, to associate particular segments of data to particular rendered elements. OCR
programs make use of this to place resulting text in the same position as the graphic version of the text in a scanned page.
This could allow copy and paste of semantically tagged data from a PDF just like an RDFa web page.
sdw
On 5/7/13 1:32 AM, RebholzSchuhmann wrote:
Hi,
I have seen similar discussions before.
I guess, we look at two different use cases:
(1) PDF: layout oriented, but could (and will, hopefully) carry a lot more semantics information. The key achievement is and
will be to have optimal layout, and on the other side the overhead for processing / exploitation / reuse goes up for everybody
who is NOT PDF-savvy.
(2) the other open formats (Html, Xml, Pdf): allow easy-to-go exploitation, processing, and enrichment, and stand for the
spirit of the open web and reuse of data.
Listening to publishers, certainly layout matters. I am not only talking about the big five or ten who would have the
resources to go a different direction, I am talking about the 1,000 smaller publishers who have to serve their community. They
would struggle more to comply with the other "standards" and still deliver an appealing product.
I guess, some clever thinking and collabortive work is required to bring both
together.
Hope this helps.
-drs-
On 07/05/2013 09:17, Steve Pettifer wrote:
I assume most authors don't actually format their documents by selecting a font
size for every single heading and so on.
This is a tempting assumption to make, especially if you come from computer
science / maths / physics and related disciplines (as I do). But my experience
in the life sciences is that authors do 'paint' their manuscripts by hand,
painstakingly selecting the font and format for every bit of their document.
Even using the 'semantic' features of wordprocessors (such as 'Heading 1') is
something that's not commonplace. So before we get too carried away with
expecting people to write HTML / LaTex or even markup, we'll need to take into
account the working practises of the vast majority of academics outside of the
more 'semantically aware' bits of science.
They work in a format that utilizes semantically meaningful information about
the work: to identify a title, headings, math blocks, illustrations, plots, etc.
No, they really don't. I wish they did. But, outside of a certain area of
science, they don't.
Steve
--
D. Rebholz-Schuhmann -mailto:[email protected]
--
Stephen D. Williams [email protected] [email protected] LinkedIn:
http://sdw.st/in
V:650-450-UNIX (8649) V:866.SDW.UNIX V:703.371.9362 F:703.995.0407
AIM:sdw Skype:StephenDWilliams Yahoo:sdwlignet Resume: http://sdw.st/gres
Personal: http://sdw.st facebook.com/sdwlig twitter.com/scienteer