Generally, I'd rather have semantically tagged reflowable CSS-enabled XHTML documents, epub like. However, PDFs serve a useful purpose too, and in some restricted cases it's hard to see how a particular goal can be achieved differently. An interesting option is to do something like a "Hybrid PDF": store the original editable document and/or alternate forms (XHTML+CSS+semantic markup) in the PDF, automatically and reliably sensing those alternates at any point. LibreOffice includes this feature now:

http://blogs.computerworlduk.com/simon-says/2012/03/the-magic-of-editable-pdfs/index.htm

It's possible, at least to a large extent, to associate particular segments of data to particular rendered elements. OCR programs make use of this to place resulting text in the same position as the graphic version of the text in a scanned page. This could allow copy and paste of semantically tagged data from a PDF just like an RDFa web page.

sdw

On 5/7/13 1:32 AM, RebholzSchuhmann wrote:
Hi,

I have seen similar discussions before.

I guess, we look at two different use cases:
(1) PDF: layout oriented, but could (and will, hopefully) carry a lot more semantics information. The key achievement is and will be to have optimal layout, and on the other side the overhead for processing / exploitation / reuse goes up for everybody who is NOT PDF-savvy. (2) the other open formats (Html, Xml, Pdf): allow easy-to-go exploitation, processing, and enrichment, and stand for the spirit of the open web and reuse of data.

Listening to publishers, certainly layout matters. I am not only talking about the big five or ten who would have the resources to go a different direction, I am talking about the 1,000 smaller publishers who have to serve their community. They would struggle more to comply with the other "standards" and still deliver an appealing product.

I guess, some clever thinking and collabortive work is required to bring both 
together.

Hope this helps.

    -drs-

On 07/05/2013 09:17, Steve Pettifer wrote:
I assume most authors don't actually format their documents by selecting a font 
size for every single heading and so on.
This is a tempting assumption to make, especially if you come from computer 
science / maths / physics and related disciplines (as I do). But my experience 
in the life sciences is that authors do 'paint' their manuscripts by hand, 
painstakingly selecting the font and format for every bit of their document. 
Even using the 'semantic' features of wordprocessors (such as 'Heading 1') is 
something that's not commonplace. So before we get too carried away with 
expecting people to write HTML / LaTex or even markup, we'll need to take into 
account the working practises of the vast majority of academics outside of the 
more 'semantically aware' bits of science.

They work in a format that utilizes semantically meaningful information about 
the work: to identify a title, headings, math blocks, illustrations, plots, etc.
No, they really don't. I wish they did. But, outside of a certain area of 
science, they don't.

Steve


--
D. Rebholz-Schuhmann -mailto:[email protected]


--
Stephen D. Williams [email protected] [email protected] LinkedIn: 
http://sdw.st/in
V:650-450-UNIX (8649) V:866.SDW.UNIX V:703.371.9362 F:703.995.0407
AIM:sdw Skype:StephenDWilliams Yahoo:sdwlignet Resume: http://sdw.st/gres
Personal: http://sdw.st facebook.com/sdwlig twitter.com/scienteer

Reply via email to