Neat. This could be extended to putting a full table of contents into the
metadata, and in lots of other ways. The other nice thing about it is that it
would be possible to push the same data through a LaTeX to HTML toolchain for
those who want HTML output.
peter
On 10/06/2014 03:18 PM, Norman Gray wrote:
Greetings.
On 2014 Oct 6, at 19:19, Alexander Garcia Castro <[email protected]> wrote:
querying PDFs is NOT simple and requires a lot of work -and usually
produces lots of errors. just querying metadata is not enough. As I said
before, I understand the PDF as something that gives me a uniform layout.
that is ok and necessary, but not enough or sufficient within the context
of the web of data and scientific publications. I would like to have the
content readily available for mining purposes. if I pay for the publication
I should get access to the publication in every format it is available. the
content should be presented in a way so that it makes sense within the web
of data. if it is the full content of the paper represented in RDF or XML
fine. also, I would like to have well annotated content, this is simple and
something that could quite easily be part of existing publication
workflows. it may also be part of the guidelines for authors -for instance,
identify and annotate rhetorical structures.
The following might add something to this conversation.
It illustrates getting the metadata from a LaTeX file, putting it into an XMP
packet in a PDF, and getting it out of the PDF as RDF. Pace Peter's mention of
/Author, /Title, etc, this just focuses on the XMP packet.
This has the document metadata, the abstract, and an illustrative bit of
argumentation. Adding details about the document structure, and (RDF) pointers
to any figures would be feasible, as would, I suspect, incorporating CSV files
directly into the PDF. Incorporating \begin{tabular} tables would be rather
tricky, but not impossible. I can't help feeling that the XHTML+RDFa
equivalent would be longer and need more documentation to instruct the author
where to put the RDFa magic.
It's not very fancy, and still has rough edges, but it only took me 100
minutes, from a standing start.
Generating and querying this PDF seems pretty simple to me.
----
[...]