Sure. So extract the text from the PDF and query that. It also would be nice to have access to the LaTeX sources.

What HTML publishing *might* have that is better than the above is to more easily embed some extra information into papers that can be queried. Is this just metadata that could also be easily injected into PDFs? If given this capability will a significant number of authors use it? Is it instead better to have a separate document that has the information and not use HTML for publishing?

peter




On 10/06/2014 10:42 AM, Alexander Garcia Castro wrote:
"It's not hard to query PDFs with SPARQL.  All you have to do is extract the
metadata from the document and turn it into RDF, if needed.  Lots of programs
extract and display this metadata already."

in the age of the web of data why should I restrict my search just to
metadata? I want the full content, open access or not once I have the document
I should be able to mine the content of the document. I dont want to limit my
search just to simple metadata.

On Mon, Oct 6, 2014 at 9:48 AM, Peter F. Patel-Schneider
<[email protected] <mailto:[email protected]>> wrote:

    It's not hard to query PDFs with SPARQL.  All you have to do is extract
    the metadata from the document and turn it into RDF, if needed.  Lots of
    programs extract and display this metadata already.

    No, I don't think that viewing this issue from the reviewer perspective is
    too narrow.  Reviewers form  a vital part of the scientific publishing
    process. Anything that makes their jobs harder or the results that they
    produce worse is going to have to have very large benefits over the
    current setup.  In any case, I haven't been looking at the reviewer
    perspective only, even in the message quoted below.

    peter

    PS:  This is *not* to say that I think that the reviewing process is
    anywhere near ideal.  On the contrary, I think that the reviewing process
    has many problems, particularly as it is performed in CS conferences.



    On 10/06/2014 09:19 AM, Martynas Jusevičius wrote:

        Dear Peter,

        please show me how to query PDFs with SPARQL. Then I'll believe there
        are no benefits of XHTML+RDFa over PDF.

        Addressing the issue from the reviewer perspective only is too narrow,
        don't you think?


        Martynas

        On Mon, Oct 6, 2014 at 6:08 PM, Peter F. Patel-Schneider
        <[email protected] <mailto:[email protected]>> wrote:



            On 10/06/2014 08:38 AM, Phillip Lord wrote:


                "Peter F. Patel-Schneider" <[email protected]
                <mailto:[email protected]>> writes:


                    I would be totally astonished if using htlatex as the main
                    way to produce
                    conference papers were as simple as this.

                    I just tried htlatex on my ISWC paper, and the result was,
                    to put it
                    mildly,
                    horrible.  (One of my AAAI papers was about the same, the
                    other one
                    caused an
                    undefined control sequence and only produced one page of
                    output.)
                    Several
                    parts of the paper were rendered in fixed-width fonts.
                    There was no
                    attempt
                    to limit line length.  Footnotes were in separate files.




                The footnote thing is pretty strange, I have to agree. Although
                "footnotes" are a fairly alien concept wrt to the web.
                Probably hover
                overs would be a reasonable presentation for this.


                    Many non-scalable images were included, even for simple 
math.



                It does MathML I think, which is then rendered client side. Or
                you could
                drop math-mode straight through and render client side with
                mathjax.



            Well, somehow png files are being produced for some math, which is a
            failure.  I don't know what the way to do this right would be, I
            just know
            that the version of htlatex for Fedora 20 fails to reasonably
            handle the
            math in this paper.

                    My carefully designed layout for examples was modified in
                    ways that
                    made the examples harder to understand.



                Perhaps this is a key difference between us. I don't care
                about the
                layout, and want someone to do it for me; it's one of the
                reasons I use
                latex as well.



            There are many cases where line breaks and indentation are
            important for
            understanding.  Getting this sort of presentation right in latex
            is a pain
            for starters, but when it has been done, having the htlatex
            toolchain mess
            it up is a failure.

                    That said, the result was better than I expected.  If
                    someone upgrades
                    htlatex
                    to work well I'm quite willing to use it, but I expect
                    that a lot of work
                    is
                    going to be needed.



                Which gets us back to the chicken and egg situation. I would
                probably do
                this; but, at the moment, ESWC and ISWC won't let me submit
                it. So, I'll
                end up with the PDF output anyway.



            Well, I'm with ESWC and ISWC here.  The review process should be
            designed to
            make reviewing easy for reviewers.  Until viewing HTML output is as
            trouble-free as viewing PDF output, then PDF should be the
            required format.

                This is why it is important that web conferences allow HTML,
                which is
                where the argument started. If you want something that prints 
just
                right, PDF is the thing for you. If you you want to read your
                papers in
                the bath, likewise, PDF is the thing for you. And that's fine
                by me (so
                long as you don't mind me reading your papers in the bath!).
                But it
                needs to not be the only option.



            Why?  What are the benefits of HTML reviewing, right now?  What
            are the
            benefits of HTML publishing, right now?  If there were HTML-based
            tools that
            worked well for preparing, reviewing, and reading scientific
            papers, then
            maybe conferences would use them.  However, conference organizers 
and
            reviewers have limited time, and are thus going for the simplest
            solution
            that works well.

            If some group thinks that a good HTML-based solution is possible,
            then let
            them produce this solution.  If the group can get pre-approval of 
some
            conference, then more power to them.  However, I'm not going to
            vote for any
            pre-approval of some future solution when the current situation is
            satisficing.

                Phil



            peter






--
Alexander Garcia
http://www.alexandergarcia.name/
http://www.usefilm.com/photographer/75943.html
http://www.linkedin.com/in/alexgarciac


Reply via email to