I have a simple problem. how to extract meaningful information from the PDF? for instance, citation data from the PDF. I would be happy if I could extract citation data with a 70% accuracy. so far we have tried a lot of tools and got very poor results. I would also like to know how could I get the content of the PDF, jail break the PDF so that I can make effective use of the content.
I dont have anything against the PDF, I would be happy just by having an open PDF, something that gets content free. On Thu, May 2, 2013 at 10:41 PM, Norman Gray <[email protected]> wrote: > > Sarven and all, hello. > > On 2013 May 2, at 18:38, Sarven Capadisli <[email protected]> wrote: > >>> _What_ sucks on the web? Certainly not PDF. >> >> HTML/Web, PDF/Desktop? > > PDF/Web, HTML/Desktop? I'm not sure what you're trying to say here. > >>> Thus HTML can do some unimportant things better than PDF, >> >> Web pages. It will never take off. > > No no, the web is massively successful. HTML is a really clever hypertext > format which is successful because it lets a number of things go wrong (it > doesn't guarantee link integrity, links are all one-way, there's minimal text > metadata, and so on). These deficiencies are seriously smart things to use > to create a global hypertext. Web pages have taken off in a big way. > > It does not follow that HTML-based hypertext solves all text problems. In > particular, there is nothing in the above set of clever properties which > makes HTML obviously ideal for communicating long-form textual arguments. > > And what is this 'desktop' of which you speak? PDF is for making posters, > presentations, on-screen documents, and on-tablet documents -- lots of very > distinct layout problems there. In the last case, you can even transfer the > things to paper and read them in the bath, if you want. > > >> but what it >>> can't do, which _is_ important, is make things readable. The visual >>> appearance -- that is, the typesetting -- of rendered HTML is almost >>> universally bad, from the point of view of reading extended pieces. >>> I haven't (I admit) yet experimented with reading extended text on a >>> tablet, but I'd be surprised if that made a major difference. >> >> I think you are conflating the job of HTML with CSS. Also, I think you are >> conflating readability with legibility as far as the typesetting goes. >> Again, that's something CSS handles provided that suitable fonts are in use. > > CSS can help make HTML pages more readable. Myself, I usually put quite a > lot of effort into the CSS which accompanies web pages I write. But it takes > a lot of effort to produce good CSS, and the case you're aiming to optimise > is the case of a normal-length web-page (under 1000 words, say), with > relatively small investments on the part of the reader. > > Distributing PDF, you have easy and precise control over fonts, layout, and > overall design (or rather, you in principle have access to a style which is > carefully designed). This makes it easy to produce something which is easy > to read for thousands of words. > > But this is to some extent irrelevant, because I think we're now talking > about a non-problem: > >>> Also, HTML is not the same as linked data; there's no 'dog food' here >>> for us to eat. >> >> That's quite a generalization there? So, I would argue that "HTML" is more >> about eating dogfood in the Linked Data mailing list than parading on PDF. >> We are trying to build things one step at a time; HTML today, a URI that it >> can sit on tomorrow. Additional machine-friendly stuff the day after. > > What, seriously, is the connection between HTML and linked-data? If there is > a deep connection, then HTML articles represent the linked-data community's > dog-food, and it should be eaten. > > But there is no such deep connection. > > Certainly, HTML is one of the representations which a LD system will offer, > because a data provider needs to produce a readily and flexibly rendered > human-readable representation of the item data being named/offered. That's > a completely different thing from an article. > > In another message in this thread, Alexander Garcia Castro says: > >> I am right now struggling with a task as simple as getting citation data >> from PDFs. I dont want to say that the PDF is all bad but... come on, >> it had a place in the time when desktop was king. now we need to make >> effective use of content, the reality is simply that content is locked >> up in PDFs. > > Sure: there are weaknesses in the way that article metadata is currently > incorporated in PDFs. DOIs, ORCIDs, arXiv identifiers, all of the 'Beyond > PDF' experiments, and so on are all attempts to join the various dots here, > and they are rapidly getting better. > > Until we really get AI that can read the paper for us, there's nothing > 'locked up in PDFs' that's more than (I exaggerate only slightly) a regular > expression away. > > All the best, > > Norman > > > -- > Norman Gray : http://nxg.me.uk > SUPA School of Physics and Astronomy, University of Glasgow, UK > -- Alexander Garcia http://www.alexandergarcia.name/ http://www.usefilm.com/photographer/75943.html http://www.linkedin.com/in/alexgarciac
