I have a simple problem. how to extract meaningful information from
the PDF? for instance, citation data from the PDF. I would be happy if
I could extract citation data with a 70% accuracy. so far we have
tried a lot of tools and got very poor results. I would also like to
know how could I get the content of the PDF, jail break the PDF so
that I can make effective use of the content.

I dont have anything against the PDF, I would be happy just by having
an open PDF, something that gets content free.

On Thu, May 2, 2013 at 10:41 PM, Norman Gray <[email protected]> wrote:
>
> Sarven and all, hello.
>
> On 2013 May 2, at 18:38, Sarven Capadisli <[email protected]> wrote:
>
>>> _What_ sucks on the web?  Certainly not PDF.
>>
>> HTML/Web, PDF/Desktop?
>
> PDF/Web, HTML/Desktop?  I'm not sure what you're trying to say here.
>
>>> Thus HTML can do some unimportant things better than PDF,
>>
>> Web pages. It will never take off.
>
> No no, the web is massively successful.  HTML is a really clever hypertext 
> format which is successful because it lets a number of things go wrong (it 
> doesn't guarantee link integrity, links are all one-way, there's minimal text 
> metadata, and so on).  These deficiencies are seriously smart things to use 
> to create a global hypertext.  Web pages have taken off in a big way.
>
> It does not follow that HTML-based hypertext solves all text problems.  In 
> particular, there is nothing in the above set of clever properties which 
> makes HTML obviously ideal for communicating long-form textual arguments.
>
> And what is this 'desktop' of which you speak?  PDF is for making posters, 
> presentations, on-screen documents, and on-tablet documents -- lots of very 
> distinct layout problems there.  In the last case, you can even transfer the 
> things to paper and read them in the bath, if you want.
>
>
>> but what it
>>> can't do, which _is_ important, is make things readable.  The visual
>>> appearance -- that is, the typesetting -- of rendered HTML is almost
>>> universally bad, from the point of view of reading extended pieces.
>>> I haven't (I admit) yet experimented with reading extended text on a
>>> tablet, but I'd be surprised if that made a major difference.
>>
>> I think you are conflating the job of HTML with CSS. Also, I think you are 
>> conflating readability with legibility as far as the typesetting goes. 
>> Again, that's something CSS handles provided that suitable fonts are in use.
>
> CSS can help make HTML pages more readable.  Myself, I usually put quite a 
> lot of effort into the CSS which accompanies web pages I write.  But it takes 
> a lot of effort to produce good CSS, and the case you're aiming to optimise 
> is the case of a normal-length web-page (under 1000 words, say), with 
> relatively small investments on the part of the reader.
>
> Distributing PDF, you have easy and precise control over fonts, layout, and 
> overall design (or rather, you in principle have access to a style which is 
> carefully designed).  This makes it easy to produce something which is easy 
> to read for thousands of words.
>
> But this is to some extent irrelevant, because I think we're now talking 
> about a non-problem:
>
>>> Also, HTML is not the same as linked data; there's no 'dog food' here
>>> for us to eat.
>>
>> That's quite a generalization there? So, I would argue that "HTML" is more 
>> about eating dogfood in the Linked Data mailing list than parading on PDF. 
>> We are trying to build things one step at a time; HTML today, a URI that it 
>> can sit on tomorrow. Additional machine-friendly stuff the day after.
>
> What, seriously, is the connection between HTML and linked-data?  If there is 
> a deep connection, then HTML articles represent the linked-data community's 
> dog-food, and it should be eaten.
>
> But there is no such deep connection.
>
> Certainly, HTML is one of the representations which a LD system will offer, 
> because a data provider needs to produce a readily and flexibly rendered 
> human-readable representation of the item data being named/offered.   That's 
> a completely different thing from an article.
>
> In another message in this thread, Alexander Garcia Castro says:
>
>> I am right now struggling with a task as simple as getting citation data
>> from PDFs. I dont want to say that the PDF is all bad but... come on,
>> it had a place in the time when desktop was king. now we need to make
>> effective use of content, the reality is simply that content is locked
>> up in PDFs.
>
> Sure: there are weaknesses in the way that article metadata is currently 
> incorporated in PDFs.  DOIs, ORCIDs, arXiv identifiers, all of the 'Beyond 
> PDF' experiments, and so on are all attempts to join the various dots here, 
> and they are rapidly getting better.
>
> Until we really get AI that can read the paper for us, there's nothing 
> 'locked up in PDFs' that's more than (I exaggerate only slightly) a regular 
> expression away.
>
> All the best,
>
> Norman
>
>
> --
> Norman Gray  :  http://nxg.me.uk
> SUPA School of Physics and Astronomy, University of Glasgow, UK
>



-- 
Alexander Garcia
http://www.alexandergarcia.name/
http://www.usefilm.com/photographer/75943.html
http://www.linkedin.com/in/alexgarciac

Reply via email to