On Mon, May 12, 2014 at 2:24 PM, Venkata Pingali <[email protected]> wrote:

> I have been working on PDF extraction. I find that PDF
> combines 'what' (text itself) with 'how' (transformations,
> presentation). The table that we see if often just a collection
> of lines and rectangles put together in an adhoc fashion.
> It could be due to pdf generator libraries themselves. It feels
> like the 'C' of this space. IMO we are missing the frameworks
> and higher levels of abstraction and/or representations. They
> may be available in the adobe ecosystem somewhere but it
> is not obvious to an outsider like me as to what they are.
>

Hey Venkat, have you made any progress on this?

Adobe formats are notorious for being hard to work with. In addition the
original objective of PDF was display, not maintaining retrievable data
hierarchy. So I have little confidence a single solution will just work for
all cases.

Perhaps the way forward is to build a document parser that takes in a
layout description in a domain specific language and tries to make sense of
the PDF.

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to