On Mon, May 12, 2014 at 2:24 PM, Venkata Pingali <[email protected]> wrote:
> I have been working on PDF extraction. I find that PDF > combines 'what' (text itself) with 'how' (transformations, > presentation). The table that we see if often just a collection > of lines and rectangles put together in an adhoc fashion. > It could be due to pdf generator libraries themselves. It feels > like the 'C' of this space. IMO we are missing the frameworks > and higher levels of abstraction and/or representations. They > may be available in the adobe ecosystem somewhere but it > is not obvious to an outsider like me as to what they are. > Hey Venkat, have you made any progress on this? Adobe formats are notorious for being hard to work with. In addition the original objective of PDF was display, not maintaining retrievable data hierarchy. So I have little confidence a single solution will just work for all cases. Perhaps the way forward is to build a document parser that takes in a layout description in a domain specific language and tries to make sense of the PDF. -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
