As it turns out, I am working on a DSL for data extraction for a client - pretty much what you said with some nuances. The client is open-source friendly and I will request for open sourcing the tooling.
On Sat, May 31, 2014 at 8:44 PM, Sriram Karra <[email protected]> wrote: > > On Mon, May 12, 2014 at 2:24 PM, Venkata Pingali <[email protected]> > wrote: > >> I have been working on PDF extraction. I find that PDF >> combines 'what' (text itself) with 'how' (transformations, >> presentation). The table that we see if often just a collection >> of lines and rectangles put together in an adhoc fashion. >> It could be due to pdf generator libraries themselves. It feels >> like the 'C' of this space. IMO we are missing the frameworks >> and higher levels of abstraction and/or representations. They >> may be available in the adobe ecosystem somewhere but it >> is not obvious to an outsider like me as to what they are. >> > > Hey Venkat, have you made any progress on this? > > Adobe formats are notorious for being hard to work with. In addition the > original objective of PDF was display, not maintaining retrievable data > hierarchy. So I have little confidence a single solution will just work for > all cases. > > Perhaps the way forward is to build a document parser that takes in a > layout description in a domain specific language and tries to make sense of > the PDF. > > -- > Datameet is a community of Data Science enthusiasts in India. Know more > about us by visiting http://datameet.org > > --- > You received this message because you are subscribed to the Google Groups > "datameet" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
