As it turns out, I am working on a DSL for data extraction for
a client - pretty much what you said with some nuances. The
client is open-source friendly and I will request for open
sourcing the tooling.


On Sat, May 31, 2014 at 8:44 PM, Sriram Karra <[email protected]> wrote:

>
> On Mon, May 12, 2014 at 2:24 PM, Venkata Pingali <[email protected]>
> wrote:
>
>> I have been working on PDF extraction. I find that PDF
>> combines 'what' (text itself) with 'how' (transformations,
>> presentation). The table that we see if often just a collection
>> of lines and rectangles put together in an adhoc fashion.
>> It could be due to pdf generator libraries themselves. It feels
>> like the 'C' of this space. IMO we are missing the frameworks
>> and higher levels of abstraction and/or representations. They
>> may be available in the adobe ecosystem somewhere but it
>> is not obvious to an outsider like me as to what they are.
>>
>
> Hey Venkat, have you made any progress on this?
>
> Adobe formats are notorious for being hard to work with. In addition the
> original objective of PDF was display, not maintaining retrievable data
> hierarchy. So I have little confidence a single solution will just work for
> all cases.
>
> Perhaps the way forward is to build a document parser that takes in a
> layout description in a domain specific language and tries to make sense of
> the PDF.
>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
>
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to