Hi,

I have tried various means to reverse engineer a non protected  PDF as a 
part of my work.

There are two methods that work. (other than special conversion software)

1. Open the File in Adobe Illustrator, If the text is represented as text 
then it will be imported as text.
Beyond this one may use VBA with Illustrator to manipulate the text.

2. I have used some Activex component to directly extract text with its 
coordinates from PDF using VBA in EXCEL. It was long time back. 
I will search that again and post the code here.

On Saturday, May 31, 2014 8:44:32 PM UTC+5:30, Karra wrote:
>
>
> On Mon, May 12, 2014 at 2:24 PM, Venkata Pingali <[email protected] 
> <javascript:>> wrote:
>
>> I have been working on PDF extraction. I find that PDF 
>> combines 'what' (text itself) with 'how' (transformations,
>> presentation). The table that we see if often just a collection
>> of lines and rectangles put together in an adhoc fashion. 
>> It could be due to pdf generator libraries themselves. It feels
>> like the 'C' of this space. IMO we are missing the frameworks 
>> and higher levels of abstraction and/or representations. They
>> may be available in the adobe ecosystem somewhere but it
>> is not obvious to an outsider like me as to what they are. 
>>
>
> Hey Venkat, have you made any progress on this?
>
> Adobe formats are notorious for being hard to work with. In addition the 
> original objective of PDF was display, not maintaining retrievable data 
> hierarchy. So I have little confidence a single solution will just work for 
> all cases.
>
> Perhaps the way forward is to build a document parser that takes in a 
> layout description in a domain specific language and tries to make sense of 
> the PDF.
>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to