----------------------------------------
> From: lrose...@adobe.com
> To: itext-questions@lists.sourceforge.net
> Date: Mon, 10 May 2010 18:09:15 -0700
> Subject: Re: [iText-questions] how to detect remote links in a PDF ?
>
> There is no such thing as "canonical" PDF - anything that complies with the
> PDF specification is valid. That allows for various uses of compression,
> ASCII encoding, etc.
>
> There are certainly tools out there that will uncompress/defilter all the
> elements in the PDF so that it is "plain text" and can be searched using
> text-only tools - though certainly that wouldn't help you for modifications
> (for obvious reasons).
Well, not really. If there are rules for the PDF standard then you could in
fact create some alternative representation- it could
be super big, verbose, complicated, etc but it may be a useful intermediate
form for various types of work
such as debug or adhoc editing where you don't want to waste time writing
custom code to do something
simple. "XXX Intermediate Form" is a very common file format :) I guess you
could imagine expanding
it to some XML format where you have decompressed the text and done something
with the images, fonts, and formatting
information- no idea what. Essentially your claim is that PDF is so bizarre,
unique, superlative, and singular, nothing can possibly
equal it :) I just downloaded some schematic capture programs and those create
"documents" that are inherently graphical-
schematics- but the essential features can be easily extracted as concise text
netlists.
>
> That's why library such as iText exist - to provide you with higher level
> APIs (where possible). They are what one would use to create automated test
> tools, validators, etc. And many such tools already do exist - so it's
> definitely doable (and has been done).
If you took that attitude you couldn't even hide behind "but pdf is a standard"
since then the argument is " well I have API
xyz and we can do anything with it. if you use my ABC format" I guess having a
list would help, is there a pdf
developer download somewhere with tools like this? This reminds me of when I
first got here and you explained
logical structure was available but everytimei it comes up in a concrete rather
than hypothetical case
everyone says, "Sure you could preserve strcuture but it is too copmlicated to
be practical." In the present
case, you say the tools exist but when someone shows up with an error from
acrobat no one can point to a
tool to check the pdf.
>
> And let us not forget the expression - just because you only have a hammer,
> doesn't mean everything is a nail!
That's fine if you have a list of tools somewhere but I keep seeing the same
hammer being used, usually
an Acrobate reader with the informative diagnostics "your pdf is damaged."
Again, I'm not saying this
is a fault with ADBE or pdf, but it would be nice to refer people to some list
of tools that give a better
diagnostic. In many cases of course all you really care about is the text and
the hammer gets almost everything
done. When you need the graphics that is a different situation.
So ok I've only got one swiss army knife LOL.
>
> Leonard
>
> -----Original Message-----
> From: Mike Marchywka [mailto:marchy...@hotmail.com]
> Sent: Monday, May 10, 2010 6:02 PM
> To: itext-questions@lists.sourceforge.net
> Subject: Re: [iText-questions] how to detect remote links in a PDF ?
>
>
>
>
>
>
>
> ----------------------------------------
>> From: lrose...@adobe.com
>> To: itext-questions@lists.sourceforge.net
>> Date: Mon, 10 May 2010 06:44:13 -0700
>> Subject: Re: [iText-questions] how to detect remote links in a PDF ?
>>
>> Prior to PDF 1.5, you could have done a grep (or equivalent) since only
>> stream objects were compressed. However, as of PDF 1.5, we now have "object
>> streams", where groups of objects are placed into a stream and then
>> compressed - which means that grep will no longer work.
>>
>> Adobe Acrobat 9 will ALWAYS (unless restricted by a specific ISO standard,
>> such as PDF/A) use object stream compression to keep file sizes down. I've
>> been trying to recommend that other products do the same.
>
>
> Is there some utility like in pdf tk to convert a pdf with arbitrary stuff in
> it to some "Standard" or
> canonical format that can let it be used with other tools so you don't have
> to write custom code for
> every little trivail variation of a thing you wish to accopmlish? For example,
>
> cat xxx.pdf | pdf_to_standard_form | grep http
>
>
> Obivously applicability would go beyond the immediate question but also let
> people writing itext
> code have some way to check their results more easily than "it opened in
> proprietary adobe product X
> but in black box Y it greyed out 3 menu options and wouldn't let me save it
> unless blah blah bla ?"
>
> There is nothing wrong with a human readable end product but given the
> complexity of these things
> it would be nice to use computers to automate certain things, like checking
> for links
> or other attributes. Without ability to use automated tools everything comes
> down to a long
> menu chain and terse messages from products not designed for debug.
>
>
>
>
>
>>
>> So while there certainly exists lots of PDFs that you could grep, the
>> numbers are reducing daily...
>>
>> Leonard
>>
>> -----Original Message-----
>> From: Mike Marchywka [mailto:marchy...@hotmail.com]
>> Sent: Monday, May 10, 2010 3:51 AM
>> To: itext-questions@lists.sourceforge.net
>> Subject: Re: [iText-questions] how to detect remote links in a PDF ?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ----------------------------------------
>>> Date: Sun, 9 May 2010 23:08:51 +0200
>>> From: papa...@googlemail.com
>>> To: itext-questions@lists.sourceforge.net
>>> Subject: [iText-questions] how to detect remote links in a PDF ?
>>>
>>> Colleagues,
>>>
>>> For an application, one needs to detect the hyperlinks (i.e. done with
>>> Chunk.setRemoteGoto) in a PDF which point to an other PDF, can someone
>>> point me to a solution ?
>>
>> Question for leonard or others who have read the spec, if you literally ONLY
>> want to list the links, not parse the document or determine any context,
>> are they likely to be hidden or can you just use text
>> tools to find strings that start or contain "http" ? For example,
>>
>>
>> 540 cat *.pdf ../Desktop/*.pdf | sed -e 's/[^a-ZA-Z0-9/:.?]/\n/g' | grep http
>> 541 cat *.pdf ../Desktop/*.pdf | strings | grep http
>> 542 history
>>
>> These seem to work in that they find things with http but not sure what
>> would be
>> missing. Many of these seem to be surrounded by xml or prefixed with "/A"
>> but not sure what other contexts may exist.
>>
>> Thanks.
>>
>>
>>
>>
>>
>>
>>>
>>> Thank you very much in advance,
>>> Pieter Vankeerberghen
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> iText-questions mailing list
>>> iText-questions@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/itext-questions
>>>
>>> Buy the iText book: http://www.itextpdf.com/book/
>>> Check the site with examples before you ask questions:
>>> http://www.1t3xt.info/examples/
>>> You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
>>
>> _________________________________________________________________
>> The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with
>> Hotmail.
>> http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> iText-questions mailing list
>> iText-questions@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/itext-questions
>>
>> Buy the iText book: http://www.itextpdf.com/book/
>> Check the site with examples before you ask questions:
>> http://www.1t3xt.info/examples/
>> You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> iText-questions mailing list
>> iText-questions@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/itext-questions
>>
>> Buy the iText book: http://www.itextpdf.com/book/
>> Check the site with examples before you ask questions:
>> http://www.1t3xt.info/examples/
>> You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
>
> _________________________________________________________________
> Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
> http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1
> ------------------------------------------------------------------------------
>
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> Buy the iText book: http://www.itextpdf.com/book/
> Check the site with examples before you ask questions:
> http://www.1t3xt.info/examples/
> You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> Buy the iText book: http://www.itextpdf.com/book/
> Check the site with examples before you ask questions:
> http://www.1t3xt.info/examples/
> You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
_________________________________________________________________
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with
Hotmail.
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5
------------------------------------------------------------------------------
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions:
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/