[jira] [Comment Edited] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x

2023-09-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764228#comment-17764228
 ] 

Tim Allison edited comment on PDFBOX-5682 at 9/12/23 2:41 PM:
--

This is the part from that document that is, erm, eye-opening:

{noformat}
4.2 AF entry not in the catalog
4.2.1 General
Most existing applications that take advantage of Associated Files use the AF 
entry in the
document catalog as the place to make the association. However, the concept of
Associated Files goes well beyond association only with the file as a whole, 
and also
allows for defining relations between embedded files and certain pages, 
annotations,
form fields, graphics objects, structure elements in the tagging structure, 
DParts or any
other PDF object.
{noformat}

And, yes, the document goes on to say, PDF writers should do the traditional 
thing, but...



was (Author: talli...@mitre.org):
This is the part from that document that is, erm, eye-opening:

{noformat}
4.2 AF entry not in the catalog
4.2.1 General
Most existing applications that take advantage of Associated Files use the AF 
entry in the
document catalog as the place to make the association. However, the concept of
Associated Files goes well beyond association only with the file as a whole, 
and also
allows for defining relations between embedded files and certain pages, 
annotations,
form fields, graphics objects, structure elements in the tagging structure, 
DParts or any
other PDF object.
{noformat}

> Long/permanent hang in PDFBox 3.x
> -
>
> Key: PDFBOX-5682
> URL: https://issues.apache.org/jira/browse/PDFBOX-5682
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
>
> I found two files in the regression tests where we're now getting timeouts at 
> 3 minutes where we weren't before.  Unfortunately, PDFBox's export:text works 
> on both, so it is probably another structural feature, perhaps a problem in 
> Tika?
> This file halts after printing out the header for Table 19 on page 46: 
> https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf
> Pure PDFBox's export:text complains multiple times: "Page skipped due to an 
> invalid or missing type null, but it does finish quickly."
> This file halts after extracting {{"854,793,592"}}: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY
> Pure PDFBox's export:text processes this without problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x

2023-09-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764225#comment-17764225
 ] 

Tim Allison edited comment on PDFBOX-5682 at 9/12/23 2:41 PM:
--

Thank you, [~lehmi].  In Tika, we initially copied PDFBox's 
ExtractEmbeddedFiles example, but we found that PDF writers can stuff attached 
files/file specs/associated files on pretty much anything 
(https://www.pdfa.org/wp-content/uploads/2018/10/PDF20_AN002-AF.pdf) . 

>From what we can tell with publicly available corpora, it is rare to have an 
>attachment not in the name tree and not in an annotation on a page, but after 
>making the change in TIKA-4012, we did find a few new attachments.

This may be a "won't fix" in 3.x. 

Perhaps we allow users to turn off the "scan every object for an embedded file" 
on the Tika side?


was (Author: talli...@mitre.org):
Thank you, [~lehmi].  In Tika, we initially copied PDFBox's 
ExtractEmbeddedFiles example, but we found that PDF writers can stuff attached 
files/file specs/associated files on pretty much anything 
(https://www.pdfa.org/wp-content/uploads/2018/10/PDF20_AN002-AF.pdf) . 

>From what we can tell with publicly available corpora, it is rare to have an 
>attachment not in the name tree and not in an annotation on a page, but after 
>making the change in TIKA-4012, we did find a few new attachments.

This may be a "won't fix" in 3.x. 

> Long/permanent hang in PDFBox 3.x
> -
>
> Key: PDFBOX-5682
> URL: https://issues.apache.org/jira/browse/PDFBOX-5682
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
>
> I found two files in the regression tests where we're now getting timeouts at 
> 3 minutes where we weren't before.  Unfortunately, PDFBox's export:text works 
> on both, so it is probably another structural feature, perhaps a problem in 
> Tika?
> This file halts after printing out the header for Table 19 on page 46: 
> https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf
> Pure PDFBox's export:text complains multiple times: "Page skipped due to an 
> invalid or missing type null, but it does finish quickly."
> This file halts after extracting {{"854,793,592"}}: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY
> Pure PDFBox's export:text processes this without problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x

2023-09-12 Thread Jira


[ 
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764199#comment-17764199
 ] 

Andreas Lehmkühler edited comment on PDFBOX-5682 at 9/12/23 1:56 PM:
-

{quote}It looks like that causes a full parse of the file?{quote}
"getObjectsByType" searches for all indirect objects of the type FILESPEC so 
that all indirect objects have to be loaded on demand which is more or less the 
whole file. In 2.0.x all objects are already loaded and therefore calling 
"getObjectsByType" is less performance consuming compared to 3.0.x.

IMHO there are two possible solutions:
* maybe there is some room for improvements when loading of all objects
* don't scan all objects when looking for some special object types like files. 
The example "org.apache.pdfbox.examples.pdmodel.ExtractEmbeddedFiles" shows how 
to get all files using PD-level objects. In 3.0.x this should be the preferred 
way to go as it doesn't scan all indirect objects


was (Author: lehmi):
{quote}It looks like that causes a full parse of the file?{quote}
"getObjectsByType" searches for all indirect objects of the type FILESPEC so 
that all indirect objects have to be loaded on demand which is more or less the 
whole file. In 2.0.x all objects are already loaded and therefore calling 
"getObjectsByType" is less performance consuming compared to 3.0.x.

IMHO there are two possible solutions:
* maybe there some room for improvements when loading of all objects
* don't scan all objects when looking for some special object types like files. 
The example "org.apache.pdfbox.examples.pdmodel.ExtractEmbeddedFiles" shows how 
to get all files using PD-level objects. In 3.0.x this should be the preferred 
way to go as it doesn't scan all indirect objects

> Long/permanent hang in PDFBox 3.x
> -
>
> Key: PDFBOX-5682
> URL: https://issues.apache.org/jira/browse/PDFBOX-5682
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
>
> I found two files in the regression tests where we're now getting timeouts at 
> 3 minutes where we weren't before.  Unfortunately, PDFBox's export:text works 
> on both, so it is probably another structural feature, perhaps a problem in 
> Tika?
> This file halts after printing out the header for Table 19 on page 46: 
> https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf
> Pure PDFBox's export:text complains multiple times: "Page skipped due to an 
> invalid or missing type null, but it does finish quickly."
> This file halts after extracting {{"854,793,592"}}: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY
> Pure PDFBox's export:text processes this without problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org