[jira] [Commented] (PDFBOX-2272) Can't extract vertical text correctly
[ https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625878#comment-14625878 ] Tilman Hausherr commented on PDFBOX-2272: - Please submit this as a diff against the repository, so that we can see easily what is different. Can't extract vertical text correctly - Key: PDFBOX-2272 URL: https://issues.apache.org/jira/browse/PDFBOX-2272 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6, 2.0.0 Reporter: Biligsaikhan Batjargal Attachments: PDFTextStripper.java, test.pdf, test.txt - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 90ms-RKSJ-V.- - 2.0 extracts the text but can't handle the vertical layout Also see the file from PDFBOX-2294 which contains both horizontal and vertical text. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: Performance of the trunkversion
Manfred Pock pock.manf...@gmail.com hat am 14. Juli 2015 um 12:15 geschrieben: Yes, the input is a inputstream. I can try it direct from file. But in general we get the pdf from an document management system as stream. Does make sense that i save the pdf to file before? If possible, yes. As I already said, we need random access to the pdf and InputStream doesn't support seek operations so that we have to copy the whole stream to a file or to memory. Why is there so an big performance difference beetween the version from May and the current version, if we use it with useScratchFiles = true ? I'm not sure, but the reason seems to be the altered scratchfile handling. I've to double check that. BR Andreas regarts, Manfred Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler: Hi, Manfred Pock pock.manf...@gmail.com hat am 14. Juli 2015 um 11:39 geschrieben: Ok, we load the pdf with useScratchFiles = true, if we load them with false the performance is better, but a little bit slower than the old one. What do you use as input, a stream or a real file? If the latter you should use the load method with the file parameter. PDFBox needs ramdom access to the pdf and if a stream is provided PDFBox copies the data to a file (lower memory usage, slower performance) or to the memory (higher memory usage, better performance). BR Andreas But now it need more memory. I cannot load some pdfs with the current version with the same java-memory configuration. Am 14.07.2015 um 11:26 schrieb Manfred Pock: Hi, we use the Pdfbox-trunkversion to render pdf's, currently we use the version from 12. May 2015. Today i have done an update to the current version and have test it. It seems to be that it need now much more time to render pdf's, it depends of the size of the pdf. for example you can try this one: http://cloud.directupload.net/15bu It need five times more then the version from May 2015. regarts, Manfred - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: Performance of the trunkversion
Hi, as I see it (had only a quick look at the implementation) the ScratchFileBuffer implementation is not optimal for fast random access. Single writes of bytes are not buffered but directly written to the file - a lot of I/O-operations) and seek operations have to travel the linked page list reading some bytes of each page - again a lot of seek and read I/O-operations. To speed things up it is crucial to minimize the number of I/O-operations directly going to the random access file. Therefore it is needed to buffer writes, keep last read page in memory for sequential reads and have an in-memory cache of page meta data (offset, link to previous/next page). Best, Timo Am 14.07.2015 um 12:15 schrieb Manfred Pock: Yes, the input is a inputstream. I can try it direct from file. But in general we get the pdf from an document management system as stream. Does make sense that i save the pdf to file before? Why is there so an big performance difference beetween the version from May and the current version, if we use it with useScratchFiles = true ? regarts, Manfred Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler: Hi, Manfred Pock pock.manf...@gmail.com hat am 14. Juli 2015 um 11:39 geschrieben: Ok, we load the pdf with useScratchFiles = true, if we load them with false the performance is better, but a little bit slower than the old one. What do you use as input, a stream or a real file? If the latter you should use the load method with the file parameter. PDFBox needs ramdom access to the pdf and if a stream is provided PDFBox copies the data to a file (lower memory usage, slower performance) or to the memory (higher memory usage, better performance). BR Andreas But now it need more memory. I cannot load some pdfs with the current version with the same java-memory configuration. Am 14.07.2015 um 11:26 schrieb Manfred Pock: Hi, we use the Pdfbox-trunkversion to render pdf's, currently we use the version from 12. May 2015. Today i have done an update to the current version and have test it. It seems to be that it need now much more time to render pdf's, it depends of the size of the pdf. for example you can try this one: http://cloud.directupload.net/15bu It need five times more then the version from May 2015. regarts, Manfred - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org -- Timo Boehme OntoChem IT Solutions GmbH Blücherstraße 24 06120 Halle (Saale) Germany phone: +49 345 478 047 4 | fax: +49 345 478 047 1 email: ulf.la...@ontochem.com | web: www.ontochem.com HRB 21962 Amtsgericht Stendal | USt-IdNr.: DE815563824 managing director : Lutz Weber - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: Performance of the trunkversion
Hi, Manfred Pock pock.manf...@gmail.com hat am 14. Juli 2015 um 11:39 geschrieben: Ok, we load the pdf with useScratchFiles = true, if we load them with false the performance is better, but a little bit slower than the old one. What do you use as input, a stream or a real file? If the latter you should use the load method with the file parameter. PDFBox needs ramdom access to the pdf and if a stream is provided PDFBox copies the data to a file (lower memory usage, slower performance) or to the memory (higher memory usage, better performance). BR Andreas But now it need more memory. I cannot load some pdfs with the current version with the same java-memory configuration. Am 14.07.2015 um 11:26 schrieb Manfred Pock: Hi, we use the Pdfbox-trunkversion to render pdf's, currently we use the version from 12. May 2015. Today i have done an update to the current version and have test it. It seems to be that it need now much more time to render pdf's, it depends of the size of the pdf. for example you can try this one: http://cloud.directupload.net/15bu It need five times more then the version from May 2015. regarts, Manfred - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2860) NonSeq parser slower than Seq parser
[ https://issues.apache.org/jira/browse/PDFBOX-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626187#comment-14626187 ] simon steiner commented on PDFBOX-2860: --- It sounds like if we can pass a file we should use nonseq otherwise for inputstream use seq NonSeq parser slower than Seq parser Key: PDFBOX-2860 URL: https://issues.apache.org/jira/browse/PDFBOX-2860 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 2.0.0 Reporter: simon steiner PDF from PDFBOX-797 for (int i=0; i1000; i++) { PDDocument.load(new FileInputStream( 4218.pdf)).close(); } Nonseq: real 0m23.691s Seq: real 0m9.705s -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Performance of the trunkversion
Hi, we use the Pdfbox-trunkversion to render pdf's, currently we use the version from 12. May 2015. Today i have done an update to the current version and have test it. It seems to be that it need now much more time to render pdf's, it depends of the size of the pdf. for example you can try this one: http://cloud.directupload.net/15bu It need five times more then the version from May 2015. regarts, Manfred
Re: Performance of the trunkversion
Yes, the input is a inputstream. I can try it direct from file. But in general we get the pdf from an document management system as stream. Does make sense that i save the pdf to file before? Why is there so an big performance difference beetween the version from May and the current version, if we use it with useScratchFiles = true ? regarts, Manfred Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler: Hi, Manfred Pock pock.manf...@gmail.com hat am 14. Juli 2015 um 11:39 geschrieben: Ok, we load the pdf with useScratchFiles = true, if we load them with false the performance is better, but a little bit slower than the old one. What do you use as input, a stream or a real file? If the latter you should use the load method with the file parameter. PDFBox needs ramdom access to the pdf and if a stream is provided PDFBox copies the data to a file (lower memory usage, slower performance) or to the memory (higher memory usage, better performance). BR Andreas But now it need more memory. I cannot load some pdfs with the current version with the same java-memory configuration. Am 14.07.2015 um 11:26 schrieb Manfred Pock: Hi, we use the Pdfbox-trunkversion to render pdf's, currently we use the version from 12. May 2015. Today i have done an update to the current version and have test it. It seems to be that it need now much more time to render pdf's, it depends of the size of the pdf. for example you can try this one: http://cloud.directupload.net/15bu It need five times more then the version from May 2015. regarts, Manfred - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Understanding PDFBox
Hi All, I wish to contribute to Apache PDFBox but before that i was trying to understand the codebase. I am finding it very tough to understand the code base as i am not finding any flow to follow. Is there any documentation from which i can draw some high level insight of the PDFBox ? -- Best Regards, Subham Tripathi
Re: Performance of the trunkversion
Hi, instead of having a linked page list in ScratchFileBuffer I would propose having a list of pages with the page numbers (integer) kept in memory (takes 1k for 1MB data). This would ease page handling, seeking does not need I/O-operations and caching of pages would be a lot easier. I may find some time later to come up with such a replacement. Best, Timo Am 14.07.2015 um 13:02 schrieb Timo Boehme: Hi, as I see it (had only a quick look at the implementation) the ScratchFileBuffer implementation is not optimal for fast random access. Single writes of bytes are not buffered but directly written to the file - a lot of I/O-operations) and seek operations have to travel the linked page list reading some bytes of each page - again a lot of seek and read I/O-operations. To speed things up it is crucial to minimize the number of I/O-operations directly going to the random access file. Therefore it is needed to buffer writes, keep last read page in memory for sequential reads and have an in-memory cache of page meta data (offset, link to previous/next page). Best, Timo Am 14.07.2015 um 12:15 schrieb Manfred Pock: Yes, the input is a inputstream. I can try it direct from file. But in general we get the pdf from an document management system as stream. Does make sense that i save the pdf to file before? Why is there so an big performance difference beetween the version from May and the current version, if we use it with useScratchFiles = true ? regarts, Manfred Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler: Hi, Manfred Pock pock.manf...@gmail.com hat am 14. Juli 2015 um 11:39 geschrieben: Ok, we load the pdf with useScratchFiles = true, if we load them with false the performance is better, but a little bit slower than the old one. What do you use as input, a stream or a real file? If the latter you should use the load method with the file parameter. PDFBox needs ramdom access to the pdf and if a stream is provided PDFBox copies the data to a file (lower memory usage, slower performance) or to the memory (higher memory usage, better performance). BR Andreas But now it need more memory. I cannot load some pdfs with the current version with the same java-memory configuration. Am 14.07.2015 um 11:26 schrieb Manfred Pock: Hi, we use the Pdfbox-trunkversion to render pdf's, currently we use the version from 12. May 2015. Today i have done an update to the current version and have test it. It seems to be that it need now much more time to render pdf's, it depends of the size of the pdf. for example you can try this one: http://cloud.directupload.net/15bu It need five times more then the version from May 2015. regarts, Manfred - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org -- Timo Boehme OntoChem IT Solutions GmbH Blücherstraße 24 06120 Halle (Saale) Germany phone: +49 345 478 047 4 | fax: +49 345 478 047 1 email: ulf.la...@ontochem.com | web: www.ontochem.com HRB 21962 Amtsgericht Stendal | USt-IdNr.: DE815563824 managing director : Lutz Weber - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: Understanding PDFBox
Hi Subham I'm a GSoc student here in PDFBox this year and I'm improving PDFDebugger of PDFBox issue https://issues.apache.org/jira/browse/PDFBOX-2530. Before applying for the project, I had to be familiar with the code base. I was in a bit of puzzle for the first time, but now I've a basic understanding of the code base though I'm not coding for the main module of the PDFBox. I'm suggesting what I've done so far to get comfortable with PDFBox to start. Read the PDF specification, at least get a head start. https://www.adobe.com/devnet/pdf/pdf_reference.html Read the documentation. https://pdfbox.apache.org/docs/2.0.0-SNAPSHOT/javadocs/ Play with example codes. https://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/ Anyway, there are other things before you can contribute which I think the committers guys can say more specifically. Regards Khyrul Bashar On Tue, Jul 14, 2015 at 4:58 PM, Subham Tripathi subham@gmail.com wrote: Hi All, I wish to contribute to Apache PDFBox but before that i was trying to understand the codebase. I am finding it very tough to understand the code base as i am not finding any flow to follow. Is there any documentation from which i can draw some high level insight of the PDFBox ? -- Best Regards, Subham Tripathi
[jira] [Commented] (PDFBOX-2877) Wrong text placement for autosize fields compared to Adobe generated
[ https://issues.apache.org/jira/browse/PDFBOX-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626372#comment-14626372 ] Maruan Sahyoun commented on PDFBOX-2877: I made some progress and the results using an Arial font are much better. The results using the Courier font are still a lot apart from the Adobe generated results so it looks like depending on the available metrics of the font such as BBox, ascender, descender ... are not enough. Will continue to investigate and post my findings. For the moment I will generate some more testing material as to ensure that we have a better coverage for different fonts. Wrong text placement for autosize fields compared to Adobe generated Key: PDFBOX-2877 URL: https://issues.apache.org/jira/browse/PDFBOX-2877 Project: PDFBox Issue Type: Sub-task Components: AcroForm Affects Versions: 1.8.9, 2.0.0 Reporter: Maruan Sahyoun Assignee: Maruan Sahyoun Labels: Appearance Fix For: 2.0.0 Attachments: AutosizeTests-filled-20150713.pdf, AutosizeTests-filled-20150713.png, AutosizeTests.pdf When a field uses autosizing the generated appearance is wrong as - the text is placed lower than expected - the font size is too large compared to the appearance generated with Adobe tools -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2852) Improve code quality (2)
[ https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626537#comment-14626537 ] ASF subversion and git services commented on PDFBOX-2852: - Commit 1691003 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691003 ] PDFBOX-2852: remove unused imports Improve code quality (2) Key: PDFBOX-2852 URL: https://issues.apache.org/jira/browse/PDFBOX-2852 Project: PDFBox Issue Type: Task Affects Versions: 2.0.0 Reporter: Tilman Hausherr This is a longterm issue for the task to improve code quality, by using the [SonarQube report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor], hints in different IDEs, the FindBugs tool and other code quality tools. This is a follow-up of PDFBOX-2576, which was getting too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2852) Improve code quality (2)
[ https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626541#comment-14626541 ] ASF subversion and git services commented on PDFBOX-2852: - Commit 1691006 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691006 ] PDFBOX-2852: use interface Improve code quality (2) Key: PDFBOX-2852 URL: https://issues.apache.org/jira/browse/PDFBOX-2852 Project: PDFBox Issue Type: Task Affects Versions: 2.0.0 Reporter: Tilman Hausherr This is a longterm issue for the task to improve code quality, by using the [SonarQube report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor], hints in different IDEs, the FindBugs tool and other code quality tools. This is a follow-up of PDFBOX-2576, which was getting too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: Understanding PDFBox
Am 14.07.2015 um 12:58 schrieb Subham Tripathi: Hi All, I wish to contribute to Apache PDFBox but before that i was trying to understand the codebase. I am finding it very tough to understand the code base as i am not finding any flow to follow. Is there any documentation from which i can draw some high level insight of the PDFBox ? Look at the examples... and start from there. Then look at an unsolved issue :-) If this is about getting coding practice, google for BATIK-1109 https://issues.apache.org/jira/browse/BATIK-1109 and BATIK-1110 https://issues.apache.org/jira/browse/BATIK-1110. One of the bugs is probably fixed by a few lines (although some debugging is needed to see how signed / unsigned values are handled there), the other one involves using code in PDFBox but in the way BATIK uses. Both bugs have been fixed in PDFBox, but not in BATIK (of which PDFBox used some code). Tilman
[jira] [Commented] (PDFBOX-2852) Improve code quality (2)
[ https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626552#comment-14626552 ] ASF subversion and git services commented on PDFBOX-2852: - Commit 1691007 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691007 ] PDFBOX-2852: use interface Improve code quality (2) Key: PDFBOX-2852 URL: https://issues.apache.org/jira/browse/PDFBOX-2852 Project: PDFBox Issue Type: Task Affects Versions: 2.0.0 Reporter: Tilman Hausherr This is a longterm issue for the task to improve code quality, by using the [SonarQube report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor], hints in different IDEs, the FindBugs tool and other code quality tools. This is a follow-up of PDFBOX-2576, which was getting too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2852) Improve code quality (2)
[ https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626576#comment-14626576 ] ASF subversion and git services commented on PDFBOX-2852: - Commit 1691024 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691024 ] PDFBOX-2852: use interface Improve code quality (2) Key: PDFBOX-2852 URL: https://issues.apache.org/jira/browse/PDFBOX-2852 Project: PDFBox Issue Type: Task Affects Versions: 2.0.0 Reporter: Tilman Hausherr This is a longterm issue for the task to improve code quality, by using the [SonarQube report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor], hints in different IDEs, the FindBugs tool and other code quality tools. This is a follow-up of PDFBOX-2576, which was getting too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2852) Improve code quality (2)
[ https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626589#comment-14626589 ] ASF subversion and git services commented on PDFBOX-2852: - Commit 1691031 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691031 ] PDFBOX-2852: use interface Improve code quality (2) Key: PDFBOX-2852 URL: https://issues.apache.org/jira/browse/PDFBOX-2852 Project: PDFBox Issue Type: Task Affects Versions: 2.0.0 Reporter: Tilman Hausherr This is a longterm issue for the task to improve code quality, by using the [SonarQube report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor], hints in different IDEs, the FindBugs tool and other code quality tools. This is a follow-up of PDFBOX-2576, which was getting too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2842) Overhaul font substitution
[ https://issues.apache.org/jira/browse/PDFBOX-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626531#comment-14626531 ] Tilman Hausherr commented on PDFBOX-2842: - {code}LOG.warn(New fonts found, font cache will be re-built);{code} shouldn't all these be info instead of warn? Overhaul font substitution -- Key: PDFBOX-2842 URL: https://issues.apache.org/jira/browse/PDFBOX-2842 Project: PDFBox Issue Type: Improvement Components: FontBox, PDModel Affects Versions: 2.0.0 Reporter: John Hewson Assignee: John Hewson Priority: Blocker Fix For: 2.0.0 Attachments: 029423-p1.pdf, 166292-fi-ligature.pdf The improved font substitution mechanisms in 2.0 are not quite sufficient to handle all PDFs. Specifically, CJK substitution and substitution of TTF in place of CFF fonts is not possible with the current design. The CJK problems can be seen in PDFBOX-2509 and PDFBOX-2563, which does not solve the problem. Additional font API weaknesses can be found in PDFBOX-2578 and PDFBOX-2366. This meta-issue aims to address all of those sub-issues. The current problems are: - FontBox does not provide a generic font type, so we have handle TrueTypeFont, CFFFont, and Type1Font separately. This hinders cross-format substitution. - ExternalFonts has no knowledge of the CIDSystemInfo which is necessary for CJK substitution - FontProvider contains too much public logic which should be internal to PDFBox, e.g. substitution logic, this makes it brittle and means we won't be able to add additional logic after 2.0 is released, e.g. CJK substitution. - Too much confusion about the role of ExternalFonts, particularly with regards to mapping of built-in fonts and the definition of substitute vs. fallback font. - ExternalFonts is a black box: the user cannot tell whether the font returned is an exact match, or a last-resort fallback. - Confusing font substitution API, users preferred having a flat file format - PDSimpleFont#getEncoding() can return null for TTFs which use built-in encodings. This has caused a lot of bugs - there must be a better way. - We still have some confusing names, for example a CustomEncoding is known as a built-in encoding in the spec. - There is no fallback CFF font, we resort to AdobeBlank instead, which has no rendering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2852) Improve code quality (2)
[ https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626538#comment-14626538 ] ASF subversion and git services commented on PDFBOX-2852: - Commit 1691004 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691004 ] PDFBOX-2852: use interface Improve code quality (2) Key: PDFBOX-2852 URL: https://issues.apache.org/jira/browse/PDFBOX-2852 Project: PDFBox Issue Type: Task Affects Versions: 2.0.0 Reporter: Tilman Hausherr This is a longterm issue for the task to improve code quality, by using the [SonarQube report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor], hints in different IDEs, the FindBugs tool and other code quality tools. This is a follow-up of PDFBOX-2576, which was getting too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2852) Improve code quality (2)
[ https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626570#comment-14626570 ] ASF subversion and git services commented on PDFBOX-2852: - Commit 1691019 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691019 ] PDFBOX-2852: use interface Improve code quality (2) Key: PDFBOX-2852 URL: https://issues.apache.org/jira/browse/PDFBOX-2852 Project: PDFBox Issue Type: Task Affects Versions: 2.0.0 Reporter: Tilman Hausherr This is a longterm issue for the task to improve code quality, by using the [SonarQube report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor], hints in different IDEs, the FindBugs tool and other code quality tools. This is a follow-up of PDFBOX-2576, which was getting too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2852) Improve code quality (2)
[ https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626587#comment-14626587 ] ASF subversion and git services commented on PDFBOX-2852: - Commit 1691030 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691030 ] PDFBOX-2852: use interface Improve code quality (2) Key: PDFBOX-2852 URL: https://issues.apache.org/jira/browse/PDFBOX-2852 Project: PDFBox Issue Type: Task Affects Versions: 2.0.0 Reporter: Tilman Hausherr This is a longterm issue for the task to improve code quality, by using the [SonarQube report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor], hints in different IDEs, the FindBugs tool and other code quality tools. This is a follow-up of PDFBOX-2576, which was getting too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2852) Improve code quality (2)
[ https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626573#comment-14626573 ] ASF subversion and git services commented on PDFBOX-2852: - Commit 1691020 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691020 ] PDFBOX-2852: use interface Improve code quality (2) Key: PDFBOX-2852 URL: https://issues.apache.org/jira/browse/PDFBOX-2852 Project: PDFBox Issue Type: Task Affects Versions: 2.0.0 Reporter: Tilman Hausherr This is a longterm issue for the task to improve code quality, by using the [SonarQube report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor], hints in different IDEs, the FindBugs tool and other code quality tools. This is a follow-up of PDFBOX-2576, which was getting too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2852) Improve code quality (2)
[ https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626578#comment-14626578 ] ASF subversion and git services commented on PDFBOX-2852: - Commit 1691027 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691027 ] PDFBOX-2852: use interface Improve code quality (2) Key: PDFBOX-2852 URL: https://issues.apache.org/jira/browse/PDFBOX-2852 Project: PDFBox Issue Type: Task Affects Versions: 2.0.0 Reporter: Tilman Hausherr This is a longterm issue for the task to improve code quality, by using the [SonarQube report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor], hints in different IDEs, the FindBugs tool and other code quality tools. This is a follow-up of PDFBOX-2576, which was getting too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2852) Improve code quality (2)
[ https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626585#comment-14626585 ] ASF subversion and git services commented on PDFBOX-2852: - Commit 1691028 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691028 ] PDFBOX-2852: use interface Improve code quality (2) Key: PDFBOX-2852 URL: https://issues.apache.org/jira/browse/PDFBOX-2852 Project: PDFBox Issue Type: Task Affects Versions: 2.0.0 Reporter: Tilman Hausherr This is a longterm issue for the task to improve code quality, by using the [SonarQube report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor], hints in different IDEs, the FindBugs tool and other code quality tools. This is a follow-up of PDFBOX-2576, which was getting too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626593#comment-14626593 ] ASF subversion and git services commented on PDFBOX-2530: - Commit 1691032 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691032 ] PDFBOX-2852, PDFBOX-2530: reduce code complexity; null parameter for setText() is legit Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html]. Mentor: Tilman Hausherr (European timezone, languages:
[jira] [Commented] (PDFBOX-2852) Improve code quality (2)
[ https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626592#comment-14626592 ] ASF subversion and git services commented on PDFBOX-2852: - Commit 1691032 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691032 ] PDFBOX-2852, PDFBOX-2530: reduce code complexity; null parameter for setText() is legit Improve code quality (2) Key: PDFBOX-2852 URL: https://issues.apache.org/jira/browse/PDFBOX-2852 Project: PDFBox Issue Type: Task Affects Versions: 2.0.0 Reporter: Tilman Hausherr This is a longterm issue for the task to improve code quality, by using the [SonarQube report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor], hints in different IDEs, the FindBugs tool and other code quality tools. This is a follow-up of PDFBOX-2576, which was getting too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: first stack trace report from pdfbox 2.0.0 trunk
Hi Tim, Currently there is at least one known regression, mentioned in PDFBOX-2842, it applies to 029423 but also to other files. Tilman Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.: All, I just posted the first stacktrace report from my initial partial batch run of against govdocs1 here: https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip Caveats/Notes The run yesterday did not include the fixes that were made in PDFBOX-2370 or PDFBOX-2862. I stopped the batch run early. This only covered ~50k pdfs. I forgot to turn on accesspermission checking. Some of the pdfs in here would normally have been skipped. I haven't reviewed any of the exceptions. They may be caused by code on the Tika side. I'll plan to re-run with the latest trunk on Tuesday. I need to turn back to the actual eval code for a bit. :) Cheers, Tim - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626647#comment-14626647 ] Tilman Hausherr commented on PDFBOX-2530: - There is a new bug (class cast exception) when clicking on a page content stream when in page mode. Although the bug is new, I assume that the root cause (a MapEntry with a MapEntry) is older. Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html]. Mentor: Tilman Hausherr (European timezone, languages: german, english,
[jira] [Commented] (PDFBOX-2860) NonSeq parser slower than Seq parser
[ https://issues.apache.org/jira/browse/PDFBOX-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626658#comment-14626658 ] Tilman Hausherr commented on PDFBOX-2860: - No you can't because the sequential parser no longer exists in 2.0. It was not correct. NonSeq parser slower than Seq parser Key: PDFBOX-2860 URL: https://issues.apache.org/jira/browse/PDFBOX-2860 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 2.0.0 Reporter: simon steiner PDF from PDFBOX-797 for (int i=0; i1000; i++) { PDDocument.load(new FileInputStream( 4218.pdf)).close(); } Nonseq: real 0m23.691s Seq: real 0m9.705s -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626647#comment-14626647 ] Tilman Hausherr edited comment on PDFBOX-2530 at 7/14/15 4:57 PM: -- There is a new bug (class cast exception) when clicking on a page content stream when in show pages mode. Although the bug is new, I assume that the root cause (a MapEntry with a MapEntry) is older. was (Author: tilman): There is a new bug (class cast exception) when clicking on a page content stream when in page mode. Although the bug is new, I assume that the root cause (a MapEntry with a MapEntry) is older. Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the
Re: Understanding PDFBox
The book “Developing with PDF” provides a short and gentle introduction to the PDF format. We have a brief architectural summary of PDFBox at: http://pdfbox.apache.org/1.8/architecture.html http://pdfbox.apache.org/1.8/architecture.html But in general, to make sense of PDFBox, you’ll need to understand the PDF spec. — John On 14 Jul 2015, at 03:58, Subham Tripathi subham@gmail.com wrote: Hi All, I wish to contribute to Apache PDFBox but before that i was trying to understand the codebase. I am finding it very tough to understand the code base as i am not finding any flow to follow. Is there any documentation from which i can draw some high level insight of the PDFBox ? -- Best Regards, Subham Tripathi
[jira] [Commented] (PDFBOX-2842) Overhaul font substitution
[ https://issues.apache.org/jira/browse/PDFBOX-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626822#comment-14626822 ] John Hewson commented on PDFBOX-2842: - I'm not sure. My thinking was that I really want users to see these messages and it's common for users to not log at the INFO level. Re-building the font cache is slow (up to 10sec) and possibly unexpected, so it seemed like a legitimate warning - explaining unusually slow behaviour. Overhaul font substitution -- Key: PDFBOX-2842 URL: https://issues.apache.org/jira/browse/PDFBOX-2842 Project: PDFBox Issue Type: Improvement Components: FontBox, PDModel Affects Versions: 2.0.0 Reporter: John Hewson Assignee: John Hewson Priority: Blocker Fix For: 2.0.0 Attachments: 029423-p1.pdf, 166292-fi-ligature.pdf The improved font substitution mechanisms in 2.0 are not quite sufficient to handle all PDFs. Specifically, CJK substitution and substitution of TTF in place of CFF fonts is not possible with the current design. The CJK problems can be seen in PDFBOX-2509 and PDFBOX-2563, which does not solve the problem. Additional font API weaknesses can be found in PDFBOX-2578 and PDFBOX-2366. This meta-issue aims to address all of those sub-issues. The current problems are: - FontBox does not provide a generic font type, so we have handle TrueTypeFont, CFFFont, and Type1Font separately. This hinders cross-format substitution. - ExternalFonts has no knowledge of the CIDSystemInfo which is necessary for CJK substitution - FontProvider contains too much public logic which should be internal to PDFBox, e.g. substitution logic, this makes it brittle and means we won't be able to add additional logic after 2.0 is released, e.g. CJK substitution. - Too much confusion about the role of ExternalFonts, particularly with regards to mapping of built-in fonts and the definition of substitute vs. fallback font. - ExternalFonts is a black box: the user cannot tell whether the font returned is an exact match, or a last-resort fallback. - Confusing font substitution API, users preferred having a flat file format - PDSimpleFont#getEncoding() can return null for TTFs which use built-in encodings. This has caused a lot of bugs - there must be a better way. - We still have some confusing names, for example a CustomEncoding is known as a built-in encoding in the spec. - There is no fallback CFF font, we resort to AdobeBlank instead, which has no rendering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626958#comment-14626958 ] John Hewson commented on PDFBOX-2530: - JComboBox doesn't have a type parameter in Java 1.6. Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html]. Mentor: Tilman Hausherr (European timezone, languages: german, english, french). To see the GSoC2014 project I mentored, go to PDFBOX-1915.
[jira] [Comment Edited] (PDFBOX-2272) Can't extract vertical text correctly
[ https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626972#comment-14626972 ] John Hewson edited comment on PDFBOX-2272 at 7/14/15 8:09 PM: -- No, diff and SVN patch are different formats. They're similar in theory but not compatible. Both can be applied with patch though. was (Author: jahewson): No, diff and SVN patch are different formats. They're similar in theory but not compatible. Can't extract vertical text correctly - Key: PDFBOX-2272 URL: https://issues.apache.org/jira/browse/PDFBOX-2272 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6, 2.0.0 Reporter: Biligsaikhan Batjargal Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.diff - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 90ms-RKSJ-V.- - 2.0 extracts the text but can't handle the vertical layout Also see the file from PDFBOX-2294 which contains both horizontal and vertical text. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-2272) Can't extract vertical text correctly
[ https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626972#comment-14626972 ] John Hewson edited comment on PDFBOX-2272 at 7/14/15 8:09 PM: -- No, diff and SVN patch are different formats. They're similar in theory but not compatible. Both can be applied with patch though. For example, IntelliJ can't apply your diff. was (Author: jahewson): No, diff and SVN patch are different formats. They're similar in theory but not compatible. Both can be applied with patch though. Can't extract vertical text correctly - Key: PDFBOX-2272 URL: https://issues.apache.org/jira/browse/PDFBOX-2272 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6, 2.0.0 Reporter: Biligsaikhan Batjargal Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.diff - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 90ms-RKSJ-V.- - 2.0 extracts the text but can't handle the vertical layout Also see the file from PDFBOX-2294 which contains both horizontal and vertical text. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-2881) Radial and Axial shading steps are calculated incorrectly
[ https://issues.apache.org/jira/browse/PDFBOX-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson resolved PDFBOX-2881. - Resolution: Fixed Radial and Axial shading steps are calculated incorrectly - Key: PDFBOX-2881 URL: https://issues.apache.org/jira/browse/PDFBOX-2881 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 2.0.0 Reporter: John Hewson Assignee: John Hewson Fix For: 2.0.0 I found a shading bug while writing some code to dump all shadings in a PDF. I don't know if this affects PDF rendering within PageDrawer or not. RadialShadingContext and AxialShadingContext use the following code in their constructors to calculate the number of steps (pixels) in the shading and build a lookup table for each step: {code} // transform the distance to actual pixel space // use transform, because xform.getScaleX() does not return correct scaling on 90° rotated matrix Point2D point = new Point2D.Double(longestDistance, longestDistance); matrix.transform(point); xform.transform(point, point); factor = (int) Math.max(Math.abs(point.getX()), Math.abs(point.getY())); colorTable = calcColorTable(); {code} The variable factor is the number of steps and matrix is the parent stream's matrix + the pattern matrix, so this code is taking the current scale and assuming that that is equal to the number of pixels. This works when a pattern is painted onto a 0...1 scaled surface, but otherwise it produces incorrect results. There's no way to calculate the number of pixels in the device from its scale, or its matrix. Paint#createContext() provides the device bounds Rectangle, which is what we should be using. Indeed, this is handled correctly in the other shading contexts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2881) Radial and Axial shading steps are calculated incorrectly
[ https://issues.apache.org/jira/browse/PDFBOX-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627111#comment-14627111 ] ASF subversion and git services commented on PDFBOX-2881: - Commit 1691093 from [~jahewson] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691093 ] PDFBOX-2881: Calculate the number of steps using the device bounds Radial and Axial shading steps are calculated incorrectly - Key: PDFBOX-2881 URL: https://issues.apache.org/jira/browse/PDFBOX-2881 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 2.0.0 Reporter: John Hewson Assignee: John Hewson Fix For: 2.0.0 I found a shading bug while writing some code to dump all shadings in a PDF. I don't know if this affects PDF rendering within PageDrawer or not. RadialShadingContext and AxialShadingContext use the following code in their constructors to calculate the number of steps (pixels) in the shading and build a lookup table for each step: {code} // transform the distance to actual pixel space // use transform, because xform.getScaleX() does not return correct scaling on 90° rotated matrix Point2D point = new Point2D.Double(longestDistance, longestDistance); matrix.transform(point); xform.transform(point, point); factor = (int) Math.max(Math.abs(point.getX()), Math.abs(point.getY())); colorTable = calcColorTable(); {code} The variable factor is the number of steps and matrix is the parent stream's matrix + the pattern matrix, so this code is taking the current scale and assuming that that is equal to the number of pixels. This works when a pattern is painted onto a 0...1 scaled surface, but otherwise it produces incorrect results. There's no way to calculate the number of pixels in the device from its scale, or its matrix. Paint#createContext() provides the device bounds Rectangle, which is what we should be using. Indeed, this is handled correctly in the other shading contexts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: first stack trace report from pdfbox 2.0.0 trunk
Interesting, yes: 781/781172.pdf, 490/490376.pdf and 029/029423.pdf. Are you running your own regression testing against govdocs1? Is it duplicated effort for me to do anything with 2.0.0? Or, is your point that should I wait until PDFBOX-2842 is completed? Thank you! Best, Tim -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, July 14, 2015 12:47 PM To: dev@pdfbox.apache.org Subject: Re: first stack trace report from pdfbox 2.0.0 trunk Hi Tim, Currently there is at least one known regression, mentioned in PDFBOX-2842, it applies to 029423 but also to other files. Tilman Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.: All, I just posted the first stacktrace report from my initial partial batch run of against govdocs1 here: https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip Caveats/Notes The run yesterday did not include the fixes that were made in PDFBOX-2370 or PDFBOX-2862. I stopped the batch run early. This only covered ~50k pdfs. I forgot to turn on accesspermission checking. Some of the pdfs in here would normally have been skipped. I haven't reviewed any of the exceptions. They may be caused by code on the Tika side. I'll plan to re-run with the latest trunk on Tuesday. I need to turn back to the actual eval code for a bit. :) Cheers, Tim - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-2272) Can't extract vertical text correctly
[ https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2272: Attachment: (was: vertical.diff) Can't extract vertical text correctly - Key: PDFBOX-2272 URL: https://issues.apache.org/jira/browse/PDFBOX-2272 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6, 2.0.0 Reporter: Biligsaikhan Batjargal Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.patch - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 90ms-RKSJ-V.- - 2.0 extracts the text but can't handle the vertical layout Also see the file from PDFBOX-2294 which contains both horizontal and vertical text. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626939#comment-14626939 ] Tilman Hausherr commented on PDFBOX-2530: - Why this? {code} -JComboBox filters = new JComboBoxString(availableFilters); +JComboBox filters = new JComboBox(availableFilters); {code} Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html]. Mentor: Tilman Hausherr (European
[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626954#comment-14626954 ] ASF subversion and git services commented on PDFBOX-2530: - Commit 1691068 from [~jahewson] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691068 ] PDFBOX-2530: Show filters in menu labels Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html]. Mentor: Tilman
[jira] [Commented] (PDFBOX-2272) Can't extract vertical text correctly
[ https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626998#comment-14626998 ] John Hewson commented on PDFBOX-2272: - The patch looks to large to me, why has handleTextPosition been created? It seems unnecessary? Can't extract vertical text correctly - Key: PDFBOX-2272 URL: https://issues.apache.org/jira/browse/PDFBOX-2272 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6, 2.0.0 Reporter: Biligsaikhan Batjargal Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.patch - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 90ms-RKSJ-V.- - 2.0 extracts the text but can't handle the vertical layout Also see the file from PDFBOX-2294 which contains both horizontal and vertical text. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626928#comment-14626928 ] Tilman Hausherr commented on PDFBOX-2530: - How about checking if there is a mask, and offer both? I.e. with the mask as default, and optionally image without mask. Re printStackTrace - I agree, as long as the exception appears at some time. Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF
[jira] [Commented] (PDFBOX-2272) Can't extract vertical text correctly
[ https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626889#comment-14626889 ] John Hewson commented on PDFBOX-2272: - That's not an SVN patch... Can't extract vertical text correctly - Key: PDFBOX-2272 URL: https://issues.apache.org/jira/browse/PDFBOX-2272 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6, 2.0.0 Reporter: Biligsaikhan Batjargal Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.diff - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 90ms-RKSJ-V.- - 2.0 extracts the text but can't handle the vertical layout Also see the file from PDFBOX-2294 which contains both horizontal and vertical text. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2272) Can't extract vertical text correctly
[ https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626972#comment-14626972 ] John Hewson commented on PDFBOX-2272: - No, diff and SVN patch are different formats. They're similar in theory but not compatible. Can't extract vertical text correctly - Key: PDFBOX-2272 URL: https://issues.apache.org/jira/browse/PDFBOX-2272 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6, 2.0.0 Reporter: Biligsaikhan Batjargal Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.diff - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 90ms-RKSJ-V.- - 2.0 extracts the text but can't handle the vertical layout Also see the file from PDFBOX-2294 which contains both horizontal and vertical text. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2272) Can't extract vertical text correctly
[ https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626997#comment-14626997 ] John Hewson commented on PDFBOX-2272: - That's ok, I already applied it using the command line. Can't extract vertical text correctly - Key: PDFBOX-2272 URL: https://issues.apache.org/jira/browse/PDFBOX-2272 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6, 2.0.0 Reporter: Biligsaikhan Batjargal Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.patch - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 90ms-RKSJ-V.- - 2.0 extracts the text but can't handle the vertical layout Also see the file from PDFBOX-2294 which contains both horizontal and vertical text. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Reopened] (PDFBOX-2842) Overhaul font substitution
[ https://issues.apache.org/jira/browse/PDFBOX-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr reopened PDFBOX-2842: - reopening, the regression from July 2 on 029423-p1.pdf has been missed :-( Overhaul font substitution -- Key: PDFBOX-2842 URL: https://issues.apache.org/jira/browse/PDFBOX-2842 Project: PDFBox Issue Type: Improvement Components: FontBox, PDModel Affects Versions: 2.0.0 Reporter: John Hewson Assignee: John Hewson Priority: Blocker Fix For: 2.0.0 Attachments: 029423-p1.pdf, 166292-fi-ligature.pdf The improved font substitution mechanisms in 2.0 are not quite sufficient to handle all PDFs. Specifically, CJK substitution and substitution of TTF in place of CFF fonts is not possible with the current design. The CJK problems can be seen in PDFBOX-2509 and PDFBOX-2563, which does not solve the problem. Additional font API weaknesses can be found in PDFBOX-2578 and PDFBOX-2366. This meta-issue aims to address all of those sub-issues. The current problems are: - FontBox does not provide a generic font type, so we have handle TrueTypeFont, CFFFont, and Type1Font separately. This hinders cross-format substitution. - ExternalFonts has no knowledge of the CIDSystemInfo which is necessary for CJK substitution - FontProvider contains too much public logic which should be internal to PDFBox, e.g. substitution logic, this makes it brittle and means we won't be able to add additional logic after 2.0 is released, e.g. CJK substitution. - Too much confusion about the role of ExternalFonts, particularly with regards to mapping of built-in fonts and the definition of substitute vs. fallback font. - ExternalFonts is a black box: the user cannot tell whether the font returned is an exact match, or a last-resort fallback. - Confusing font substitution API, users preferred having a flat file format - PDSimpleFont#getEncoding() can return null for TTFs which use built-in encodings. This has caused a lot of bugs - there must be a better way. - We still have some confusing names, for example a CustomEncoding is known as a built-in encoding in the spec. - There is no fallback CFF font, we resort to AdobeBlank instead, which has no rendering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-2881) Radial and Axial shading steps are calculated incorrectly
John Hewson created PDFBOX-2881: --- Summary: Radial and Axial shading steps are calculated incorrectly Key: PDFBOX-2881 URL: https://issues.apache.org/jira/browse/PDFBOX-2881 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 2.0.0 Reporter: John Hewson Assignee: John Hewson Fix For: 2.0.0 I found a shading bug while writing some code to dump all shadings in a PDF. I don't know if this affects PDF rendering within PageDrawer or not. RadialShadingContext and AxialShadingContext use the following code in their constructors to calculate the number of steps (pixels) in the shading and build a lookup table for each step: {code} // transform the distance to actual pixel space // use transform, because xform.getScaleX() does not return correct scaling on 90° rotated matrix Point2D point = new Point2D.Double(longestDistance, longestDistance); matrix.transform(point); xform.transform(point, point); factor = (int) Math.max(Math.abs(point.getX()), Math.abs(point.getY())); colorTable = calcColorTable(); {code} The matrix is the parent stream's matrix + the pattern matrix, so this code is taking the current scale and assuming that that is equal to the number of pixels. This works when a pattern is painted onto a 0...1 scaled surface, but otherwise it produces incorrect results. There's no way to calculate the number of pixels in the device from its scale, or its matrix. Paint#createContext() provides the device bounds Rectangle, which is what we should be using. Indeed, this is handled correctly in the other shading contexts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-2881) Radial and Axial shading steps are calculated incorrectly
[ https://issues.apache.org/jira/browse/PDFBOX-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2881: Description: I found a shading bug while writing some code to dump all shadings in a PDF. I don't know if this affects PDF rendering within PageDrawer or not. RadialShadingContext and AxialShadingContext use the following code in their constructors to calculate the number of steps (pixels) in the shading and build a lookup table for each step: {code} // transform the distance to actual pixel space // use transform, because xform.getScaleX() does not return correct scaling on 90° rotated matrix Point2D point = new Point2D.Double(longestDistance, longestDistance); matrix.transform(point); xform.transform(point, point); factor = (int) Math.max(Math.abs(point.getX()), Math.abs(point.getY())); colorTable = calcColorTable(); {code} The variable factor is the number of steps and matrix is the parent stream's matrix + the pattern matrix, so this code is taking the current scale and assuming that that is equal to the number of pixels. This works when a pattern is painted onto a 0...1 scaled surface, but otherwise it produces incorrect results. There's no way to calculate the number of pixels in the device from its scale, or its matrix. Paint#createContext() provides the device bounds Rectangle, which is what we should be using. Indeed, this is handled correctly in the other shading contexts. was: I found a shading bug while writing some code to dump all shadings in a PDF. I don't know if this affects PDF rendering within PageDrawer or not. RadialShadingContext and AxialShadingContext use the following code in their constructors to calculate the number of steps (pixels) in the shading and build a lookup table for each step: {code} // transform the distance to actual pixel space // use transform, because xform.getScaleX() does not return correct scaling on 90° rotated matrix Point2D point = new Point2D.Double(longestDistance, longestDistance); matrix.transform(point); xform.transform(point, point); factor = (int) Math.max(Math.abs(point.getX()), Math.abs(point.getY())); colorTable = calcColorTable(); {code} The matrix is the parent stream's matrix + the pattern matrix, so this code is taking the current scale and assuming that that is equal to the number of pixels. This works when a pattern is painted onto a 0...1 scaled surface, but otherwise it produces incorrect results. There's no way to calculate the number of pixels in the device from its scale, or its matrix. Paint#createContext() provides the device bounds Rectangle, which is what we should be using. Indeed, this is handled correctly in the other shading contexts. Radial and Axial shading steps are calculated incorrectly - Key: PDFBOX-2881 URL: https://issues.apache.org/jira/browse/PDFBOX-2881 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 2.0.0 Reporter: John Hewson Assignee: John Hewson Fix For: 2.0.0 I found a shading bug while writing some code to dump all shadings in a PDF. I don't know if this affects PDF rendering within PageDrawer or not. RadialShadingContext and AxialShadingContext use the following code in their constructors to calculate the number of steps (pixels) in the shading and build a lookup table for each step: {code} // transform the distance to actual pixel space // use transform, because xform.getScaleX() does not return correct scaling on 90° rotated matrix Point2D point = new Point2D.Double(longestDistance, longestDistance); matrix.transform(point); xform.transform(point, point); factor = (int) Math.max(Math.abs(point.getX()), Math.abs(point.getY())); colorTable = calcColorTable(); {code} The variable factor is the number of steps and matrix is the parent stream's matrix + the pattern matrix, so this code is taking the current scale and assuming that that is equal to the number of pixels. This works when a pattern is painted onto a 0...1 scaled surface, but otherwise it produces incorrect results. There's no way to calculate the number of pixels in the device from its scale, or its matrix. Paint#createContext() provides the device bounds Rectangle, which is what we should be using. Indeed, this is handled correctly in the other shading contexts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626885#comment-14626885 ] John Hewson commented on PDFBOX-2530: - Two minor issues: - PDImageXObject#getImage() returns the image with the mask applied, which means we can't view the raw image. Calling getOpaqueImage() instead would solve this. - Never use e.printStackTrace() :) Use throw new RuntimeException(e) instead. That way exceptions won't get lost. It's actually better still to throw early and catch late and let the caller handle the IOException. Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the
[jira] [Closed] (PDFBOX-2839) Missing TextPosition(s) in PDFTextStripper
[ https://issues.apache.org/jira/browse/PDFBOX-2839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Clark closed PDFBOX-2839. - Resolution: Not A Problem Having looked at this further I see that a 1:1 correspondence between characters and text positions was not intended Missing TextPosition(s) in PDFTextStripper -- Key: PDFBOX-2839 URL: https://issues.apache.org/jira/browse/PDFBOX-2839 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 2.0.0 Reporter: Christopher Clark The protected method `writeString` in `PDFTextStripper` can receive more characters than TextPositions. I tracked the problem down to the `normalizeAdd` method where, for multi-character unicode words, Multiple characters can be added to a line while only a single TextPosition object is added to the corresponding list of TextPositions. This pdf: https://www.aclweb.org/anthology/W/W13/W13-4011.pdf contains such a character. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: first stack trace report from pdfbox 2.0.0 trunk
Am 14.07.2015 um 21:37 schrieb Allison, Timothy B.: Interesting, yes: 781/781172.pdf, 490/490376.pdf and 029/029423.pdf. Are you running your own regression testing against govdocs1? Yes, from time to time for the last few months. Is it duplicated effort for me to do anything with 2.0.0? Partly yes. The only difference is that I didn't do any text extraction. Or, is your point that should I wait until PDFBOX-2842 is completed? Yes :-) Tilman Thank you! Best, Tim -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, July 14, 2015 12:47 PM To: dev@pdfbox.apache.org Subject: Re: first stack trace report from pdfbox 2.0.0 trunk Hi Tim, Currently there is at least one known regression, mentioned in PDFBOX-2842, it applies to 029423 but also to other files. Tilman Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.: All, I just posted the first stacktrace report from my initial partial batch run of against govdocs1 here: https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip Caveats/Notes The run yesterday did not include the fixes that were made in PDFBOX-2370 or PDFBOX-2862. I stopped the batch run early. This only covered ~50k pdfs. I forgot to turn on accesspermission checking. Some of the pdfs in here would normally have been skipped. I haven't reviewed any of the exceptions. They may be caused by code on the Tika side. I'll plan to re-run with the latest trunk on Tuesday. I need to turn back to the actual eval code for a bit. :) Cheers, Tim - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-2272) Can't extract vertical text correctly
[ https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626998#comment-14626998 ] John Hewson edited comment on PDFBOX-2272 at 7/14/15 8:30 PM: -- The patch looks to large to me, [~AndreasMeier] why has handleTextPosition been created? It seems unnecessary? was (Author: jahewson): The patch looks to large to me, why has handleTextPosition been created? It seems unnecessary? Can't extract vertical text correctly - Key: PDFBOX-2272 URL: https://issues.apache.org/jira/browse/PDFBOX-2272 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6, 2.0.0 Reporter: Biligsaikhan Batjargal Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.patch - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 90ms-RKSJ-V.- - 2.0 extracts the text but can't handle the vertical layout Also see the file from PDFBOX-2294 which contains both horizontal and vertical text. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: first stack trace report from pdfbox 2.0.0 trunk
On 14 Jul 2015, at 13:20, Tilman Hausherr thaush...@t-online.de wrote: Am 14.07.2015 um 21:37 schrieb Allison, Timothy B.: Interesting, yes: 781/781172.pdf, 490/490376.pdf and 029/029423.pdf. Are you running your own regression testing against govdocs1? Yes, from time to time for the last few months. Is it duplicated effort for me to do anything with 2.0.0? Partly yes. The only difference is that I didn't do any text extraction. Or, is your point that should I wait until PDFBOX-2842 is completed? Yes :-) Good news, PDFBOX-2842 is now complete. — John Tilman Thank you! Best, Tim -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, July 14, 2015 12:47 PM To: dev@pdfbox.apache.org Subject: Re: first stack trace report from pdfbox 2.0.0 trunk Hi Tim, Currently there is at least one known regression, mentioned in PDFBOX-2842, it applies to 029423 but also to other files. Tilman Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.: All, I just posted the first stacktrace report from my initial partial batch run of against govdocs1 here: https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip Caveats/Notes The run yesterday did not include the fixes that were made in PDFBOX-2370 or PDFBOX-2862. I stopped the batch run early. This only covered ~50k pdfs. I forgot to turn on accesspermission checking. Some of the pdfs in here would normally have been skipped. I haven't reviewed any of the exceptions. They may be caused by code on the Tika side. I'll plan to re-run with the latest trunk on Tuesday. I need to turn back to the actual eval code for a bit. :) Cheers, Tim - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org mailto:dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org mailto:dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-2842) Overhaul font substitution
[ https://issues.apache.org/jira/browse/PDFBOX-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson resolved PDFBOX-2842. - Resolution: Fixed I'm going to leave this item for a rainy day: - ExternalFonts is a black box: the user cannot tell whether the font returned is an exact match, or a last-resort fallback. Overhaul font substitution -- Key: PDFBOX-2842 URL: https://issues.apache.org/jira/browse/PDFBOX-2842 Project: PDFBox Issue Type: Improvement Components: FontBox, PDModel Affects Versions: 2.0.0 Reporter: John Hewson Assignee: John Hewson Priority: Blocker Fix For: 2.0.0 Attachments: 029423-p1.pdf, 166292-fi-ligature.pdf The improved font substitution mechanisms in 2.0 are not quite sufficient to handle all PDFs. Specifically, CJK substitution and substitution of TTF in place of CFF fonts is not possible with the current design. The CJK problems can be seen in PDFBOX-2509 and PDFBOX-2563, which does not solve the problem. Additional font API weaknesses can be found in PDFBOX-2578 and PDFBOX-2366. This meta-issue aims to address all of those sub-issues. The current problems are: - FontBox does not provide a generic font type, so we have handle TrueTypeFont, CFFFont, and Type1Font separately. This hinders cross-format substitution. - ExternalFonts has no knowledge of the CIDSystemInfo which is necessary for CJK substitution - FontProvider contains too much public logic which should be internal to PDFBox, e.g. substitution logic, this makes it brittle and means we won't be able to add additional logic after 2.0 is released, e.g. CJK substitution. - Too much confusion about the role of ExternalFonts, particularly with regards to mapping of built-in fonts and the definition of substitute vs. fallback font. - ExternalFonts is a black box: the user cannot tell whether the font returned is an exact match, or a last-resort fallback. - Confusing font substitution API, users preferred having a flat file format - PDSimpleFont#getEncoding() can return null for TTFs which use built-in encodings. This has caused a lot of bugs - there must be a better way. - We still have some confusing names, for example a CustomEncoding is known as a built-in encoding in the spec. - There is no fallback CFF font, we resort to AdobeBlank instead, which has no rendering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627010#comment-14627010 ] ASF subversion and git services commented on PDFBOX-2530: - Commit 1691077 from [~jahewson] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691077 ] PDFBOX-2530: Comma separation for filter labels Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, filters-screenshot.png, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF
[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627011#comment-14627011 ] John Hewson commented on PDFBOX-2530: - Yep, I put the code to do that on the wrong line. Will fix. Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, filters-screenshot.png, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html]. Mentor: Tilman Hausherr (European timezone, languages: german, english, french). To see the GSoC2014 project
[jira] [Commented] (PDFBOX-2272) Can't extract vertical text correctly
[ https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626904#comment-14626904 ] Tilman Hausherr commented on PDFBOX-2272: - The only difference is that the relative path is missing. I can't do that because I have changes elsewhere. Can't extract vertical text correctly - Key: PDFBOX-2272 URL: https://issues.apache.org/jira/browse/PDFBOX-2272 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6, 2.0.0 Reporter: Biligsaikhan Batjargal Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.diff - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 90ms-RKSJ-V.- - 2.0 extracts the text but can't handle the vertical layout Also see the file from PDFBOX-2294 which contains both horizontal and vertical text. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626964#comment-14626964 ] John Hewson commented on PDFBOX-2530: - Yes, adding Image + Mask as the default item in the drop down menu would work nicely. Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html]. Mentor: Tilman Hausherr (European timezone, languages: german, english, french). To see the GSoC2014 project
[jira] [Updated] (PDFBOX-2272) Can't extract vertical text correctly
[ https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2272: Attachment: vertical.patch ok, here's the same as a .patch (hopefully). Can't extract vertical text correctly - Key: PDFBOX-2272 URL: https://issues.apache.org/jira/browse/PDFBOX-2272 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6, 2.0.0 Reporter: Biligsaikhan Batjargal Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.patch - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 90ms-RKSJ-V.- - 2.0 extracts the text but can't handle the vertical layout Also see the file from PDFBOX-2294 which contains both horizontal and vertical text. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2530: Attachment: filters-screenshot.png [~jahewson] please add a , or whatever... this is what people see with the file of PDFBOX-2215-027073.pdf. Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, filters-screenshot.png, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html]. Mentor: Tilman Hausherr (European timezone, languages: german, english,
[jira] [Comment Edited] (PDFBOX-2842) Overhaul font substitution
[ https://issues.apache.org/jira/browse/PDFBOX-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627002#comment-14627002 ] John Hewson edited comment on PDFBOX-2842 at 7/14/15 8:35 PM: -- I'm going to leave this item for a rainy day: - Confusing font substitution API, users preferred having a flat file format. was (Author: jahewson): I'm going to leave this item for a rainy day: - ExternalFonts is a black box: the user cannot tell whether the font returned is an exact match, or a last-resort fallback. Overhaul font substitution -- Key: PDFBOX-2842 URL: https://issues.apache.org/jira/browse/PDFBOX-2842 Project: PDFBox Issue Type: Improvement Components: FontBox, PDModel Affects Versions: 2.0.0 Reporter: John Hewson Assignee: John Hewson Priority: Blocker Fix For: 2.0.0 Attachments: 029423-p1.pdf, 166292-fi-ligature.pdf The improved font substitution mechanisms in 2.0 are not quite sufficient to handle all PDFs. Specifically, CJK substitution and substitution of TTF in place of CFF fonts is not possible with the current design. The CJK problems can be seen in PDFBOX-2509 and PDFBOX-2563, which does not solve the problem. Additional font API weaknesses can be found in PDFBOX-2578 and PDFBOX-2366. This meta-issue aims to address all of those sub-issues. The current problems are: - FontBox does not provide a generic font type, so we have handle TrueTypeFont, CFFFont, and Type1Font separately. This hinders cross-format substitution. - ExternalFonts has no knowledge of the CIDSystemInfo which is necessary for CJK substitution - FontProvider contains too much public logic which should be internal to PDFBox, e.g. substitution logic, this makes it brittle and means we won't be able to add additional logic after 2.0 is released, e.g. CJK substitution. - Too much confusion about the role of ExternalFonts, particularly with regards to mapping of built-in fonts and the definition of substitute vs. fallback font. - ExternalFonts is a black box: the user cannot tell whether the font returned is an exact match, or a last-resort fallback. - Confusing font substitution API, users preferred having a flat file format - PDSimpleFont#getEncoding() can return null for TTFs which use built-in encodings. This has caused a lot of bugs - there must be a better way. - We still have some confusing names, for example a CustomEncoding is known as a built-in encoding in the spec. - There is no fallback CFF font, we resort to AdobeBlank instead, which has no rendering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: first stack trace report from pdfbox 2.0.0 trunk
Am 14.07.2015 um 22:35 schrieb John Hewson: On 14 Jul 2015, at 13:20, Tilman Hausherr thaush...@t-online.de wrote: Am 14.07.2015 um 21:37 schrieb Allison, Timothy B.: Interesting, yes: 781/781172.pdf, 490/490376.pdf and 029/029423.pdf. Are you running your own regression testing against govdocs1? Yes, from time to time for the last few months. Is it duplicated effort for me to do anything with 2.0.0? Partly yes. The only difference is that I didn't do any text extraction. Or, is your point that should I wait until PDFBOX-2842 is completed? Yes :-) Good news, PDFBOX-2842 is now complete. No, the 029423 file is still throwing an exception :-( Tilman — John Tilman Thank you! Best, Tim -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, July 14, 2015 12:47 PM To: dev@pdfbox.apache.org Subject: Re: first stack trace report from pdfbox 2.0.0 trunk Hi Tim, Currently there is at least one known regression, mentioned in PDFBOX-2842, it applies to 029423 but also to other files. Tilman Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.: All, I just posted the first stacktrace report from my initial partial batch run of against govdocs1 here: https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip Caveats/Notes The run yesterday did not include the fixes that were made in PDFBOX-2370 or PDFBOX-2862. I stopped the batch run early. This only covered ~50k pdfs. I forgot to turn on accesspermission checking. Some of the pdfs in here would normally have been skipped. I haven't reviewed any of the exceptions. They may be caused by code on the Tika side. I'll plan to re-run with the latest trunk on Tuesday. I need to turn back to the actual eval code for a bit. :) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2842) Overhaul font substitution
[ https://issues.apache.org/jira/browse/PDFBOX-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627346#comment-14627346 ] ASF subversion and git services commented on PDFBOX-2842: - Commit 1691119 from [~jahewson] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691119 ] PDFBOX-2842: Non-symbolic TTFs use StandardEncoding as their built-in Overhaul font substitution -- Key: PDFBOX-2842 URL: https://issues.apache.org/jira/browse/PDFBOX-2842 Project: PDFBox Issue Type: Improvement Components: FontBox, PDModel Affects Versions: 2.0.0 Reporter: John Hewson Assignee: John Hewson Priority: Blocker Fix For: 2.0.0 Attachments: 029423-p1.pdf, 166292-fi-ligature.pdf The improved font substitution mechanisms in 2.0 are not quite sufficient to handle all PDFs. Specifically, CJK substitution and substitution of TTF in place of CFF fonts is not possible with the current design. The CJK problems can be seen in PDFBOX-2509 and PDFBOX-2563, which does not solve the problem. Additional font API weaknesses can be found in PDFBOX-2578 and PDFBOX-2366. This meta-issue aims to address all of those sub-issues. The current problems are: - FontBox does not provide a generic font type, so we have handle TrueTypeFont, CFFFont, and Type1Font separately. This hinders cross-format substitution. - ExternalFonts has no knowledge of the CIDSystemInfo which is necessary for CJK substitution - FontProvider contains too much public logic which should be internal to PDFBox, e.g. substitution logic, this makes it brittle and means we won't be able to add additional logic after 2.0 is released, e.g. CJK substitution. - Too much confusion about the role of ExternalFonts, particularly with regards to mapping of built-in fonts and the definition of substitute vs. fallback font. - ExternalFonts is a black box: the user cannot tell whether the font returned is an exact match, or a last-resort fallback. - Confusing font substitution API, users preferred having a flat file format - PDSimpleFont#getEncoding() can return null for TTFs which use built-in encodings. This has caused a lot of bugs - there must be a better way. - We still have some confusing names, for example a CustomEncoding is known as a built-in encoding in the spec. - There is no fallback CFF font, we resort to AdobeBlank instead, which has no rendering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: first stack trace report from pdfbox 2.0.0 trunk
On 14 Jul 2015, at 13:49, Tilman Hausherr thaush...@t-online.de wrote: Am 14.07.2015 um 22:35 schrieb John Hewson: On 14 Jul 2015, at 13:20, Tilman Hausherr thaush...@t-online.de wrote: Am 14.07.2015 um 21:37 schrieb Allison, Timothy B.: Interesting, yes: 781/781172.pdf, 490/490376.pdf and 029/029423.pdf. Are you running your own regression testing against govdocs1? Yes, from time to time for the last few months. Is it duplicated effort for me to do anything with 2.0.0? Partly yes. The only difference is that I didn't do any text extraction. Or, is your point that should I wait until PDFBOX-2842 is completed? Yes :-) Good news, PDFBOX-2842 is now complete. No, the 029423 file is still throwing an exception :-( Ok, I’ve just fixed this, hopefully it works. — John Tilman — John Tilman Thank you! Best, Tim -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, July 14, 2015 12:47 PM To: dev@pdfbox.apache.org Subject: Re: first stack trace report from pdfbox 2.0.0 trunk Hi Tim, Currently there is at least one known regression, mentioned in PDFBOX-2842, it applies to 029423 but also to other files. Tilman Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.: All, I just posted the first stacktrace report from my initial partial batch run of against govdocs1 here: https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip Caveats/Notes The run yesterday did not include the fixes that were made in PDFBOX-2370 or PDFBOX-2862. I stopped the batch run early. This only covered ~50k pdfs. I forgot to turn on accesspermission checking. Some of the pdfs in here would normally have been skipped. I haven't reviewed any of the exceptions. They may be caused by code on the Tika side. I'll plan to re-run with the latest trunk on Tuesday. I need to turn back to the actual eval code for a bit. :) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-2842) Overhaul font substitution
[ https://issues.apache.org/jira/browse/PDFBOX-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson resolved PDFBOX-2842. - Resolution: Fixed Overhaul font substitution -- Key: PDFBOX-2842 URL: https://issues.apache.org/jira/browse/PDFBOX-2842 Project: PDFBox Issue Type: Improvement Components: FontBox, PDModel Affects Versions: 2.0.0 Reporter: John Hewson Assignee: John Hewson Priority: Blocker Fix For: 2.0.0 Attachments: 029423-p1.pdf, 166292-fi-ligature.pdf The improved font substitution mechanisms in 2.0 are not quite sufficient to handle all PDFs. Specifically, CJK substitution and substitution of TTF in place of CFF fonts is not possible with the current design. The CJK problems can be seen in PDFBOX-2509 and PDFBOX-2563, which does not solve the problem. Additional font API weaknesses can be found in PDFBOX-2578 and PDFBOX-2366. This meta-issue aims to address all of those sub-issues. The current problems are: - FontBox does not provide a generic font type, so we have handle TrueTypeFont, CFFFont, and Type1Font separately. This hinders cross-format substitution. - ExternalFonts has no knowledge of the CIDSystemInfo which is necessary for CJK substitution - FontProvider contains too much public logic which should be internal to PDFBox, e.g. substitution logic, this makes it brittle and means we won't be able to add additional logic after 2.0 is released, e.g. CJK substitution. - Too much confusion about the role of ExternalFonts, particularly with regards to mapping of built-in fonts and the definition of substitute vs. fallback font. - ExternalFonts is a black box: the user cannot tell whether the font returned is an exact match, or a last-resort fallback. - Confusing font substitution API, users preferred having a flat file format - PDSimpleFont#getEncoding() can return null for TTFs which use built-in encodings. This has caused a lot of bugs - there must be a better way. - We still have some confusing names, for example a CustomEncoding is known as a built-in encoding in the spec. - There is no fallback CFF font, we resort to AdobeBlank instead, which has no rendering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-2882) Improve performance when using scratch file
Timo Boehme created PDFBOX-2882: --- Summary: Improve performance when using scratch file Key: PDFBOX-2882 URL: https://issues.apache.org/jira/browse/PDFBOX-2882 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 2.0.0 Reporter: Timo Boehme Priority: Minor The current scratch file implementation uses many direct I/O calls which slows down parsing compared with in-memory scratch buffer considerably. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2842) Overhaul font substitution
[ https://issues.apache.org/jira/browse/PDFBOX-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627288#comment-14627288 ] John Hewson commented on PDFBOX-2842: - Thanks, yes I missed that one. Overhaul font substitution -- Key: PDFBOX-2842 URL: https://issues.apache.org/jira/browse/PDFBOX-2842 Project: PDFBox Issue Type: Improvement Components: FontBox, PDModel Affects Versions: 2.0.0 Reporter: John Hewson Assignee: John Hewson Priority: Blocker Fix For: 2.0.0 Attachments: 029423-p1.pdf, 166292-fi-ligature.pdf The improved font substitution mechanisms in 2.0 are not quite sufficient to handle all PDFs. Specifically, CJK substitution and substitution of TTF in place of CFF fonts is not possible with the current design. The CJK problems can be seen in PDFBOX-2509 and PDFBOX-2563, which does not solve the problem. Additional font API weaknesses can be found in PDFBOX-2578 and PDFBOX-2366. This meta-issue aims to address all of those sub-issues. The current problems are: - FontBox does not provide a generic font type, so we have handle TrueTypeFont, CFFFont, and Type1Font separately. This hinders cross-format substitution. - ExternalFonts has no knowledge of the CIDSystemInfo which is necessary for CJK substitution - FontProvider contains too much public logic which should be internal to PDFBox, e.g. substitution logic, this makes it brittle and means we won't be able to add additional logic after 2.0 is released, e.g. CJK substitution. - Too much confusion about the role of ExternalFonts, particularly with regards to mapping of built-in fonts and the definition of substitute vs. fallback font. - ExternalFonts is a black box: the user cannot tell whether the font returned is an exact match, or a last-resort fallback. - Confusing font substitution API, users preferred having a flat file format - PDSimpleFont#getEncoding() can return null for TTFs which use built-in encodings. This has caused a lot of bugs - there must be a better way. - We still have some confusing names, for example a CustomEncoding is known as a built-in encoding in the spec. - There is no fallback CFF font, we resort to AdobeBlank instead, which has no rendering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2842) Overhaul font substitution
[ https://issues.apache.org/jira/browse/PDFBOX-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627287#comment-14627287 ] ASF subversion and git services commented on PDFBOX-2842: - Commit 1691110 from [~jahewson] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691110 ] PDFBOX-2842: Re-build stale font cache Overhaul font substitution -- Key: PDFBOX-2842 URL: https://issues.apache.org/jira/browse/PDFBOX-2842 Project: PDFBox Issue Type: Improvement Components: FontBox, PDModel Affects Versions: 2.0.0 Reporter: John Hewson Assignee: John Hewson Priority: Blocker Fix For: 2.0.0 Attachments: 029423-p1.pdf, 166292-fi-ligature.pdf The improved font substitution mechanisms in 2.0 are not quite sufficient to handle all PDFs. Specifically, CJK substitution and substitution of TTF in place of CFF fonts is not possible with the current design. The CJK problems can be seen in PDFBOX-2509 and PDFBOX-2563, which does not solve the problem. Additional font API weaknesses can be found in PDFBOX-2578 and PDFBOX-2366. This meta-issue aims to address all of those sub-issues. The current problems are: - FontBox does not provide a generic font type, so we have handle TrueTypeFont, CFFFont, and Type1Font separately. This hinders cross-format substitution. - ExternalFonts has no knowledge of the CIDSystemInfo which is necessary for CJK substitution - FontProvider contains too much public logic which should be internal to PDFBox, e.g. substitution logic, this makes it brittle and means we won't be able to add additional logic after 2.0 is released, e.g. CJK substitution. - Too much confusion about the role of ExternalFonts, particularly with regards to mapping of built-in fonts and the definition of substitute vs. fallback font. - ExternalFonts is a black box: the user cannot tell whether the font returned is an exact match, or a last-resort fallback. - Confusing font substitution API, users preferred having a flat file format - PDSimpleFont#getEncoding() can return null for TTFs which use built-in encodings. This has caused a lot of bugs - there must be a better way. - We still have some confusing names, for example a CustomEncoding is known as a built-in encoding in the spec. - There is no fallback CFF font, we resort to AdobeBlank instead, which has no rendering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-2882) Improve performance when using scratch file
[ https://issues.apache.org/jira/browse/PDFBOX-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timo Boehme updated PDFBOX-2882: Attachment: ScratchFileBuffer.java ScratchFile.java Drop-in replacement for classes in org.apache.pdfbox.io package. It keeps the single scratch file approach and paging but does not use linking between pages but direct index. Additionally pages can be re-used if buffers are closed. File-I/O is only necessary to read/write whole pages. For a small test on loading PDF-reference file this implementation reduced the time needed by a factor of 2. Improve performance when using scratch file --- Key: PDFBOX-2882 URL: https://issues.apache.org/jira/browse/PDFBOX-2882 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 2.0.0 Reporter: Timo Boehme Priority: Minor Attachments: ScratchFile.java, ScratchFileBuffer.java The current scratch file implementation uses many direct I/O calls which slows down parsing compared with in-memory scratch buffer considerably. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2881) Radial and Axial shading steps are calculated incorrectly
[ https://issues.apache.org/jira/browse/PDFBOX-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627568#comment-14627568 ] Tilman Hausherr commented on PDFBOX-2881: - Can you tell a file that was rendering incorrectly? Radial and Axial shading steps are calculated incorrectly - Key: PDFBOX-2881 URL: https://issues.apache.org/jira/browse/PDFBOX-2881 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 2.0.0 Reporter: John Hewson Assignee: John Hewson Fix For: 2.0.0 I found a shading bug while writing some code to dump all shadings in a PDF. I don't know if this affects PDF rendering within PageDrawer or not. RadialShadingContext and AxialShadingContext use the following code in their constructors to calculate the number of steps (pixels) in the shading and build a lookup table for each step: {code} // transform the distance to actual pixel space // use transform, because xform.getScaleX() does not return correct scaling on 90° rotated matrix Point2D point = new Point2D.Double(longestDistance, longestDistance); matrix.transform(point); xform.transform(point, point); factor = (int) Math.max(Math.abs(point.getX()), Math.abs(point.getY())); colorTable = calcColorTable(); {code} The variable factor is the number of steps and matrix is the parent stream's matrix + the pattern matrix, so this code is taking the current scale and assuming that that is equal to the number of pixels. This works when a pattern is painted onto a 0...1 scaled surface, but otherwise it produces incorrect results. There's no way to calculate the number of pixels in the device from its scale, or its matrix. Paint#createContext() provides the device bounds Rectangle, which is what we should be using. Indeed, this is handled correctly in the other shading contexts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-2845) Error parsing PDF
[ https://issues.apache.org/jira/browse/PDFBOX-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Clark updated PDFBOX-2845: -- Fix Version/s: 2.0.0 Error parsing PDF - Key: PDFBOX-2845 URL: https://issues.apache.org/jira/browse/PDFBOX-2845 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 2.0.0 Reporter: Christopher Clark Fix For: 2.0.0 I get the following error when parsing this pdf: http://jmlr.csail.mit.edu/proceedings/papers/v28/ranganath13.pdf java.io.IOException: Object must be defined and must not be compressed object: 554:0 Stack trace: Exception in thread main java.io.IOException: Object must be defined and must not be compressed object: 554:0 at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:682) at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:646) at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:847) at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:906) at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:732) at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:693) at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:646) at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:607) at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:225) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:848) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:793) at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:192) at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:81) at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:55) Note this problem does not occur in 1.8.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Assigned] (PDFBOX-2882) Improve performance when using scratch file
[ https://issues.apache.org/jira/browse/PDFBOX-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timo Boehme reassigned PDFBOX-2882: --- Assignee: Timo Boehme Improve performance when using scratch file --- Key: PDFBOX-2882 URL: https://issues.apache.org/jira/browse/PDFBOX-2882 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 2.0.0 Reporter: Timo Boehme Assignee: Timo Boehme Priority: Minor Attachments: ScratchFile.java, ScratchFileBuffer.java The current scratch file implementation uses many direct I/O calls which slows down parsing compared with in-memory scratch buffer considerably. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2272) Can't extract vertical text correctly
[ https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626831#comment-14626831 ] John Hewson commented on PDFBOX-2272: - https://ariejan.net/2007/07/03/how-to-create-and-apply-a-patch-with-subversion/ Can't extract vertical text correctly - Key: PDFBOX-2272 URL: https://issues.apache.org/jira/browse/PDFBOX-2272 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6, 2.0.0 Reporter: Biligsaikhan Batjargal Attachments: PDFTextStripper.java, test.pdf, test.txt - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 90ms-RKSJ-V.- - 2.0 extracts the text but can't handle the vertical layout Also see the file from PDFBOX-2294 which contains both horizontal and vertical text. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2871) Performance issue when filling the first PDTextField of an AcroForm
[ https://issues.apache.org/jira/browse/PDFBOX-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626837#comment-14626837 ] John Hewson commented on PDFBOX-2871: - I'm not really worried about speeding up font parsing, we have the on-disk cache now, so it's a only once ever event. Scanning for files on the local system is already fast. What's still relatively slow about the new cache is that it uses Java's serialization - a custom serialisation format could be much faster. I'm also not sure about the speed of the Preferences API - benchmarking needed. Performance issue when filling the first PDTextField of an AcroForm --- Key: PDFBOX-2871 URL: https://issues.apache.org/jira/browse/PDFBOX-2871 Project: PDFBox Issue Type: Bug Components: AcroForm Affects Versions: 2.0.0 Reporter: Maruan Sahyoun Assignee: John Hewson Priority: Critical Labels: Appearance Fix For: 2.0.0 Attachments: PDTextField.pdf, ProfilingOutput.png When filling the first PDTextField in a form the performance is slow. All other PDTextFields in the form are handled quickly. This code {code} PDTextField field = (PDTextField) doc.getDocumentCatalog().getAcroForm().getField(Textfield01); long start = System.nanoTime(); field.setValue(ABCD); long end = System.nanoTime(); double difference = (end - start)/1e6; System.out.println(difference); field = (PDTextField) doc.getDocumentCatalog().getAcroForm().getField(Textfield02); start = System.nanoTime(); field.setValue(ABCD); end = System.nanoTime(); difference = (end - start)/1e6; System.out.println(difference); {code} produces the following output {noformat} 9713.38 3.904 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-2871) Performance issue when filling the first PDTextField of an AcroForm
[ https://issues.apache.org/jira/browse/PDFBOX-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626837#comment-14626837 ] John Hewson edited comment on PDFBOX-2871 at 7/14/15 6:41 PM: -- I'm not really worried about speeding up font parsing, we have the on-disk cache now, so it's an only once ever event. Scanning for files on the local system is already fast. What's still relatively slow about the new cache is that it uses Java's serialization - a custom serialisation format could be much faster. I'm also not sure about the speed of the Preferences API - benchmarking needed. was (Author: jahewson): I'm not really worried about speeding up font parsing, we have the on-disk cache now, so it's a only once ever event. Scanning for files on the local system is already fast. What's still relatively slow about the new cache is that it uses Java's serialization - a custom serialisation format could be much faster. I'm also not sure about the speed of the Preferences API - benchmarking needed. Performance issue when filling the first PDTextField of an AcroForm --- Key: PDFBOX-2871 URL: https://issues.apache.org/jira/browse/PDFBOX-2871 Project: PDFBox Issue Type: Bug Components: AcroForm Affects Versions: 2.0.0 Reporter: Maruan Sahyoun Assignee: John Hewson Priority: Critical Labels: Appearance Fix For: 2.0.0 Attachments: PDTextField.pdf, ProfilingOutput.png When filling the first PDTextField in a form the performance is slow. All other PDTextFields in the form are handled quickly. This code {code} PDTextField field = (PDTextField) doc.getDocumentCatalog().getAcroForm().getField(Textfield01); long start = System.nanoTime(); field.setValue(ABCD); long end = System.nanoTime(); double difference = (end - start)/1e6; System.out.println(difference); field = (PDTextField) doc.getDocumentCatalog().getAcroForm().getField(Textfield02); start = System.nanoTime(); field.setValue(ABCD); end = System.nanoTime(); difference = (end - start)/1e6; System.out.println(difference); {code} produces the following output {noformat} 9713.38 3.904 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-2272) Can't extract vertical text correctly
[ https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626841#comment-14626841 ] Tilman Hausherr edited comment on PDFBOX-2272 at 7/14/15 6:43 PM: -- Here's the change as a patch, just to show that this isn't some bureaucratic trick. Hopefully somebody will understand it... I've never worked deeply on that part of PDFBox, except two bug fixes (one of them from you). was (Author: tilman): Here's the change as a patch, just to show that this isn't some bureaucratic trick. Hopefully somebody will understand it... I've never worked deeply on that part of PDFBox, except two bug fixes (one of the from you). Can't extract vertical text correctly - Key: PDFBOX-2272 URL: https://issues.apache.org/jira/browse/PDFBOX-2272 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6, 2.0.0 Reporter: Biligsaikhan Batjargal Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.diff - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 90ms-RKSJ-V.- - 2.0 extracts the text but can't handle the vertical layout Also see the file from PDFBOX-2294 which contains both horizontal and vertical text. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-2272) Can't extract vertical text correctly
[ https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2272: Attachment: vertical.diff Here's the change as a patch, just to show that this isn't some bureaucratic trick. Hopefully somebody will understand it... I've never worked deeply on that part of PDFBox, except two bug fixes (one of the from you). Can't extract vertical text correctly - Key: PDFBOX-2272 URL: https://issues.apache.org/jira/browse/PDFBOX-2272 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6, 2.0.0 Reporter: Biligsaikhan Batjargal Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.diff - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 90ms-RKSJ-V.- - 2.0 extracts the text but can't handle the vertical layout Also see the file from PDFBOX-2294 which contains both horizontal and vertical text. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626869#comment-14626869 ] ASF subversion and git services commented on PDFBOX-2530: - Commit 1691060 from [~jahewson] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691060 ] PDFBOX-2530: UI tweak for image view Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html]. Mentor: Tilman Hausherr
[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626872#comment-14626872 ] John Hewson commented on PDFBOX-2530: - I was testing the new image view on some black-white gradient images but it was hard to tell what was the image and what was the background, so I made some minor tweaks. Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html]. Mentor: Tilman Hausherr
[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626751#comment-14626751 ] Tilman Hausherr commented on PDFBOX-2530: - Commit msg has been fixed separately with correct attribution to you. Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html]. Mentor: Tilman Hausherr (European timezone, languages: german, english, french). To see the GSoC2014 project I
[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626753#comment-14626753 ] khyrul bashar commented on PDFBOX-2530: --- I've uploaded a patch. Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html]. Mentor: Tilman Hausherr (European timezone, languages: german, english, french). To see the GSoC2014 project I mentored, go to PDFBOX-1915. -- This message was sent
[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626754#comment-14626754 ] khyrul bashar commented on PDFBOX-2530: --- I've uploaded a patch. Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html]. Mentor: Tilman Hausherr (European timezone, languages: german, english, french). To see the GSoC2014 project I mentored, go to PDFBOX-1915. -- This message was sent
[jira] [Issue Comment Deleted] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] khyrul bashar updated PDFBOX-2530: -- Comment: was deleted (was: I've uploaded a patch. ) Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html]. Mentor: Tilman Hausherr (European timezone, languages: german, english, french). To see the GSoC2014 project I mentored, go to PDFBOX-1915. -- This message was sent by Atlassian JIRA
[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626755#comment-14626755 ] khyrul bashar commented on PDFBOX-2530: --- Thanks :) Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html]. Mentor: Tilman Hausherr (European timezone, languages: german, english, french). To see the GSoC2014 project I mentored, go to PDFBOX-1915. -- This message was sent by Atlassian
[jira] [Comment Edited] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626647#comment-14626647 ] Tilman Hausherr edited comment on PDFBOX-2530 at 7/14/15 5:57 PM: -- There is a new bug (class cast exception) when clicking on a page content stream when in show pages mode. Although the bug is new, I assume that the root cause -(a MapEntry with a MapEntry)- is older. was (Author: tilman): There is a new bug (class cast exception) when clicking on a page content stream when in show pages mode. Although the bug is new, I assume that the root cause (a MapEntry with a MapEntry) is older. Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with
[jira] [Updated] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] khyrul bashar updated PDFBOX-2530: -- Attachment: Sonarqube_warning_resolved.diff Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html]. Mentor: Tilman Hausherr (European timezone, languages: german, english, french). To see the GSoC2014 project I mentored, go to PDFBOX-1915. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] khyrul bashar updated PDFBOX-2530: -- Attachment: Class_cast_exception_in_page_mode_avoided.diff Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html]. Mentor: Tilman Hausherr (European timezone, languages: german, english, french). To see the GSoC2014 project I mentored, go to PDFBOX-1915. -- This message was sent by Atlassian JIRA
[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger
[ https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626774#comment-14626774 ] ASF subversion and git services commented on PDFBOX-2530: - Commit 1691044 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1691044 ] PDFBOX-2530: fix ClassCastException in page content streams when in page display mode, as done by Khyrul Bashar in GSoC2015 Improve PDFDebugger --- Key: PDFBOX-2530 URL: https://issues.apache.org/jira/browse/PDFBOX-2530 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: khyrul bashar Labels: gsoc2015 Attachments: Avoiding_NPE_for_null_Field_Type.diff, BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, PDFDebugger_StatusBar_01.png, Parent_dictionary_type_checking_for__f__and__flags.diff, Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, treestatuspane.diff (This is an idea for the [Google Summer of Code 2015|https://www.google-melange.com/]) Our command line utility PDFDebugger (part of the command line pdfbox-app get it [here|https://pdfbox.apache.org/downloads.html], read description [here|https://pdfbox.apache.org/commandline/], see the source code [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date]) needs some improvements: - hex view - view of non printable characters - ✓ saving streams - binary copy paste - ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT) - ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF) - ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string - ✓ ability to search in streams (very useful for content streams and meta data) - ✓ show images that are streams - ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves - ✓ show PDSeparation color - ✓ show PDDeviceN colors - optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. appearance stream when hovering over /AP - show font encodings and characters - ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once - edit attributes (should be possible to enter values as decimal, hex or binary) - edit streams, while keeping or changing the compression filter - save altered PDF - color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the bracketing of these operators, i.e. understand where a sequence starts and where it ends. (See operator summary in the PDF Spec) Other important operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator. To see a product with a similar purpose that is better than PDFDebugger, watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc]. I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger. Prerequisites: - java programming, especially the GUI components - the ability to understand existing source code Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big. Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly). Get introduced: [download the source code with svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the [PDF