[jira] [Commented] (PDFBOX-2272) Can't extract vertical text correctly

2015-07-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625878#comment-14625878
 ] 

Tilman Hausherr commented on PDFBOX-2272:
-

Please submit this as a diff against the repository, so that we can see easily 
what is different.

 Can't extract vertical text correctly
 -

 Key: PDFBOX-2272
 URL: https://issues.apache.org/jira/browse/PDFBOX-2272
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6, 2.0.0
Reporter: Biligsaikhan Batjargal
 Attachments: PDFTextStripper.java, test.pdf, test.txt


 - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 
 90ms-RKSJ-V.-
 - 2.0 extracts the text but can't handle the vertical layout
 Also see the file from PDFBOX-2294 which contains both horizontal and 
 vertical text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Performance of the trunkversion

2015-07-14 Thread Andreas Lehmkühler

 Manfred Pock pock.manf...@gmail.com hat am 14. Juli 2015 um 12:15
 geschrieben:
 
 
 Yes, the input is a inputstream. I can try it direct from file.
 
 But in general we get the pdf from an document management system as stream.
 Does make sense that i save the pdf to file before?
If possible, yes. As I already said, we need random access to the pdf and
InputStream doesn't support seek operations so that we have to copy the whole
stream to a file or to memory.

 Why is there so an big performance difference beetween the version from 
 May and the current version, if we use it with useScratchFiles = true ?
I'm not sure, but the reason seems to be the altered scratchfile handling. I've
to double check that.

BR
Andreas

 regarts, Manfred
 
 Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:
  Hi,
 
  Manfred Pock pock.manf...@gmail.com hat am 14. Juli 2015 um 11:39
  geschrieben:
 
 
  Ok, we load the pdf with useScratchFiles = true, if we load them with
  false the performance is better, but a little bit slower than the old one.
  What do you use as input, a stream or a real file? If the latter you should
  use
  the load method with the file parameter.
 
  PDFBox needs ramdom access to the pdf and if a stream is provided PDFBox
  copies
  the data to a file (lower memory usage, slower performance) or to the memory
  (higher memory usage, better performance).
 
  BR
  Andreas
 
 
  But now it need more memory. I cannot load some pdfs with the current
  version with the same java-memory configuration.
 
  Am 14.07.2015 um 11:26 schrieb Manfred Pock:
  Hi,
 
  we use the Pdfbox-trunkversion to render pdf's, currently we use the
  version from 12. May 2015.
 
  Today i have done an update to the current version and have test it.
  It seems to be that it need now much more time to render pdf's, it
  depends of the size of the pdf.
 
  for example you can try this one:
 
  http://cloud.directupload.net/15bu
 
  It need five times more then the version from May 2015.
 
  regarts, Manfred
  -
  To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
  For additional commands, e-mail: dev-h...@pdfbox.apache.org
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
 For additional commands, e-mail: dev-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Performance of the trunkversion

2015-07-14 Thread Timo Boehme

Hi,

as I see it (had only a quick look at the implementation) the 
ScratchFileBuffer implementation is not optimal for fast random access. 
Single writes of bytes are not buffered but directly written to the file 
- a lot of I/O-operations) and seek operations have to travel the linked 
page list reading some bytes of each page - again a lot of seek and read 
I/O-operations.
To speed things up it is crucial to minimize the number of 
I/O-operations directly going to the random access file. Therefore it is 
needed to buffer writes, keep last read page in memory for sequential 
reads and have an in-memory cache of page meta data (offset, link to 
previous/next page).



Best,
Timo


Am 14.07.2015 um 12:15 schrieb Manfred Pock:

Yes, the input is a inputstream. I can try it direct from file.

But in general we get the pdf from an document management system as stream.
Does make sense that i save the pdf to file before?

Why is there so an big performance difference beetween the version from
May and the current version, if we use it with useScratchFiles = true ?

regarts, Manfred

Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:

Hi,


Manfred Pock pock.manf...@gmail.com hat am 14. Juli 2015 um 11:39
geschrieben:


Ok, we load the pdf with useScratchFiles = true, if we load them with
false the performance is better, but a little bit slower than the old
one.

What do you use as input, a stream or a real file? If the latter you
should use
the load method with the file parameter.

PDFBox needs ramdom access to the pdf and if a stream is provided
PDFBox copies
the data to a file (lower memory usage, slower performance) or to the
memory
(higher memory usage, better performance).

BR
Andreas



But now it need more memory. I cannot load some pdfs with the current
version with the same java-memory configuration.

Am 14.07.2015 um 11:26 schrieb Manfred Pock:

Hi,

we use the Pdfbox-trunkversion to render pdf's, currently we use the
version from 12. May 2015.

Today i have done an update to the current version and have test it.
It seems to be that it need now much more time to render pdf's, it
depends of the size of the pdf.

for example you can try this one:

http://cloud.directupload.net/15bu

It need five times more then the version from May 2015.

regarts, Manfred

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




--
Timo Boehme
OntoChem IT Solutions GmbH
Blücherstraße 24
06120 Halle (Saale)
Germany

phone: +49 345 478 047 4  | fax: +49 345 478 047 1
email: ulf.la...@ontochem.com | web: www.ontochem.com
HRB 21962 Amtsgericht Stendal | USt-IdNr.: DE815563824
managing director : Lutz Weber


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Performance of the trunkversion

2015-07-14 Thread Andreas Lehmkühler
Hi,

 Manfred Pock pock.manf...@gmail.com hat am 14. Juli 2015 um 11:39
 geschrieben:
 
 
 Ok, we load the pdf with useScratchFiles = true, if we load them with 
 false the performance is better, but a little bit slower than the old one.
What do you use as input, a stream or a real file? If the latter you should use
the load method with the file parameter.

PDFBox needs ramdom access to the pdf and if a stream is provided PDFBox copies
the data to a file (lower memory usage, slower performance) or to the memory
(higher memory usage, better performance). 

BR
Andreas


 But now it need more memory. I cannot load some pdfs with the current 
 version with the same java-memory configuration.
 
 Am 14.07.2015 um 11:26 schrieb Manfred Pock:
  Hi,
 
  we use the Pdfbox-trunkversion to render pdf's, currently we use the 
  version from 12. May 2015.
 
  Today i have done an update to the current version and have test it. 
  It seems to be that it need now much more time to render pdf's, it 
  depends of the size of the pdf.
 
  for example you can try this one:
 
  http://cloud.directupload.net/15bu
 
  It need five times more then the version from May 2015.
 
  regarts, Manfred
 

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2860) NonSeq parser slower than Seq parser

2015-07-14 Thread simon steiner (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626187#comment-14626187
 ] 

simon steiner commented on PDFBOX-2860:
---

It sounds like if we can pass a file we should use nonseq otherwise for 
inputstream use seq

 NonSeq parser slower than Seq parser
 

 Key: PDFBOX-2860
 URL: https://issues.apache.org/jira/browse/PDFBOX-2860
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 2.0.0
Reporter: simon steiner

 PDF from PDFBOX-797
 for (int i=0; i1000; i++) {
 PDDocument.load(new FileInputStream(
 4218.pdf)).close();
 }
 Nonseq:
 real  0m23.691s
 Seq:
 real  0m9.705s



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Performance of the trunkversion

2015-07-14 Thread Manfred Pock

Hi,

we use the Pdfbox-trunkversion to render pdf's, currently we use the 
version from 12. May 2015.


Today i have done an update to the current version and have test it. It 
seems to be that it need now much more time to render pdf's, it depends 
of the size of the pdf.


for example you can try this one:

http://cloud.directupload.net/15bu

It need five times more then the version from May 2015.

regarts, Manfred


Re: Performance of the trunkversion

2015-07-14 Thread Manfred Pock

Yes, the input is a inputstream. I can try it direct from file.

But in general we get the pdf from an document management system as stream.
Does make sense that i save the pdf to file before?

Why is there so an big performance difference beetween the version from 
May and the current version, if we use it with useScratchFiles = true ?


regarts, Manfred

Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:

Hi,


Manfred Pock pock.manf...@gmail.com hat am 14. Juli 2015 um 11:39
geschrieben:


Ok, we load the pdf with useScratchFiles = true, if we load them with
false the performance is better, but a little bit slower than the old one.

What do you use as input, a stream or a real file? If the latter you should use
the load method with the file parameter.

PDFBox needs ramdom access to the pdf and if a stream is provided PDFBox copies
the data to a file (lower memory usage, slower performance) or to the memory
(higher memory usage, better performance).

BR
Andreas



But now it need more memory. I cannot load some pdfs with the current
version with the same java-memory configuration.

Am 14.07.2015 um 11:26 schrieb Manfred Pock:

Hi,

we use the Pdfbox-trunkversion to render pdf's, currently we use the
version from 12. May 2015.

Today i have done an update to the current version and have test it.
It seems to be that it need now much more time to render pdf's, it
depends of the size of the pdf.

for example you can try this one:

http://cloud.directupload.net/15bu

It need five times more then the version from May 2015.

regarts, Manfred

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Understanding PDFBox

2015-07-14 Thread Subham Tripathi
Hi All,
I wish to contribute to Apache PDFBox but before that i was trying to
understand the codebase. I am finding it very tough to understand the code
base as i am not finding any flow to follow.
Is there any documentation from which i can draw some high level insight of
the PDFBox ?

-- 
Best Regards,
Subham Tripathi


Re: Performance of the trunkversion

2015-07-14 Thread Timo Boehme

Hi,

instead of having a linked page list in ScratchFileBuffer I would 
propose having a list of pages with the page numbers (integer) kept in 
memory (takes 1k for 1MB data). This would ease page handling, seeking 
does not need I/O-operations and caching of pages would be a lot easier.

I may find some time later to come up with such a replacement.

Best,
Timo


Am 14.07.2015 um 13:02 schrieb Timo Boehme:

Hi,

as I see it (had only a quick look at the implementation) the
ScratchFileBuffer implementation is not optimal for fast random access.
Single writes of bytes are not buffered but directly written to the file
- a lot of I/O-operations) and seek operations have to travel the linked
page list reading some bytes of each page - again a lot of seek and read
I/O-operations.
To speed things up it is crucial to minimize the number of
I/O-operations directly going to the random access file. Therefore it is
needed to buffer writes, keep last read page in memory for sequential
reads and have an in-memory cache of page meta data (offset, link to
previous/next page).


Best,
Timo


Am 14.07.2015 um 12:15 schrieb Manfred Pock:

Yes, the input is a inputstream. I can try it direct from file.

But in general we get the pdf from an document management system as
stream.
Does make sense that i save the pdf to file before?

Why is there so an big performance difference beetween the version from
May and the current version, if we use it with useScratchFiles = true ?

regarts, Manfred

Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:

Hi,


Manfred Pock pock.manf...@gmail.com hat am 14. Juli 2015 um 11:39
geschrieben:


Ok, we load the pdf with useScratchFiles = true, if we load them with
false the performance is better, but a little bit slower than the old
one.

What do you use as input, a stream or a real file? If the latter you
should use
the load method with the file parameter.

PDFBox needs ramdom access to the pdf and if a stream is provided
PDFBox copies
the data to a file (lower memory usage, slower performance) or to the
memory
(higher memory usage, better performance).

BR
Andreas



But now it need more memory. I cannot load some pdfs with the current
version with the same java-memory configuration.

Am 14.07.2015 um 11:26 schrieb Manfred Pock:

Hi,

we use the Pdfbox-trunkversion to render pdf's, currently we use the
version from 12. May 2015.

Today i have done an update to the current version and have test it.
It seems to be that it need now much more time to render pdf's, it
depends of the size of the pdf.

for example you can try this one:

http://cloud.directupload.net/15bu

It need five times more then the version from May 2015.

regarts, Manfred

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org







--
Timo Boehme
OntoChem IT Solutions GmbH
Blücherstraße 24
06120 Halle (Saale)
Germany

phone: +49 345 478 047 4  | fax: +49 345 478 047 1
email: ulf.la...@ontochem.com | web: www.ontochem.com
HRB 21962 Amtsgericht Stendal | USt-IdNr.: DE815563824
managing director : Lutz Weber


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Understanding PDFBox

2015-07-14 Thread khyrul Bashar
Hi Subham
I'm a GSoc student here in PDFBox this year and I'm improving PDFDebugger
of PDFBox issue https://issues.apache.org/jira/browse/PDFBOX-2530. Before
applying for the project, I had to be familiar with the code base. I was in
a bit of puzzle for the first time, but now I've a basic understanding of
the code base though I'm not coding for the main module of the PDFBox. I'm
suggesting what I've done so far to get comfortable with PDFBox to start.

Read the PDF specification, at least get a head start.
https://www.adobe.com/devnet/pdf/pdf_reference.html
Read the documentation.
https://pdfbox.apache.org/docs/2.0.0-SNAPSHOT/javadocs/
Play with example codes.
https://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/

Anyway, there are other things before you can contribute which I think
the committers guys can say more specifically.

Regards
Khyrul Bashar

On Tue, Jul 14, 2015 at 4:58 PM, Subham Tripathi subham@gmail.com
wrote:

 Hi All,
 I wish to contribute to Apache PDFBox but before that i was trying to
 understand the codebase. I am finding it very tough to understand the code
 base as i am not finding any flow to follow.
 Is there any documentation from which i can draw some high level insight of
 the PDFBox ?

 --
 Best Regards,
 Subham Tripathi



[jira] [Commented] (PDFBOX-2877) Wrong text placement for autosize fields compared to Adobe generated

2015-07-14 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626372#comment-14626372
 ] 

Maruan Sahyoun commented on PDFBOX-2877:


I made some progress and the results using an Arial font are much better. The 
results using the Courier font are still a lot apart from the Adobe generated 
results so it looks like depending on the available metrics of the font such as 
BBox, ascender, descender ... are not enough. Will continue to investigate and 
post my findings. For the moment I will generate some more testing material as 
to ensure that we have a better coverage for different fonts.

 Wrong text placement for autosize fields compared to Adobe generated
 

 Key: PDFBOX-2877
 URL: https://issues.apache.org/jira/browse/PDFBOX-2877
 Project: PDFBox
  Issue Type: Sub-task
  Components: AcroForm
Affects Versions: 1.8.9, 2.0.0
Reporter: Maruan Sahyoun
Assignee: Maruan Sahyoun
  Labels: Appearance
 Fix For: 2.0.0

 Attachments: AutosizeTests-filled-20150713.pdf, 
 AutosizeTests-filled-20150713.png, AutosizeTests.pdf


 When a field uses autosizing the generated appearance is wrong as
 - the text is placed lower than expected
 - the font size is too large
 compared to the appearance generated with Adobe tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626537#comment-14626537
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691003 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691003 ]

PDFBOX-2852: remove unused imports

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626541#comment-14626541
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691006 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691006 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Understanding PDFBox

2015-07-14 Thread Tilman Hausherr

Am 14.07.2015 um 12:58 schrieb Subham Tripathi:

Hi All,
I wish to contribute to Apache PDFBox but before that i was trying to
understand the codebase. I am finding it very tough to understand the code
base as i am not finding any flow to follow.
Is there any documentation from which i can draw some high level insight of
the PDFBox ?




Look at the examples... and start from there. Then look at an unsolved 
issue :-)


If this is about getting coding practice, google for BATIK-1109 
https://issues.apache.org/jira/browse/BATIK-1109 and BATIK-1110 
https://issues.apache.org/jira/browse/BATIK-1110. One of the bugs is 
probably fixed by a few lines (although some debugging is needed to see 
how signed / unsigned values are handled there), the other one involves 
using code in PDFBox but in the way BATIK uses. Both bugs have been 
fixed in PDFBox, but not in BATIK (of which PDFBox used some code).


Tilman




[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626552#comment-14626552
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691007 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691007 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626576#comment-14626576
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691024 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691024 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626589#comment-14626589
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691031 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691031 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2842) Overhaul font substitution

2015-07-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626531#comment-14626531
 ] 

Tilman Hausherr commented on PDFBOX-2842:
-

{code}LOG.warn(New fonts found, font cache will be re-built);{code}
shouldn't all these be info instead of warn? 


 Overhaul font substitution
 --

 Key: PDFBOX-2842
 URL: https://issues.apache.org/jira/browse/PDFBOX-2842
 Project: PDFBox
  Issue Type: Improvement
  Components: FontBox, PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Assignee: John Hewson
Priority: Blocker
 Fix For: 2.0.0

 Attachments: 029423-p1.pdf, 166292-fi-ligature.pdf


 The improved font substitution mechanisms in 2.0 are not quite sufficient to 
 handle all PDFs. Specifically, CJK substitution and substitution of TTF in 
 place of CFF fonts is not possible with the current design.
 The CJK problems can be seen in PDFBOX-2509 and PDFBOX-2563, which does not 
 solve the problem. Additional font API weaknesses can be found in PDFBOX-2578 
 and PDFBOX-2366. This meta-issue aims to address all of those sub-issues.
 The current problems are:
 - FontBox does not provide a generic font type, so we have handle 
 TrueTypeFont, CFFFont, and Type1Font separately. This hinders cross-format 
 substitution.
 - ExternalFonts has no knowledge of the CIDSystemInfo which is necessary for 
 CJK substitution
 - FontProvider contains too much public logic which should be internal to 
 PDFBox, e.g. substitution logic, this makes it brittle and means we won't be 
 able to add additional logic after 2.0 is released, e.g. CJK substitution.
 - Too much confusion about the role of ExternalFonts, particularly with 
 regards to mapping of built-in fonts and the definition of substitute vs. 
 fallback font.
 - ExternalFonts is a black box: the user cannot tell whether the font 
 returned is an exact match, or a last-resort fallback.
 - Confusing font substitution API, users preferred having a flat file format
 - PDSimpleFont#getEncoding() can return null for TTFs which use built-in 
 encodings. This has caused a lot of bugs - there must be a better way.
 - We still have some confusing names, for example a CustomEncoding is known 
 as a built-in encoding in the spec.
 - There is no fallback CFF font, we resort to AdobeBlank instead, which has 
 no rendering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626538#comment-14626538
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691004 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691004 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626570#comment-14626570
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691019 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691019 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626587#comment-14626587
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691030 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691030 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626573#comment-14626573
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691020 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691020 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626578#comment-14626578
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691027 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691027 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626585#comment-14626585
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691028 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691028 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626593#comment-14626593
 ] 

ASF subversion and git services commented on PDFBOX-2530:
-

Commit 1691032 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691032 ]

PDFBOX-2852, PDFBOX-2530: reduce code complexity; null parameter for setText() 
is legit

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, 
 parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, 
 removed_redundant_codes.patch, separationCS.diff, 
 sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, 
 treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html].
 Mentor: Tilman Hausherr (European timezone, languages: 

[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626592#comment-14626592
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691032 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691032 ]

PDFBOX-2852, PDFBOX-2530: reduce code complexity; null parameter for setText() 
is legit

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: first stack trace report from pdfbox 2.0.0 trunk

2015-07-14 Thread Tilman Hausherr

Hi Tim,

Currently there is at least one known regression, mentioned in 
PDFBOX-2842, it applies to 029423 but also to other files.


Tilman

Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.:

All,
   I just posted the first stacktrace report from my initial partial batch run 
of against govdocs1 here: 
https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip

Caveats/Notes

The run yesterday did not include the fixes that were made in PDFBOX-2370 or 
PDFBOX-2862.

I stopped the batch run early. This only covered ~50k pdfs.

I forgot to turn on accesspermission checking. Some of the pdfs in here would 
normally have been skipped.

I haven't reviewed any of the exceptions. They may be caused by code on the 
Tika side.

I'll plan to re-run with the latest trunk on Tuesday.  I need to turn back to 
the actual eval code for a bit. :)


Cheers,

   Tim






-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626647#comment-14626647
 ] 

Tilman Hausherr commented on PDFBOX-2530:
-

There is a new bug (class cast exception) when clicking on a page content 
stream when in page mode. Although the bug is new, I assume that the root cause 
(a MapEntry with a MapEntry) is older.

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, 
 parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, 
 removed_redundant_codes.patch, separationCS.diff, 
 sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, 
 treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html].
 Mentor: Tilman Hausherr (European timezone, languages: german, english, 
 

[jira] [Commented] (PDFBOX-2860) NonSeq parser slower than Seq parser

2015-07-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626658#comment-14626658
 ] 

Tilman Hausherr commented on PDFBOX-2860:
-

No you can't because the sequential parser no longer exists in 2.0. It was not 
correct.

 NonSeq parser slower than Seq parser
 

 Key: PDFBOX-2860
 URL: https://issues.apache.org/jira/browse/PDFBOX-2860
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 2.0.0
Reporter: simon steiner

 PDF from PDFBOX-797
 for (int i=0; i1000; i++) {
 PDDocument.load(new FileInputStream(
 4218.pdf)).close();
 }
 Nonseq:
 real  0m23.691s
 Seq:
 real  0m9.705s



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626647#comment-14626647
 ] 

Tilman Hausherr edited comment on PDFBOX-2530 at 7/14/15 4:57 PM:
--

There is a new bug (class cast exception) when clicking on a page content 
stream when in show pages mode. Although the bug is new, I assume that the 
root cause (a MapEntry with a MapEntry) is older.


was (Author: tilman):
There is a new bug (class cast exception) when clicking on a page content 
stream when in page mode. Although the bug is new, I assume that the root cause 
(a MapEntry with a MapEntry) is older.

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, 
 parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, 
 removed_redundant_codes.patch, separationCS.diff, 
 sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, 
 treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the 

Re: Understanding PDFBox

2015-07-14 Thread John Hewson
The book “Developing with PDF” provides a short and gentle introduction to the
PDF format.

We have a brief architectural summary of PDFBox at:

http://pdfbox.apache.org/1.8/architecture.html 
http://pdfbox.apache.org/1.8/architecture.html

But in general, to make sense of PDFBox, you’ll need to understand the PDF spec.

— John

 On 14 Jul 2015, at 03:58, Subham Tripathi subham@gmail.com wrote:
 
 Hi All,
 I wish to contribute to Apache PDFBox but before that i was trying to
 understand the codebase. I am finding it very tough to understand the code
 base as i am not finding any flow to follow.
 Is there any documentation from which i can draw some high level insight of
 the PDFBox ?
 
 -- 
 Best Regards,
 Subham Tripathi



[jira] [Commented] (PDFBOX-2842) Overhaul font substitution

2015-07-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626822#comment-14626822
 ] 

John Hewson commented on PDFBOX-2842:
-

I'm not sure. My thinking was that I really want users to see these messages 
and it's common for users to not log at the INFO level. Re-building the font 
cache is slow (up to 10sec) and possibly unexpected, so it seemed like a 
legitimate warning - explaining unusually slow behaviour.

 Overhaul font substitution
 --

 Key: PDFBOX-2842
 URL: https://issues.apache.org/jira/browse/PDFBOX-2842
 Project: PDFBox
  Issue Type: Improvement
  Components: FontBox, PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Assignee: John Hewson
Priority: Blocker
 Fix For: 2.0.0

 Attachments: 029423-p1.pdf, 166292-fi-ligature.pdf


 The improved font substitution mechanisms in 2.0 are not quite sufficient to 
 handle all PDFs. Specifically, CJK substitution and substitution of TTF in 
 place of CFF fonts is not possible with the current design.
 The CJK problems can be seen in PDFBOX-2509 and PDFBOX-2563, which does not 
 solve the problem. Additional font API weaknesses can be found in PDFBOX-2578 
 and PDFBOX-2366. This meta-issue aims to address all of those sub-issues.
 The current problems are:
 - FontBox does not provide a generic font type, so we have handle 
 TrueTypeFont, CFFFont, and Type1Font separately. This hinders cross-format 
 substitution.
 - ExternalFonts has no knowledge of the CIDSystemInfo which is necessary for 
 CJK substitution
 - FontProvider contains too much public logic which should be internal to 
 PDFBox, e.g. substitution logic, this makes it brittle and means we won't be 
 able to add additional logic after 2.0 is released, e.g. CJK substitution.
 - Too much confusion about the role of ExternalFonts, particularly with 
 regards to mapping of built-in fonts and the definition of substitute vs. 
 fallback font.
 - ExternalFonts is a black box: the user cannot tell whether the font 
 returned is an exact match, or a last-resort fallback.
 - Confusing font substitution API, users preferred having a flat file format
 - PDSimpleFont#getEncoding() can return null for TTFs which use built-in 
 encodings. This has caused a lot of bugs - there must be a better way.
 - We still have some confusing names, for example a CustomEncoding is known 
 as a built-in encoding in the spec.
 - There is no fallback CFF font, we resort to AdobeBlank instead, which has 
 no rendering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626958#comment-14626958
 ] 

John Hewson commented on PDFBOX-2530:
-

JComboBox doesn't have a type parameter in Java 1.6.

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, 
 DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, 
 openSelectedPath.diff, parent_node_redirect.diff, 
 parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, 
 separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, 
 treestatus.diff, treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html].
 Mentor: Tilman Hausherr (European timezone, languages: german, english, 
 french). To see the GSoC2014 project I mentored, go to PDFBOX-1915.




[jira] [Comment Edited] (PDFBOX-2272) Can't extract vertical text correctly

2015-07-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626972#comment-14626972
 ] 

John Hewson edited comment on PDFBOX-2272 at 7/14/15 8:09 PM:
--

No, diff and SVN patch are different formats. They're similar in theory but not 
compatible. Both can be applied with patch though.


was (Author: jahewson):
No, diff and SVN patch are different formats. They're similar in theory but not 
compatible.

 Can't extract vertical text correctly
 -

 Key: PDFBOX-2272
 URL: https://issues.apache.org/jira/browse/PDFBOX-2272
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6, 2.0.0
Reporter: Biligsaikhan Batjargal
 Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.diff


 - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 
 90ms-RKSJ-V.-
 - 2.0 extracts the text but can't handle the vertical layout
 Also see the file from PDFBOX-2294 which contains both horizontal and 
 vertical text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-2272) Can't extract vertical text correctly

2015-07-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626972#comment-14626972
 ] 

John Hewson edited comment on PDFBOX-2272 at 7/14/15 8:09 PM:
--

No, diff and SVN patch are different formats. They're similar in theory but not 
compatible. Both can be applied with patch though. For example, IntelliJ 
can't apply your diff.


was (Author: jahewson):
No, diff and SVN patch are different formats. They're similar in theory but not 
compatible. Both can be applied with patch though.

 Can't extract vertical text correctly
 -

 Key: PDFBOX-2272
 URL: https://issues.apache.org/jira/browse/PDFBOX-2272
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6, 2.0.0
Reporter: Biligsaikhan Batjargal
 Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.diff


 - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 
 90ms-RKSJ-V.-
 - 2.0 extracts the text but can't handle the vertical layout
 Also see the file from PDFBOX-2294 which contains both horizontal and 
 vertical text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Resolved] (PDFBOX-2881) Radial and Axial shading steps are calculated incorrectly

2015-07-14 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson resolved PDFBOX-2881.
-
Resolution: Fixed

 Radial and Axial shading steps are calculated incorrectly
 -

 Key: PDFBOX-2881
 URL: https://issues.apache.org/jira/browse/PDFBOX-2881
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 2.0.0
Reporter: John Hewson
Assignee: John Hewson
 Fix For: 2.0.0


 I found a shading bug while writing some code to dump all shadings in a PDF. 
 I don't know if this affects PDF rendering within PageDrawer or not.
 RadialShadingContext and AxialShadingContext use the following code in their 
 constructors to calculate the number of steps (pixels) in the shading and 
 build a lookup table for each step:
 {code}
 // transform the distance to actual pixel space
 // use transform, because xform.getScaleX() does not return correct scaling 
 on 90° rotated matrix
 Point2D point = new Point2D.Double(longestDistance, longestDistance);
 matrix.transform(point);
 xform.transform(point, point);
 factor = (int) Math.max(Math.abs(point.getX()), Math.abs(point.getY()));
 colorTable = calcColorTable();
 {code}
 The variable factor is the number of steps and matrix is the parent 
 stream's matrix + the pattern matrix, so this code is taking the current 
 scale and assuming that that is equal to the number of pixels. This works 
 when a pattern is painted onto a 0...1 scaled surface, but otherwise it 
 produces incorrect results.
 There's no way to calculate the number of pixels in the device from its 
 scale, or its matrix. Paint#createContext() provides the device bounds 
 Rectangle, which is what we should be using. Indeed, this is handled 
 correctly in the other shading contexts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2881) Radial and Axial shading steps are calculated incorrectly

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627111#comment-14627111
 ] 

ASF subversion and git services commented on PDFBOX-2881:
-

Commit 1691093 from [~jahewson] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691093 ]

PDFBOX-2881: Calculate the number of steps using the device bounds

 Radial and Axial shading steps are calculated incorrectly
 -

 Key: PDFBOX-2881
 URL: https://issues.apache.org/jira/browse/PDFBOX-2881
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 2.0.0
Reporter: John Hewson
Assignee: John Hewson
 Fix For: 2.0.0


 I found a shading bug while writing some code to dump all shadings in a PDF. 
 I don't know if this affects PDF rendering within PageDrawer or not.
 RadialShadingContext and AxialShadingContext use the following code in their 
 constructors to calculate the number of steps (pixels) in the shading and 
 build a lookup table for each step:
 {code}
 // transform the distance to actual pixel space
 // use transform, because xform.getScaleX() does not return correct scaling 
 on 90° rotated matrix
 Point2D point = new Point2D.Double(longestDistance, longestDistance);
 matrix.transform(point);
 xform.transform(point, point);
 factor = (int) Math.max(Math.abs(point.getX()), Math.abs(point.getY()));
 colorTable = calcColorTable();
 {code}
 The variable factor is the number of steps and matrix is the parent 
 stream's matrix + the pattern matrix, so this code is taking the current 
 scale and assuming that that is equal to the number of pixels. This works 
 when a pattern is painted onto a 0...1 scaled surface, but otherwise it 
 produces incorrect results.
 There's no way to calculate the number of pixels in the device from its 
 scale, or its matrix. Paint#createContext() provides the device bounds 
 Rectangle, which is what we should be using. Indeed, this is handled 
 correctly in the other shading contexts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: first stack trace report from pdfbox 2.0.0 trunk

2015-07-14 Thread Allison, Timothy B.
Interesting, yes: 781/781172.pdf, 490/490376.pdf and 029/029423.pdf.  Are you 
running your own regression testing against govdocs1?  Is it duplicated effort 
for me to do anything with 2.0.0?  Or, is your point that should I wait until 
PDFBOX-2842 is completed?

Thank you!

Best,

  Tim
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Tuesday, July 14, 2015 12:47 PM
To: dev@pdfbox.apache.org
Subject: Re: first stack trace report from pdfbox 2.0.0 trunk

Hi Tim,

Currently there is at least one known regression, mentioned in 
PDFBOX-2842, it applies to 029423 but also to other files.

Tilman

Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.:
 All,
I just posted the first stacktrace report from my initial partial batch 
 run of against govdocs1 here: 
 https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip

 Caveats/Notes

 The run yesterday did not include the fixes that were made in PDFBOX-2370 or 
 PDFBOX-2862.

 I stopped the batch run early. This only covered ~50k pdfs.

 I forgot to turn on accesspermission checking. Some of the pdfs in here would 
 normally have been skipped.

 I haven't reviewed any of the exceptions. They may be caused by code on the 
 Tika side.

 I'll plan to re-run with the latest trunk on Tuesday.  I need to turn back to 
 the actual eval code for a bit. :)


 Cheers,

Tim





-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-2272) Can't extract vertical text correctly

2015-07-14 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2272:

Attachment: (was: vertical.diff)

 Can't extract vertical text correctly
 -

 Key: PDFBOX-2272
 URL: https://issues.apache.org/jira/browse/PDFBOX-2272
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6, 2.0.0
Reporter: Biligsaikhan Batjargal
 Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.patch


 - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 
 90ms-RKSJ-V.-
 - 2.0 extracts the text but can't handle the vertical layout
 Also see the file from PDFBOX-2294 which contains both horizontal and 
 vertical text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626939#comment-14626939
 ] 

Tilman Hausherr commented on PDFBOX-2530:
-

Why this?
{code}
-JComboBox filters = new JComboBoxString(availableFilters);
+JComboBox filters = new JComboBox(availableFilters);
{code}

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, 
 DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, 
 openSelectedPath.diff, parent_node_redirect.diff, 
 parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, 
 separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, 
 treestatus.diff, treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html].
 Mentor: Tilman Hausherr (European 

[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626954#comment-14626954
 ] 

ASF subversion and git services commented on PDFBOX-2530:
-

Commit 1691068 from [~jahewson] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691068 ]

PDFBOX-2530: Show filters in menu labels

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, 
 DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, 
 openSelectedPath.diff, parent_node_redirect.diff, 
 parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, 
 separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, 
 treestatus.diff, treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html].
 Mentor: Tilman 

[jira] [Commented] (PDFBOX-2272) Can't extract vertical text correctly

2015-07-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626998#comment-14626998
 ] 

John Hewson commented on PDFBOX-2272:
-

The patch looks to large to me, why has handleTextPosition been created? It 
seems unnecessary?

 Can't extract vertical text correctly
 -

 Key: PDFBOX-2272
 URL: https://issues.apache.org/jira/browse/PDFBOX-2272
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6, 2.0.0
Reporter: Biligsaikhan Batjargal
 Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.patch


 - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 
 90ms-RKSJ-V.-
 - 2.0 extracts the text but can't handle the vertical layout
 Also see the file from PDFBOX-2294 which contains both horizontal and 
 vertical text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626928#comment-14626928
 ] 

Tilman Hausherr commented on PDFBOX-2530:
-

How about checking if there is a mask, and offer both? I.e. with the mask as 
default, and optionally image without mask.

Re printStackTrace - I agree, as long as the exception appears at some time.

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, 
 DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, 
 openSelectedPath.diff, parent_node_redirect.diff, 
 parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, 
 separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, 
 treestatus.diff, treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 

[jira] [Commented] (PDFBOX-2272) Can't extract vertical text correctly

2015-07-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626889#comment-14626889
 ] 

John Hewson commented on PDFBOX-2272:
-

That's not an SVN patch...

 Can't extract vertical text correctly
 -

 Key: PDFBOX-2272
 URL: https://issues.apache.org/jira/browse/PDFBOX-2272
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6, 2.0.0
Reporter: Biligsaikhan Batjargal
 Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.diff


 - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 
 90ms-RKSJ-V.-
 - 2.0 extracts the text but can't handle the vertical layout
 Also see the file from PDFBOX-2294 which contains both horizontal and 
 vertical text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2272) Can't extract vertical text correctly

2015-07-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626972#comment-14626972
 ] 

John Hewson commented on PDFBOX-2272:
-

No, diff and SVN patch are different formats. They're similar in theory but not 
compatible.

 Can't extract vertical text correctly
 -

 Key: PDFBOX-2272
 URL: https://issues.apache.org/jira/browse/PDFBOX-2272
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6, 2.0.0
Reporter: Biligsaikhan Batjargal
 Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.diff


 - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 
 90ms-RKSJ-V.-
 - 2.0 extracts the text but can't handle the vertical layout
 Also see the file from PDFBOX-2294 which contains both horizontal and 
 vertical text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2272) Can't extract vertical text correctly

2015-07-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626997#comment-14626997
 ] 

John Hewson commented on PDFBOX-2272:
-

That's ok, I already applied it using the command line.

 Can't extract vertical text correctly
 -

 Key: PDFBOX-2272
 URL: https://issues.apache.org/jira/browse/PDFBOX-2272
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6, 2.0.0
Reporter: Biligsaikhan Batjargal
 Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.patch


 - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 
 90ms-RKSJ-V.-
 - 2.0 extracts the text but can't handle the vertical layout
 Also see the file from PDFBOX-2294 which contains both horizontal and 
 vertical text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Reopened] (PDFBOX-2842) Overhaul font substitution

2015-07-14 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reopened PDFBOX-2842:
-

reopening, the regression from July 2 on 029423-p1.pdf has been missed :-(

 Overhaul font substitution
 --

 Key: PDFBOX-2842
 URL: https://issues.apache.org/jira/browse/PDFBOX-2842
 Project: PDFBox
  Issue Type: Improvement
  Components: FontBox, PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Assignee: John Hewson
Priority: Blocker
 Fix For: 2.0.0

 Attachments: 029423-p1.pdf, 166292-fi-ligature.pdf


 The improved font substitution mechanisms in 2.0 are not quite sufficient to 
 handle all PDFs. Specifically, CJK substitution and substitution of TTF in 
 place of CFF fonts is not possible with the current design.
 The CJK problems can be seen in PDFBOX-2509 and PDFBOX-2563, which does not 
 solve the problem. Additional font API weaknesses can be found in PDFBOX-2578 
 and PDFBOX-2366. This meta-issue aims to address all of those sub-issues.
 The current problems are:
 - FontBox does not provide a generic font type, so we have handle 
 TrueTypeFont, CFFFont, and Type1Font separately. This hinders cross-format 
 substitution.
 - ExternalFonts has no knowledge of the CIDSystemInfo which is necessary for 
 CJK substitution
 - FontProvider contains too much public logic which should be internal to 
 PDFBox, e.g. substitution logic, this makes it brittle and means we won't be 
 able to add additional logic after 2.0 is released, e.g. CJK substitution.
 - Too much confusion about the role of ExternalFonts, particularly with 
 regards to mapping of built-in fonts and the definition of substitute vs. 
 fallback font.
 - ExternalFonts is a black box: the user cannot tell whether the font 
 returned is an exact match, or a last-resort fallback.
 - Confusing font substitution API, users preferred having a flat file format
 - PDSimpleFont#getEncoding() can return null for TTFs which use built-in 
 encodings. This has caused a lot of bugs - there must be a better way.
 - We still have some confusing names, for example a CustomEncoding is known 
 as a built-in encoding in the spec.
 - There is no fallback CFF font, we resort to AdobeBlank instead, which has 
 no rendering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-2881) Radial and Axial shading steps are calculated incorrectly

2015-07-14 Thread John Hewson (JIRA)
John Hewson created PDFBOX-2881:
---

 Summary: Radial and Axial shading steps are calculated incorrectly
 Key: PDFBOX-2881
 URL: https://issues.apache.org/jira/browse/PDFBOX-2881
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 2.0.0
Reporter: John Hewson
Assignee: John Hewson
 Fix For: 2.0.0


I found a shading bug while writing some code to dump all shadings in a PDF. I 
don't know if this affects PDF rendering within PageDrawer or not.

RadialShadingContext and AxialShadingContext use the following code in their 
constructors to calculate the number of steps (pixels) in the shading and build 
a lookup table for each step:

{code}
// transform the distance to actual pixel space
// use transform, because xform.getScaleX() does not return correct scaling on 
90° rotated matrix
Point2D point = new Point2D.Double(longestDistance, longestDistance);
matrix.transform(point);
xform.transform(point, point);
factor = (int) Math.max(Math.abs(point.getX()), Math.abs(point.getY()));
colorTable = calcColorTable();
{code}

The matrix is the parent stream's matrix + the pattern matrix, so this code 
is taking the current scale and assuming that that is equal to the number of 
pixels. This works when a pattern is painted onto a 0...1 scaled surface, but 
otherwise it produces incorrect results.

There's no way to calculate the number of pixels in the device from its scale, 
or its matrix. Paint#createContext() provides the device bounds Rectangle, 
which is what we should be using. Indeed, this is handled correctly in the 
other shading contexts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-2881) Radial and Axial shading steps are calculated incorrectly

2015-07-14 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2881:

Description: 
I found a shading bug while writing some code to dump all shadings in a PDF. I 
don't know if this affects PDF rendering within PageDrawer or not.

RadialShadingContext and AxialShadingContext use the following code in their 
constructors to calculate the number of steps (pixels) in the shading and build 
a lookup table for each step:

{code}
// transform the distance to actual pixel space
// use transform, because xform.getScaleX() does not return correct scaling on 
90° rotated matrix
Point2D point = new Point2D.Double(longestDistance, longestDistance);
matrix.transform(point);
xform.transform(point, point);
factor = (int) Math.max(Math.abs(point.getX()), Math.abs(point.getY()));
colorTable = calcColorTable();
{code}

The variable factor is the number of steps and matrix is the parent 
stream's matrix + the pattern matrix, so this code is taking the current scale 
and assuming that that is equal to the number of pixels. This works when a 
pattern is painted onto a 0...1 scaled surface, but otherwise it produces 
incorrect results.

There's no way to calculate the number of pixels in the device from its scale, 
or its matrix. Paint#createContext() provides the device bounds Rectangle, 
which is what we should be using. Indeed, this is handled correctly in the 
other shading contexts.

  was:
I found a shading bug while writing some code to dump all shadings in a PDF. I 
don't know if this affects PDF rendering within PageDrawer or not.

RadialShadingContext and AxialShadingContext use the following code in their 
constructors to calculate the number of steps (pixels) in the shading and build 
a lookup table for each step:

{code}
// transform the distance to actual pixel space
// use transform, because xform.getScaleX() does not return correct scaling on 
90° rotated matrix
Point2D point = new Point2D.Double(longestDistance, longestDistance);
matrix.transform(point);
xform.transform(point, point);
factor = (int) Math.max(Math.abs(point.getX()), Math.abs(point.getY()));
colorTable = calcColorTable();
{code}

The matrix is the parent stream's matrix + the pattern matrix, so this code 
is taking the current scale and assuming that that is equal to the number of 
pixels. This works when a pattern is painted onto a 0...1 scaled surface, but 
otherwise it produces incorrect results.

There's no way to calculate the number of pixels in the device from its scale, 
or its matrix. Paint#createContext() provides the device bounds Rectangle, 
which is what we should be using. Indeed, this is handled correctly in the 
other shading contexts.


 Radial and Axial shading steps are calculated incorrectly
 -

 Key: PDFBOX-2881
 URL: https://issues.apache.org/jira/browse/PDFBOX-2881
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 2.0.0
Reporter: John Hewson
Assignee: John Hewson
 Fix For: 2.0.0


 I found a shading bug while writing some code to dump all shadings in a PDF. 
 I don't know if this affects PDF rendering within PageDrawer or not.
 RadialShadingContext and AxialShadingContext use the following code in their 
 constructors to calculate the number of steps (pixels) in the shading and 
 build a lookup table for each step:
 {code}
 // transform the distance to actual pixel space
 // use transform, because xform.getScaleX() does not return correct scaling 
 on 90° rotated matrix
 Point2D point = new Point2D.Double(longestDistance, longestDistance);
 matrix.transform(point);
 xform.transform(point, point);
 factor = (int) Math.max(Math.abs(point.getX()), Math.abs(point.getY()));
 colorTable = calcColorTable();
 {code}
 The variable factor is the number of steps and matrix is the parent 
 stream's matrix + the pattern matrix, so this code is taking the current 
 scale and assuming that that is equal to the number of pixels. This works 
 when a pattern is painted onto a 0...1 scaled surface, but otherwise it 
 produces incorrect results.
 There's no way to calculate the number of pixels in the device from its 
 scale, or its matrix. Paint#createContext() provides the device bounds 
 Rectangle, which is what we should be using. Indeed, this is handled 
 correctly in the other shading contexts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626885#comment-14626885
 ] 

John Hewson commented on PDFBOX-2530:
-

Two minor issues:

- PDImageXObject#getImage() returns the image with the mask applied, which 
means we can't view the raw image. Calling getOpaqueImage() instead would solve 
this.
- Never use e.printStackTrace() :) Use throw new RuntimeException(e) instead. 
That way exceptions won't get lost. It's actually better still to throw early 
and catch late and let the caller handle the IOException.

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, 
 DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, 
 openSelectedPath.diff, parent_node_redirect.diff, 
 parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, 
 separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, 
 treestatus.diff, treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the 

[jira] [Closed] (PDFBOX-2839) Missing TextPosition(s) in PDFTextStripper

2015-07-14 Thread Christopher Clark (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Clark closed PDFBOX-2839.
-
Resolution: Not A Problem

Having looked at this further I see that a 1:1 correspondence between 
characters and text positions was not intended

 Missing TextPosition(s) in PDFTextStripper
 --

 Key: PDFBOX-2839
 URL: https://issues.apache.org/jira/browse/PDFBOX-2839
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 2.0.0
Reporter: Christopher Clark

 The protected method `writeString` in `PDFTextStripper` can receive more 
 characters than TextPositions. I tracked the problem down to the 
 `normalizeAdd` method where, for multi-character unicode words, Multiple 
 characters can be added to a line while only a single TextPosition object is 
 added to the corresponding list of TextPositions.
 This pdf: https://www.aclweb.org/anthology/W/W13/W13-4011.pdf contains such a 
 character.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: first stack trace report from pdfbox 2.0.0 trunk

2015-07-14 Thread Tilman Hausherr

Am 14.07.2015 um 21:37 schrieb Allison, Timothy B.:

Interesting, yes: 781/781172.pdf, 490/490376.pdf and 029/029423.pdf.  Are you 
running your own regression testing against govdocs1?


Yes, from time to time for the last few months.


Is it duplicated effort for me to do anything with 2.0.0?

Partly yes. The only difference is that I didn't do any text extraction.


Or, is your point that should I wait until PDFBOX-2842 is completed?


Yes :-)

Tilman



Thank you!

Best,

   Tim
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, July 14, 2015 12:47 PM
To: dev@pdfbox.apache.org
Subject: Re: first stack trace report from pdfbox 2.0.0 trunk

Hi Tim,

Currently there is at least one known regression, mentioned in
PDFBOX-2842, it applies to 029423 but also to other files.

Tilman

Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.:

All,
I just posted the first stacktrace report from my initial partial batch run 
of against govdocs1 here: 
https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip

Caveats/Notes

The run yesterday did not include the fixes that were made in PDFBOX-2370 or 
PDFBOX-2862.

I stopped the batch run early. This only covered ~50k pdfs.

I forgot to turn on accesspermission checking. Some of the pdfs in here would 
normally have been skipped.

I haven't reviewed any of the exceptions. They may be caused by code on the 
Tika side.

I'll plan to re-run with the latest trunk on Tuesday.  I need to turn back to 
the actual eval code for a bit. :)


Cheers,

Tim





-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-2272) Can't extract vertical text correctly

2015-07-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626998#comment-14626998
 ] 

John Hewson edited comment on PDFBOX-2272 at 7/14/15 8:30 PM:
--

The patch looks to large to me, [~AndreasMeier] why has handleTextPosition been 
created? It seems unnecessary?


was (Author: jahewson):
The patch looks to large to me, why has handleTextPosition been created? It 
seems unnecessary?

 Can't extract vertical text correctly
 -

 Key: PDFBOX-2272
 URL: https://issues.apache.org/jira/browse/PDFBOX-2272
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6, 2.0.0
Reporter: Biligsaikhan Batjargal
 Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.patch


 - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 
 90ms-RKSJ-V.-
 - 2.0 extracts the text but can't handle the vertical layout
 Also see the file from PDFBOX-2294 which contains both horizontal and 
 vertical text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: first stack trace report from pdfbox 2.0.0 trunk

2015-07-14 Thread John Hewson

 On 14 Jul 2015, at 13:20, Tilman Hausherr thaush...@t-online.de wrote:
 
 Am 14.07.2015 um 21:37 schrieb Allison, Timothy B.:
 Interesting, yes: 781/781172.pdf, 490/490376.pdf and 029/029423.pdf.  Are 
 you running your own regression testing against govdocs1?
 
 Yes, from time to time for the last few months.
 
 Is it duplicated effort for me to do anything with 2.0.0?
 Partly yes. The only difference is that I didn't do any text extraction.
 
 Or, is your point that should I wait until PDFBOX-2842 is completed?
 
 Yes :-)

Good news, PDFBOX-2842 is now complete.

— John

 
 Tilman
 
 
 Thank you!
 
 Best,
 
   Tim
 -Original Message-
 From: Tilman Hausherr [mailto:thaush...@t-online.de]
 Sent: Tuesday, July 14, 2015 12:47 PM
 To: dev@pdfbox.apache.org
 Subject: Re: first stack trace report from pdfbox 2.0.0 trunk
 
 Hi Tim,
 
 Currently there is at least one known regression, mentioned in
 PDFBOX-2842, it applies to 029423 but also to other files.
 
 Tilman
 
 Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.:
 All,
I just posted the first stacktrace report from my initial partial batch 
 run of against govdocs1 here: 
 https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip
 
 Caveats/Notes
 
 The run yesterday did not include the fixes that were made in PDFBOX-2370 
 or PDFBOX-2862.
 
 I stopped the batch run early. This only covered ~50k pdfs.
 
 I forgot to turn on accesspermission checking. Some of the pdfs in here 
 would normally have been skipped.
 
 I haven't reviewed any of the exceptions. They may be caused by code on the 
 Tika side.
 
 I'll plan to re-run with the latest trunk on Tuesday.  I need to turn back 
 to the actual eval code for a bit. :)
 
 
 Cheers,
 
Tim
 
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
 For additional commands, e-mail: dev-h...@pdfbox.apache.org
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
 For additional commands, e-mail: dev-h...@pdfbox.apache.org
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org 
 mailto:dev-unsubscr...@pdfbox.apache.org
 For additional commands, e-mail: dev-h...@pdfbox.apache.org 
 mailto:dev-h...@pdfbox.apache.org


[jira] [Resolved] (PDFBOX-2842) Overhaul font substitution

2015-07-14 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson resolved PDFBOX-2842.
-
Resolution: Fixed

I'm going to leave this item for a rainy day:

- ExternalFonts is a black box: the user cannot tell whether the font returned 
is an exact match, or a last-resort fallback.

 Overhaul font substitution
 --

 Key: PDFBOX-2842
 URL: https://issues.apache.org/jira/browse/PDFBOX-2842
 Project: PDFBox
  Issue Type: Improvement
  Components: FontBox, PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Assignee: John Hewson
Priority: Blocker
 Fix For: 2.0.0

 Attachments: 029423-p1.pdf, 166292-fi-ligature.pdf


 The improved font substitution mechanisms in 2.0 are not quite sufficient to 
 handle all PDFs. Specifically, CJK substitution and substitution of TTF in 
 place of CFF fonts is not possible with the current design.
 The CJK problems can be seen in PDFBOX-2509 and PDFBOX-2563, which does not 
 solve the problem. Additional font API weaknesses can be found in PDFBOX-2578 
 and PDFBOX-2366. This meta-issue aims to address all of those sub-issues.
 The current problems are:
 - FontBox does not provide a generic font type, so we have handle 
 TrueTypeFont, CFFFont, and Type1Font separately. This hinders cross-format 
 substitution.
 - ExternalFonts has no knowledge of the CIDSystemInfo which is necessary for 
 CJK substitution
 - FontProvider contains too much public logic which should be internal to 
 PDFBox, e.g. substitution logic, this makes it brittle and means we won't be 
 able to add additional logic after 2.0 is released, e.g. CJK substitution.
 - Too much confusion about the role of ExternalFonts, particularly with 
 regards to mapping of built-in fonts and the definition of substitute vs. 
 fallback font.
 - ExternalFonts is a black box: the user cannot tell whether the font 
 returned is an exact match, or a last-resort fallback.
 - Confusing font substitution API, users preferred having a flat file format
 - PDSimpleFont#getEncoding() can return null for TTFs which use built-in 
 encodings. This has caused a lot of bugs - there must be a better way.
 - We still have some confusing names, for example a CustomEncoding is known 
 as a built-in encoding in the spec.
 - There is no fallback CFF font, we resort to AdobeBlank instead, which has 
 no rendering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627010#comment-14627010
 ] 

ASF subversion and git services commented on PDFBOX-2530:
-

Commit 1691077 from [~jahewson] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691077 ]

PDFBOX-2530: Comma separation for filter labels

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, 
 DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, 
 filters-screenshot.png, indexedcs.diff, openSelectedPath.diff, 
 parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, 
 removed_redundant_codes.patch, separationCS.diff, 
 sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, 
 treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 

[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627011#comment-14627011
 ] 

John Hewson commented on PDFBOX-2530:
-

Yep, I put the code to do that on the wrong line. Will fix.

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, 
 DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, 
 filters-screenshot.png, indexedcs.diff, openSelectedPath.diff, 
 parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, 
 removed_redundant_codes.patch, separationCS.diff, 
 sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, 
 treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html].
 Mentor: Tilman Hausherr (European timezone, languages: german, english, 
 french). To see the GSoC2014 project 

[jira] [Commented] (PDFBOX-2272) Can't extract vertical text correctly

2015-07-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626904#comment-14626904
 ] 

Tilman Hausherr commented on PDFBOX-2272:
-

The only difference is that the relative path is missing. I can't do that 
because I have changes elsewhere.

 Can't extract vertical text correctly
 -

 Key: PDFBOX-2272
 URL: https://issues.apache.org/jira/browse/PDFBOX-2272
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6, 2.0.0
Reporter: Biligsaikhan Batjargal
 Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.diff


 - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 
 90ms-RKSJ-V.-
 - 2.0 extracts the text but can't handle the vertical layout
 Also see the file from PDFBOX-2294 which contains both horizontal and 
 vertical text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626964#comment-14626964
 ] 

John Hewson commented on PDFBOX-2530:
-

Yes, adding Image + Mask as the default item in the drop down menu would work 
nicely.

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, 
 DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, 
 openSelectedPath.diff, parent_node_redirect.diff, 
 parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, 
 separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, 
 treestatus.diff, treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html].
 Mentor: Tilman Hausherr (European timezone, languages: german, english, 
 french). To see the GSoC2014 project 

[jira] [Updated] (PDFBOX-2272) Can't extract vertical text correctly

2015-07-14 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2272:

Attachment: vertical.patch

ok, here's the same as a .patch (hopefully).

 Can't extract vertical text correctly
 -

 Key: PDFBOX-2272
 URL: https://issues.apache.org/jira/browse/PDFBOX-2272
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6, 2.0.0
Reporter: Biligsaikhan Batjargal
 Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.patch


 - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 
 90ms-RKSJ-V.-
 - 2.0 extracts the text but can't handle the vertical layout
 Also see the file from PDFBOX-2294 which contains both horizontal and 
 vertical text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2530:

Attachment: filters-screenshot.png

[~jahewson] please add a ,  or whatever... this is what people see with the 
file of PDFBOX-2215-027073.pdf.

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, 
 DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, 
 filters-screenshot.png, indexedcs.diff, openSelectedPath.diff, 
 parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, 
 removed_redundant_codes.patch, separationCS.diff, 
 sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, 
 treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html].
 Mentor: Tilman Hausherr (European timezone, languages: german, english, 

[jira] [Comment Edited] (PDFBOX-2842) Overhaul font substitution

2015-07-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627002#comment-14627002
 ] 

John Hewson edited comment on PDFBOX-2842 at 7/14/15 8:35 PM:
--

I'm going to leave this item for a rainy day:

- Confusing font substitution API, users preferred having a flat file format.


was (Author: jahewson):
I'm going to leave this item for a rainy day:

- ExternalFonts is a black box: the user cannot tell whether the font returned 
is an exact match, or a last-resort fallback.

 Overhaul font substitution
 --

 Key: PDFBOX-2842
 URL: https://issues.apache.org/jira/browse/PDFBOX-2842
 Project: PDFBox
  Issue Type: Improvement
  Components: FontBox, PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Assignee: John Hewson
Priority: Blocker
 Fix For: 2.0.0

 Attachments: 029423-p1.pdf, 166292-fi-ligature.pdf


 The improved font substitution mechanisms in 2.0 are not quite sufficient to 
 handle all PDFs. Specifically, CJK substitution and substitution of TTF in 
 place of CFF fonts is not possible with the current design.
 The CJK problems can be seen in PDFBOX-2509 and PDFBOX-2563, which does not 
 solve the problem. Additional font API weaknesses can be found in PDFBOX-2578 
 and PDFBOX-2366. This meta-issue aims to address all of those sub-issues.
 The current problems are:
 - FontBox does not provide a generic font type, so we have handle 
 TrueTypeFont, CFFFont, and Type1Font separately. This hinders cross-format 
 substitution.
 - ExternalFonts has no knowledge of the CIDSystemInfo which is necessary for 
 CJK substitution
 - FontProvider contains too much public logic which should be internal to 
 PDFBox, e.g. substitution logic, this makes it brittle and means we won't be 
 able to add additional logic after 2.0 is released, e.g. CJK substitution.
 - Too much confusion about the role of ExternalFonts, particularly with 
 regards to mapping of built-in fonts and the definition of substitute vs. 
 fallback font.
 - ExternalFonts is a black box: the user cannot tell whether the font 
 returned is an exact match, or a last-resort fallback.
 - Confusing font substitution API, users preferred having a flat file format
 - PDSimpleFont#getEncoding() can return null for TTFs which use built-in 
 encodings. This has caused a lot of bugs - there must be a better way.
 - We still have some confusing names, for example a CustomEncoding is known 
 as a built-in encoding in the spec.
 - There is no fallback CFF font, we resort to AdobeBlank instead, which has 
 no rendering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: first stack trace report from pdfbox 2.0.0 trunk

2015-07-14 Thread Tilman Hausherr

Am 14.07.2015 um 22:35 schrieb John Hewson:

On 14 Jul 2015, at 13:20, Tilman Hausherr thaush...@t-online.de wrote:

Am 14.07.2015 um 21:37 schrieb Allison, Timothy B.:

Interesting, yes: 781/781172.pdf, 490/490376.pdf and 029/029423.pdf.  Are you 
running your own regression testing against govdocs1?

Yes, from time to time for the last few months.


Is it duplicated effort for me to do anything with 2.0.0?

Partly yes. The only difference is that I didn't do any text extraction.


Or, is your point that should I wait until PDFBOX-2842 is completed?

Yes :-)

Good news, PDFBOX-2842 is now complete.


No, the 029423 file is still throwing an exception :-(

Tilman




— John


Tilman


Thank you!

Best,

   Tim
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, July 14, 2015 12:47 PM
To: dev@pdfbox.apache.org
Subject: Re: first stack trace report from pdfbox 2.0.0 trunk

Hi Tim,

Currently there is at least one known regression, mentioned in
PDFBOX-2842, it applies to 029423 but also to other files.

Tilman

Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.:

All,
I just posted the first stacktrace report from my initial partial batch run 
of against govdocs1 here: 
https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip

Caveats/Notes

The run yesterday did not include the fixes that were made in PDFBOX-2370 or 
PDFBOX-2862.

I stopped the batch run early. This only covered ~50k pdfs.

I forgot to turn on accesspermission checking. Some of the pdfs in here would 
normally have been skipped.

I haven't reviewed any of the exceptions. They may be caused by code on the 
Tika side.

I'll plan to re-run with the latest trunk on Tuesday.  I need to turn back to 
the actual eval code for a bit. :)






-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2842) Overhaul font substitution

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627346#comment-14627346
 ] 

ASF subversion and git services commented on PDFBOX-2842:
-

Commit 1691119 from [~jahewson] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691119 ]

PDFBOX-2842: Non-symbolic TTFs use StandardEncoding as their built-in

 Overhaul font substitution
 --

 Key: PDFBOX-2842
 URL: https://issues.apache.org/jira/browse/PDFBOX-2842
 Project: PDFBox
  Issue Type: Improvement
  Components: FontBox, PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Assignee: John Hewson
Priority: Blocker
 Fix For: 2.0.0

 Attachments: 029423-p1.pdf, 166292-fi-ligature.pdf


 The improved font substitution mechanisms in 2.0 are not quite sufficient to 
 handle all PDFs. Specifically, CJK substitution and substitution of TTF in 
 place of CFF fonts is not possible with the current design.
 The CJK problems can be seen in PDFBOX-2509 and PDFBOX-2563, which does not 
 solve the problem. Additional font API weaknesses can be found in PDFBOX-2578 
 and PDFBOX-2366. This meta-issue aims to address all of those sub-issues.
 The current problems are:
 - FontBox does not provide a generic font type, so we have handle 
 TrueTypeFont, CFFFont, and Type1Font separately. This hinders cross-format 
 substitution.
 - ExternalFonts has no knowledge of the CIDSystemInfo which is necessary for 
 CJK substitution
 - FontProvider contains too much public logic which should be internal to 
 PDFBox, e.g. substitution logic, this makes it brittle and means we won't be 
 able to add additional logic after 2.0 is released, e.g. CJK substitution.
 - Too much confusion about the role of ExternalFonts, particularly with 
 regards to mapping of built-in fonts and the definition of substitute vs. 
 fallback font.
 - ExternalFonts is a black box: the user cannot tell whether the font 
 returned is an exact match, or a last-resort fallback.
 - Confusing font substitution API, users preferred having a flat file format
 - PDSimpleFont#getEncoding() can return null for TTFs which use built-in 
 encodings. This has caused a lot of bugs - there must be a better way.
 - We still have some confusing names, for example a CustomEncoding is known 
 as a built-in encoding in the spec.
 - There is no fallback CFF font, we resort to AdobeBlank instead, which has 
 no rendering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: first stack trace report from pdfbox 2.0.0 trunk

2015-07-14 Thread John Hewson

 On 14 Jul 2015, at 13:49, Tilman Hausherr thaush...@t-online.de wrote:
 
 Am 14.07.2015 um 22:35 schrieb John Hewson:
 On 14 Jul 2015, at 13:20, Tilman Hausherr thaush...@t-online.de wrote:
 
 Am 14.07.2015 um 21:37 schrieb Allison, Timothy B.:
 Interesting, yes: 781/781172.pdf, 490/490376.pdf and 029/029423.pdf.  Are 
 you running your own regression testing against govdocs1?
 Yes, from time to time for the last few months.
 
 Is it duplicated effort for me to do anything with 2.0.0?
 Partly yes. The only difference is that I didn't do any text extraction.
 
 Or, is your point that should I wait until PDFBOX-2842 is completed?
 Yes :-)
 Good news, PDFBOX-2842 is now complete.
 
 No, the 029423 file is still throwing an exception :-(
 

Ok, I’ve just fixed this, hopefully it works.

— John

 Tilman
 
 
 
 — John
 
 Tilman
 
 Thank you!
 
 Best,
 
   Tim
 -Original Message-
 From: Tilman Hausherr [mailto:thaush...@t-online.de]
 Sent: Tuesday, July 14, 2015 12:47 PM
 To: dev@pdfbox.apache.org
 Subject: Re: first stack trace report from pdfbox 2.0.0 trunk
 
 Hi Tim,
 
 Currently there is at least one known regression, mentioned in
 PDFBOX-2842, it applies to 029423 but also to other files.
 
 Tilman
 
 Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.:
 All,
I just posted the first stacktrace report from my initial partial 
 batch run of against govdocs1 here: 
 https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip
 
 Caveats/Notes
 
 The run yesterday did not include the fixes that were made in PDFBOX-2370 
 or PDFBOX-2862.
 
 I stopped the batch run early. This only covered ~50k pdfs.
 
 I forgot to turn on accesspermission checking. Some of the pdfs in here 
 would normally have been skipped.
 
 I haven't reviewed any of the exceptions. They may be caused by code on 
 the Tika side.
 
 I'll plan to re-run with the latest trunk on Tuesday.  I need to turn 
 back to the actual eval code for a bit. :)
 
 
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
 For additional commands, e-mail: dev-h...@pdfbox.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Resolved] (PDFBOX-2842) Overhaul font substitution

2015-07-14 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson resolved PDFBOX-2842.
-
Resolution: Fixed

 Overhaul font substitution
 --

 Key: PDFBOX-2842
 URL: https://issues.apache.org/jira/browse/PDFBOX-2842
 Project: PDFBox
  Issue Type: Improvement
  Components: FontBox, PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Assignee: John Hewson
Priority: Blocker
 Fix For: 2.0.0

 Attachments: 029423-p1.pdf, 166292-fi-ligature.pdf


 The improved font substitution mechanisms in 2.0 are not quite sufficient to 
 handle all PDFs. Specifically, CJK substitution and substitution of TTF in 
 place of CFF fonts is not possible with the current design.
 The CJK problems can be seen in PDFBOX-2509 and PDFBOX-2563, which does not 
 solve the problem. Additional font API weaknesses can be found in PDFBOX-2578 
 and PDFBOX-2366. This meta-issue aims to address all of those sub-issues.
 The current problems are:
 - FontBox does not provide a generic font type, so we have handle 
 TrueTypeFont, CFFFont, and Type1Font separately. This hinders cross-format 
 substitution.
 - ExternalFonts has no knowledge of the CIDSystemInfo which is necessary for 
 CJK substitution
 - FontProvider contains too much public logic which should be internal to 
 PDFBox, e.g. substitution logic, this makes it brittle and means we won't be 
 able to add additional logic after 2.0 is released, e.g. CJK substitution.
 - Too much confusion about the role of ExternalFonts, particularly with 
 regards to mapping of built-in fonts and the definition of substitute vs. 
 fallback font.
 - ExternalFonts is a black box: the user cannot tell whether the font 
 returned is an exact match, or a last-resort fallback.
 - Confusing font substitution API, users preferred having a flat file format
 - PDSimpleFont#getEncoding() can return null for TTFs which use built-in 
 encodings. This has caused a lot of bugs - there must be a better way.
 - We still have some confusing names, for example a CustomEncoding is known 
 as a built-in encoding in the spec.
 - There is no fallback CFF font, we resort to AdobeBlank instead, which has 
 no rendering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-2882) Improve performance when using scratch file

2015-07-14 Thread Timo Boehme (JIRA)
Timo Boehme created PDFBOX-2882:
---

 Summary: Improve performance when using scratch file
 Key: PDFBOX-2882
 URL: https://issues.apache.org/jira/browse/PDFBOX-2882
 Project: PDFBox
  Issue Type: Improvement
  Components: Parsing
Affects Versions: 2.0.0
Reporter: Timo Boehme
Priority: Minor


The current scratch file implementation uses many direct I/O calls which slows 
down parsing compared with in-memory scratch buffer considerably.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2842) Overhaul font substitution

2015-07-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627288#comment-14627288
 ] 

John Hewson commented on PDFBOX-2842:
-

Thanks, yes I missed that one.

 Overhaul font substitution
 --

 Key: PDFBOX-2842
 URL: https://issues.apache.org/jira/browse/PDFBOX-2842
 Project: PDFBox
  Issue Type: Improvement
  Components: FontBox, PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Assignee: John Hewson
Priority: Blocker
 Fix For: 2.0.0

 Attachments: 029423-p1.pdf, 166292-fi-ligature.pdf


 The improved font substitution mechanisms in 2.0 are not quite sufficient to 
 handle all PDFs. Specifically, CJK substitution and substitution of TTF in 
 place of CFF fonts is not possible with the current design.
 The CJK problems can be seen in PDFBOX-2509 and PDFBOX-2563, which does not 
 solve the problem. Additional font API weaknesses can be found in PDFBOX-2578 
 and PDFBOX-2366. This meta-issue aims to address all of those sub-issues.
 The current problems are:
 - FontBox does not provide a generic font type, so we have handle 
 TrueTypeFont, CFFFont, and Type1Font separately. This hinders cross-format 
 substitution.
 - ExternalFonts has no knowledge of the CIDSystemInfo which is necessary for 
 CJK substitution
 - FontProvider contains too much public logic which should be internal to 
 PDFBox, e.g. substitution logic, this makes it brittle and means we won't be 
 able to add additional logic after 2.0 is released, e.g. CJK substitution.
 - Too much confusion about the role of ExternalFonts, particularly with 
 regards to mapping of built-in fonts and the definition of substitute vs. 
 fallback font.
 - ExternalFonts is a black box: the user cannot tell whether the font 
 returned is an exact match, or a last-resort fallback.
 - Confusing font substitution API, users preferred having a flat file format
 - PDSimpleFont#getEncoding() can return null for TTFs which use built-in 
 encodings. This has caused a lot of bugs - there must be a better way.
 - We still have some confusing names, for example a CustomEncoding is known 
 as a built-in encoding in the spec.
 - There is no fallback CFF font, we resort to AdobeBlank instead, which has 
 no rendering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2842) Overhaul font substitution

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627287#comment-14627287
 ] 

ASF subversion and git services commented on PDFBOX-2842:
-

Commit 1691110 from [~jahewson] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691110 ]

PDFBOX-2842: Re-build stale font cache

 Overhaul font substitution
 --

 Key: PDFBOX-2842
 URL: https://issues.apache.org/jira/browse/PDFBOX-2842
 Project: PDFBox
  Issue Type: Improvement
  Components: FontBox, PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Assignee: John Hewson
Priority: Blocker
 Fix For: 2.0.0

 Attachments: 029423-p1.pdf, 166292-fi-ligature.pdf


 The improved font substitution mechanisms in 2.0 are not quite sufficient to 
 handle all PDFs. Specifically, CJK substitution and substitution of TTF in 
 place of CFF fonts is not possible with the current design.
 The CJK problems can be seen in PDFBOX-2509 and PDFBOX-2563, which does not 
 solve the problem. Additional font API weaknesses can be found in PDFBOX-2578 
 and PDFBOX-2366. This meta-issue aims to address all of those sub-issues.
 The current problems are:
 - FontBox does not provide a generic font type, so we have handle 
 TrueTypeFont, CFFFont, and Type1Font separately. This hinders cross-format 
 substitution.
 - ExternalFonts has no knowledge of the CIDSystemInfo which is necessary for 
 CJK substitution
 - FontProvider contains too much public logic which should be internal to 
 PDFBox, e.g. substitution logic, this makes it brittle and means we won't be 
 able to add additional logic after 2.0 is released, e.g. CJK substitution.
 - Too much confusion about the role of ExternalFonts, particularly with 
 regards to mapping of built-in fonts and the definition of substitute vs. 
 fallback font.
 - ExternalFonts is a black box: the user cannot tell whether the font 
 returned is an exact match, or a last-resort fallback.
 - Confusing font substitution API, users preferred having a flat file format
 - PDSimpleFont#getEncoding() can return null for TTFs which use built-in 
 encodings. This has caused a lot of bugs - there must be a better way.
 - We still have some confusing names, for example a CustomEncoding is known 
 as a built-in encoding in the spec.
 - There is no fallback CFF font, we resort to AdobeBlank instead, which has 
 no rendering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-2882) Improve performance when using scratch file

2015-07-14 Thread Timo Boehme (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timo Boehme updated PDFBOX-2882:

Attachment: ScratchFileBuffer.java
ScratchFile.java

Drop-in replacement for classes in org.apache.pdfbox.io package. It keeps the 
single scratch file approach and paging but does not use linking between pages 
but direct index. Additionally pages can be re-used if buffers are closed. 
File-I/O is only necessary to read/write whole pages.
For a small test on loading PDF-reference file this implementation reduced the 
time needed by a factor of 2.

 Improve performance when using scratch file
 ---

 Key: PDFBOX-2882
 URL: https://issues.apache.org/jira/browse/PDFBOX-2882
 Project: PDFBox
  Issue Type: Improvement
  Components: Parsing
Affects Versions: 2.0.0
Reporter: Timo Boehme
Priority: Minor
 Attachments: ScratchFile.java, ScratchFileBuffer.java


 The current scratch file implementation uses many direct I/O calls which 
 slows down parsing compared with in-memory scratch buffer considerably.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2881) Radial and Axial shading steps are calculated incorrectly

2015-07-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627568#comment-14627568
 ] 

Tilman Hausherr commented on PDFBOX-2881:
-

Can you tell a file that was rendering incorrectly? 

 Radial and Axial shading steps are calculated incorrectly
 -

 Key: PDFBOX-2881
 URL: https://issues.apache.org/jira/browse/PDFBOX-2881
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 2.0.0
Reporter: John Hewson
Assignee: John Hewson
 Fix For: 2.0.0


 I found a shading bug while writing some code to dump all shadings in a PDF. 
 I don't know if this affects PDF rendering within PageDrawer or not.
 RadialShadingContext and AxialShadingContext use the following code in their 
 constructors to calculate the number of steps (pixels) in the shading and 
 build a lookup table for each step:
 {code}
 // transform the distance to actual pixel space
 // use transform, because xform.getScaleX() does not return correct scaling 
 on 90° rotated matrix
 Point2D point = new Point2D.Double(longestDistance, longestDistance);
 matrix.transform(point);
 xform.transform(point, point);
 factor = (int) Math.max(Math.abs(point.getX()), Math.abs(point.getY()));
 colorTable = calcColorTable();
 {code}
 The variable factor is the number of steps and matrix is the parent 
 stream's matrix + the pattern matrix, so this code is taking the current 
 scale and assuming that that is equal to the number of pixels. This works 
 when a pattern is painted onto a 0...1 scaled surface, but otherwise it 
 produces incorrect results.
 There's no way to calculate the number of pixels in the device from its 
 scale, or its matrix. Paint#createContext() provides the device bounds 
 Rectangle, which is what we should be using. Indeed, this is handled 
 correctly in the other shading contexts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-2845) Error parsing PDF

2015-07-14 Thread Christopher Clark (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Clark updated PDFBOX-2845:
--
Fix Version/s: 2.0.0

 Error parsing PDF
 -

 Key: PDFBOX-2845
 URL: https://issues.apache.org/jira/browse/PDFBOX-2845
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 2.0.0
Reporter: Christopher Clark
 Fix For: 2.0.0


 I get the following error when parsing this pdf:  
 http://jmlr.csail.mit.edu/proceedings/papers/v28/ranganath13.pdf
 java.io.IOException: Object must be defined and must not be compressed 
 object: 554:0
 Stack trace:
 Exception in thread main java.io.IOException: Object must be defined and 
 must not be compressed object: 554:0
 at 
 org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:682)
 at 
 org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:646)
 at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:847)
 at 
 org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:906)
 at 
 org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:732)
 at 
 org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:693)
 at 
 org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:646)
 at 
 org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:607)
 at 
 org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
 at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:225)
 at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:848)
 at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:793)
 at 
 org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:192)
 at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:81)
 at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:55)
 Note this problem does not occur in 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Assigned] (PDFBOX-2882) Improve performance when using scratch file

2015-07-14 Thread Timo Boehme (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timo Boehme reassigned PDFBOX-2882:
---

Assignee: Timo Boehme

 Improve performance when using scratch file
 ---

 Key: PDFBOX-2882
 URL: https://issues.apache.org/jira/browse/PDFBOX-2882
 Project: PDFBox
  Issue Type: Improvement
  Components: Parsing
Affects Versions: 2.0.0
Reporter: Timo Boehme
Assignee: Timo Boehme
Priority: Minor
 Attachments: ScratchFile.java, ScratchFileBuffer.java


 The current scratch file implementation uses many direct I/O calls which 
 slows down parsing compared with in-memory scratch buffer considerably.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2272) Can't extract vertical text correctly

2015-07-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626831#comment-14626831
 ] 

John Hewson commented on PDFBOX-2272:
-

https://ariejan.net/2007/07/03/how-to-create-and-apply-a-patch-with-subversion/

 Can't extract vertical text correctly
 -

 Key: PDFBOX-2272
 URL: https://issues.apache.org/jira/browse/PDFBOX-2272
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6, 2.0.0
Reporter: Biligsaikhan Batjargal
 Attachments: PDFTextStripper.java, test.pdf, test.txt


 - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 
 90ms-RKSJ-V.-
 - 2.0 extracts the text but can't handle the vertical layout
 Also see the file from PDFBOX-2294 which contains both horizontal and 
 vertical text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2871) Performance issue when filling the first PDTextField of an AcroForm

2015-07-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626837#comment-14626837
 ] 

John Hewson commented on PDFBOX-2871:
-

I'm not really worried about speeding up font parsing, we have the on-disk 
cache now, so it's a only once ever event. Scanning for files on the local 
system is already fast. What's still relatively slow about the new cache is 
that it uses Java's serialization - a custom serialisation format could be much 
faster. I'm also not sure about the speed of the Preferences API - benchmarking 
needed.

 Performance issue when filling the first PDTextField of an AcroForm
 ---

 Key: PDFBOX-2871
 URL: https://issues.apache.org/jira/browse/PDFBOX-2871
 Project: PDFBox
  Issue Type: Bug
  Components: AcroForm
Affects Versions: 2.0.0
Reporter: Maruan Sahyoun
Assignee: John Hewson
Priority: Critical
  Labels: Appearance
 Fix For: 2.0.0

 Attachments: PDTextField.pdf, ProfilingOutput.png


 When filling the first PDTextField in a form the performance is slow. All 
 other PDTextFields in the form are handled quickly.
 This code
 {code}
 PDTextField field = (PDTextField) 
 doc.getDocumentCatalog().getAcroForm().getField(Textfield01);
 long start = System.nanoTime();
 field.setValue(ABCD);
 long end = System.nanoTime();
 double difference = (end - start)/1e6;
 System.out.println(difference);
 field = (PDTextField) 
 doc.getDocumentCatalog().getAcroForm().getField(Textfield02);
 start = System.nanoTime();
 field.setValue(ABCD);
 end = System.nanoTime();
 difference = (end - start)/1e6;
 System.out.println(difference);
 {code}
 produces the following output
 {noformat}
 9713.38
 3.904
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-2871) Performance issue when filling the first PDTextField of an AcroForm

2015-07-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626837#comment-14626837
 ] 

John Hewson edited comment on PDFBOX-2871 at 7/14/15 6:41 PM:
--

I'm not really worried about speeding up font parsing, we have the on-disk 
cache now, so it's an only once ever event. Scanning for files on the local 
system is already fast. What's still relatively slow about the new cache is 
that it uses Java's serialization - a custom serialisation format could be much 
faster. I'm also not sure about the speed of the Preferences API - benchmarking 
needed.


was (Author: jahewson):
I'm not really worried about speeding up font parsing, we have the on-disk 
cache now, so it's a only once ever event. Scanning for files on the local 
system is already fast. What's still relatively slow about the new cache is 
that it uses Java's serialization - a custom serialisation format could be much 
faster. I'm also not sure about the speed of the Preferences API - benchmarking 
needed.

 Performance issue when filling the first PDTextField of an AcroForm
 ---

 Key: PDFBOX-2871
 URL: https://issues.apache.org/jira/browse/PDFBOX-2871
 Project: PDFBox
  Issue Type: Bug
  Components: AcroForm
Affects Versions: 2.0.0
Reporter: Maruan Sahyoun
Assignee: John Hewson
Priority: Critical
  Labels: Appearance
 Fix For: 2.0.0

 Attachments: PDTextField.pdf, ProfilingOutput.png


 When filling the first PDTextField in a form the performance is slow. All 
 other PDTextFields in the form are handled quickly.
 This code
 {code}
 PDTextField field = (PDTextField) 
 doc.getDocumentCatalog().getAcroForm().getField(Textfield01);
 long start = System.nanoTime();
 field.setValue(ABCD);
 long end = System.nanoTime();
 double difference = (end - start)/1e6;
 System.out.println(difference);
 field = (PDTextField) 
 doc.getDocumentCatalog().getAcroForm().getField(Textfield02);
 start = System.nanoTime();
 field.setValue(ABCD);
 end = System.nanoTime();
 difference = (end - start)/1e6;
 System.out.println(difference);
 {code}
 produces the following output
 {noformat}
 9713.38
 3.904
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-2272) Can't extract vertical text correctly

2015-07-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626841#comment-14626841
 ] 

Tilman Hausherr edited comment on PDFBOX-2272 at 7/14/15 6:43 PM:
--

Here's the change as a patch, just to show that this isn't some bureaucratic 
trick. Hopefully somebody will understand it... I've never worked deeply on 
that part of PDFBox, except two bug fixes (one of them from you).


was (Author: tilman):
Here's the change as a patch, just to show that this isn't some bureaucratic 
trick. Hopefully somebody will understand it... I've never worked deeply on 
that part of PDFBox, except two bug fixes (one of the from you).

 Can't extract vertical text correctly
 -

 Key: PDFBOX-2272
 URL: https://issues.apache.org/jira/browse/PDFBOX-2272
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6, 2.0.0
Reporter: Biligsaikhan Batjargal
 Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.diff


 - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 
 90ms-RKSJ-V.-
 - 2.0 extracts the text but can't handle the vertical layout
 Also see the file from PDFBOX-2294 which contains both horizontal and 
 vertical text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-2272) Can't extract vertical text correctly

2015-07-14 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2272:

Attachment: vertical.diff

Here's the change as a patch, just to show that this isn't some bureaucratic 
trick. Hopefully somebody will understand it... I've never worked deeply on 
that part of PDFBox, except two bug fixes (one of the from you).

 Can't extract vertical text correctly
 -

 Key: PDFBOX-2272
 URL: https://issues.apache.org/jira/browse/PDFBOX-2272
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6, 2.0.0
Reporter: Biligsaikhan Batjargal
 Attachments: PDFTextStripper.java, test.pdf, test.txt, vertical.diff


 - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 
 90ms-RKSJ-V.-
 - 2.0 extracts the text but can't handle the vertical layout
 Also see the file from PDFBOX-2294 which contains both horizontal and 
 vertical text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626869#comment-14626869
 ] 

ASF subversion and git services commented on PDFBOX-2530:
-

Commit 1691060 from [~jahewson] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691060 ]

PDFBOX-2530: UI tweak for image view

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, 
 DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, 
 openSelectedPath.diff, parent_node_redirect.diff, 
 parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, 
 separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, 
 treestatus.diff, treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html].
 Mentor: Tilman Hausherr 

[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626872#comment-14626872
 ] 

John Hewson commented on PDFBOX-2530:
-

I was testing the new image view on some black-white gradient images but it was 
hard to tell what was the image and what was the background, so I made some 
minor tweaks.

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, 
 DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, 
 openSelectedPath.diff, parent_node_redirect.diff, 
 parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, 
 separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, 
 treestatus.diff, treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html].
 Mentor: Tilman Hausherr 

[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626751#comment-14626751
 ] 

Tilman Hausherr commented on PDFBOX-2530:
-

Commit msg has been fixed  separately with correct attribution to you.

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, 
 DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, 
 openSelectedPath.diff, parent_node_redirect.diff, 
 parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, 
 separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, 
 treestatus.diff, treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html].
 Mentor: Tilman Hausherr (European timezone, languages: german, english, 
 french). To see the GSoC2014 project I 

[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread khyrul bashar (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626753#comment-14626753
 ] 

khyrul bashar commented on PDFBOX-2530:
---

I've uploaded a patch. 

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, 
 DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, 
 openSelectedPath.diff, parent_node_redirect.diff, 
 parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, 
 separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, 
 treestatus.diff, treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html].
 Mentor: Tilman Hausherr (European timezone, languages: german, english, 
 french). To see the GSoC2014 project I mentored, go to PDFBOX-1915.



--
This message was sent 

[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread khyrul bashar (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626754#comment-14626754
 ] 

khyrul bashar commented on PDFBOX-2530:
---

I've uploaded a patch. 

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, 
 DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, 
 openSelectedPath.diff, parent_node_redirect.diff, 
 parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, 
 separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, 
 treestatus.diff, treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html].
 Mentor: Tilman Hausherr (European timezone, languages: german, english, 
 french). To see the GSoC2014 project I mentored, go to PDFBOX-1915.



--
This message was sent 

[jira] [Issue Comment Deleted] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread khyrul bashar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

khyrul bashar updated PDFBOX-2530:
--
Comment: was deleted

(was: I've uploaded a patch. )

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, 
 DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, 
 openSelectedPath.diff, parent_node_redirect.diff, 
 parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, 
 separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, 
 treestatus.diff, treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html].
 Mentor: Tilman Hausherr (European timezone, languages: german, english, 
 french). To see the GSoC2014 project I mentored, go to PDFBOX-1915.



--
This message was sent by Atlassian JIRA

[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread khyrul bashar (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626755#comment-14626755
 ] 

khyrul bashar commented on PDFBOX-2530:
---

Thanks :)

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, 
 DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, 
 openSelectedPath.diff, parent_node_redirect.diff, 
 parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, 
 separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, 
 treestatus.diff, treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html].
 Mentor: Tilman Hausherr (European timezone, languages: german, english, 
 french). To see the GSoC2014 project I mentored, go to PDFBOX-1915.



--
This message was sent by Atlassian 

[jira] [Comment Edited] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626647#comment-14626647
 ] 

Tilman Hausherr edited comment on PDFBOX-2530 at 7/14/15 5:57 PM:
--

There is a new bug (class cast exception) when clicking on a page content 
stream when in show pages mode. Although the bug is new, I assume that the 
root cause -(a MapEntry with a MapEntry)- is older.


was (Author: tilman):
There is a new bug (class cast exception) when clicking on a page content 
stream when in show pages mode. Although the bug is new, I assume that the 
root cause (a MapEntry with a MapEntry) is older.

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, 
 DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, 
 openSelectedPath.diff, parent_node_redirect.diff, 
 parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, 
 separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, 
 treestatus.diff, treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 

[jira] [Updated] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread khyrul bashar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

khyrul bashar updated PDFBOX-2530:
--
Attachment: Sonarqube_warning_resolved.diff

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, 
 openSelectedPath.diff, parent_node_redirect.diff, 
 parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, 
 separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, 
 treestatus.diff, treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html].
 Mentor: Tilman Hausherr (European timezone, languages: german, english, 
 french). To see the GSoC2014 project I mentored, go to PDFBOX-1915.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread khyrul bashar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

khyrul bashar updated PDFBOX-2530:
--
Attachment: Class_cast_exception_in_page_mode_avoided.diff

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, 
 DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, 
 openSelectedPath.diff, parent_node_redirect.diff, 
 parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, 
 separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, 
 treestatus.diff, treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html].
 Mentor: Tilman Hausherr (European timezone, languages: german, english, 
 french). To see the GSoC2014 project I mentored, go to PDFBOX-1915.



--
This message was sent by Atlassian JIRA

[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626774#comment-14626774
 ] 

ASF subversion and git services commented on PDFBOX-2530:
-

Commit 1691044 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691044 ]

PDFBOX-2530: fix ClassCastException in page content streams when in page 
display mode, as done by Khyrul Bashar in GSoC2015

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, Class_cast_exception_in_page_mode_avoided.diff, 
 DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Sonarqube_warning_resolved.diff, Stream_Showing_Feature.diff, indexedcs.diff, 
 openSelectedPath.diff, parent_node_redirect.diff, 
 parent_node_redirect_expand_disabled.diff, removed_redundant_codes.patch, 
 separationCS.diff, sonarqube_warning_resolve.diff, tree.diff, 
 treestatus.diff, treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF