date:20150714

[jira] [Commented] (PDFBOX-2272) Can't extract vertical text correctly

2015-07-14 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625878#comment-14625878
 ] 

Tilman Hausherr commented on PDFBOX-2272:
-

Please submit this as a diff against the repository, so that we can see easily 
what is different.

 Can't extract vertical text correctly
 -

 Key: PDFBOX-2272
 URL: https://issues.apache.org/jira/browse/PDFBOX-2272
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6, 2.0.0
Reporter: Biligsaikhan Batjargal
 Attachments: PDFTextStripper.java, test.pdf, test.txt


 - -1.8.6 can't extract the Unicode due to failing to map the UCS2 CMap for 
 90ms-RKSJ-V.-
 - 2.0 extracts the text but can't handle the vertical layout
 Also see the file from PDFBOX-2294 which contains both horizontal and 
 vertical text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: Performance of the trunkversion

2015-07-14 Thread Andreas Lehmkühler


 Manfred Pock pock.manf...@gmail.com hat am 14. Juli 2015 um 12:15
 geschrieben:
 
 
 Yes, the input is a inputstream. I can try it direct from file.
 
 But in general we get the pdf from an document management system as stream.
 Does make sense that i save the pdf to file before?
If possible, yes. As I already said, we need random access to the pdf and
InputStream doesn't support seek operations so that we have to copy the whole
stream to a file or to memory.

 Why is there so an big performance difference beetween the version from 
 May and the current version, if we use it with useScratchFiles = true ?
I'm not sure, but the reason seems to be the altered scratchfile handling. I've
to double check that.

BR
Andreas

 regarts, Manfred
 
 Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:
  Hi,
 
  Manfred Pock pock.manf...@gmail.com hat am 14. Juli 2015 um 11:39
  geschrieben:
 
 
  Ok, we load the pdf with useScratchFiles = true, if we load them with
  false the performance is better, but a little bit slower than the old one.
  What do you use as input, a stream or a real file? If the latter you should
  use
  the load method with the file parameter.
 
  PDFBox needs ramdom access to the pdf and if a stream is provided PDFBox
  copies
  the data to a file (lower memory usage, slower performance) or to the memory
  (higher memory usage, better performance).
 
  BR
  Andreas
 
 
  But now it need more memory. I cannot load some pdfs with the current
  version with the same java-memory configuration.
 
  Am 14.07.2015 um 11:26 schrieb Manfred Pock:
  Hi,
 
  we use the Pdfbox-trunkversion to render pdf's, currently we use the
  version from 12. May 2015.
 
  Today i have done an update to the current version and have test it.
  It seems to be that it need now much more time to render pdf's, it
  depends of the size of the pdf.
 
  for example you can try this one:
 
  http://cloud.directupload.net/15bu
 
  It need five times more then the version from May 2015.
 
  regarts, Manfred
  -
  To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
  For additional commands, e-mail: dev-h...@pdfbox.apache.org
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
 For additional commands, e-mail: dev-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: Performance of the trunkversion

2015-07-14 Thread Timo Boehme


Hi,

as I see it (had only a quick look at the implementation) the 
ScratchFileBuffer implementation is not optimal for fast random access. 
Single writes of bytes are not buffered but directly written to the file 
- a lot of I/O-operations) and seek operations have to travel the linked 
page list reading some bytes of each page - again a lot of seek and read 
I/O-operations.
To speed things up it is crucial to minimize the number of 
I/O-operations directly going to the random access file. Therefore it is 
needed to buffer writes, keep last read page in memory for sequential 
reads and have an in-memory cache of page meta data (offset, link to 
previous/next page).



Best,
Timo


Am 14.07.2015 um 12:15 schrieb Manfred Pock:

Yes, the input is a inputstream. I can try it direct from file.

But in general we get the pdf from an document management system as stream.
Does make sense that i save the pdf to file before?

Why is there so an big performance difference beetween the version from
May and the current version, if we use it with useScratchFiles = true ?

regarts, Manfred

Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:

Hi,


Manfred Pock pock.manf...@gmail.com hat am 14. Juli 2015 um 11:39
geschrieben:


Ok, we load the pdf with useScratchFiles = true, if we load them with
false the performance is better, but a little bit slower than the old
one.

What do you use as input, a stream or a real file? If the latter you
should use
the load method with the file parameter.

PDFBox needs ramdom access to the pdf and if a stream is provided
PDFBox copies
the data to a file (lower memory usage, slower performance) or to the
memory
(higher memory usage, better performance).

BR
Andreas



But now it need more memory. I cannot load some pdfs with the current
version with the same java-memory configuration.

Am 14.07.2015 um 11:26 schrieb Manfred Pock:

Hi,

we use the Pdfbox-trunkversion to render pdf's, currently we use the
version from 12. May 2015.

Today i have done an update to the current version and have test it.
It seems to be that it need now much more time to render pdf's, it
depends of the size of the pdf.

for example you can try this one:

http://cloud.directupload.net/15bu

It need five times more then the version from May 2015.

regarts, Manfred

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




--
Timo Boehme
OntoChem IT Solutions GmbH
Blücherstraße 24
06120 Halle (Saale)
Germany

phone: +49 345 478 047 4  | fax: +49 345 478 047 1
email: ulf.la...@ontochem.com | web: www.ontochem.com
HRB 21962 Amtsgericht Stendal | USt-IdNr.: DE815563824
managing director : Lutz Weber


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: Performance of the trunkversion

2015-07-14 Thread Andreas Lehmkühler

Hi,

 Manfred Pock pock.manf...@gmail.com hat am 14. Juli 2015 um 11:39
 geschrieben:
 
 
 Ok, we load the pdf with useScratchFiles = true, if we load them with 
 false the performance is better, but a little bit slower than the old one.
What do you use as input, a stream or a real file? If the latter you should use
the load method with the file parameter.

PDFBox needs ramdom access to the pdf and if a stream is provided PDFBox copies
the data to a file (lower memory usage, slower performance) or to the memory
(higher memory usage, better performance). 

BR
Andreas


 But now it need more memory. I cannot load some pdfs with the current 
 version with the same java-memory configuration.
 
 Am 14.07.2015 um 11:26 schrieb Manfred Pock:
  Hi,
 
  we use the Pdfbox-trunkversion to render pdf's, currently we use the 
  version from 12. May 2015.
 
  Today i have done an update to the current version and have test it. 
  It seems to be that it need now much more time to render pdf's, it 
  depends of the size of the pdf.
 
  for example you can try this one:
 
  http://cloud.directupload.net/15bu
 
  It need five times more then the version from May 2015.
 
  regarts, Manfred
 

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2860) NonSeq parser slower than Seq parser

2015-07-14 Thread simon steiner (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626187#comment-14626187
 ] 

simon steiner commented on PDFBOX-2860:
---

It sounds like if we can pass a file we should use nonseq otherwise for 
inputstream use seq

 NonSeq parser slower than Seq parser
 

 Key: PDFBOX-2860
 URL: https://issues.apache.org/jira/browse/PDFBOX-2860
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 2.0.0
Reporter: simon steiner

 PDF from PDFBOX-797
 for (int i=0; i1000; i++) {
 PDDocument.load(new FileInputStream(
 4218.pdf)).close();
 }
 Nonseq:
 real  0m23.691s
 Seq:
 real  0m9.705s



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Performance of the trunkversion

2015-07-14 Thread Manfred Pock


Hi,

we use the Pdfbox-trunkversion to render pdf's, currently we use the 
version from 12. May 2015.


Today i have done an update to the current version and have test it. It 
seems to be that it need now much more time to render pdf's, it depends 
of the size of the pdf.


for example you can try this one:

http://cloud.directupload.net/15bu

It need five times more then the version from May 2015.

regarts, Manfred

Re: Performance of the trunkversion

2015-07-14 Thread Manfred Pock


Yes, the input is a inputstream. I can try it direct from file.

But in general we get the pdf from an document management system as stream.
Does make sense that i save the pdf to file before?

Why is there so an big performance difference beetween the version from 
May and the current version, if we use it with useScratchFiles = true ?


regarts, Manfred

Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:

Hi,


Manfred Pock pock.manf...@gmail.com hat am 14. Juli 2015 um 11:39
geschrieben:


Ok, we load the pdf with useScratchFiles = true, if we load them with
false the performance is better, but a little bit slower than the old one.

What do you use as input, a stream or a real file? If the latter you should use
the load method with the file parameter.

PDFBox needs ramdom access to the pdf and if a stream is provided PDFBox copies
the data to a file (lower memory usage, slower performance) or to the memory
(higher memory usage, better performance).

BR
Andreas



But now it need more memory. I cannot load some pdfs with the current
version with the same java-memory configuration.

Am 14.07.2015 um 11:26 schrieb Manfred Pock:

Hi,

we use the Pdfbox-trunkversion to render pdf's, currently we use the
version from 12. May 2015.

Today i have done an update to the current version and have test it.
It seems to be that it need now much more time to render pdf's, it
depends of the size of the pdf.

for example you can try this one:

http://cloud.directupload.net/15bu

It need five times more then the version from May 2015.

regarts, Manfred

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Understanding PDFBox

2015-07-14 Thread Subham Tripathi

Hi All,
I wish to contribute to Apache PDFBox but before that i was trying to
understand the codebase. I am finding it very tough to understand the code
base as i am not finding any flow to follow.
Is there any documentation from which i can draw some high level insight of
the PDFBox ?

-- 
Best Regards,
Subham Tripathi

Re: Performance of the trunkversion

2015-07-14 Thread Timo Boehme


Hi,

instead of having a linked page list in ScratchFileBuffer I would 
propose having a list of pages with the page numbers (integer) kept in 
memory (takes 1k for 1MB data). This would ease page handling, seeking 
does not need I/O-operations and caching of pages would be a lot easier.

I may find some time later to come up with such a replacement.

Best,
Timo


Am 14.07.2015 um 13:02 schrieb Timo Boehme:

Hi,

as I see it (had only a quick look at the implementation) the
ScratchFileBuffer implementation is not optimal for fast random access.
Single writes of bytes are not buffered but directly written to the file
- a lot of I/O-operations) and seek operations have to travel the linked
page list reading some bytes of each page - again a lot of seek and read
I/O-operations.
To speed things up it is crucial to minimize the number of
I/O-operations directly going to the random access file. Therefore it is
needed to buffer writes, keep last read page in memory for sequential
reads and have an in-memory cache of page meta data (offset, link to
previous/next page).


Best,
Timo


Am 14.07.2015 um 12:15 schrieb Manfred Pock:

Yes, the input is a inputstream. I can try it direct from file.

But in general we get the pdf from an document management system as
stream.
Does make sense that i save the pdf to file before?

Why is there so an big performance difference beetween the version from
May and the current version, if we use it with useScratchFiles = true ?

regarts, Manfred

Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:

Hi,


Manfred Pock pock.manf...@gmail.com hat am 14. Juli 2015 um 11:39
geschrieben:


Ok, we load the pdf with useScratchFiles = true, if we load them with
false the performance is better, but a little bit slower than the old
one.

What do you use as input, a stream or a real file? If the latter you
should use
the load method with the file parameter.

PDFBox needs ramdom access to the pdf and if a stream is provided
PDFBox copies
the data to a file (lower memory usage, slower performance) or to the
memory
(higher memory usage, better performance).

BR
Andreas



But now it need more memory. I cannot load some pdfs with the current
version with the same java-memory configuration.

Am 14.07.2015 um 11:26 schrieb Manfred Pock:

Hi,

we use the Pdfbox-trunkversion to render pdf's, currently we use the
version from 12. May 2015.

Today i have done an update to the current version and have test it.
It seems to be that it need now much more time to render pdf's, it
depends of the size of the pdf.

for example you can try this one:

http://cloud.directupload.net/15bu

It need five times more then the version from May 2015.

regarts, Manfred

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org







--
Timo Boehme
OntoChem IT Solutions GmbH
Blücherstraße 24
06120 Halle (Saale)
Germany

phone: +49 345 478 047 4  | fax: +49 345 478 047 1
email: ulf.la...@ontochem.com | web: www.ontochem.com
HRB 21962 Amtsgericht Stendal | USt-IdNr.: DE815563824
managing director : Lutz Weber


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: Understanding PDFBox

2015-07-14 Thread khyrul Bashar

Hi Subham
I'm a GSoc student here in PDFBox this year and I'm improving PDFDebugger
of PDFBox issue https://issues.apache.org/jira/browse/PDFBOX-2530. Before
applying for the project, I had to be familiar with the code base. I was in
a bit of puzzle for the first time, but now I've a basic understanding of
the code base though I'm not coding for the main module of the PDFBox. I'm
suggesting what I've done so far to get comfortable with PDFBox to start.

Read the PDF specification, at least get a head start.
https://www.adobe.com/devnet/pdf/pdf_reference.html
Read the documentation.
https://pdfbox.apache.org/docs/2.0.0-SNAPSHOT/javadocs/
Play with example codes.
https://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/

Anyway, there are other things before you can contribute which I think
the committers guys can say more specifically.

Regards
Khyrul Bashar

On Tue, Jul 14, 2015 at 4:58 PM, Subham Tripathi subham@gmail.com
wrote:

 Hi All,
 I wish to contribute to Apache PDFBox but before that i was trying to
 understand the codebase. I am finding it very tough to understand the code
 base as i am not finding any flow to follow.
 Is there any documentation from which i can draw some high level insight of
 the PDFBox ?

 --
 Best Regards,
 Subham Tripathi

[jira] [Commented] (PDFBOX-2877) Wrong text placement for autosize fields compared to Adobe generated

2015-07-14 Thread Maruan Sahyoun (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626372#comment-14626372
 ] 

Maruan Sahyoun commented on PDFBOX-2877:


I made some progress and the results using an Arial font are much better. The 
results using the Courier font are still a lot apart from the Adobe generated 
results so it looks like depending on the available metrics of the font such as 
BBox, ascender, descender ... are not enough. Will continue to investigate and 
post my findings. For the moment I will generate some more testing material as 
to ensure that we have a better coverage for different fonts.

 Wrong text placement for autosize fields compared to Adobe generated
 

 Key: PDFBOX-2877
 URL: https://issues.apache.org/jira/browse/PDFBOX-2877
 Project: PDFBox
  Issue Type: Sub-task
  Components: AcroForm
Affects Versions: 1.8.9, 2.0.0
Reporter: Maruan Sahyoun
Assignee: Maruan Sahyoun
  Labels: Appearance
 Fix For: 2.0.0

 Attachments: AutosizeTests-filled-20150713.pdf, 
 AutosizeTests-filled-20150713.png, AutosizeTests.pdf


 When a field uses autosizing the generated appearance is wrong as
 - the text is placed lower than expected
 - the font size is too large
 compared to the appearance generated with Adobe tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626537#comment-14626537
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691003 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691003 ]

PDFBOX-2852: remove unused imports

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626541#comment-14626541
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691006 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691006 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: Understanding PDFBox

2015-07-14 Thread Tilman Hausherr


Am 14.07.2015 um 12:58 schrieb Subham Tripathi:

Hi All,
I wish to contribute to Apache PDFBox but before that i was trying to
understand the codebase. I am finding it very tough to understand the code
base as i am not finding any flow to follow.
Is there any documentation from which i can draw some high level insight of
the PDFBox ?




Look at the examples... and start from there. Then look at an unsolved 
issue :-)


If this is about getting coding practice, google for BATIK-1109 
https://issues.apache.org/jira/browse/BATIK-1109 and BATIK-1110 
https://issues.apache.org/jira/browse/BATIK-1110. One of the bugs is 
probably fixed by a few lines (although some debugging is needed to see 
how signed / unsigned values are handled there), the other one involves 
using code in PDFBox but in the way BATIK uses. Both bugs have been 
fixed in PDFBox, but not in BATIK (of which PDFBox used some code).


Tilman

[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626552#comment-14626552
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691007 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691007 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626576#comment-14626576
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691024 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691024 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626589#comment-14626589
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691031 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691031 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2842) Overhaul font substitution

2015-07-14 Thread Tilman Hausherr (JIRA)

[
https://issues.apache.org/jira/browse/PDFBOX-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626531#comment-14626531
]

Tilman Hausherr commented on PDFBOX-2842:
-

{code}LOG.warn(New fonts found, font cache will be re-built);{code}
shouldn't all these be info instead of warn?

Overhaul font substitution
--

Key: PDFBOX-2842
URL: https://issues.apache.org/jira/browse/PDFBOX-2842
Project: PDFBox
Issue Type: Improvement
Components: FontBox, PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Assignee: John Hewson
Priority: Blocker
Fix For: 2.0.0

Attachments: 029423-p1.pdf, 166292-fi-ligature.pdf

The improved font substitution mechanisms in 2.0 are not quite sufficient to
handle all PDFs. Specifically, CJK substitution and substitution of TTF in
place of CFF fonts is not possible with the current design.
The CJK problems can be seen in PDFBOX-2509 and PDFBOX-2563, which does not
solve the problem. Additional font API weaknesses can be found in PDFBOX-2578
and PDFBOX-2366. This meta-issue aims to address all of those sub-issues.
The current problems are:
- FontBox does not provide a generic font type, so we have handle
TrueTypeFont, CFFFont, and Type1Font separately. This hinders cross-format
substitution.
- ExternalFonts has no knowledge of the CIDSystemInfo which is necessary for
CJK substitution
- FontProvider contains too much public logic which should be internal to
PDFBox, e.g. substitution logic, this makes it brittle and means we won't be
able to add additional logic after 2.0 is released, e.g. CJK substitution.
- Too much confusion about the role of ExternalFonts, particularly with
regards to mapping of built-in fonts and the definition of substitute vs.
fallback font.
- ExternalFonts is a black box: the user cannot tell whether the font
returned is an exact match, or a last-resort fallback.
- Confusing font substitution API, users preferred having a flat file format
- PDSimpleFont#getEncoding() can return null for TTFs which use built-in
encodings. This has caused a lot of bugs - there must be a better way.
- We still have some confusing names, for example a CustomEncoding is known
as a built-in encoding in the spec.
- There is no fallback CFF font, we resort to AdobeBlank instead, which has
no rendering.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626538#comment-14626538
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691004 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691004 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626570#comment-14626570
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691019 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691019 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626587#comment-14626587
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691030 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691030 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626573#comment-14626573
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691020 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691020 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626578#comment-14626578
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691027 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691027 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626585#comment-14626585
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691028 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691028 ]

PDFBOX-2852: use interface

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626593#comment-14626593
 ] 

ASF subversion and git services commented on PDFBOX-2530:
-

Commit 1691032 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691032 ]

PDFBOX-2852, PDFBOX-2530: reduce code complexity; null parameter for setText() 
is legit

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, 
 parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, 
 removed_redundant_codes.patch, separationCS.diff, 
 sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, 
 treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html].
 Mentor: Tilman Hausherr (European timezone, languages:

[jira] [Commented] (PDFBOX-2852) Improve code quality (2)

2015-07-14 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626592#comment-14626592
 ] 

ASF subversion and git services commented on PDFBOX-2852:
-

Commit 1691032 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1691032 ]

PDFBOX-2852, PDFBOX-2530: reduce code complexity; null parameter for setText() 
is legit

 Improve code quality (2)
 

 Key: PDFBOX-2852
 URL: https://issues.apache.org/jira/browse/PDFBOX-2852
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This is a longterm issue for the task to improve code quality, by using the 
 [SonarQube 
 report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
  hints in different IDEs, the FindBugs tool and other code quality tools.
 This is a follow-up of PDFBOX-2576, which was getting too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: first stack trace report from pdfbox 2.0.0 trunk

2015-07-14 Thread Tilman Hausherr


Hi Tim,

Currently there is at least one known regression, mentioned in 
PDFBOX-2842, it applies to 029423 but also to other files.


Tilman

Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.:

All,
   I just posted the first stacktrace report from my initial partial batch run 
of against govdocs1 here: 
https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip

Caveats/Notes

The run yesterday did not include the fixes that were made in PDFBOX-2370 or 
PDFBOX-2862.

I stopped the batch run early. This only covered ~50k pdfs.

I forgot to turn on accesspermission checking. Some of the pdfs in here would 
normally have been skipped.

I haven't reviewed any of the exceptions. They may be caused by code on the 
Tika side.

I'll plan to re-run with the latest trunk on Tuesday.  I need to turn back to 
the actual eval code for a bit. :)


Cheers,

   Tim






-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626647#comment-14626647
 ] 

Tilman Hausherr commented on PDFBOX-2530:
-

There is a new bug (class cast exception) when clicking on a page content 
stream when in page mode. Although the bug is new, I assume that the root cause 
(a MapEntry with a MapEntry) is older.

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, 
 parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, 
 removed_redundant_codes.patch, separationCS.diff, 
 sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, 
 treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the components of a PDF. Start with 
 the file of PDFBOX-2401. Read up something about the structure of PDF on the 
 web or from the [PDF 
 Specification|https://www.adobe.com/devnet/pdf/pdf_reference.html].
 Mentor: Tilman Hausherr (European timezone, languages: german, english,

[jira] [Commented] (PDFBOX-2860) NonSeq parser slower than Seq parser

2015-07-14 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626658#comment-14626658
 ] 

Tilman Hausherr commented on PDFBOX-2860:
-

No you can't because the sequential parser no longer exists in 2.0. It was not 
correct.

 NonSeq parser slower than Seq parser
 

 Key: PDFBOX-2860
 URL: https://issues.apache.org/jira/browse/PDFBOX-2860
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 2.0.0
Reporter: simon steiner

 PDF from PDFBOX-797
 for (int i=0; i1000; i++) {
 PDDocument.load(new FileInputStream(
 4218.pdf)).close();
 }
 Nonseq:
 real  0m23.691s
 Seq:
 real  0m9.705s



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-2530) Improve PDFDebugger

2015-07-14 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626647#comment-14626647
 ] 

Tilman Hausherr edited comment on PDFBOX-2530 at 7/14/15 4:57 PM:
--

There is a new bug (class cast exception) when clicking on a page content 
stream when in show pages mode. Although the bug is new, I assume that the 
root cause (a MapEntry with a MapEntry) is older.


was (Author: tilman):
There is a new bug (class cast exception) when clicking on a page content 
stream when in page mode. Although the bug is new, I assume that the root cause 
(a MapEntry with a MapEntry) is older.

 Improve PDFDebugger
 ---

 Key: PDFBOX-2530
 URL: https://issues.apache.org/jira/browse/PDFBOX-2530
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: khyrul bashar
  Labels: gsoc2015
 Attachments: Avoiding_NPE_for_null_Field_Type.diff, 
 BracketsColorChooser.png, DeviceNCS.diff, FlagBitsPane-26-06-2015.diff, 
 Flag_bits_showing_feature-redesigned.diff, Flag_bits_showing_feature.diff, 
 K4SystemFontsNotEmbeded218.pdf, PDFDebugger_StatusBar.png, 
 PDFDebugger_StatusBar_01.png, 
 Parent_dictionary_type_checking_for__f__and__flags.diff, 
 Stream_Showing_Feature.diff, indexedcs.diff, openSelectedPath.diff, 
 parent_node_redirect.diff, parent_node_redirect_expand_disabled.diff, 
 removed_redundant_codes.patch, separationCS.diff, 
 sonarqube_warning_resolve.diff, tree.diff, treestatus.diff, 
 treestatuspane.diff


 (This is an idea for the [Google Summer of Code 
 2015|https://www.google-melange.com/])
 Our command line utility PDFDebugger (part of the command line pdfbox-app get 
 it [here|https://pdfbox.apache.org/downloads.html], read description 
 [here|https://pdfbox.apache.org/commandline/], see the source code 
 [here|https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFDebugger.java?view=markupsortby=date])
  needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy  paste
- ✓ Create a status line that shows where we are in the tree. (Like in the 
 Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions 
 about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by 
 entering tree string
- ✓ ability to search in streams (very useful for content streams and meta 
 data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and 
 RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on 
 some attributes, e.g. appearance stream when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to 
 understand. There are probably others, I assume that the main work needs to 
 be done only once
- edit attributes (should be possible to enter values as decimal, hex or 
 binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF 
- color mark of certain PDF operators, especially Q...q and text operators 
 (BT...ET). Ideally, it should help the user understand the bracketing of 
 these operators, i.e. understand where a sequence starts and where it ends. 
 (See operator summary in the PDF Spec) Other important operators I can 
 think of are the matrix, font and color operators. A cool advanced thing 
 would be to show the current color or the font in a popup when hovering above 
 such an operator.
 To see a product with a similar purpose that is better than PDFDebugger, 
 watch [this video|https://www.youtube.com/watch?v=g-QcU9B4qMc].
 I'm not asking to implement a clone of that product (I don't use it, all I 
 know is that video), but we at PDFBox really need something that makes PDF 
 debugging easier. As an example of how the current PDFDebugger prevented me 
 from finding a bug quickly, see PDFBOX-2401 and search for PDFDebugger.
 Prerequisites:
 - java programming, especially the GUI components
 - the ability to understand existing source code
 Using external software components is possible (must have Apache License or a 
 compatible one), but should be decided on a case-by-case basis, we don't want 
 to get too big.
 Development strategy: go from the easy to the difficult. The wished features 
 are already sorted this way (mostly).
 Get introduced: [download the source code with 
 svn|https://pdfbox.apache.org/downloads.html#scm] and build it with maven. 
 Run PDFDebugger and view some PDFs to see the

Re: Understanding PDFBox

2015-07-14 Thread John Hewson

The book “Developing with PDF” provides a short and gentle introduction to the
PDF format.

We have a brief architectural summary of PDFBox at:

http://pdfbox.apache.org/1.8/architecture.html 
http://pdfbox.apache.org/1.8/architecture.html

But in general, to make sense of PDFBox, you’ll need to understand the PDF spec.

— John

 On 14 Jul 2015, at 03:58, Subham Tripathi subham@gmail.com wrote:
 
 Hi All,
 I wish to contribute to Apache PDFBox but before that i was trying to
 understand the codebase. I am finding it very tough to understand the code
 base as i am not finding any flow to follow.
 Is there any documentation from which i can draw some high level insight of
 the PDFBox ?
 
 -- 
 Best Regards,
 Subham Tripathi

89 matches

Mail list logo