from:"Maruan Sahyoun"

Re: Problem With MergeUtility

2014-03-13 Thread Maruan Sahyoun

Hi,

not a direct answer to your question but could you try PDDocument.loadNonSeq 
instead?

BR
Maruan Sahyoun

 Am 13.03.2014 um 16:16 schrieb Alin Mazilu impet...@gmail.com:
 
 Hello guys,
 
 
 Has anyone had any problem with this? Any idea why it happens? What would
 be a good value for pushBackSize so this does not happen? Thanks!
 
 
 Partial stack trace:
 
 
 org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 72940
 bytes in order to reparse stream. Try increasing push back buffer using
 system property org.apache.pdfbox.baseParser.pushBackSize
 
 
 
at
 org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:546)
 
 
 
at
 org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:566)
 
 
 
at
 org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
 
 
 
at
 org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1071)
 
 
 
at
 org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1038)
 
 
 
at
 org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:186)

Re: Problem With MergeUtility

2014-03-13 Thread Maruan Sahyoun

this issue is logged at PDFBOX-1964 with a potential patch attached.


BR 
Maruan Sahyoun

Am 13.03.2014 um 17:52 schrieb Timo Boehme timo.boe...@ontochem.com:

 Hi,
 
 as far as I remember PDFMergeUtility is one of the last utilities not 
 supporting loadNonSeq currently.
 
 As a workaround get the source of PDFMergeUtility, change PDDocument.load to 
 PDDocument.loadNonSeq  (you may provide null as buffer parameter).
 
 
 Best,
 Timo
 
 
 Am 13.03.2014 16:46, schrieb Alin Mazilu:
 Where? Here's the code that causes that:
 
 PDFMergeUtility util = new PDFMergeUtility();
 
 for (File file : set) {
 try{
 if( file.exists() ){
 util.addSource(file);
 }
 } catch ( Exception e ){
//log e
 }
  }
 util.setDestinationFileName(...);
 
 util.mergeDocuments();
 
 
 On Thu, Mar 13, 2014 at 11:27 AM, Maruan Sahyoun 
 sahy...@fileaffairs.dewrote:
 
 Hi,
 
 not a direct answer to your question but could you try
 PDDocument.loadNonSeq instead?
 
 BR
 Maruan Sahyoun
 
 Am 13.03.2014 um 16:16 schrieb Alin Mazilu impet...@gmail.com:
 
 Hello guys,
 
 
 Has anyone had any problem with this? Any idea why it happens? What would
 be a good value for pushBackSize so this does not happen? Thanks!
 
 
 Partial stack trace:
 
 
 org.apache.pdfbox.exceptions.WrappedIOException: Could not push back
 72940
 bytes in order to reparse stream. Try increasing push back buffer using
 system property org.apache.pdfbox.baseParser.pushBackSize
 
 
 
at
 
 org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:546)
 
 
 
at
 org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:566)
 
 
 
at
 org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
 
 
 
at
 org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1071)
 
 
 
at
 org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1038)
 
 
 
at
 
 org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:186)
 
 
 
 
 -- 
 
 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com
 
 _
 
 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
 _

Re: Removing processStream and processSubStream

2014-03-19 Thread Maruan Sahyoun

Hi,

in general I think that this is a valid change. From how I understand the 
rendering in PDF Form, Text, Image and Pattern maintain their own matrix to map 
to user space which is then transformed by the CTM to device space so handling 
them specifically is fine and inline with the spec. I’d suggest that we make 
sure that the different ‚spaces‘ are defined properly within the code and refer 
to the PDF spec so that the code is easier to read if this is not already the 
case. With so many changes it’s a good opportunity to enhance the documentation 
within the source code. Some of the old code enjoys very little documentation.  

I wouldn’t remove processStream and processSubStream but deprecate them and 
remove them in the next major release though as to keep the changes to a 
minimum. There are a number of very important changes in 2.0. The easier we can 
get people to use that version wo to many changes to their own code the better.

For 2.0 removing the deprecated stuff of 1.x is fine. Removing not deprecated 
stuff should be avoided if possible. 

For the rendering what might have been missed is taking the UserUnit entry in 
the page dictionary into account which might change the default user space. 
This was introduced in PDF 1.6. A good opportunity to read that entry and make 
sure that we handle it appropriately.

BR
Maruan Sahyoun

Am 18.03.2014 um 20:46 schrieb John Hewson j...@jahewson.com:

 Hi All
 
 I’m still working on getting Tiling Patterns to render correctly, and need to 
 make some
 changes to core PDFBox functionality in order to proceed. My problem is that 
 tiling
 patterns are defined in their parent stream’s initial coordinate space, 
 rather than the
 coordinate space defined by the CTM. However, in PDFBox there is no way to 
 access
 the parent stream, so I can’t find out what it’s initial matrix is. The 
 manner in which the
 initial coordinate space is determined is different for pages, forms, and 
 patterns
 
 What this means is that the parent stream’s initial coordinate space needs to 
 be passed
 to processStream and processSubStream in PDFStreamEngine. This will 
 necessarily be
 a breaking change, and it will affect all downstream subclasses of 
 PDFStreamEngine.
 
 Because this has to be a breaking change, I propose that we go all the way 
 and make
 the new API bulletproof, 1) so that we won’t have to introduce breaking 
 changes in the
 future if we encounter similar issues, 2) so that the caller of the method 
 can’t pass the
 wrong data in the parameters. We would remove the two generic methods:
 
 public void processStream(PDResources resources, COSStream cosStream, 
 PDRectangle drawingSize, int rotation)
 public void processSubStream(PDResources resources, COSStream cosStream)
 
 and replace them with four specific methods:
 
 public void processPage(PDPage page)
 public void processForm(PDFormXObject form)
 public void processTilingPattern(PDTilingPattern pattern)
 public void processType3Font(PDType3Font font)
 
 This would mean that the various “proces” methods have access to their 
 parent
 stream, and can read any of its public fields in the future without 
 introducing breaking
 changes by altering the method’s parameters.
 
 What do you think?
 
 -- John

Re: Removing processStream and processSubStream

2014-03-19 Thread Maruan Sahyoun

as an added note - initially you suggested

public void processTilingPattern(PDTilingPattern pattern) 

but as Patterns in general have their own matrix I think it applies to all 
patterns, that’s why I wrote „… Form, Text, Image and Pattern maintain …“

BR
Maruan

Am 19.03.2014 um 18:31 schrieb Maruan Sahyoun sahy...@fileaffairs.de:

 John,
 
 Am 19.03.2014 um 18:15 schrieb John Hewson j...@jahewson.com:
 
 Maruan
 
 From how I understand the rendering in PDF Form, Text, Image and Pattern 
 maintain their own matrix to map to user space which is then transformed by 
 the CTM to device space so handling them specifically is fine and inline 
 with the spec.
 
 No, that’s not right, what I said was:
 
 My problem is that tiling patterns are defined in their parent stream’s 
 initial coordinate space, rather than the
 coordinate space defined by the CTM.
 
 So patterns should *not* be using the CTM, which is what I’m trying to 
 achieve.
 
 
 I think you misunderstood what I wrote - patterns have their own matrix - so 
 I think we are on the same page here. IMHO according to the spec CTM 
 transforms from user space to device space. So it’s pattern space - user 
 space - device space.
 
 
 I’d suggest that we make sure that the different ‚spaces‘ are defined 
 properly within the code and refer to the PDF spec so that the code is 
 easier to read if this is not already the case. With so many changes it’s a 
 good opportunity to enhance the documentation within the source code. Some 
 of the old code enjoys very little documentation.
 
 
 I disagree, in general I don’t think that references to the PDF spec are a 
 good form of documentation (there are some exceptions). References to the 
 spec are meaningless to the reader unless they take the time to look them up 
 in a 700 page PDF document. I would argue that by just linking back to the 
 spec, we have *failed* to document PDFBox, not succeeded.
 
 References to the PDF spec have another major flaw: they go out-of-date. For 
 example a Pattern Colour Space will always be called “Pattern Colour Space” 
 in future versions of the PDF spec but it may not be described in paragraph 
 8.6.6.2 or on page 156. The existing code contains many references to the 
 PDF 1.6 and 1.7 specs as well as the ISO PDF32000 spec, which means that I 
 need three 700 page PDF files open at all times in order to look up PDFBox 
 references. With the new version of the PDF spec due this year, this 
 situation is going to get worse.
 
 
 Didn’t mean to only reference to the spec but to use the same terms as 
 described by the spec. Adding references to the spec is an add-on not a 
 replacement.
 
 I agree that some of the existing code needs more documentation, and I often 
 add documentation to old files which I’m working on. However, my approach is 
 to just paste in a sentence or two from the PDF spec (fair use). That way 
 the reader does not ever need to look at the PDF spec. Because we use the 
 same terminology in PDFBox as in the spec, if someone really wants to look 
 something up, it’s as simple as Ctrl+F, no reference needed, and it’s 
 guaranteed not to go out-of-date.
 
 I wouldn’t remove processStream and processSubStream but deprecate them and 
 remove them in the next major release though as to keep the changes to a 
 minimum.
 
 This isn’t possible, as I said it will necessarily be a breaking change”. 
 This is because in 2.0 PDFStreamEngine needs to know the parent of each 
 stream, but processStream and processSubStream do not provide this 
 information. That’s why I’m discussing this on the mailing list.
 
 I don’t understand why this is shouldn’t be possible. It’s more effort, 
 agreed, but beneficial.
 
 
 For the rendering what might have been missed is taking the UserUnit entry 
 in the page dictionary into account which might change the default user 
 space. This was introduced in PDF 1.6. A good opportunity to read that 
 entry and make sure that we handle it appropriately.
 
 Yes, I have this as a “todo” in my working copy, however, if we put the 
 UserUnit in the matrix then we should also put the page Rotation into the 
 matrix, but that’a a significant change.
 
 -- John

Re: Removing processStream and processSubStream

2014-03-19 Thread Maruan Sahyoun

John

Am 19.03.2014 um 19:10 schrieb John Hewson j...@jahewson.com:

 Maruan,
 
 From how I understand the rendering in PDF Form, Text, Image and Pattern 
 maintain their own matrix to map to user space which is then transformed 
 by the CTM to device space so handling them specifically is fine and 
 inline with the spec.
 
 No, that’s not right, what I said was:
 
 My problem is that tiling patterns are defined in their parent stream’s 
 initial coordinate space, rather than the
 coordinate space defined by the CTM.
 
 So patterns should *not* be using the CTM, which is what I’m trying to 
 achieve.
 
 
 I think you misunderstood what I wrote - patterns have their own matrix - so 
 I think we are on the same page here. IMHO according to the spec CTM 
 transforms from user space to device space. So it’s pattern space - user 
 space - device space.
 
 Nope, as I said, that’s what PDFBox currently does and it’s wrong. As you say 
 the CTM transforms from user space to device space, but it’s not the only way 
 to do so, and it is not used by patterns.

As the processing is defined in the spec this is a good reference so no need to 
discuss that further. Of course different people might come to different 
conclusions by reading and interpreting the spec. 

 
 Didn’t mean to only reference to the spec but to use the same terms as 
 described by the spec. Adding references to the spec is an add-on not a 
 replacement.
 
 I don’t see what value this adds, given that the references will just go 
 out-of-date when the next spec is released. We already use the same 
 terminology as the PDF spec, so Ctrl+F can be used for quick look-ups that 
 won’t go out-of-date.

You are not enforced to add the information.

 
 This isn’t possible, as I said it will necessarily be a breaking change”. 
 This is because in 2.0 PDFStreamEngine needs to know the parent of each 
 stream, but processStream and processSubStream do not provide this 
 information. That’s why I’m discussing this on the mailing list.
 
 I don’t understand why this is shouldn’t be possible. It’s more effort, 
 agreed, but beneficial.
 
 
 What’s not to understand? PDFStreamEngine *needs* to know the parent of each 
 stream, and the old methods don’t provide this, passing a null parent will 
 not work because we need that information later in order to correctly process 
 the stream. If we allowed a null parent to be passed, the result would be 
 silently broken rendering - there’s no value in providing a 
 backwards-compatible API if it can only produce broken results.

Won’t get to the same conclusion here (as I think we won’t get on the other 
topics above).

 
 -- John
 
 On 19 Mar 2014, at 10:31, Maruan Sahyoun sahy...@fileaffairs.de wrote:
 
 John,
 
 Am 19.03.2014 um 18:15 schrieb John Hewson j...@jahewson.com:
 
 Maruan
 
 From how I understand the rendering in PDF Form, Text, Image and Pattern 
 maintain their own matrix to map to user space which is then transformed 
 by the CTM to device space so handling them specifically is fine and 
 inline with the spec.
 
 No, that’s not right, what I said was:
 
 My problem is that tiling patterns are defined in their parent stream’s 
 initial coordinate space, rather than the
 coordinate space defined by the CTM.
 
 So patterns should *not* be using the CTM, which is what I’m trying to 
 achieve.
 
 
 I think you misunderstood what I wrote - patterns have their own matrix - so 
 I think we are on the same page here. IMHO according to the spec CTM 
 transforms from user space to device space. So it’s pattern space - user 
 space - device space.
 
 
 I’d suggest that we make sure that the different ‚spaces‘ are defined 
 properly within the code and refer to the PDF spec so that the code is 
 easier to read if this is not already the case. With so many changes it’s 
 a good opportunity to enhance the documentation within the source code. 
 Some of the old code enjoys very little documentation.
 
 
 I disagree, in general I don’t think that references to the PDF spec are a 
 good form of documentation (there are some exceptions). References to the 
 spec are meaningless to the reader unless they take the time to look them 
 up in a 700 page PDF document. I would argue that by just linking back to 
 the spec, we have *failed* to document PDFBox, not succeeded.
 
 References to the PDF spec have another major flaw: they go out-of-date. 
 For example a Pattern Colour Space will always be called “Pattern Colour 
 Space” in future versions of the PDF spec but it may not be described in 
 paragraph 8.6.6.2 or on page 156. The existing code contains many 
 references to the PDF 1.6 and 1.7 specs as well as the ISO PDF32000 spec, 
 which means that I need three 700 page PDF files open at all times in order 
 to look up PDFBox references. With the new version of the PDF spec due this 
 year, this situation is going to get worse.
 
 
 Didn’t mean to only reference to the spec but to use the same terms as 
 described

Re: Apache PDFBox April 2014 board report due

2014-04-01 Thread Maruan Sahyoun

Hi Andreas,

+1 with the additions from John and Tilman

BR
Maruan

Am 30.03.2014 um 16:29 schrieb Andreas Lehmkuehler andr...@lehmi.de:

 Hi,
 
 find attached a quick draft of the board report we're expected to submit this
 month.
 
 @Johm, @Tilman
 Please add something about the GSoC status.
 
 
 Any further comments, objections or additions?
 
 
 draft
 
 The Apache PDFBox library is an open source Java tool for working with PDF
 documents.
 
 
 General Comments
 
 
 There are no issues that require Board attention.
 
 
 Community
 -
 
 There is a steady stream of contributions and bug reports from the community.
 
 John Hewson and Tilman Hausherr were added as committers and PMC members to 
 our ranks in February 2014.
 
 Eric Leleu stepped back and went emeritus per his own request in March 2014.
 
 452 (429 last report) subscribers on the user@ list
 157 (164 last report) subscribers on the dev@ list
 
 Releases
 
 
 Version 1.8.4 was released on 31th of January 2014
 
 1.8.4 is an incremental bugfix release based on PDFBox 1.8.x.
 
 GSoC
 
 
 TODO
 
 Development:
 
 
 Most likely the next bugfix version 1.8.5 will be released in the second 
 quarter.
 
 The work on our next major release is an ongoing effort. The main topics are:
 
 - switch to java 1.6
 - modularization
 - replace/enhance the parser
 - refactor the underlying COS model
 - code cleanup
 - enhance rendering
 
 /draft
 
 BR
 Andreas Lehmkühler

Re: New PDFBox bugfix release 1.8.5

2014-04-18 Thread Maruan Sahyoun

Hi,

I'm currently on a trip so won't be able to fix it today.

BR

Maruan

 Am 18.04.2014 um 16:58 schrieb Tilman Hausherr thaush...@t-online.de:
 
 Now only Maruans issue is open. I'm currently fixing more javadoc stuff for 
 1.8 and 2.0 and will comment when done. This will be finished in 15 min.
 
 After that, two possibilities IMO:
 - if you can also work on it tomorrow, just wait for Maruan
 - if you can only work on it today, then set the issue to resolved after I'm 
 done and Maruan can open a new issue.
 
 Tilman
 
 Am 18.04.2014 16:13, schrieb Andreas Lehmkuehler:
 Hi,
 
 Am 18.04.2014 15:52, schrieb Tilman Hausherr:
 Am 18.04.2014 15:36, schrieb Andreas Lehmkuehler:
 Hi,
 
 it's time to cut a new bugfix release as there are a lot of fixes
 
 Yes!
 
 WDYT?
 Is there anything we should wait for? Any fix only available in the trunk
 which should be merged into then branch as well? What about the 4 open 
 issues
 [1] marked with fix for 1.8.5?
 
 PDFBOX-1946 https://issues.apache.org/jira/browse/PDFBOX-1946: person 
 didn't
 answer = set to resolve
 +1
 
 PDFBOX-1977 https://issues.apache.org/jira/browse/PDFBOX-1977: LZW bug has
 been resolved. However the test is still not perfect. Don't really know 
 what to
 do, I don't have the time to create a perfect test, i.e. that would 1. 
 include
 the case that failed, 2. have both types of tests, deterministic and
 non-deterministic. A possible solution would be to change the title to the 
 bug
 only, then create a new issue re: the test for 2.0 only. WDYT?
 Sounds reasonable. Will you do that?
 
 PDFBOX-2026: https://issues.apache.org/jira/browse/PDFBOX-2026IMO the bug 
 has
 been fixed. However the user didn't answer. I will set to resolve.
 +1
 
 PDFBOX-1897 https://issues.apache.org/jira/browse/PDFBOX-1897: I'll let 
 Maruan
 resolve that one
 
 Tilman
 
 
 BR
 Andreas Lehmkühler
 
 [1] http://s.apache.org/VwQ
 
 Thanks for the fast reply
 
 BR
 Andreas Lehmkühler

Re: New PDFBox bugfix release 1.8.5

2014-04-22 Thread Maruan Sahyoun

Hi there,

there is an issue with my local copy of pdfbox atm - need a little more time to 
resolve the issue.

BR
Maruan Sahyoun

Am 18.04.2014 um 17:03 schrieb Maruan Sahyoun sahy...@fileaffairs.de:

 Hi,
 
 I'm currently on a trip so won't be able to fix it today.
 
 BR
 
 Maruan
 
 Am 18.04.2014 um 16:58 schrieb Tilman Hausherr thaush...@t-online.de:
 
 Now only Maruans issue is open. I'm currently fixing more javadoc stuff for 
 1.8 and 2.0 and will comment when done. This will be finished in 15 min.
 
 After that, two possibilities IMO:
 - if you can also work on it tomorrow, just wait for Maruan
 - if you can only work on it today, then set the issue to resolved after I'm 
 done and Maruan can open a new issue.
 
 Tilman
 
 Am 18.04.2014 16:13, schrieb Andreas Lehmkuehler:
 Hi,
 
 Am 18.04.2014 15:52, schrieb Tilman Hausherr:
 Am 18.04.2014 15:36, schrieb Andreas Lehmkuehler:
 Hi,
 
 it's time to cut a new bugfix release as there are a lot of fixes
 
 Yes!
 
 WDYT?
 Is there anything we should wait for? Any fix only available in the trunk
 which should be merged into then branch as well? What about the 4 open 
 issues
 [1] marked with fix for 1.8.5?
 
 PDFBOX-1946 https://issues.apache.org/jira/browse/PDFBOX-1946: person 
 didn't
 answer = set to resolve
 +1
 
 PDFBOX-1977 https://issues.apache.org/jira/browse/PDFBOX-1977: LZW bug 
 has
 been resolved. However the test is still not perfect. Don't really know 
 what to
 do, I don't have the time to create a perfect test, i.e. that would 1. 
 include
 the case that failed, 2. have both types of tests, deterministic and
 non-deterministic. A possible solution would be to change the title to the 
 bug
 only, then create a new issue re: the test for 2.0 only. WDYT?
 Sounds reasonable. Will you do that?
 
 PDFBOX-2026: https://issues.apache.org/jira/browse/PDFBOX-2026IMO the 
 bug has
 been fixed. However the user didn't answer. I will set to resolve.
 +1
 
 PDFBOX-1897 https://issues.apache.org/jira/browse/PDFBOX-1897: I'll let 
 Maruan
 resolve that one
 
 Tilman
 
 
 BR
 Andreas Lehmkühler
 
 [1] http://s.apache.org/VwQ
 
 Thanks for the fast reply
 
 BR
 Andreas Lehmkühler

Re: New PDFBox bugfix release 1.8.5

2014-04-25 Thread Maruan Sahyoun

Hi Andreas,

will commit them later today.

BR
Maruan Sahyoun

Am 24.04.2014 um 11:52 schrieb Andreas Lehmkühler andr...@lehmi.de:

 Hi,
 
 I'm planning to cut the release at the beginning of the next week.
 
 Any objections?
 
 @Maruan
 What about your pending javadoc changes? Do you need more time or help? As we
 are not in a hurry, it wouldn't be a problem to postpone the release process 
 for
 another week or two.
 
 BR
 Andreas Lehmkühler
 
 Andreas Lehmkuehler andr...@lehmi.de hat am 18. April 2014 um 15:36
 geschrieben:
 
 
 Hi,
 
 it's time to cut a new bugfix release as there are a lot of fixes
 in our queue. Additionally I already announced a possible new release in the
 second quarter and people are already asking for it. ;-)
 
 WDYT?
 Is there anything we should wait for? Any fix only available in the trunk
 which
 should be merged into then branch as well? What about the 4 open issues [1]
 marked with fix for 1.8.5?
 
 BR
 Andreas Lehmkühler
 
 [1] http://s.apache.org/VwQ

Re: xmpbox vs. jempbox - which is the one moving forward

2014-04-25 Thread Maruan Sahyoun

Hi

Am 25.04.2014 um 12:38 schrieb Andreas Lehmkühler andr...@lehmi.de:

 Hi,
 
 
 Maruan Sahyoun sahy...@fileaffairs.de hat am 9. April 2014 um 15:10
 geschrieben:
 
 
 Hi,
 
 did we make a decision about xmpbox or jempbox are the one to use for XMP
 metadata moving forward? There is a discussion in PDFBOX-1187 about cutting
 the dependency to jempbox and preflight uses xmpbox.
 
 Thanks for bringing this up again.
 
 How about the following scenario:
 
 We could alter PDMetadata as follows:
 
 - remove the import/exportXMPMetadata methods
 - provide new methods get/setMetadatastream to provide an Input/Outputstream 
 to
 be used with your favourite XMPMetadata implementation

+1 for being independent.

E.g. Adobe has a Java XMP lib under BSD license 
http://www.adobe.com/devnet/xmp/library/eula-xmp-library-java.html 

 
 Pros:
 
 - this would remove a in many cases not needed dependency in pdfbox
 - users can choose what library to use for handling XMP-Metadata, even any
 thirdparty lib could be used
 
 Cons:
 
 - we still have to maintain 2 XMP-libs

I’d think we should remove one of the XMP metadata libs which we can do 
independent of the above decision.


 
 WDYT?
 
 
 BR
 Andreas Lehmkühler

Re: New PDFBox bugfix release 1.8.5

2014-04-25 Thread Maruan Sahyoun

Hi Andreas,

I’ve committed the changes. Fingers crossed that I did that correctly this time.

BR
Maruan

Am 24.04.2014 um 11:52 schrieb Andreas Lehmkühler andr...@lehmi.de:

 Hi,
 
 I'm planning to cut the release at the beginning of the next week.
 
 Any objections?
 
 @Maruan
 What about your pending javadoc changes? Do you need more time or help? As we
 are not in a hurry, it wouldn't be a problem to postpone the release process 
 for
 another week or two.
 
 BR
 Andreas Lehmkühler
 
 Andreas Lehmkuehler andr...@lehmi.de hat am 18. April 2014 um 15:36
 geschrieben:
 
 
 Hi,
 
 it's time to cut a new bugfix release as there are a lot of fixes
 in our queue. Additionally I already announced a possible new release in the
 second quarter and people are already asking for it. ;-)
 
 WDYT?
 Is there anything we should wait for? Any fix only available in the trunk
 which
 should be merged into then branch as well? What about the 4 open issues [1]
 marked with fix for 1.8.5?
 
 BR
 Andreas Lehmkühler
 
 [1] http://s.apache.org/VwQ

Re: New PDFBox bugfix release 1.8.5

2014-04-26 Thread Maruan Sahyoun

Yes, already monitored it :-) 

thanks for the patience.

BR
Maruan

 Am 26.04.2014 um 10:29 schrieb Andreas Lehmkuehler andr...@lehmi.de:
 
 Hi Maruan,
 
 [1] everything works. Thanks!
 
 Looks like we are done here and I'm going to cut the release on Monday or 
 Tuesday evening (UTC+2)
 
 BR
 Andreas Lehmkühler
 
 [1] https://builds.apache.org/job/PDFBox%201.8.x/122/
 
 
 Am 26.04.2014 00:07, schrieb Maruan Sahyoun:
 Hi Andreas,
 
 I’ve committed the changes. Fingers crossed that I did that correctly this 
 time.
 
 BR
 Maruan
 
 Am 24.04.2014 um 11:52 schrieb Andreas Lehmkühler andr...@lehmi.de:
 
 Hi,
 
 I'm planning to cut the release at the beginning of the next week.
 
 Any objections?
 
 @Maruan
 What about your pending javadoc changes? Do you need more time or help? As 
 we
 are not in a hurry, it wouldn't be a problem to postpone the release 
 process for
 another week or two.
 
 BR
 Andreas Lehmkühler
 
 Andreas Lehmkuehler andr...@lehmi.de hat am 18. April 2014 um 15:36
 geschrieben:
 
 
 Hi,
 
 it's time to cut a new bugfix release as there are a lot of fixes
 in our queue. Additionally I already announced a possible new release in 
 the
 second quarter and people are already asking for it. ;-)
 
 WDYT?
 Is there anything we should wait for? Any fix only available in the trunk
 which
 should be merged into then branch as well? What about the 4 open issues [1]
 marked with fix for 1.8.5?
 
 BR
 Andreas Lehmkühler
 
 [1] http://s.apache.org/VwQ

Re: [VOTE] Release Apache PDFBox 1.8.5

2014-04-29 Thread Maruan Sahyoun

+1 - thanks for preparing the release.

I’ll update the docs on the website as soon as the release is out.

BR
Maruan Sahyoun

Am 28.04.2014 um 19:57 schrieb Andreas Lehmkuehler andr...@lehmi.de:

 Hi,
 
 a candidate for the PDFBox 1.8.5 release is available at:
 
http://people.apache.org/~lehmi/pdfbox/1.8.5/
 
 The release candidate is a zip archive of the sources in:
 
http://svn.apache.org/repos/asf/pdfbox/tags/1.8.5/
 
 The SHA1 checksum of the archive is fc01acc1e2575ff1f40e44e949a862fcae076029.
 
 Please vote on releasing this package as Apache PDFBox 1.8.5.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 PDFBox PMC votes are cast.
 
[ ] +1 Release this package as Apache PDFBox 1.8.5
[ ] -1 Do not release this package because...
 
 
 Here is my +1
 
 BR
 Andreas Lehmkühler

1.8.5 and Website

2014-05-02 Thread Maruan Sahyoun

Hi,

I’ve updated the PDFBox API docs to reflect 1.8.5 on the website.

BR
Maruan

Am 02.05.2014 um 09:27 schrieb Andreas Lehmkühler andr...@lehmi.de:

 Hi,
 
 due to the newest PDFBox 1.8.5 release I've closed all 1.8.5 related issues
 in a bulk operation. I've disabled the email notification to avoid an email
 flood.
 I've also added the all new version 1.8.6 for our next bugfix release ...
 
 I'll update the download page once the mirrors copied the version from our
 repository.
 
 BR
 Andreas Lehmkühler

Re: [VOTE] Release Apache PDFBox 1.8.5

2014-05-04 Thread Maruan Sahyoun

same for me

BR - Maruan

Am 04.05.2014 um 12:37 schrieb Andreas Lehmkuehler andr...@lehmi.de:

 Hi,
 
 Am 28.04.2014 21:20, schrieb John Hewson:
 +1
 
 Is it just me, or did others on the list get this mail with a delay of 6 days 
 too? According to the mail header the issue was on the senders side.
 
 As we got enough votes for the release and John didn't veto to release 1.8.5. 
 everything is fine.
 
 
 -- John
 
 BR
 Andreas Lehmkühler
 
 
 On 28 Apr 2014, at 10:57, Andreas Lehmkuehler andr...@lehmi.de wrote:
 
 Hi,
 
 a candidate for the PDFBox 1.8.5 release is available at:
 
http://people.apache.org/~lehmi/pdfbox/1.8.5/
 
 The release candidate is a zip archive of the sources in:
 
http://svn.apache.org/repos/asf/pdfbox/tags/1.8.5/
 
 The SHA1 checksum of the archive is 
 fc01acc1e2575ff1f40e44e949a862fcae076029.
 
 Please vote on releasing this package as Apache PDFBox 1.8.5.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 PDFBox PMC votes are cast.
 
[ ] +1 Release this package as Apache PDFBox 1.8.5
[ ] -1 Do not release this package because...
 
 
 Here is my +1
 
 BR
 Andreas Lehmkühler

Enhancements to PDFBox

2014-05-29 Thread Maruan Sahyoun

Hi,

for a current project I need to work on enhancing PDFBox for

# splitting files (e.g. remove no longer needed resources)
# merging files (e.g. avoid duplicating resources)
# page handling (adding/removing individual pages with resource handling)
# enhancements to forms handling (pre fill XFA forms - partially done, 
enhancing AP generation)

Is someone else working on something similar?

BR

Maruan

Re: Enhancements to PDFBox

2014-05-29 Thread Maruan Sahyoun

Hi,

Am 29.05.2014 um 13:57 schrieb Andreas Lehmkuehler andr...@lehmi.de:

 Am 29.05.2014 09:39, schrieb Maruan Sahyoun:
 Hi,
 
 for a current project I need to work on enhancing PDFBox for
 
 # splitting files (e.g. remove no longer needed resources)
 I had a quick look some time ago hoping that it would be easy to just remove 
 unneeded stuff but it isn't (maybe I didn't get it yet). In most cases 
 resources are deleted in combination with the page they belong to. The bigger 
 issue is annotations referring to pages. Those pages including there 
 resources aren't removed when the pages are removed because of the reference 
 in the annotation directory.
 # merging files (e.g. avoid duplicating resources)
 That just makes sense if the pdfs to be merged uses similar resources.
 
 # page handling (adding/removing individual pages with resource handling)
 This should be a side produkt of #1 and #2
 
 # enhancements to forms handling (pre fill XFA forms - partially done, 
 enhancing AP generation)
 This seems to be an important feature not only for you. So it would be nice 
 if someone could improve that.
 

I already have filling an XFA form ready with some limitations (PDXFA’s COS has 
to be an array, dataset entry must be present … ). Could put it in if someone 
is interested in the current stage but planned to remove some limitations 
first. I’m not totally sure if that should be part of PDXFA or a Filler tool as 
this will introduce some dependency on XML handling. 
Preferences?

 Is someone else working on something similar?
 My recent todo list is already quite long and maybe #1 and #2 or on it, but 
 I'm afraid on a lower position. But I'm happy to help if someone wants to 
 implement some of those features.

I will be working on #1 and #2 (at least to a degree which is needed for the 
project). If we could get some ideas together and you could help me - based on 
your past experience and knowledge of the code base - to get this started this 
would be great. 

 
 
 BR
 
 Maruan
 
 BR
 Andreas Lehmkühler

Re: Enhancements to PDFBox

2014-05-29 Thread Maruan Sahyoun


Am 29.05.2014 um 14:31 schrieb Andreas Lehmkuehler andr...@lehmi.de:

 Am 29.05.2014 14:20, schrieb Maruan Sahyoun:
 Hi,
 
 Am 29.05.2014 um 13:57 schrieb Andreas Lehmkuehler andr...@lehmi.de:
 
 Am 29.05.2014 09:39, schrieb Maruan Sahyoun:
 Hi,
 
 for a current project I need to work on enhancing PDFBox for
 
 # splitting files (e.g. remove no longer needed resources)
 I had a quick look some time ago hoping that it would be easy to just 
 remove unneeded stuff but it isn't (maybe I didn't get it yet). In most 
 cases resources are deleted in combination with the page they belong to. 
 The bigger issue is annotations referring to pages. Those pages including 
 there resources aren't removed when the pages are removed because of the 
 reference in the annotation directory.
 # merging files (e.g. avoid duplicating resources)
 That just makes sense if the pdfs to be merged uses similar resources.
 
 # page handling (adding/removing individual pages with resource handling)
 This should be a side produkt of #1 and #2
 
 # enhancements to forms handling (pre fill XFA forms - partially done, 
 enhancing AP generation)
 This seems to be an important feature not only for you. So it would be nice 
 if someone could improve that.
 
 
 I already have filling an XFA form ready with some limitations (PDXFA’s COS 
 has to be an array, dataset entry must be present … ). Could put it in if 
 someone is interested in the current stage but planned to remove some 
 limitations first. I’m not totally sure if that should be part of PDXFA or a 
 Filler tool as this will introduce some dependency on XML handling.
 Preferences?
 Hmm, maybe it would be I good idea to put that stuff in a separate module, so 
 that it could be added/discarded on demand.

OK - will do.

 
 Is someone else working on something similar?
 My recent todo list is already quite long and maybe #1 and #2 or on it, but 
 I'm afraid on a lower position. But I'm happy to help if someone wants to 
 implement some of those features.
 
 I will be working on #1 and #2 (at least to a degree which is needed for the 
 project). If we could get some ideas together and you could help me - based 
 on your past experience and knowledge of the code base - to get this started 
 this would be great.
 Yes, of course.
 
 BR
 
 Maruan
 
 
 BR
 Andreas Lehmkühler

Re: Enhancements to PDFBox

2014-05-29 Thread Maruan Sahyoun

Hi Simon,

thanks for the pointer - very useful.

BR
Maruan

Am 29.05.2014 um 12:06 schrieb Simon Steiner simonsteiner1...@gmail.com:

 Hi,
 
 I worked on merging fonts in pdfs in fop using pdfbox
 https://issues.apache.org/jira/browse/FOP-2302
 
 Thanks
 
 -Original Message-
 From: Maruan Sahyoun [mailto:sahy...@fileaffairs.de] 
 Sent: 29 May 2014 08:40
 To: dev@pdfbox.apache.org
 Subject: Enhancements to PDFBox
 
 Hi,
 
 for a current project I need to work on enhancing PDFBox for
 
 # splitting files (e.g. remove no longer needed resources) # merging files
 (e.g. avoid duplicating resources) # page handling (adding/removing
 individual pages with resource handling) # enhancements to forms handling
 (pre fill XFA forms - partially done, enhancing AP generation)
 
 Is someone else working on something similar?
 
 BR
 
 Maruan

Re: Enhancements to PDFBox

2014-05-30 Thread Maruan Sahyoun


Am 29.05.2014 um 18:51 schrieb John Hewson j...@jahewson.com:

 # splitting files (e.g. remove no longer needed resources)
 
 Each page has its own Resources dictionary, so it shouldn't be too difficult. 
 One thing to watch out for is is the page tree which allows pages to 
 inherit resources from each other, this is handled as PDPageNode but it's 
 kind of messy.

thanks for the hint. Splitting and merging is somewhat similar as splitting is 
typically done by creating a new document and importing the needed pages into 
the newly created document. Using the current code this might lead to duplicate 
resources. 

 
 # merging files (e.g. avoid duplicating resources)
 
 Sounds like the files are pretty similar, is this actually an overlay? Or are 
 you wanting to insert entire pages?

it’s merging individual files together inserting entire pages. Although the 
files are created individually they share some common elements like company 
logos or fonts. 

 
 I imagine you probably want to implement both these features at the COS level 
 rather than the PD level, as it's pretty low-level processing.
 

It will involve a lot of COS processing. I haven’t decided yet if it will sit 
on top of COS or PD. Typically we do encourage people to use PD so I tend to 
start from there and dig down internally as needed. WDYT?


 -- John
 
 On 29 May 2014, at 00:39, Maruan Sahyoun sahy...@fileaffairs.de wrote:
 
 Hi,
 
 for a current project I need to work on enhancing PDFBox for
 
 # splitting files (e.g. remove no longer needed resources)
 # merging files (e.g. avoid duplicating resources)
 # page handling (adding/removing individual pages with resource handling)
 # enhancements to forms handling (pre fill XFA forms - partially done, 
 enhancing AP generation)
 
 Is someone else working on something similar?
 
 BR
 
 Maruan

Re: Idea: stable 2.0 versions

2014-06-01 Thread Maruan Sahyoun

Hi

Am 01.06.2014 um 18:51 schrieb Tilman Hausherr thaush...@t-online.de:

 Am 01.06.2014 15:46, schrieb Maruan Sahyoun:
 
 There is one important thing we have to do before releasing 2.0, an 
 upgrade guide including updated docs.
 could handle that. Would need some input about major changes as a starting 
 point as I din’t follow all breaking changes.
 
 
 Here are the ones I know about:
 
 old = new
 
 PDXObjectForm = PDFormXObject
 PDXObjectImage = PDImageXObject
 PDPage.convertToImage() = PDFRenderer(PDDocument).renderImage()
 PDXObjectImage.getRGBImage() = PDImageXObject.getImage()
 
  = PDFPrinter(PDDocument, ).print(PDDocument,PrinterJob, …)

AFAIK this was PDDocument.print()

Re: Idea: stable 2.0 versions

2014-06-02 Thread Maruan Sahyoun

Hi,

Maruan Sahyoun

Am 02.06.2014 um 08:59 schrieb John Hewson j...@jahewson.com:

 On 1 Jun 2014, at 06:03, Andreas Lehmkuehler andr...@lehmi.de wrote:
 
 Hi,
 
 Am 30.05.2014 23:13, schrieb John Hewson:
 I think the risk of creating the impression that 2.0 is stable is too high. 
 The real problem
 is that 2.0 has been too long in development, there were frustrated users 
 asking a year
 ago about when it would be released.
 The biggest issue is, that we can't name a version stable without an 
 official release.
 
 Seems like there could be some release candidates at some point soon... not 
 quite yet.
 
 
 Perhaps it’s time to push for a release of 2.0 and aim for a more frequent 
 release cycle
 after that, to avoid repeating the situation where the stable and trunk 
 versions are
 years apart?
 +1, it's time to go for release, not tomorrow or next week, but we should 
 start to do some planning.
 
 What is holding back 2.0? What features are we *really* holding out on? Can 
 we put
 together a roadmap - our users often ask for one...
 I already had a starting discussion with Maruan two weeks ago at a f2f 
 meeting.
 
 I'd like to add those changes which include api changes so what we haven't 
 to wait until the next major release, at least those changes which are not 
 that big, such as
 
 - solving the jempbox/xmpbox issue
 - update bouncy castle
 - split the pdfbox module in at least 2 modules (core and rendering)
 
 Splitting the rendering code into a module isn't really a feature... is there 
 a higher-level goal? If so, is it achievable for a 2.0 release in the near 
 future?

There are requests for PDFBox on Android where most of awt is not available.

 
 
 There are some changes/improvements/bugfixes I'd like to solve as well:
 
 - PDFBOX-922: unicode support
 - PDFBOX-62: almost done
 - improve the parser concerning broken XRef-tables
 - complete the recent font-improvements
 
 Yes, finally removing AWT fonts will be a huge improvement.
 
 There some other more or less easy to solve candidates
 
 - enhance type safety
 - remove dependencies
 - 
 
 There are some other things on our ideas list which should be postponed
 
 - enhanced parser (could maybe done without big refactorings, so that we 
 don't have to wait until the next major release)
 - refactoring of COS-level object
 - 
 
 There is one important thing we have to do before releasing 2.0, an upgrade 
 guide including updated docs.
 
 We should contact press@ in preparation of the release to phrase a press 
 release.
 
 
 IMHO, it could be realisitc to do a release in the summer, maybe in august.
 
 -- John
 
 BR
 Andreas Lehmkühler
 
 On 30 May 2014, at 14:01, Tilman Hausherr thaush...@t-online.de wrote:
 
 I suggest that we come up with a concept of designating stable versions 
 (or tested versions) for the trunk and put them on the homepage. A 
 stable version is one with no or only minor regressions, and/or a version 
 that committers have found to be good. This would be for users of the 
 2.0 version who don't want to read every discussion, and also as a hint 
 for unhappy 1.8 users.
 
 I suspect that other open source projects do also have rules to designate 
 stable versions, but I didn't look at them.
 
 Proposed rules:
 - any committer can designate any version that is older than 24 hours as 
 stable
 - any committer can veto any version as unstable
 - any version that has only positive votes is mentioned on
 https://pdfbox.apache.org/downloads.html#scm
 - there should be up to three versions there
 
 Tilman

Re: Idea: stable 2.0 versions

2014-06-02 Thread Maruan Sahyoun

Hi

Am 02.06.2014 um 17:59 schrieb John Hewson j...@jahewson.com:

 On 2 Jun 2014, at 00:24, Maruan Sahyoun sahy...@fileaffairs.de wrote:
 
 
 Hi,
 
 Maruan Sahyoun
 
 Am 02.06.2014 um 08:59 schrieb John Hewson j...@jahewson.com:
 
 On 1 Jun 2014, at 06:03, Andreas Lehmkuehler andr...@lehmi.de wrote:
 
 Hi,
 
 Am 30.05.2014 23:13, schrieb John Hewson:
 I think the risk of creating the impression that 2.0 is stable is too 
 high. The real problem
 is that 2.0 has been too long in development, there were frustrated users 
 asking a year
 ago about when it would be released.
 The biggest issue is, that we can't name a version stable without an 
 official release.
 
 Seems like there could be some release candidates at some point soon... 
 not quite yet.
 
 
 Perhaps it’s time to push for a release of 2.0 and aim for a more 
 frequent release cycle
 after that, to avoid repeating the situation where the stable and trunk 
 versions are
 years apart?
 +1, it's time to go for release, not tomorrow or next week, but we should 
 start to do some planning.
 
 What is holding back 2.0? What features are we *really* holding out on? 
 Can we put
 together a roadmap - our users often ask for one...
 I already had a starting discussion with Maruan two weeks ago at a f2f 
 meeting.
 
 I'd like to add those changes which include api changes so what we haven't 
 to wait until the next major release, at least those changes which are not 
 that big, such as
 
 - solving the jempbox/xmpbox issue
 - update bouncy castle
 - split the pdfbox module in at least 2 modules (core and rendering)
 
 Splitting the rendering code into a module isn't really a feature... is 
 there a higher-level goal? If so, is it achievable for a 2.0 release in the 
 near future?
 
 There are requests for PDFBox on Android where most of awt is not available.
 
 So the ultimate goal is to have an Android release for 2.0, who's going to do 
 this? AWT is very deeply integrated into PD (e.g. colour spaces, images) and 
 also FontBox (paths). I think a workable plan for removing it is much harder 
 than it looks.

I don’t think and didn’t want to say that an Android release shall be done for 
2.0. Only wanted to provide feedback why rendering might be on it’s own module 
as per Andreas input.

 
 
 
 
 There are some changes/improvements/bugfixes I'd like to solve as well:
 
 - PDFBOX-922: unicode support
 - PDFBOX-62: almost done
 - improve the parser concerning broken XRef-tables
 
 I'm thinking of taking a look at XRefs.
 
 - complete the recent font-improvements
 
 Yes, finally removing AWT fonts will be a huge improvement.
 
 There some other more or less easy to solve candidates
 
 - enhance type safety
 - remove dependencies
 - 
 
 There are some other things on our ideas list which should be postponed
 
 - enhanced parser (could maybe done without big refactorings, so that we 
 don't have to wait until the next major release)
 
 Yeah, let's just makes sure the public API is nice and tight, then we can 
 refactor the internals at will later.
 
 - refactoring of COS-level object
 - 
 
 There is one important thing we have to do before releasing 2.0, an 
 upgrade guide including updated docs.
 
 We should contact press@ in preparation of the release to phrase a press 
 release.
 
 
 IMHO, it could be realisitc to do a release in the summer, maybe in august.
 
 -- John
 
 BR
 Andreas Lehmkühler
 
 On 30 May 2014, at 14:01, Tilman Hausherr thaush...@t-online.de wrote:
 
 I suggest that we come up with a concept of designating stable 
 versions (or tested versions) for the trunk and put them on the 
 homepage. A stable version is one with no or only minor regressions, 
 and/or a version that committers have found to be good. This would be 
 for users of the 2.0 version who don't want to read every discussion, 
 and also as a hint for unhappy 1.8 users.
 
 I suspect that other open source projects do also have rules to 
 designate stable versions, but I didn't look at them.
 
 Proposed rules:
 - any committer can designate any version that is older than 24 hours as 
 stable
 - any committer can veto any version as unstable
 - any version that has only positive votes is mentioned on
 https://pdfbox.apache.org/downloads.html#scm
 - there should be up to three versions there
 
 Tilman

Re: Changing font tag for BaseFont

2014-06-05 Thread Maruan Sahyoun

Hi,

why do you need to change that tag? IKOTCH+ as a prefix to the font is used 
because you font is subsetted i.e. not all glyphs of the font have been written 
into the PDF file. This is inline with the specification.

As usage questions are discussed on the users mailing list may I ask you to use 
that in the future?

BR

Maruan Sahyoun

Am 05.06.2014 um 09:12 schrieb Robert Strauch robert.stra...@gmx.de:

 Hello,
 
 I have a PDF which embeds a TrueType font called UnicodeDoc. Within the PDF I 
 can see the following:
 
 /BaseFont /IKOTCH+UnicodeDoc
 
 Is it possible using PDFBox to change the tag value IKOTCH and if so how? I 
 know that this value may be different for other documents. However I just 
 need acces to this tagbut I cannot find the appropriate way.
 
 Sincerely,
 Robert

Re: PDFBox 1.8.6 release

2014-06-11 Thread Maruan Sahyoun

Hi,

would you think that https://issues.apache.org/jira/browse/PDFBOX-1512 
(TextPositionComparator is not compatible with Java 7) should potentially be 
handled. Although I haven’t received feedback on it I could move forward 
implementing it to reflect other PDF readers handle positions. But I wouldn’t 
be able to start working on it before the week after next.

BR

Maruan

Am 11.06.2014 um 18:02 schrieb Tilman Hausherr thaush...@t-online.de:

 Sure... Could you make a decision on
 PDFBOX-239 https://issues.apache.org/jira/browse/PDFBOX-239? And are there 
 any other issues that are to be fixed for 1.8.6?
 
 Tilman
 
 
 
 
 
 Am 11.06.2014 08:04, schrieb Andreas Lehmkuehler:
 Am 28.05.2014 15:10, schrieb Andreas Lehmkühler:
 Hi,
 
 there are already a number of solved issues mostly due
 to the hard work of Tilman and I'm thinking about a new
 bugfix release. How about a new one in 2 or 3 weeks
 from now?
 
 WDYT?
 How about next week, let say wednesday the 18th?
 
 BR
 Andreas Lehmkühler
 
 BR
 Andreas Lehmkühler

PDFBox and XMP - retire jempbox

2014-06-20 Thread Maruan Sahyoun

Hi,

we currently have two libraries handling XMP metadata jempbox and xmpbox.

Part of PDFBOX-1187/PDFBOX-2197 was to remove a direct dependency from jempbox 
as now XMP metadata could be generated by any library and added as a stream. 
This will be available for PDFBox 2.0.0.

I would like to propose to now retire jempbox as xmpbox

# is closer to the spec (naming conventions)
# used for PDF/A validation where we can not remove a dependency on XMP 
handling as checking metadata is necessary for PDF/A compliance. 

In case there is functionality in jempbox that is missing in xmpbox that could 
be added at a later stage upon request.

WDYT? 

BR
Maruan

Release Apache PDFBox 1.8.6 - API docs

2014-06-20 Thread Maruan Sahyoun

the apidocs for 1.8.6 are available at 
http://pdfbox.staging.apache.org/docs/1.8.6/javadocs/

upon release they will be put into production.

BR

Maruan Sahyoun

Am 19.06.2014 um 14:28 schrieb Andreas Lehmkuehler andr...@lehmi.de:

 Hi,
 
 a candidate for the PDFBox 1.8.6 release is available at:
 
http://people.apache.org/~lehmi/pdfbox/1.8.6/
 
 The release candidate is a zip archive of the sources in:
 
http://svn.apache.org/repos/asf/pdfbox/tags/1.8.6/
 
 The SHA1 checksum of the archive is 543c49ebe34a443654a0c3c264f36acc07983cc6.
 
 Please vote on releasing this package as Apache PDFBox 1.8.6.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 PDFBox PMC votes are cast.
 
[ ] +1 Release this package as Apache PDFBox 1.8.6
[ ] -1 Do not release this package because...
 
 
 Here is my +1
 
 BR
 Andreas Lehmkühler

Re: TIKA-1300

2014-06-27 Thread Maruan Sahyoun

thanks for the pointer - very useful information.

BR
Maruan

Am 27.06.2014 um 08:18 schrieb Tilman Hausherr thaush...@t-online.de:

 Please look at TIKA-1300 https://issues.apache.org/jira/browse/TIKA-1300, 
 it about PDFBox sequential parser vs. non sequential parser

Re: Apache PDFBox July 2014 board report due

2014-07-01 Thread Maruan Sahyoun

+1 - thx for taking care of this.

Maruan


Am 28.06.2014 um 12:15 schrieb Andreas Lehmkuehler andr...@lehmi.de:

 Hi,
 
 find attached a quick draft of the board report we're expected to submit this
 month.
 
 @John, @Tilman
 Please add something about the GSoC status.
 
 
 Any further comments, objections or additions?
 
 
 draft
 
 The Apache PDFBox library is an open source Java tool for working with PDF
 documents.
 
 
 General Comments
 
 
 There are no issues that require Board attention.
 
 Community
 -
 
 There is a steady stream of contributions and bug reports from the community.
 
 451 (452 last report) subscribers on the user@ list
 153 (157 last report) subscribers on the dev@ list
 
 Maruan gave a presentation about PDFBox at the PDF Days Europe 2014 in 
 cologne.
 We got some positive feedback and a couple of people show some interest in our
 project/community.
 
 Releases
 
 
 Version 1.8.5 was released on 2nd of May 2014
 Version 1.8.6 was released on 22nd of June 2014
 
 Both are incremental bugfix releases based on PDFBox 1.8.x.
 
 GSoC
 
 
 TODO John  Tilman
 
 Development:
 
 
 The work on our next major release is an ongoing effort. The main topics are:
 
 - switch to java 1.6
 - modularization
 - replace/enhance the parser
 - code cleanup
 - enhance rendering
 
 We are targeting the late summer as a rough release date for the next major 
 release.
 
 /draft
 
 BR
 Andreas Lehmkühler

Re: Regression Testing

2014-07-04 Thread Maruan Sahyoun

Hi John,

thanks for binging this up. This is a very important topic which was also 
discussed at the PDFDays in Germany.

 # Tests #
In addition to rendering we shall be covering metadata and text extraction as 
well as PDF/A validation. 

# Testfiles # 
Recently there were a number of test sets made available which we can use. 
http://digitalcorpora.org/corpora/files , 
https://github.com/openplanets/format-corpus/tree/master/pdfCabinetOfHorrors …
For PDF/A validation there is the Isartor test suite 
http://www.pdfa.org/2011/08/download-isartor-test-suite/. Some restrictions 
apply there.
In addition we can put additional files into our own repository as you 
suggested.
So there is no shortage on test files. 

TIKA-1300/TIKA-1302 has a discussion around the same topic together with some 
development for an infrastructure (VM, Jenkins …). IMHO we should join forces 
with them.

BR

Maruan


Am 04.07.2014 um 02:16 schrieb John Hewson j...@jahewson.com:

 Hi All
 
 I’ve been thinking about regression testing recently and how we can improve
 our tests for rendering. There are currently two problems:
 
 1) Different JDKs produce slightly different renderings (see PDFBOX-1843).
(I suspect that AWT fonts are a big part of this, so the problem might get 
 a lot better
soon once we render all fonts ourselves).
 
 2) Most PDF test files we have are not under an Apache-friendly license, so
we can’t put the test files into the trunk SVN.
 
 It seems that some of you have your own collections of test PDF files which 
 you are
 running regression tests on: that’s great but it would be much better if we 
 had a
 central repository of test files and sample renderings.
 
 I’d like to suggest the following solutions to the above issues:
 
 1) We should choose a “blessed” JDK which will be used to perform the 
 renderings
this should be whatever is a convenient and sensible default for 
 committers. (My
preference would be for Oracle’s JDK 7 because JDK 6 is deprecated has 
 known
rendering bugs). We should make sure that Jenkins runs tests using the 
 ”blessed”
JDK.
 
   The regression test can then check to see if it is running on the “blessed” 
 JDK and
   if not then the tests can be skipped and we can warn the user.
 
 2) We should create a new “regression” branch in SVN which contains only PDF 
 files
for testing and PNG images which contain known-good renderings created 
 using the
“blessed” JDK. This branch would not be part of the source of PDFBox but 
 will still
allow us to version control the test PDFs (it also simplifies the workflow 
 for adding
new test PDFs and new known-good renderings: simply do an svn add”).
 
As far as copyright and licensing is concerned we can put any PDF files 
 which are
available publicly on the web into this branch without too much worry.
 
 What does everybody think?
 
 -- John

Re: Regression Testing

2014-07-05 Thread Maruan Sahyoun


 Hi Tilman
 
 Thanks for your thoughts, I think that your concerns are already covered by 
 my original proposal, I’ll try to explain why and how:
 
 Of course I agree with the need for regression tests, however it isn't easy: 
 besides the problems of the different JDKs (I use JDK7 Windows 64 bit), 
 there is the problem that some enhancements create slight changes in 
 rendering that are not errors, i.e. both the before and the after files 
 look OK by itself. This has happened when we changed the text rendering 
 recently, and has happened again when the clipping was improved. The cause 
 are probably slight changes in color or in boundaries.
 
 If a rendering has changed then the regression test should fail. When a 
 failure occurs the developer needs to manually inspect the differences (we 
 could generate a visual diff which highlights what changed to make this 
 easier) and if ok then they can replace the known-good PNG with the ones just 
 rendered. Indeed this will be the basic workflow for working with regression 
 tests.
 

I think this is the only way to handle that situation. The same applies for 
text extraction etc. - If an improvement changes the results the ‚base‘ needs 
to be reset by adding the new image, text etc as the validation source.

A basic testbed could also run against other JDKs - e.g. wo validating against 
the know-good files - so we pick up potential issues early. Should be easy with 
Jenkins and treated as a hint.  


 Copyrights is a problem: I'm testing mostly with JIRA attachments that I've 
 downloaded over the years. While uploading such files to JIRA might count as 
 fair use, I doubt that this would still be true if they are included in a 
 distribution. Instead, they should be stored somewhere on Apache servers 
 where only committers and build software (Travis, Jenkins, ...) can 
 access then. The public PDFs that Maruan mentions don't possibly have all 
 the Problem cases that we solved before. However I have started working with 
 these files and there are at least 5 recent issues that deals with them.
 
 The PDFs won’t be in a distribution. They will just happen to be stored in an 
 SVN repo but not our source code repo, in the same way that the website is 
 stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law doesn’t 
 distinguish between JIRA and SVN, both are publicly available via HTTP, so 
 using SVN will simply be a continuation of what we’re already doing with JIRA.
 
 The crucial factor is that we’re only storing publicly available PDFs,  
 because we have the right to do so, just like Google’s cache, and like we 
 currently do with JIRA.
 
 Additionally, the PDFs need to be version controlled otherwise we won’t be 
 able to reliably recreate previous builds, so storing the files on a web 
 server won’t be practical. Also committers will frequently be updating the 
 renderings as bugs are fixed and we’ll need to version-control the rendered 
 PNG files for the same reason. Finally, having committers-only files doesn’t 
 fit well with the Apache goal of open development and would be unnecessary 
 anyway given that all the PDFs are to be taken from public sources only.
 
 In summary, I’m proposing that we just keep doing what we’re currently doing 
 with JIRA but we move it into its own SVN repo along with some pre-rendered 
 PNGs.

In addition if we put in workarounds to handle nonconforming PDFs there should 
be a unit test added to make sure that we don’t break that e.g. when rewriting 
the parser. 

 
 Re preflight: the default mode should be to have the Isartor tests on. 
 Individuals could still disable them locally, but the central build software 
 should always use them.
 
 Yes - does anybody know why this isn’t the default?
 

No.

+1 for enabling it per default


 -- John

PDFBox and documentation

2014-07-05 Thread Maruan Sahyoun

Hi,

I have the infrastructure for enhancing our documentation nearly sorted (needed 
to learn a little more about the possibilities of the Apache CMS). Now WDYT 
would be the expectation for documenting how to use PDFBox for different use 
cases - code snippets or runnable examples?

BR
Maruan

Re: PDFBox and documentation

2014-07-05 Thread Maruan Sahyoun

that should be doable with some newer additions to the Apache CMS which allows 
to pull from svn and/or git. Will try something on that basis. If it works we 
can enhance the example package.

BR
Maruan

Am 05.07.2014 um 18:45 schrieb John Hewson j...@jahewson.com:

 I'm for runnable examples in trunk on SVN, otherwise we'll end up with code 
 that doesn't actually run. Some snippets from these examples could be put on 
 the website but they should always link back to the example file in SVN 
 viewvc - there's nothing more frustrating for a new user than incomplete 
 examples, or having to copy and paste snippets together to recreate an 
 example file.
 
 Looking at the examples we have currently on SVN the coding conventions used 
 are starting to look a bit dated, certainly far behind more recently written 
 code.
 
 -- John
 
 On 5 Jul 2014, at 04:46, Maruan Sahyoun sahy...@fileaffairs.de wrote:
 
 Hi,
 
 I have the infrastructure for enhancing our documentation nearly sorted 
 (needed to learn a little more about the possibilities of the Apache CMS). 
 Now WDYT would be the expectation for documenting how to use PDFBox for 
 different use cases - code snippets or runnable examples?
 
 BR
 Maruan

Re: Paid PDFBox support

2014-07-07 Thread Maruan Sahyoun

the issue is because part1.pdf in PDFBOX-1533 references the same 2 pages 3 
times within the document catalog (/Kids [3 0 R, 3 0 R, 3 0 R]). Could you 
attach a sample pdf to PDFBOX-1533 to verify that your issue has the same cause 
or verify it for yourself?

We are using PDFBox for merging documents ourselves successfully. Obviously 
this file would need some special treatment. 

BR
Maruan

Am 07.07.2014 um 11:31 schrieb Aleksander Blomskøld aleks...@gmail.com:

 Hi,
 
 We're using PDFBox for PDF validation and PDF merging in a backend
 invoicing system. It's working pretty well for most of the time, but right
 now we're having some unhappy customers because of
 https://issues.apache.org/jira/browse/PDFBOX-1533.
 
 As it's important for us to have this fixed pretty soon, we're wondering if
 anyone of you would be willing to fix this issue for pay. If so, please
 contact me so we can work out the details.
 
 
 Regards,
 
 Aleksander Blomskøld

Re: Paid PDFBox support

2014-07-08 Thread Maruan Sahyoun

of course it’s possible to put in a workaround - might it be in PDFBox itself 
or in the merging application. Even better might be to check why this - at 
least misleading information - might have been created. Would you think you 
could influence that?

BR
Maruan

Am 08.07.2014 um 11:01 schrieb Aleksander Blomskøld aleks...@gmail.com:

 Yes, it's the same issue. The files attached actually comes from the
 company I'm working for.
 
 
 On Mon, Jul 7, 2014 at 11:05 PM, Maruan Sahyoun sahy...@fileaffairs.de
 wrote:
 
 the issue is because part1.pdf in PDFBOX-1533 references the same 2 pages
 3 times within the document catalog (/Kids [3 0 R, 3 0 R, 3 0 R]). Could
 you attach a sample pdf to PDFBOX-1533 to verify that your issue has the
 same cause or verify it for yourself?
 
 We are using PDFBox for merging documents ourselves successfully.
 Obviously this file would need some special treatment.
 
 BR
 Maruan
 
 Am 07.07.2014 um 11:31 schrieb Aleksander Blomskøld aleks...@gmail.com:
 
 Hi,
 
 We're using PDFBox for PDF validation and PDF merging in a backend
 invoicing system. It's working pretty well for most of the time, but
 right
 now we're having some unhappy customers because of
 https://issues.apache.org/jira/browse/PDFBOX-1533.
 
 As it's important for us to have this fixed pretty soon, we're wondering
 if
 anyone of you would be willing to fix this issue for pay. If so, please
 contact me so we can work out the details.
 
 
 Regards,
 
 Aleksander Blomskøld

Re: Paid PDFBox support

2014-07-08 Thread Maruan Sahyoun

what we could do is put the workaround into PDFBox and print a log output. OTOH 
you might have more control over handling such situation if you deal with it 
yourself by putting in a check and a workaround. See my comment at PDFBOX-1533. 
WDYT?

BR
Maruan

Am 08.07.2014 um 15:02 schrieb Aleksander Blomskøld aleks...@gmail.com:

 Our biggest problem now is that we haven't been able to detect when the
 issue occours before our customer does. I guess a possible (but not
 optimal) work around for us would be to check the PDF files if they got
 this issue (getAllPages.size() is not the same as getNumPages()), and then
 raise an exception so we can contact the senders manually.
 
 
 Aleksander
 
 On Tue, Jul 8, 2014 at 11:05 AM, Maruan Sahyoun sahy...@fileaffairs.de
 wrote:
 
 of course it’s possible to put in a workaround - might it be in PDFBox
 itself or in the merging application. Even better might be to check why
 this - at least misleading information - might have been created. Would you
 think you could influence that?
 
 BR
 Maruan
 
 Am 08.07.2014 um 11:01 schrieb Aleksander Blomskøld aleks...@gmail.com:
 
 Yes, it's the same issue. The files attached actually comes from the
 company I'm working for.
 
 
 On Mon, Jul 7, 2014 at 11:05 PM, Maruan Sahyoun sahy...@fileaffairs.de
 wrote:
 
 the issue is because part1.pdf in PDFBOX-1533 references the same 2
 pages
 3 times within the document catalog (/Kids [3 0 R, 3 0 R, 3 0 R]). Could
 you attach a sample pdf to PDFBOX-1533 to verify that your issue has the
 same cause or verify it for yourself?
 
 We are using PDFBox for merging documents ourselves successfully.
 Obviously this file would need some special treatment.
 
 BR
 Maruan
 
 Am 07.07.2014 um 11:31 schrieb Aleksander Blomskøld aleks...@gmail.com
 :
 
 Hi,
 
 We're using PDFBox for PDF validation and PDF merging in a backend
 invoicing system. It's working pretty well for most of the time, but
 right
 now we're having some unhappy customers because of
 https://issues.apache.org/jira/browse/PDFBOX-1533.
 
 As it's important for us to have this fixed pretty soon, we're
 wondering
 if
 anyone of you would be willing to fix this issue for pay. If so, please
 contact me so we can work out the details.
 
 
 Regards,
 
 Aleksander Blomskøld

Re: Paid PDFBox support

2014-07-08 Thread Maruan Sahyoun

yes - in PDFBOX-1533 I added a description for a workaround I plan to put in. 
WDYT?

BR
Maruan

Am 08.07.2014 um 19:49 schrieb John Hewson j...@jahewson.com:

 In Adobe Acrobat this file has only two pages, so as noted the root of the 
 page tree is invalid:
 
 /Kids [3 0 R, 3 0 R, 3 0 R]
 
 Acrobat is ignoring these extra pages, so the fix for PDFBox should be to 
 ignore repeated objects in the page tree.
 
 -- John
 
 On 7 Jul 2014, at 14:05, Maruan Sahyoun sahy...@fileaffairs.de wrote:
 
 the issue is because part1.pdf in PDFBOX-1533 references the same 2 pages 3 
 times within the document catalog (/Kids [3 0 R, 3 0 R, 3 0 R]). Could you 
 attach a sample pdf to PDFBOX-1533 to verify that your issue has the same 
 cause or verify it for yourself?
 
 We are using PDFBox for merging documents ourselves successfully. Obviously 
 this file would need some special treatment. 
 
 BR
 Maruan
 
 Am 07.07.2014 um 11:31 schrieb Aleksander Blomskøld aleks...@gmail.com:
 
 Hi,
 
 We're using PDFBox for PDF validation and PDF merging in a backend
 invoicing system. It's working pretty well for most of the time, but right
 now we're having some unhappy customers because of
 https://issues.apache.org/jira/browse/PDFBOX-1533.
 
 As it's important for us to have this fixed pretty soon, we're wondering if
 anyone of you would be willing to fix this issue for pay. If so, please
 contact me so we can work out the details.
 
 
 Regards,
 
 Aleksander Blomskøld

Re: Paid PDFBox support

2014-07-08 Thread Maruan Sahyoun

thx

Maruan

Am 08.07.2014 um 20:33 schrieb John Hewson j...@jahewson.com:

 Looks good. I modified getAllKids() so that it returns the same output as 
 your workaround, rather than applying the workaround to the output. It’s now 
 in the 1.8.7 and 2.0 trunks.
 
 -- John
 
 On 8 Jul 2014, at 10:53, Maruan Sahyoun sahy...@fileaffairs.de wrote:
 
 yes - in PDFBOX-1533 I added a description for a workaround I plan to put 
 in. WDYT?
 
 BR
 Maruan
 
 Am 08.07.2014 um 19:49 schrieb John Hewson j...@jahewson.com:
 
 In Adobe Acrobat this file has only two pages, so as noted the root of the 
 page tree is invalid:
 
 /Kids [3 0 R, 3 0 R, 3 0 R]
 
 Acrobat is ignoring these extra pages, so the fix for PDFBox should be to 
 ignore repeated objects in the page tree.
 
 -- John
 
 On 7 Jul 2014, at 14:05, Maruan Sahyoun sahy...@fileaffairs.de wrote:
 
 the issue is because part1.pdf in PDFBOX-1533 references the same 2 pages 
 3 times within the document catalog (/Kids [3 0 R, 3 0 R, 3 0 R]). Could 
 you attach a sample pdf to PDFBOX-1533 to verify that your issue has the 
 same cause or verify it for yourself?
 
 We are using PDFBox for merging documents ourselves successfully. 
 Obviously this file would need some special treatment. 
 
 BR
 Maruan
 
 Am 07.07.2014 um 11:31 schrieb Aleksander Blomskøld aleks...@gmail.com:
 
 Hi,
 
 We're using PDFBox for PDF validation and PDF merging in a backend
 invoicing system. It's working pretty well for most of the time, but right
 now we're having some unhappy customers because of
 https://issues.apache.org/jira/browse/PDFBOX-1533.
 
 As it's important for us to have this fixed pretty soon, we're wondering 
 if
 anyone of you would be willing to fix this issue for pay. If so, please
 contact me so we can work out the details.
 
 
 Regards,
 
 Aleksander Blomskøld

Re: Subversion integration with JIRA

2014-07-23 Thread Maruan Sahyoun

+1 

Maruan

Am 22.07.2014 um 19:53 schrieb Andreas Lehmkuehler andr...@lehmi.de:

 Hi,
 
 our infra guys provide an integration of subversion with JIRA tickets. All 
 subversion commits will be automatically added as comment  to the 
 corresponding JIRA ticket as long as the ticket number is used within the svn 
 commit comment.
 
 See http://www.apache.org/dev/svngit2jira.html for any further details.
 
 Should we ask infra to enable that feature for PDFBox?
 
 WDYT?
 
 
 BR
 Andreas Lehmkühler

Re: Subversion integration with JIRA

2014-07-23 Thread Maruan Sahyoun

according to the sample provided in http://www.apache.org/dev/svngit2jira.html 
the commit will be shown in the comments.

Maruan

Am 23.07.2014 um 08:33 schrieb Thomas Chojecki i...@rayman2200.de:

 Am 2014-07-23 07:57, schrieb Tilman Hausherr:
 Lets try it. TIKA has something similar, see e.g. here:
 https://issues.apache.org/jira/browse/TIKA-1325
 Tilman
 
 Looks like they mishandle the hudson to do something that jira already 
 support in a similar way. I think the solution from infra is the better one. 
 So the code changes will be shown only in the sourcecode section of a ticket. 
 :-)
 
 The feature to link a sourcecode with a issue is imo a must have.
 
 +1
 
 Am 22.07.2014 19:53, schrieb Andreas Lehmkuehler:
 Hi,
 our infra guys provide an integration of subversion with JIRA tickets. All 
 subversion commits will be automatically added as comment  to the 
 corresponding JIRA ticket as long as the ticket number is used within the 
 svn commit comment.
 See http://www.apache.org/dev/svngit2jira.html for any further details.
 Should we ask infra to enable that feature for PDFBox?
 WDYT?
 BR
 Andreas Lehmkühler

Re: Custom TextStripper / PDGraphicsState Not Reading Color

2014-07-30 Thread Maruan Sahyoun

+1 for removing the .properties file if the new mechanism is easier to 
understand and handle. The discussion doesn’t provide that proof or some 
information about that.

How would a replacement look like?

OTOH if it’s a documentation issue we could also add some more information to 
the javadocs to explain the dependencies. 

We could add a register/unregister method to allow to add/remove custom 
operator handling or provide a service discovery mechanism. This way we still 
have the old flexibility.

BR
Maruan

Am 29.07.2014 um 21:48 schrieb John Hewson j...@jahewson.com:

 Right but we need to address the confusion and complexity that has been 
 caused by .properties files which made PDFBOX-2246 so tricky to figure out.
 
 Lets remove this wart!
 
 -- John
 
 On 29 Jul 2014, at 10:44, Tilman Hausherr thaush...@t-online.de wrote:
 
 Hi,
 
 At this time, the problem I see and wanted to solve (PDFBOX-2246) exists 
 regardless whether we use a properties file or initialize directly in the 
 code.
 
 Tilman
 
 
 Am 29.07.2014 19:41, schrieb John Hewson:
 On 29 Jul 2014, at 03:44, Andreas Lehmkühler andr...@lehmi.de wrote:
 
 Hi,
 
 it's not a black and white issue (comments inline)
 
 John Hewson j...@jahewson.com hat am 29. Juli 2014 um 07:44 geschrieben:
 
 
 Yes, really I should have said subclasses of PDFStreamEngine -  that's 
 where
 the .properties file originates. I'd propose replacing the properties
 mechanism with a simple method containing the mapping which can be 
 overridden
 in subclasses. Ultimately, users expect to be able to subclass the 
 behaviour
 of a class by just subclassing the class.
 PDFStreamEngine doesn't configure any operator set itself. The subclasses 
 are
 supposed to configure their own set of operators depending on the 
 particular
 usecase. E.g. to extend the text extraction one has to subclass 
 PDFTextStripper
 and so on.
 It’s PDFStreamEngine which implements the .property mechanism though, via 
 the
 PDFStreamEngine(Properties properties) constructor.
 
 E.g. to extend the text extraction one has to subclass PDFTextStripper and 
 so on.
 That’s true, but it’s only half the story, don’t forget that the 
 .properties files need
 to be copied and pasted elsewhere and modified along with overriding which 
 .property
 file is passed in the constructor if you want to truly override the class’ 
 behaviour.
 
 We've seen a number of incidents of confusion on the mailing list due to 
 the
 current design.
 IMHO, most of the confusion is based on the lack of knowledge of the pdf 
 spec.
 One can't understand how pdfbox works under the hood by simply looking at 
 the
 code. One has to understand the pdf spec as well, at least the base 
 concepts.
 I’m specifically talking about confusion surrounding how to override 
 operators, and
 .properties files, this has come up before. This entire thread has been 
 caused by
 PDFBox’s design and *not* the PDF spec.
 
 I'd say that to the modern Java developer having non-code runtime binding 
 has
 become an anti-pattern, resulting in brittle code which can't easily be
 navigated in an IDE and which resists automated analysis and exhibits 
 runtime
 failures despite compiling ok. This is one of those cases where the 
 collective
 wisdom has just evolved over the years.
 It depends on the given usecase. All solutions have advantages and
 disadvantages. E.g. if someone wants to configure the PDFTextStripper 
 without
 recompiling the code, it is quite handy to keep the configuration in a text
 file.
 Has anybody *ever* wanted to change the operators which PDFTextStripper is
 processing without recompiling the code? These are internal implementation
 details that shouldn’t be exposed in the first place - it’s not a 
 “configuration” at
 all, especially as 99% of possible changes would just break PDFTextStripper.
 
 In this case I'm neither pro or con a text based config, but I tend to 
 agree
 with John to have the different configurations in some method within the
 subclasses of PDFStreamEngine.
 As above, this isn’t “configuration” at all, it lacks even a basic use 
 case. I don’t
 see any pros which aren’t fabricated for the sake of argument, but the cons 
 are
 causing us significant problems right here, right now.
 
 BR
 Andreas Lehmkühler
 
 -- John
 
 On 28 Jul 2014, at 13:42, Tilman Hausherr thaush...@t-online.de wrote:
 
 I disagree - one doesn't *have* to pass a property file to 
 PDFTextStripper
 and PageDrawer. The properties file for PDFTextStripper is optional. The
 property parameter was already there before it became an apache project.
 
 
 Tilman
 
 
 
 Am 28.07.2014 22:08, schrieb John Hewson:
 We need to get rid of these .properties files, they’re causing endless
 confusion, not to mention that they hide runtime dependencies in text
 files.
 
 We should make it so that overriding a TextStripper, PageDrawer, etc.
 doesn’t require external .properties files, currently Preflight works in
 this manner and it’s much clearer.

Re: PDFBox 1.8.7 release?

2014-08-07 Thread Maruan Sahyoun

+1

Maruan Sahyoun

Am 07.08.2014 um 12:35 schrieb Andreas Lehmkühler andr...@lehmi.de:

 Hi,
 
 there is already a number of solved issues and I guess it's
 time for a new bugfix release.
 
 I'm working on PDFBOX-2250 and I'd like to finish that
 first but how about a new release in 2 or 3 weeks from now?
 
 WDYT?
 
 BR
 Andreas Lehmkühler

Re: PDFBox 1.8.7 release?

2014-08-11 Thread Maruan Sahyoun


Am 11.08.2014 um 18:59 schrieb Andreas Lehmkuehler andr...@lehmi.de:

 Hi,
 
 
 Am 11.08.2014 18:35, schrieb John Hewson:
 Andreas,
 
 What I had been thinking was that now that 2.0 is getting closer that me 
 wight want to do less with 1.8, but I agree with you that we don’t need any 
 fixed rules, staying flexible is better. It sounds like we might want to 
 think about some guidelines for 1.8 after 2.0 is released to avoid a 
 “Windows XP” situation, but we’re not at that point yet.
 Yes, good point. Hopefully 1.8.7 will be the last 1.8 release before 2.0 :-)
 

As 2.0 breaks the current API (which is intended) I suspect that there will be 
bugfixes for 1.8 needed for some time.

 Cheers
 
 -- John
 
 BR
 Andreas Lehmkühler
 
 
 On 11 Aug 2014, at 03:57, Andreas Lehmkühler andr...@lehmi.de wrote:
 
 Hi,
 
 John Hewson j...@jahewson.com hat am 7. August 2014 um 18:48 geschrieben:
 
 
 Perhaps we should stop adding new features to 1.8, and only fix the most
 problematic bugs?
 We never were that strict about the contents of a bugfix release in the 
 past.
 We always added some improvements or new features. Most of them were small
 and/or hadn't a huge impact on the code/functionality. Some were added
 because people were eagerly waiting for them. There aren't any rules what
 to add or not and IMHO we don't need any.
 
 
 BR
 Andreas Lehmkühler
 
 
 -- John
 
 On 7 Aug 2014, at 09:11, Tilman Hausherr thaush...@t-online.de wrote:
 
 +1
 
 but after I've ported the GSoC2014-improved shading package to 1.8
 
 Tilman
 
 Am 07.08.2014 12:35, schrieb Andreas Lehmkühler:
 Hi,
 
 there is already a number of solved issues and I guess it's
 time for a new bugfix release.
 
 I'm working on PDFBOX-2250 and I'd like to finish that
 first but how about a new release in 2 or 3 weeks from now?
 
 WDYT?
 
 BR
 Andreas Lehmkühler

Re: PDFBox 1.8.7 release?

2014-08-14 Thread Maruan Sahyoun

I’d like to include PDFBOX-2249 - should be ready by then. 

Am 14.08.2014 um 09:08 schrieb Andreas Lehmkühler andr...@lehmi.de:

 
 
 Andreas Lehmkühler andr...@lehmi.de hat am 7. August 2014 um 12:35
 geschrieben:
 
 
 Hi,
 
 there is already a number of solved issues and I guess it's
 time for a new bugfix release.
 
 I'm working on PDFBOX-2250 and I'd like to finish that
 first but how about a new release in 2 or 3 weeks from now?
 
 WDYT?
 
 As there weren't any objections I'm targeting the first week of september to 
 cut
 the release.
  
 BR
 Andreas Lehmkühler

AcroForm fields and appearance stream generation

2014-08-28 Thread Maruan Sahyoun

Hi,

there are cases where a form field doesn’t contain an appearance e.g. when the 
form was filled and the NeedAppearances flag in the forms dictionary has been 
set. In such cases for rendering an appearance stream needs to be generated.

Am I right that for PDFBox
# we should respect a NeedAppearances flag when setting a fields value so that 
we don’t generate an appearance stream in that case
# we shouldn’t generate an appearance stream during the parsing stage if none 
exists
# we shall generate an appearance stream if non exists when rendering the PDF

BR
Maruan

Re: PDFBox 1.8.7 release?

2014-09-11 Thread Maruan Sahyoun

Hi Andreas,

what are your current plans to cut the new release? Dependent on that I could 
do https://issues.apache.org/jira/browse/PDFBOX-91 [Comb Fields] as a quick fix 
this weekend to the 1.8 branch.

BR
Maruan

Am 14.08.2014 um 09:08 schrieb Andreas Lehmkühler andr...@lehmi.de:

 
 
 Andreas Lehmkühler andr...@lehmi.de hat am 7. August 2014 um 12:35
 geschrieben:
 
 
 Hi,
 
 there is already a number of solved issues and I guess it's
 time for a new bugfix release.
 
 I'm working on PDFBOX-2250 and I'd like to finish that
 first but how about a new release in 2 or 3 weeks from now?
 
 WDYT?
 
 As there weren't any objections I'm targeting the first week of september to 
 cut
 the release.
  
 BR
 Andreas Lehmkühler

Re: [VOTE] Release Apache PDFBox 1.8.7

2014-09-15 Thread Maruan Sahyoun

+1 - thanks for taking care of the release process.

Maruan

Am 15.09.2014 um 20:49 schrieb Andreas Lehmkuehler andr...@lehmi.de:

 Hi,
 
 a candidate for the PDFBox 1.8.7 release is available at:
 
http://people.apache.org/~lehmi/pdfbox/1.8.7/
 
 The release candidate is a zip archive of the sources in:
 
http://svn.apache.org/repos/asf/pdfbox/tags/1.8.7/
 
 The SHA1 checksum of the archive is ba7f83a1db9e697bcd0d3613571e1b397968daf6.
 
 Please vote on releasing this package as Apache PDFBox 1.8.7.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 PDFBox PMC votes are cast.
 
[ ] +1 Release this package as Apache PDFBox 1.8.7
[ ] -1 Do not release this package because...
 
 Here is my +1
 
 BR
 Andreas Lehmkühler

[DISCUSS] move documentation and examples to git

2014-09-16 Thread Maruan Sahyoun

Hi there,

in order to make it easier for people to contribute to the documentation and 
examples I thought about the potential benefits of moving these to a git based 
repository instead of svn. The main idea behind that is to allow people to 
contribute via github opening another channel of communication and making it 
easier to contribute. 

Proposed names are pdfbox-docs and pdfbox-examples. Take a look at 
https://github.com/apache/cordova-docs for an example of that.

I haven’t thought about all potential implications and changes necessary yet 
but wanted to get a first feedback about support for that idea before putting 
more effort into that.

WDYT?

Maruan

Re: [DISCUSS] move documentation and examples to git

2014-09-16 Thread Maruan Sahyoun

what about having extra repos for pdfbox-docs and pdfbox-examples?

Maruan

Am 16.09.2014 um 11:43 schrieb Andreas Lehmkühler andr...@lehmi.de:

 Hi,
 
 Maruan Sahyoun sahy...@fileaffairs.de hat am 16. September 2014 um 10:21
 geschrieben:
 
 
 Hi there,
 
 in order to make it easier for people to contribute to the documentation and
 examples I thought about the potential benefits of moving these to a git 
 based
 repository instead of svn. The main idea behind that is to allow people to
 contribute via github opening another channel of communication and making it
 easier to contribute.
 
 Proposed names are pdfbox-docs and pdfbox-examples. Take a look at
 https://github.com/apache/cordova-docs for an example of that.
 
 I haven’t thought about all potential implications and changes necessary yet
 but wanted to get a first feedback about support for that idea before putting
 more effort into that.
 
 WDYT?
 Good idea, but I'm not sure if a splitted repo configuration (svn/git) is
 supported by infra. So maybe this is only possible if we migrate the whole
 project to git.
 
 Maruan
 
 BR
 Andreas Lehmkühler

Re: [DISCUSS] move documentation and examples to git

2014-09-16 Thread Maruan Sahyoun

OK - I see what you mean, got your question wrong. We can check with infra but 
I don’t see a reason why pdfbox-docs and pdfbox-examples can't exist in new 
repos and there is pdfbox in the old one and the new repos being git based. 
Would behave just like ‚different‘ projects.

So if it’s possible shall we do it?

Moving the whole project to git is a different story. I’d see the same benefit 
applying to pdfbox but the impact is larger. So moving the docs and examples 
might also be a good test case.

Maruan


Am 16.09.2014 um 11:55 schrieb Andreas Lehmkühler andr...@lehmi.de:

 Maruan Sahyoun sahy...@fileaffairs.de hat am 16. September 2014 um 11:46
 geschrieben:
 
 
 what about having extra repos for pdfbox-docs and pdfbox-examples?
 Hmm, I'm a little bit puzzled. Your origin proposal was already about extra
 git-repos for docs and examples, wasn't it?
 
 Andreas
 
 
 Maruan
 
 Am 16.09.2014 um 11:43 schrieb Andreas Lehmkühler andr...@lehmi.de:
 
 Hi,
 
 Maruan Sahyoun sahy...@fileaffairs.de hat am 16. September 2014 um 10:21
 geschrieben:
 
 
 Hi there,
 
 in order to make it easier for people to contribute to the documentation
 and
 examples I thought about the potential benefits of moving these to a git
 based
 repository instead of svn. The main idea behind that is to allow people to
 contribute via github opening another channel of communication and making
 it
 easier to contribute.
 
 Proposed names are pdfbox-docs and pdfbox-examples. Take a look at
 https://github.com/apache/cordova-docs for an example of that.
 
 I haven’t thought about all potential implications and changes necessary
 yet
 but wanted to get a first feedback about support for that idea before
 putting
 more effort into that.
 
 WDYT?
 Good idea, but I'm not sure if a splitted repo configuration (svn/git) is
 supported by infra. So maybe this is only possible if we migrate the whole
 project to git.
 
 Maruan
 
 BR
 Andreas Lehmkühler

Re: [DISCUSS] move documentation and examples to git

2014-09-17 Thread Maruan Sahyoun

Dear Santosh,

you can unregister using the link below.

https://pdfbox.apache.org/mailinglists.html

With kind regards
Maruan

 Am 17.09.2014 um 03:00 schrieb Santosh Arakeri santosh.arak...@gmail.com:
 
 Pl dont send me mail.
 On 16 Sep 2014 13:52, Maruan Sahyoun sahy...@fileaffairs.de wrote:
 
 Hi there,
 
 in order to make it easier for people to contribute to the documentation
 and examples I thought about the potential benefits of moving these to a
 git based repository instead of svn. The main idea behind that is to allow
 people to contribute via github opening another channel of communication
 and making it easier to contribute.
 
 Proposed names are pdfbox-docs and pdfbox-examples. Take a look at
 https://github.com/apache/cordova-docs for an example of that.
 
 I haven’t thought about all potential implications and changes necessary
 yet but wanted to get a first feedback about support for that idea before
 putting more effort into that.
 
 WDYT?
 
 Maruan

Re: [DISCUSS] move documentation and examples to git

2014-09-17 Thread Maruan Sahyoun

is that because of the examples, the docs or both?

BR

Maruan

Am 17.09.2014 um 18:46 schrieb Tilman Hausherr thaush...@t-online.de:

 It is a I don't like it, but I can live with it but I think it might be a 
 pain. A soft -1.
 
 Tilman
 
 Am 17.09.2014 um 08:40 schrieb Andreas Lehmkühler:
 Hi,
 
 Tilman Hausherr thaush...@t-online.de hat am 16. September 2014 um 18:03
 geschrieben:
 
 
 -1, I don't like the idea to have different repository types.
 Hmmm, is this just a I don't like it, but I can live with it or is it a 
 clear
 veto?
 
 In a case of a veto, how about starting with moving parts of the docs to a 
 new
 git repo? IMO sooner or later the project will move from svn to git and that
 would be a good opertunity to get used to the general usage of git and of 
 course
 to the special processes used here at the ASF so that we are not thrown in at
 the deep end after the migration.
 
 Tilman
 BR
 Andreas
 
 Am 16.09.2014 um 10:21 schrieb Maruan Sahyoun:
 Hi there,
 
 in order to make it easier for people to contribute to the documentation 
 and
 examples I thought about the potential benefits of moving these to a git
 based repository instead of svn. The main idea behind that is to allow
 people to contribute via github opening another channel of communication 
 and
 making it easier to contribute.
 
 Proposed names are pdfbox-docs and pdfbox-examples. Take a look at
 https://github.com/apache/cordova-docs for an example of that.
 
 I haven’t thought about all potential implications and changes necessary 
 yet
 but wanted to get a first feedback about support for that idea before
 putting more effort into that.
 
 WDYT?
 
 Maruan

Re: [DISCUSS] move documentation and examples to git

2014-09-17 Thread Maruan Sahyoun

The docs are part of the website 

Currently I mean the cookbook, how to build the project, architecture ..

Maruan

Am 17.09.2014 um 19:26 schrieb Tilman Hausherr thaush...@t-online.de:

 Hi Maruan,
 
 The examples only.
 
 With the docs I assume you mean the website. I've never touched it 
 (although I might in the future), it isn't part of the project, so I don't 
 mind.
 
 Tilman
 
 Am 17.09.2014 um 19:01 schrieb Maruan Sahyoun:
 is that because of the examples, the docs or both?
 
 BR
 
 Maruan
 
 Am 17.09.2014 um 18:46 schrieb Tilman Hausherr thaush...@t-online.de:
 
 It is a I don't like it, but I can live with it but I think it might be a 
 pain. A soft -1.
 
 Tilman
 
 Am 17.09.2014 um 08:40 schrieb Andreas Lehmkühler:
 Hi,
 
 Tilman Hausherr thaush...@t-online.de hat am 16. September 2014 um 18:03
 geschrieben:
 
 
 -1, I don't like the idea to have different repository types.
 Hmmm, is this just a I don't like it, but I can live with it or is it a 
 clear
 veto?
 
 In a case of a veto, how about starting with moving parts of the docs to a 
 new
 git repo? IMO sooner or later the project will move from svn to git and 
 that
 would be a good opertunity to get used to the general usage of git and of 
 course
 to the special processes used here at the ASF so that we are not thrown in 
 at
 the deep end after the migration.
 
 Tilman
 BR
 Andreas
 
 Am 16.09.2014 um 10:21 schrieb Maruan Sahyoun:
 Hi there,
 
 in order to make it easier for people to contribute to the documentation 
 and
 examples I thought about the potential benefits of moving these to a git
 based repository instead of svn. The main idea behind that is to allow
 people to contribute via github opening another channel of communication 
 and
 making it easier to contribute.
 
 Proposed names are pdfbox-docs and pdfbox-examples. Take a look at
 https://github.com/apache/cordova-docs for an example of that.
 
 I haven’t thought about all potential implications and changes necessary 
 yet
 but wanted to get a first feedback about support for that idea before
 putting more effort into that.
 
 WDYT?
 
 Maruan

Re: [DISCUSS] move documentation and examples to git

2014-09-17 Thread Maruan Sahyoun



Maruan Sahyoun

 Am 18.09.2014 um 02:03 schrieb John Hewson j...@jahewson.com:
 
 I agree with Tilman on this point, the examples need to stay in the trunk 
 where they can be built along with it.
 It’s very common to modify an example to take into account API changes. 
 They’re also currently distributed along with the main PDFBox source bundle, 
 which is a good thing.
 
 I’d be surprised if anybody outside of the project wanted to contribute to 
 the documentation, almost nobody seems to like writing it. Perhaps we could 
 do this as a trial - see if it really increases contributions or not? It 
 would be great if it did.
 

OK so lets try with the docs. 

To mention it for completness - the build process for the web site and the 
documentation contained within will still be done by the Apache CMS. 

 It’s worth adding that I’m (reluctantly) against moving PDFBox trunk over to 
 GitHub because GitHub Issues is not powerful enough for our needs (e.g. no 
 file attachments), which is really a shame.
 

Issue tracking would still be done using Jira. Same as for most other Apache 
projects

 -- John
 
 On 17 Sep 2014, at 10:26, Tilman Hausherr thaush...@t-online.de wrote:
 
 Hi Maruan,
 
 The examples only.
 
 With the docs I assume you mean the website. I've never touched it 
 (although I might in the future), it isn't part of the project, so I don't 
 mind.
 
 Tilman
 
 Am 17.09.2014 um 19:01 schrieb Maruan Sahyoun:
 is that because of the examples, the docs or both?
 
 BR
 
 Maruan
 
 Am 17.09.2014 um 18:46 schrieb Tilman Hausherr thaush...@t-online.de:
 
 It is a I don't like it, but I can live with it but I think it might be a 
 pain. A soft -1.
 
 Tilman
 
 Am 17.09.2014 um 08:40 schrieb Andreas Lehmkühler:
 Hi,
 
 Tilman Hausherr thaush...@t-online.de hat am 16. September 2014 um 
 18:03
 geschrieben:
 
 
 -1, I don't like the idea to have different repository types.
 Hmmm, is this just a I don't like it, but I can live with it or is it a 
 clear
 veto?
 
 In a case of a veto, how about starting with moving parts of the docs to 
 a new
 git repo? IMO sooner or later the project will move from svn to git and 
 that
 would be a good opertunity to get used to the general usage of git and of 
 course
 to the special processes used here at the ASF so that we are not thrown 
 in at
 the deep end after the migration.
 
 Tilman
 BR
 Andreas
 
 Am 16.09.2014 um 10:21 schrieb Maruan Sahyoun:
 Hi there,
 
 in order to make it easier for people to contribute to the 
 documentation and
 examples I thought about the potential benefits of moving these to a git
 based repository instead of svn. The main idea behind that is to allow
 people to contribute via github opening another channel of 
 communication and
 making it easier to contribute.
 
 Proposed names are pdfbox-docs and pdfbox-examples. Take a look at
 https://github.com/apache/cordova-docs for an example of that.
 
 I haven’t thought about all potential implications and changes 
 necessary yet
 but wanted to get a first feedback about support for that idea before
 putting more effort into that.
 
 WDYT?
 
 Maruan

Re: [DISCUSS] move documentation and examples to git

2014-09-20 Thread Maruan Sahyoun

I’d think if projects such as Apache Camel, Apache Jackrabbit, Apache Tomee, 
Apache Cordova to mention some can handle it we should be smart enough to 
handle it too. And I can’t see the issues tab for these projects but pull 
requests.

BR
Maruan

Am 20.09.2014 um 04:22 schrieb John Hewson j...@jahewson.com:

 Issue tracking would still be done using Jira. Same as for most other Apache 
 projects
 
 The problem with that approach is that GitHub’s pull requests can only be 
 managed via GitHub’s issues interface, so we’re forced to use it. There’s no 
 way to prevent GitHub users from opening and discussing issues in pull 
 requests rather than on JIRA.
 
 -- John
 
 On 17 Sep 2014, at 21:58, Maruan Sahyoun sahy...@fileaffairs.de wrote:
 
 
 
 Maruan Sahyoun
 
 Am 18.09.2014 um 02:03 schrieb John Hewson j...@jahewson.com:
 
 I agree with Tilman on this point, the examples need to stay in the trunk 
 where they can be built along with it.
 It’s very common to modify an example to take into account API changes. 
 They’re also currently distributed along with the main PDFBox source 
 bundle, which is a good thing.
 
 I’d be surprised if anybody outside of the project wanted to contribute to 
 the documentation, almost nobody seems to like writing it. Perhaps we could 
 do this as a trial - see if it really increases contributions or not? It 
 would be great if it did.
 
 
 OK so lets try with the docs. 
 
 To mention it for completness - the build process for the web site and the 
 documentation contained within will still be done by the Apache CMS. 
 
 It’s worth adding that I’m (reluctantly) against moving PDFBox trunk over 
 to GitHub because GitHub Issues is not powerful enough for our needs (e.g. 
 no file attachments), which is really a shame.
 
 
 Issue tracking would still be done using Jira. Same as for most other Apache 
 projects
 
 -- John
 
 On 17 Sep 2014, at 10:26, Tilman Hausherr thaush...@t-online.de wrote:
 
 Hi Maruan,
 
 The examples only.
 
 With the docs I assume you mean the website. I've never touched it 
 (although I might in the future), it isn't part of the project, so I don't 
 mind.
 
 Tilman
 
 Am 17.09.2014 um 19:01 schrieb Maruan Sahyoun:
 is that because of the examples, the docs or both?
 
 BR
 
 Maruan
 
 Am 17.09.2014 um 18:46 schrieb Tilman Hausherr thaush...@t-online.de:
 
 It is a I don't like it, but I can live with it but I think it might be 
 a pain. A soft -1.
 
 Tilman
 
 Am 17.09.2014 um 08:40 schrieb Andreas Lehmkühler:
 Hi,
 
 Tilman Hausherr thaush...@t-online.de hat am 16. September 2014 um 
 18:03
 geschrieben:
 
 
 -1, I don't like the idea to have different repository types.
 Hmmm, is this just a I don't like it, but I can live with it or is it 
 a clear
 veto?
 
 In a case of a veto, how about starting with moving parts of the docs 
 to a new
 git repo? IMO sooner or later the project will move from svn to git and 
 that
 would be a good opertunity to get used to the general usage of git and 
 of course
 to the special processes used here at the ASF so that we are not thrown 
 in at
 the deep end after the migration.
 
 Tilman
 BR
 Andreas
 
 Am 16.09.2014 um 10:21 schrieb Maruan Sahyoun:
 Hi there,
 
 in order to make it easier for people to contribute to the 
 documentation and
 examples I thought about the potential benefits of moving these to a 
 git
 based repository instead of svn. The main idea behind that is to allow
 people to contribute via github opening another channel of 
 communication and
 making it easier to contribute.
 
 Proposed names are pdfbox-docs and pdfbox-examples. Take a look at
 https://github.com/apache/cordova-docs for an example of that.
 
 I haven’t thought about all potential implications and changes 
 necessary yet
 but wanted to get a first feedback about support for that idea before
 putting more effort into that.
 
 WDYT?
 
 Maruan

Re: [DISCUSS] move documentation and examples to git

2014-09-21 Thread Maruan Sahyoun

e.g. Apache Camel does use JIRA for issue tracking. They are not using GitHubs 
issue management. And they do accept pull requests.

And from the infra blog 
https://blogs.apache.org/infra/entry/improved_integration_between_apache_and

Any Pull Request that gets opened, closed, reopened or commented on now gets 
recorded on the project's mailing list
If a project has a JIRA instance, any PRs or comments on PRs that include a 
JIRA ticket ID will trigger an update on that specific ticket

I don’t get your point.

BR

Maruan

Am 21.09.2014 um 21:42 schrieb John Hewson j...@jahewson.com:

 I’d think if projects such as Apache Camel, Apache Jackrabbit, Apache Tomee, 
 Apache Cordova to mention some can handle it we should be smart enough to 
 handle it too.
 
 None of those projects make use of file attachments for issues the way that 
 we do.
 
 I can’t see the issues tab for these projects but pull requests.
 
 Is exactly my point - we’re forced to use GitHub issues for pull requests, 
 which is a problem because then we don’t get to manage these via JIRA. 
 Looking at these projects all of them have had pull requests which do not 
 contain any references to JIRA issues but have been merged in, so it seems 
 certain that we would loose JIRA as a central point of information.
 
 -- John
 
 On 20 Sep 2014, at 04:24, Maruan Sahyoun sahy...@fileaffairs.de wrote:
 
 I’d think if projects such as Apache Camel, Apache Jackrabbit, Apache Tomee, 
 Apache Cordova to mention some can handle it we should be smart enough to 
 handle it too. And I can’t see the issues tab for these projects but pull 
 requests.
 
 BR
 Maruan
 
 Am 20.09.2014 um 04:22 schrieb John Hewson j...@jahewson.com:
 
 Issue tracking would still be done using Jira. Same as for most other 
 Apache projects
 
 The problem with that approach is that GitHub’s pull requests can only be 
 managed via GitHub’s issues interface, so we’re forced to use it. There’s 
 no way to prevent GitHub users from opening and discussing issues in pull 
 requests rather than on JIRA.
 
 -- John
 
 On 17 Sep 2014, at 21:58, Maruan Sahyoun sahy...@fileaffairs.de wrote:
 
 
 
 Maruan Sahyoun
 
 Am 18.09.2014 um 02:03 schrieb John Hewson j...@jahewson.com:
 
 I agree with Tilman on this point, the examples need to stay in the trunk 
 where they can be built along with it.
 It’s very common to modify an example to take into account API changes. 
 They’re also currently distributed along with the main PDFBox source 
 bundle, which is a good thing.
 
 I’d be surprised if anybody outside of the project wanted to contribute 
 to the documentation, almost nobody seems to like writing it. Perhaps we 
 could do this as a trial - see if it really increases contributions or 
 not? It would be great if it did.
 
 
 OK so lets try with the docs. 
 
 To mention it for completness - the build process for the web site and the 
 documentation contained within will still be done by the Apache CMS. 
 
 It’s worth adding that I’m (reluctantly) against moving PDFBox trunk over 
 to GitHub because GitHub Issues is not powerful enough for our needs 
 (e.g. no file attachments), which is really a shame.
 
 
 Issue tracking would still be done using Jira. Same as for most other 
 Apache projects
 
 -- John
 
 On 17 Sep 2014, at 10:26, Tilman Hausherr thaush...@t-online.de wrote:
 
 Hi Maruan,
 
 The examples only.
 
 With the docs I assume you mean the website. I've never touched it 
 (although I might in the future), it isn't part of the project, so I 
 don't mind.
 
 Tilman
 
 Am 17.09.2014 um 19:01 schrieb Maruan Sahyoun:
 is that because of the examples, the docs or both?
 
 BR
 
 Maruan
 
 Am 17.09.2014 um 18:46 schrieb Tilman Hausherr thaush...@t-online.de:
 
 It is a I don't like it, but I can live with it but I think it might 
 be a pain. A soft -1.
 
 Tilman
 
 Am 17.09.2014 um 08:40 schrieb Andreas Lehmkühler:
 Hi,
 
 Tilman Hausherr thaush...@t-online.de hat am 16. September 2014 um 
 18:03
 geschrieben:
 
 
 -1, I don't like the idea to have different repository types.
 Hmmm, is this just a I don't like it, but I can live with it or is 
 it a clear
 veto?
 
 In a case of a veto, how about starting with moving parts of the docs 
 to a new
 git repo? IMO sooner or later the project will move from svn to git 
 and that
 would be a good opertunity to get used to the general usage of git 
 and of course
 to the special processes used here at the ASF so that we are not 
 thrown in at
 the deep end after the migration.
 
 Tilman
 BR
 Andreas
 
 Am 16.09.2014 um 10:21 schrieb Maruan Sahyoun:
 Hi there,
 
 in order to make it easier for people to contribute to the 
 documentation and
 examples I thought about the potential benefits of moving these to 
 a git
 based repository instead of svn. The main idea behind that is to 
 allow
 people to contribute via github opening another channel of 
 communication and
 making it easier to contribute.
 
 Proposed names are pdfbox-docs and pdfbox-examples

Re: Reopen PDFBOX-483?

2010-03-08 Thread Maruan Sahyoun

Hi Andreas,

I can do a test on our Windows test server (Windows 2003, 32bit) and let you 
know the results around lunch time (german time) if that helps

Maruan Sahyoun

Am 09.03.2010 um 08:11 schrieb Andreas Lehmkuehler:

 Hi,
 
 steve poling schrieb:
 Andreas Lehmkuehler schrieb:
 If you goto PDFBOX-490 
 https://issues.apache.org/jira/browse/PDFBOX-490, you'll find attached 
 file filled.pdf that manifests this error, but I've been seeing this with 
 a lot of different PDFs: display looks good, print looks bad. I can 
 attach another file to PDFBOX-483 
 https://issues.apache.org/jira/browse/PDFBOX-483 if you'd like.
 I've tried that pdf and it works like a charm except for some misplaced 
 characters. I'm using ubuntu linux, java 1.6.0_15 32bit and a HP Laserjet 
 2550N.
 I've made another test on my MacBook (MacOSX 10.6., jdk 1.6.0_17 64bit, 
 same printer) and it works well too.
 I'd like to know if anyone has repeated the experiment on any Windows-based 
 platform, since Ubuntu and OSX are both Linux-based. If someone else can 
 reproduce the failure on Windows, I'll start trusting my sanity again.
 I'm a software development for a lot of years and sometimes it leads to
 insanity, but we all have to do our best not to end in the programmers
 nuthouse ;-))
 
 I'll see if I can find some time to run that test on my rarely used windows 
 box.
 
 BR
 Andreas Lehmkühler

Re: pdfbox develpment

2010-03-09 Thread Maruan Sahyoun

Hi ,

I started with the documentation of some tools and opened an issue in JIRA for 
that (PDFBOX-653). Please let me know if that workflow is OK for you or if I 
should use a different approach. 

Kind regards
 
Maruan Sahyoun

Am 09.03.2010 um 09:37 schrieb Andreas Lehmkühler:

 Hi,
 
 Betreff: Re: pdfbox develpment
 Gesendet: Di, 09. Mrz 2010
 Von: Maruan Sahyounsahy...@fileaffairs.de
 
 Hi,
 
 we were looking to start fixing some of the open issues but can instead
 develop some small tutorials for common tasks like text extraction, forms
 handling and highlighting.
 
 WDYT
 Sounds good to me. Some of the command line utilities are already described 
 at [1] and
 some other documentation can be found at [2], so that will be a good point to 
 start.
 IMHO, the following command line tools should be described anyway:
 
 - PDFSplit, PDFMerger, Overlay
 - PDFReader
 - PDFDebugger
 
 These can be found here [3]. Probably we should describe some/all of the 
 examples
 which can be found here [4]. The sources for the documentation itself can be 
 found here [5]
 
 BR
 Andreas Lehmkühler
 
 [1] http://pdfbox.apache.org/commandlineutilities/index.html
 [2] http://pdfbox.apache.org/userguide/index.html
 [3] http://svn.apache.org/viewvc/pdfbox/trunk/src/main/java/org/apache/pdfbox/
 [4] 
 http://svn.apache.org/viewvc/pdfbox/trunk/src/main/java/org/apache/pdfbox/examples/
 [5] http://svn.apache.org/viewvc/pdfbox/trunk/src/site/
 
 Kind regards
 
 Maruan Sahyoun
 
 Am 09.03.2010 um 07:58 schrieb Andreas Lehmkuehler:
 
 Hi,
 
 Michael Müller schrieb:
 Daniel,
 Yes, I found some activities on the lists. But on the project site
 neither developer nor commiter. Just missing documentation? ;-)
 Great to hear, this project is alive.
 I have big problems to use it, due to missing or vague docs.
 EG: setTextMatrix
 public void setTextMatrix(double a, double b, double c, double d, double
 e, double f)
 What's a, b, c, d, e, f? I figured out, e and f to be coordinates. Would
 be much better to name this x and y or to enhance this documentation.
 These values correspond to the naming used in the pdf reference for a
 matrix.
 
 Maybe enhancing documentaion is an entry point for me to support the
 project? Or does any doc exists beside the published java docs?
 Be our guest, a good and complete documentation is always useful,
 especially
 for beginners.
 
 BR
 Andreas Lehmkühler
 
 
 
 --- original Nachricht Ende

Re: Reopen PDFBOX-483?

2010-03-09 Thread Maruan Sahyoun

Hi ,

please find enclosed the text extracted from the printed PDF. Extraction was 
done using Adobe Acrobat 8.

X0X0X0 X0X0X05 
X0X0X0 X0X0X05 
X0X0X0 X0X0X05 
X0X0X05 MM/DD/ X0X2 
X0X2 
X0X0X0X X0X0X0X 
X0X0X05 X0X0X05 
X0X0X05 X0X0X05 
X0X0X0X X0X0X05 
X0X0X05 

X0X0 
X0X0 
X05 X0X0X05 
MM/DD/ X0X0X05 X0X0 


Maruan Sahyoun



Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827

Am 09.03.2010 um 13:45 schrieb Maruan Sahyoun:

 Hi Andreas,
 
 yes, the results are similar BUT most of the text and some of the lines are 
 missing. Converting to Image output using PDFToImage provides a different and 
 much better result where all text and lines are included and only some 
 misplacement occurs. Is there a way to submit the attachment so you can see 
 for yourself?
 
 Maruan Sahyoun
 
 Am 09.03.2010 um 13:38 schrieb Andreas Lehmkühler:
 
 Hi,
 
 Betreff: Re: Reopen PDFBOX-483?
 Gesendet: Di, 09. Mrz 2010
 Von: Maruan Sahyounsahy...@fileaffairs.de
 
 Hi,
 
 please find enclosed the result of the printing test conducted on 
 
 Windows 2003 Server SP2 32 bit, Java 1.5 using a fresh built from trunk. The
 test was done using the Adobe PDF printer driver as well as Apple and HP
 Postscript printers with similar results.
 Thanks for testing. Your attachments didn't make it due to some restrictions 
 of the mailing list.
 Probably it would be sufficient to describe the results. Let me guess, they 
 are all similar. All
 contain text, some characters are misplaced and a wrong font is used.
 
 BR
 Andreas Lehmkühler

Re: Reopen PDFBOX-483?

2010-03-10 Thread Maruan Sahyoun

Hi,

I did some initial debugging and it seems that the content of the form fields 
(date part) is being printed but the form template itself being held in 
Pages:Kids:Resources:XObject are not printed. Unfortunately as I'm currently in 
the stage of learning about the PDFBox code at that point in time I can't 
provide more help.

Kind regards

Maruan



Am 09.03.2010 um 21:01 schrieb Andreas Lehmkuehler:

 Hi,
 
 steve poling schrieb:
 Andreas Lehmkuehler schrieb:
 If you goto PDFBOX-490 
 https://issues.apache.org/jira/browse/PDFBOX-490, you'll find attached 
 file filled.pdf that manifests this error, but I've been seeing this with 
 a lot of different PDFs: display looks good, print looks bad. I can 
 attach another file to PDFBOX-483 
 https://issues.apache.org/jira/browse/PDFBOX-483 if you'd like.
 I've tried that pdf and it works like a charm except for some misplaced 
 characters. I'm using ubuntu linux, java 1.6.0_15 32bit and a HP Laserjet 
 2550N.
 I've made another test on my MacBook (MacOSX 10.6., jdk 1.6.0_17 64bit, 
 same printer) and it works well too.
 I'd like to know if anyone has repeated the experiment on any Windows-based 
 platform, since Ubuntu and OSX are both Linux-based. If someone else can 
 reproduce the failure on Windows, I'll start trusting my sanity again.
 Good news Steve you're obviously not insane. ;-) Maruan confirmed your issue 
 on
 W2K and I've tested it on my WinXP with jdk 1.6.0_13 with the same result. The
 print looks bad. I have no explanation yet, except that it seems to be windows
 only. For now I don't have a clue where to look. Perhaps I will have an idea 
 in
 a few days ...
 
 BR
 Andreas Lehmkühler

PDFBox documentation (PDFBOX-661)

2010-03-12 Thread Maruan Sahyoun

Hi,

I've added a patch under PDFBOX-661 implementing some of the changes to the 
documentation for your review. Please let me know if the changes are inline 
with your thoughts. If they are I'll move forward completing the task.

Kind regards
  
Maruan Sahyoun

Re: Reopen PDFBOX-483?

2010-03-23 Thread Maruan Sahyoun

Hi Andreas,

that's good news. Congrats that you found the issue.

Maruan Sahyoun

Am 23.03.2010 um 20:00 schrieb Andreas Lehmkuehler:

 Hi,
 
 
 Maruan Sahyoun schrieb:
 Hi,
 FYI - using PDFReader the PDF is displayed OK but when printed the same 
 results are produced as with PrintPDF. The printed output contains the 
 variable data only (and some lines), Boilerplate text is not printed.  
 That was a hard nut to crack, but I guess it's done. With resolving PDFBOX-632
 it works for me on WINDOWS.!!
 
 BR
 Andreas Lehmkühler
 
 Maruan Sahyoun
 Am 09.03.2010 um 13:58 schrieb Andreas Lehmkühler:
 Hi,
 
 Betreff: Re: Reopen PDFBOX-483?
 Gesendet: Di, 09. Mrz 2010
 Von: Maruan Sahyounsahy...@fileaffairs.de
 
 Hi ,
 
 please find enclosed the text extracted from the printed PDF. Extraction 
 was
 done using Adobe Acrobat 8.
 
 X0X0X0 X0X0X05
 X0X0X0 X0X0X05
 X0X0X0 X0X0X05 
 X0X0X05 MM/DD/ X0X2 X0X2 
 X0X0X0X X0X0X0X
 X0X0X05 X0X0X05
 X0X0X05 X0X0X05
 X0X0X0X X0X0X05 
 X0X0X05 
 X0X0 X0X0 X05 
 X0X0X05 MM/DD/ 
 X0X0X05 X0X0 
 Hmm, that's odd. I'll run my own tests later when I'm at home. Finally that 
 seems to be a windows only issue. I'll also file an issue on JIRA
 
 Thanks for the tests!
 
 BR
 Andreas Lehmkühler
 
 Maruan Sahyoun
 
 
 
 Geschäftsführer: Maruan Sahyoun
 Handelsregister: AG Düsseldorf, HRB 53837
 UST.-ID: DE248275827
 
 Am 09.03.2010 um 13:45 schrieb Maruan Sahyoun:
 
 Hi Andreas,
 
 yes, the results are similar BUT most of the text and some of the lines
 are missing. Converting to Image output using PDFToImage provides a
 different and much better result where all text and lines are included and
 only some misplacement occurs. Is there a way to submit the attachment so
 you can see for yourself?
 Maruan Sahyoun
 
 Am 09.03.2010 um 13:38 schrieb Andreas Lehmkühler:
 
 Hi,
 
 Betreff: Re: Reopen PDFBOX-483?
 Gesendet: Di, 09. Mrz 2010
 Von: Maruan Sahyounsahy...@fileaffairs.de
 
 Hi,
 
 please find enclosed the result of the printing test conducted on 
 Windows 2003 Server SP2 32 bit, Java 1.5 using a fresh built from trunk.
 The
 test was done using the Adobe PDF printer driver as well as Apple and
 HP
 Postscript printers with similar results.
 Thanks for testing. Your attachments didn't make it due to some
 restrictions of the mailing list.
 Probably it would be sufficient to describe the results. Let me guess,
 they are all similar. All
 contain text, some characters are misplaced and a wrong font is used.
 
 BR
 Andreas Lehmkühler
 
 
 --- original Nachricht Ende

PageDrawer renders page twice

2010-03-31 Thread Maruan Sahyoun

Hi,

during my debugging of PrintPDF I saw that text is printed twice e.g. all 
strings are printed by writeFont from the top of the page to the end and then 
again. Is that by design or should I start to look into why that is happening? 
An initial debugging showed that the processing already starts repeating in 
PageDrawer.processTextPosition()

Kind regards

Maruan Sahyoun

Re: Commons Logging in PDFBox -- how do I turn it on for my application?

2010-04-01 Thread Maruan Sahyoun

Hi 

eg. at the command line you can type java 
-Djava.util.logging.config.file=logging.properties

Hope that helps

Kind regards

Maruan Sahyoun

Am 01.04.2010 um 20:37 schrieb Daniel Wilson:

 In the tests run by JUnit, we have logging turned on ... with this:
  junit ... 
sysproperty key=java.util.logging.config.file
 value=src/test/resources/logging.properties/
   ...
  /junit
 
 in the ANT script.
 
 But, how would I tell my (real, not JUnit) application to turn logging on?
 
 After that, I need to figure it out in the .Net version ... but hopefully
 that will be the same.
 
 Thanks!
 
 Daniel

Text rendering modes in PDFBox (PDFBOX-678)

2010-04-06 Thread Maruan Sahyoun

would someone like to comment on PDFBOX-678 or shall I simply move forward and 
start implementing it as proposed?

Maruan Sahyoun

Re: [jira] Issue Comment Edited: (PDFBOX-686) Invalid text rendering while printing a PDF

2010-04-08 Thread Maruan Sahyoun

I already made a patch for that at another bug reported but it's not  
avail in trunk. the issue is with PDFStreamEngine. I'll attach the  
patch to that issue later today.


Maruan

Am 08.04.2010 um 13:56 schrieb Bertrand GILLIS (JIRA)  
j...@apache.org:




   [ https://issues.apache.org/jira/browse/PDFBOX-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854913#action_12854913 
 ]


Bertrand GILLIS edited comment on PDFBOX-686 at 4/8/10 11:55 AM:
-

A printscreen image with the text rendering issue.

 was (Author: bgillis):
   A printscreen image whith the text rendering issue.


Invalid text rendering while printing a PDF
---

   Key: PDFBOX-686
   URL: https://issues.apache.org/jira/browse/PDFBOX-686
   Project: PDFBox
Issue Type: Bug
  Affects Versions: 1.0.0, 1.1.0
   Environment: Windows XP SP3 32 bit
Sun JDK 1.6.0_19
  Reporter: Bertrand GILLIS
   Fix For: 1.2.0

   Attachments: sample.jpg, sample.pdf, sample.xps


The space between the last character and the previous character at  
the end of a line of text is expanded or shrinked of 2px depending  
on the printer selected.

Steps to reproduce:
- create a pdf with 1 page
- add a phrase that wrap on 2 lines at least
- print the pdf page throught org.apache.pdfbox.PrintPDF


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (PDFBOX-688) Refactoring rendering-related classes/methods for extensibility

2010-04-09 Thread Maruan Sahyoun

Hi Daniel,

as I'm currently looking at implementing support for some more text rendering
modes in PageDrawer (PDFBOX-678) I would like to understand if that might
affect the .NET Version. Although I don't have a completed version this is a
list of the potential operations I will be using.

* generating a Shape based on TextLayout.getOutline()
* filling, drawing and clipping using that Shape
* possibly AlphaComposite
* possibly GlyphVector

Are there things I should avoid?

Kind regards

Maruan Sahyoun

Am 09.04.2010 um 18:18 schrieb Daniel Wilson (JIRA):

[
https://issues.apache.org/jira/browse/PDFBOX-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855454#action_12855454
]

Daniel Wilson commented on PDFBOX-688:
--

928957 -- Make page and pageSize available (again) for
libraries/applications that inherit.
931616 -- Stroke line width/style modifications.
931633 -- Invoke / drawImage
932179 -- Don't fail to BLACK quite so quickly ... do some more intelligent
guessing.
Necessary when implementing in .Net as there are still some
key things IKVM is missing.

Refactoring rendering-related classes/methods for extensibility
---

Key: PDFBOX-688
URL: https://issues.apache.org/jira/browse/PDFBOX-688
Project: PDFBox
Issue Type: Improvement
Reporter: Daniel Wilson
Assignee: Daniel Wilson
Priority: Minor

Some of the classes/methods in the rendering area assume they have access to
a Graphics2D object.
This assumption breaks when using the .Net version of PDFBox. Some
judicious refactoring permits PageDrawer to be extended in .Net and key
methods to be overriden.
I am continuing this refactoring for better rendering support in .Net.
Andreas recently asked that code committed to SVN also be tied to a Jira
issue -- a good idea really -- so I'm putting this in as an issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (PDFBOX-688) Refactoring rendering-related classes/methods for extensibility

2010-04-12 Thread Maruan Sahyoun

Hi Daniel,

I think it's a good first step. Having thought about that a little more the 
question to me becomes if it makes sense to even refactor that a bit more as 
the font handling still contains some platform specific things as now we have 
getawtFont. What if the PD classes are PDF related and application or platform 
specific stuff ends in PageDrawer and some new classes let's call them platform 
classes for the moment. This way PageDrawer would use the platform classes to 
get the font which then would use the PDxxx classes to get the information 
necessary  to generate the right information. Currently PDxxx is a mixture of 
PDF specific classes and routines and non PDF specific things like returning a 
java font. Factoring non PDF related stuff out would provide a better 
separation and also the chance to provide additional implementations for 
specific applications e.g. let's say we would like to render PDF to HTML we 
could implement  a HTML rendition without touching the core PDF stuff.

If that's a good idea really depends on where PDFBox should go. Currently 
looking at the issues in JIRA and the users mailing list I see a number of 
different types of applications where people are trying to use PDFBox - from 
text extraction to printing to a Reader type application.  In addition to that 
there is also some core functionality missing. If there are layers like 
core, platform and application I think development would benefit.

As I'm new to PDFBox (and as I wouldn't say that a) I have seen all code and b) 
understood how all the stuff works together) this reflects my current 
understanding.

And I'm also not a Java expert (although I do most of my development in Java).

Last comment ;-) I would call the method getAwtFont and not getawtFont.

Kind regards 

Maruan Sahyoun

Am 12.04.2010 um 15:20 schrieb Daniel Wilson:

 What I have done, I consider a step in the right direction, but you may have
 some better code.  I do not develop much in Java, so sometimes I do things
 in ways that are not that elegant.
 
 I skipped the Type3 fonts in what I did.  I saw what was going on  decided
 I had no idea how to handle it!
 
 As for drawString vs TextLayout.getOutline, I really don't know.
 
 Sorry I didn't read your comments in 678 well enough.  I would have
 collaborated on the getAwtFont work!
 
 Daniel
 
 On Sun, Apr 11, 2010 at 3:04 AM, Maruan Sahyoun sahy...@fileaffairs.dewrote:
 
 Hi Daniel,
 
 I think we are currently trying to do the same as I also started
 implementing a getAwtFont Method ;-) as outlined in my comments for
 PDFBOX-678 in order to get all drawing for the different text modes done in
 PageDrawer itself (I think we share the same general idea that there should
 be a clearer separation of concerns). I already have that working for
 TrueType fonts (just copied the code in writeText into the new method) and
 the non clipping text modes. The only difficulty I see is handling e.g.
 Type3 fonts as they can not be so easily converted to a font. Maybe we share
 ideas how to deal with these and then make a decision who implements what in
 order to avoid duplication of efforts. I'm happy to just rely on your
 getAwtFont implementation as you might be further down the road.
 
 One question when drawing text in PageDrawer is how text handling should be
 done in general. E.g. using drawString is faster and produces text objects
 which can be selected for example when you print to a PDF printer. But
 outlines etc. are not possible that way. There I can either use
 TextLayout.getOutline() to draw the outline (and combine that with
 drawString to get selectable text) or selectable text as a result of
 PageDrawer is not important at all. This will then also affect possible
 applications in PDFReader which currently is display only - but what is the
 idea with that further down the road.
 
 Maybe there we should also share some thoughts as you will have a much
 better idea about the longer term plan for PDFBox as I'm new to that
 project.
 
 Kind regards
 
 Maruan Sahyoun
 
 Am 11.04.2010 um 04:32 schrieb Daniel Wilson:
 
 Thanks, Maruan.
 
 The big thing to avoid is direct access to a graphics object in an object
 other than PageDrawer.  I inherit from PageDrawer and override many of
 the
 methods, and I believe anyone else who wishes to use PDFBox for rendering
 in
 .Net would need to do the same.
 
 A big hint that direct access to a graphics object is coming is a line of
 code like
 Graphics2D graphics = (PageDrawer)context.getGraphics();
 
 If that line tries to execute in .Net ... it will return a NULL ... and
 then
 you get NullPointerExceptions.
 
 Better to keep the graphics code in PageDrawer.
 
 The refactoring of some of the Font stuff I'm about to commit doesn't
 completely do this ... but it does provide a getawtFont routine that can
 be
 called from .Net, permitting the actual graphics stuff down in
 PDSimpleFont
 to be avoided.
 
 Daniel
 
 On Fri, Apr 9, 2010 at 2:44 PM, Maruan

Re: [jira] Commented: (PDFBOX-688) Refactoring rendering-related classes/methods for extensibility

2010-04-12 Thread Maruan Sahyoun

just wanted to share my thoughts ;-) 

Maruan Sahyoun


Am 12.04.2010 um 18:03 schrieb Daniel Wilson:

 No objection to the capitalization change.  As I just submitted this last
 night, I am probably the only one w/ anything depending on that name.
 
 I think your view of the separation of platform classes from PDF classes
 makes a lot of sense.
 
 My use of PDFBox is fairly narrow (as is that of many users), so I would
 like to hear from Andreas or Jukka before committing to anything too major,
 though.
 
 Daniel
 
 On Mon, Apr 12, 2010 at 10:56 AM, Maruan Sahyoun 
 sahy...@fileaffairs.dewrote:
 
 Hi Daniel,
 
 I think it's a good first step. Having thought about that a little more the
 question to me becomes if it makes sense to even refactor that a bit more as
 the font handling still contains some platform specific things as now we
 have getawtFont. What if the PD classes are PDF related and application or
 platform specific stuff ends in PageDrawer and some new classes let's call
 them platform classes for the moment. This way PageDrawer would use the
 platform classes to get the font which then would use the PDxxx classes to
 get the information necessary  to generate the right information.
 Currently PDxxx is a mixture of PDF specific classes and routines and non
 PDF specific things like returning a java font. Factoring non PDF related
 stuff out would provide a better separation and also the chance to provide
 additional implementations for specific applications e.g. let's say we would
 like to render PDF to HTML we could implement  a HTML rendition without
 touching the core PDF stuff.
 
 If that's a good idea really depends on where PDFBox should go. Currently
 looking at the issues in JIRA and the users mailing list I see a number of
 different types of applications where people are trying to use PDFBox - from
 text extraction to printing to a Reader type application.  In addition to
 that there is also some core functionality missing. If there are layers
 like core, platform and application I think development would benefit.
 
 As I'm new to PDFBox (and as I wouldn't say that a) I have seen all code
 and b) understood how all the stuff works together) this reflects my current
 understanding.
 
 And I'm also not a Java expert (although I do most of my development in
 Java).
 
 Last comment ;-) I would call the method getAwtFont and not getawtFont.
 
 Kind regards
 
 Maruan Sahyoun
 
 Am 12.04.2010 um 15:20 schrieb Daniel Wilson:
 
 What I have done, I consider a step in the right direction, but you may
 have
 some better code.  I do not develop much in Java, so sometimes I do
 things
 in ways that are not that elegant.
 
 I skipped the Type3 fonts in what I did.  I saw what was going on 
 decided
 I had no idea how to handle it!
 
 As for drawString vs TextLayout.getOutline, I really don't know.
 
 Sorry I didn't read your comments in 678 well enough.  I would have
 collaborated on the getAwtFont work!
 
 Daniel
 
 On Sun, Apr 11, 2010 at 3:04 AM, Maruan Sahyoun sahy...@fileaffairs.de
 wrote:
 
 Hi Daniel,
 
 I think we are currently trying to do the same as I also started
 implementing a getAwtFont Method ;-) as outlined in my comments for
 PDFBOX-678 in order to get all drawing for the different text modes done
 in
 PageDrawer itself (I think we share the same general idea that there
 should
 be a clearer separation of concerns). I already have that working for
 TrueType fonts (just copied the code in writeText into the new method)
 and
 the non clipping text modes. The only difficulty I see is handling e.g.
 Type3 fonts as they can not be so easily converted to a font. Maybe we
 share
 ideas how to deal with these and then make a decision who implements
 what in
 order to avoid duplication of efforts. I'm happy to just rely on your
 getAwtFont implementation as you might be further down the road.
 
 One question when drawing text in PageDrawer is how text handling should
 be
 done in general. E.g. using drawString is faster and produces text
 objects
 which can be selected for example when you print to a PDF printer. But
 outlines etc. are not possible that way. There I can either use
 TextLayout.getOutline() to draw the outline (and combine that with
 drawString to get selectable text) or selectable text as a result of
 PageDrawer is not important at all. This will then also affect possible
 applications in PDFReader which currently is display only - but what is
 the
 idea with that further down the road.
 
 Maybe there we should also share some thoughts as you will have a much
 better idea about the longer term plan for PDFBox as I'm new to that
 project.
 
 Kind regards
 
 Maruan Sahyoun
 
 Am 11.04.2010 um 04:32 schrieb Daniel Wilson:
 
 Thanks, Maruan.
 
 The big thing to avoid is direct access to a graphics object in an
 object
 other than PageDrawer.  I inherit from PageDrawer and override many of
 the
 methods, and I believe anyone else who wishes to use PDFBox for
 rendering

Re: PDFBox Project for GSoC 2012

2012-03-20 Thread Maruan Sahyoun

Hi,

suggestions:

1) Mapping PDF features to version and standards e.g. layers in PDF are a PDF 
1.5 feature, AES encryption for PDF 1.6. 
2) PDF Writer to support versions and standards
3) OpenType support
4) Performance improvements (e.g. with some applications we developed iText is 
approx. 30% faster merging PDFs than PDFBox)
5) Widgets for PDF generation (Text, Tables …) as there seems to be some demand 
in using PDFBox for generation of PDFs from scratch although I think that one 
could use e.g FOP for that purpose.
6) Documentation. 

With kind regards

Maruan Sahyoun


Am 19.03.2012 um 07:45 schrieb Andreas Lehmkuehler:

 Hi,
 
 Am 18.03.2012 03:16, schrieb Tharaka Nayanajith Wijebandara:
 Hi,
 
 
 Thanks mehdi.
 
 
 I have two ideas for a GSoC task, but need all of your help to select
 suitable one.
 
 
- One project is HTML to PDF and vise versa converter. This feature can
be found in JIRA also (https://issues.apache.org/jira/browse/PDFBOX-6,
https://issues.apache.org/jira/browse/PDFBOX-9)
 Good idea, but complicated, as some of the feature you would need aren't yet 
 implemented.
 
- Other one is enhancing features of PDF reader and zooming features,
page display features, bookmark navigator, page thumbnail viewer can be
very much useful. Since I have previous experience in awt, swing and
java2d, it will be easy for me.
 I like this idea. It would be a nice feature.
 
 There might be several other tasks which are important than this. So all of
 you are welcome, to reply with good ideas.
 Yes there are a lot things to do, probably someone else might come up with a 
 wish?
 
 On Sat, Mar 17, 2012 at 5:01 PM, mehdi houshmandmed1...@gmail.com  wrote:
 
 Hi Tharaka,
 
 Have you had any more thoughts on a project you'd like to undertake?
 Have you applied and been through all the admin needed to be accepted
 into GSoC 2012? Let me know if you need any help.
 
 Mehdi
 
 On 9 March 2012 06:25, Andreas Lehmkuehlerandr...@lehmi.de  wrote:
 Hi,
 
 Am 07.03.2012 07:40, schrieb mehdi houshmand:
 
 Hi Andreas,
 
 Sorry, maybe I wasn't clear, I am an ASF committer... Just not to
 PDFBox.. . I do have domain expertise being a full-time FOP developer
 and having dealt with PDFs and fonts quite a bit. Should I pop an
 email to dev-community to see if it's ok? It seems like such a waste
 to have an interested applicant but no mentor...
 
 I'm not an GSoC expert but that sounds good to me. You may double check
 with
 the dev-community, but IMHO it's not necessary.
 I'm glad that you volunteer to help us, thanks in advance. I'll try to
 help
 as much as I can.
 
 SNIP
 
 BR
 Andreas Lehmkühler

Re: PDFBox Project for GSoC 2012

2012-03-20 Thread Maruan Sahyoun

Hi,

 Hi,
 
 Am 18.03.2012 03:16, schrieb Tharaka Nayanajith Wijebandara:
 Hi,
 
 
 Thanks mehdi.
 
 
 I have two ideas for a GSoC task, but need all of your help to select
 suitable one.
 
 
- One project is HTML to PDF and vise versa converter. This feature can
be found in JIRA also (https://issues.apache.org/jira/browse/PDFBOX-6,
https://issues.apache.org/jira/browse/PDFBOX-9)
 Good idea, but complicated, as some of the feature you would need aren't yet 
 implemented.

I think PDF to HTML is a very good idea even if it will be very limited because 
as Andreas pointed out there are some features missing. Maybe these can be 
documented and eventually be implemented.

 
- Other one is enhancing features of PDF reader and zooming features,
page display features, bookmark navigator, page thumbnail viewer can be
very much useful. Since I have previous experience in awt, swing and
java2d, it will be easy for me.
 I like this idea. It would be a nice feature.

Although I think that the current PDF Reader can be enhanced in many ways there 
are already so many Readers out there as well as PDF support within web 
browsers my personal opinion is that enhancing PDFBox core capabilities would 
be more beneficial.

With kind regards

Maruan Sahyoun   

 
 There might be several other tasks which are important than this. So all of
 you are welcome, to reply with good ideas.
 Yes there are a lot things to do, probably someone else might come up with a 
 wish?
 
 On Sat, Mar 17, 2012 at 5:01 PM, mehdi houshmandmed1...@gmail.com  wrote:
 
 Hi Tharaka,
 
 Have you had any more thoughts on a project you'd like to undertake?
 Have you applied and been through all the admin needed to be accepted
 into GSoC 2012? Let me know if you need any help.
 
 Mehdi
 
 On 9 March 2012 06:25, Andreas Lehmkuehlerandr...@lehmi.de  wrote:
 Hi,
 
 Am 07.03.2012 07:40, schrieb mehdi houshmand:
 
 Hi Andreas,
 
 Sorry, maybe I wasn't clear, I am an ASF committer... Just not to
 PDFBox.. . I do have domain expertise being a full-time FOP developer
 and having dealt with PDFs and fonts quite a bit. Should I pop an
 email to dev-community to see if it's ok? It seems like such a waste
 to have an interested applicant but no mentor...
 
 I'm not an GSoC expert but that sounds good to me. You may double check
 with
 the dev-community, but IMHO it's not necessary.
 I'm glad that you volunteer to help us, thanks in advance. I'll try to
 help
 as much as I can.
 
 SNIP
 
 BR
 Andreas Lehmkühler

Re: PDFBox Project for GSoC 2012

2012-03-20 Thread Maruan Sahyoun


 snip/
 
 
 Although I think that the current PDF Reader can be enhanced in many ways
 there are already so many Readers out there as well as PDF support within
 web browsers my personal opinion is that enhancing PDFBox core capabilities
 would be more beneficial.
 
 With kind regards
 
 Maruan Sahyoun
 
 
 Check out Jeremias' suggestions of the viewer, it's less of a viewer and
 more of a front-end for a lot of the tools PDFBox has to offer, a PDFBox
 GUI so to speak rather than a PDF viewer.

I'd still look into enhancing PDFBox core as this will benefit most users. 
Looking at the bugs and issues most come from core capabilities.

Re: Next release(s)?

2012-04-09 Thread Maruan Sahyoun

I think that going for option 1 is the best approach.

The new NonSequentialParser PDFBOX-1199 is a huge step forwards reusing the 
'old' codebase and overcoming the main issues resolving from the fact that the 
old parser was sequential and not in line with how PDFs are build. 

Working on the ConformingParser I've outlined my approach in PDFBOX-1000. As I 
don't want to simply take existing code without revisiting it and making sure 
that conformance is met I agree with Timo's point that this might affect a 
couple of internal classes. So this is a longer term goal. With regards to the 
ConformingParser it would be good to get some more feedback about the current 
approach as moving forward with ConformingParser - SimpleParser - PDFLexer it 
will create a lot of effort if we revisit that design decision.

So from that doing a 1.7.x release using the current trunk will provide a lot 
of benefits and leave time for redoing a new parser 'from scratch'.

BR
Maruan

Am 09.04.2012 um 13:30 schrieb Timo Boehme:

 Hi,
 
 I do also prefer option 1. For the conforming parser to be cleanly integrated 
 I assume we will have to adjust a couple of internal classes thus we really 
 should have one (or more) releases before this major release with the 'old' 
 code base.
 
 With the new intermediate 'conforming' parser (PDFBOX-1199) I think we should 
 do a 1.7.x release. While creating a branch to separate next major release 
 would be a cleaner solution I'm afraid that maintaining two branches is 
 currently not doable with the available resources.
 
 
 Best regards,
 Timo
 
 
 Am 08.04.2012 21:26, schrieb Andreas Lehmkuehler:
 when preparing the next board report I was wondering what to write about
 our plans for the next release.
 
 I guess it's obvious that sooner or later we will go for a 2.x release.
 The major release may include the following
 
 - merge/replace Jempbox/Xmpbox
 - remove deprecated stuff
 - move to java6 as minimum requirement
 - switch to the (completed?) conforming parser as default
 - 
 
 IMO we have different options how to do that:
 
 1.
 
 Release a 1.7.x version based on the current trunk. Start with the major
 release using the current trunk.
 
 pros:
 
 - new feature release after 9 months
 - 1.7.x release without much effort
 - enough time for the major release
 - ...
 
 cons:
 
 - 2 XMP libs
 - unstable conforming parser
 - ...
 
 2.
 
 Choose a couple of improvements/fixes from the trunk and apply them to
 the 1.6 branch and release a 1.6.x bugfix or a 1.7.0 feature release.
 Start with the major release using the current trunk.
 
 pros:
 
 - new feature/bugfix release only with chosen features/fixes
 - enough time for the major release
 - no unstable conforming parser, as it wouldn't be part of the release
 - ...
 
 cons:
 
 - 2 XMP libs (if we would do a 1.7.0 release including preflight)
 - a lot of discussion on what will be part of the release and what won't be
 - a lot of work to create the release compaired to alternative 1
 - ...
 
 3.
 
 Drop all 1.6.x/1.7.0 plans and start with the major release using the
 current trunk.
 
 pros:
 
 - we wouldn't have to spend time on a 1.6.x/1.7.0 release
 - ...
 
 cons:
 
 - too much time without release
 - too less time to work on the new major release, because of con 1
 - ...
 
 I prefer option 1, what do you think?
 
 BR
 Andreas Lehmkühler
 
 
 -- 
 
 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com
 
 _
 
 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
 _

Re: 1.7 release?

2012-05-06 Thread Maruan Sahyoun


Before integrating the current work at PDFBOX-1000 I would prefer to 

- make sure the lexer is using the new IO classes
- move some parts to the (new) SimpleParser as e.g. some keywords are already 
handled in the lexer which is more than the lexer should do imo

regards

Maruan

Am 06.05.2012 um 16:46 schrieb Andreas Lehmkuehler andr...@lehmi.de:

 Hi,
 
 Am 04.05.2012 15:46, schrieb Timo Boehme:
 Am 03.05.2012 21:04, schrieb Michael McCandless:
 Any guestimates for a 1.7.0 release?
 
 It's been a long time (9 months) since 1.6.0... and I count ~203
 commits since 1.6.0.
 
 There was already some discussion about it (see Re: Next release(s)? dating
 from 2012-04-10) and it is clear that a new version (probably 1.7.0) should 
 be
 released soon.
 IMHO there are some things which should be done before, integrate Maruans 
 latest patch (PDFBOX-1000), improve the TTF-Parser (PDFBOX-490) 
 
 However I think we will wait until the project lead is back online.
 I guess you are adressing me as PMC Chair. I'm afraid there is a
 misunderstanding I'd like to clarify.
 
 There is no concept of leadership within the ASF. An apache project is led by 
 the PMC [1]. The PMC Chair [2] is just the speaker of the project and acts as 
 interface to the board of the foundation. All PMC members [3] including the 
 chair are equal and each of them has one vote.
 
 Kind regards,
 Timo
 
 BR
 Andreas Lehmkühler
 
 [1] http://www.apache.org/foundation/how-it-works.html#pmc
 [2] http://www.apache.org/foundation/how-it-works.html#pmc-chair
 [3] http://www.apache.org/foundation/how-it-works.html#pmc-members

Re: 1.7 release?

2012-05-14 Thread Maruan Sahyoun

the new parser is - unfortunately - still in it's early state and not in any 
way helpful. I wanted to complete the SimpleParser, which takes the tokens from 
the PDF Lexer and creates the COS level objects this week. All this is still in 
preparation for the ConformingParser.

WRT 1.7 I agree with Timo that the enhancements made so far do validate a new 
release esp the new NonSequentialParser Timo created has already proven to 
solve a number of issues raised. Maybe this could be the default for the time 
being?

regards
 
Maruan

Am 14.05.2012 um 09:54 schrieb Timo Boehme:

 Hi,
 
 Am 13.05.2012 10:24, schrieb Andreas Lehmkuehler:
 Am 07.05.2012 10:50, schrieb Timo Boehme:
 ...
 In my opinion there are already a number of improvements in current trunk
 compared to 1.6 and there is no reason to not release another 1.8 before
 PDFBOX-1000 is really ready. As I see it we should bump the version to
 2.0 if PDFBOX-1000 finally lands.
 I just thought about a kind of beta version of the new parser, so that
 one can test ist without building its own version.
 
 As I see it we are currently not there. However this is a point Maruan is the 
 only one who knows about current state.
 
 ...
 Nevertheless I'd like to have your opinion on a release and expertise
 doing it :-)
 The release process uses the maven release plugin and therefore it is
 quite easy to perform. If you are interested in acting as release
 manager you have to provide a key which will be used to sign the
 release. This key should be signed by at least one member of The Apache
 Web of Trust, see [1] and [2].
 
 Thanks for the pointers. Since I'm currently a bit short of time I really 
 appreciate that you volunteer as RM.
 
 I'll volunteer as RM for the next release. What do you think about
 cutting the release in one week from now on 22th? As I won't be
 available in the first 2 weeks of june the next reasonable target date
 could be june 26th, if we need some more time to include more stuff.
 
 22nd is perfect for me.
 
 
 Best regards,
 
 Timo
 
 -- 
 
 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com
 
 _
 
 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
 _

Re: Apache PDFBox July 2012 board report due

2012-07-19 Thread Maruan Sahyoun

Hi,

maybe wie can join forces here as I'm currently working on an Xref class which 
parses xref tables and xref streams. One method should also do the mentioned 
scanning.

Kind regards

Maruan Sahyoun

Am 19.07.2012 um 09:42 schrieb Andreas Lehmkühler andr...@lehmi.de:

 
 Timo Boehme timo.boe...@ontochem.com hat am 16. Juli 2012 um 18:02
 geschrieben:
 
 Hi,
 
 Am 16.07.2012 17:48, schrieb Andreas Lehmkuehler:
 Am 10.07.2012 09:16, schrieb Timo Boehme:
 ...
 looks good to me. Some mention about the preflight module which will be
 integrated in the next major release?
 Thanks for your comment. I added some information about preflight/xmpbox
 as you maybe already have seen.
 
 Yes, thank you very much for all the time spending on administrative
 tasks/improvements on PDFBOX.
 
 For the next time I plan to improve on the broken document robustness of
 the parser by doing a first scan over the document (in case of parsing
 failure), collecting object start/end points and using them to repair
 xref table.
 
 
 Seems to be necessary, at least for some PDFs. :-(
 
 
 Another task I would like to do is reducing the amount of memory needed
 by using the existing file as input stream resource instead of copying
 an object stream first to a temporary buffer (in cases where an input
 file exists).
 Maybe for this we should change from assuming to have an input stream to
 assuming we have an input file and if we have an input stream a
 temporary file is created on the fly - WDYT?
 
 
 I guess internally we have to use something abstract and as everything is a
 stream
 the might be a good choice. AFAIU the current implementation, one reason for 
 the
 usage of a temporary buffer is the fact that the data is modified
 (decompressing,
 decrypting) and we must not alter the input data. It is perhaps a better idea 
 to
 somehow split the inputstream and the unfilteredinputstream, e.g. read from 
 the
 inputstream every time an object is dereferenced and store the (decompressed)
 data in the corresponding object.
 
 
 
 Kind regards,
 Timo
 
 
 BR
 Andreas Lehmkühler

ConformingParser (PDFBOX-1000)

2012-07-19 Thread Maruan Sahyoun

Hi there,

resuming to work on PDFBOX-1000 I came across a question how to maintain some 
state within the base components PDFLexer and Simple Parser (which has yet to 
come). 

E.g. in order to differentiate a number from an indirect object I potentially 
have to read three tokens {num} {gen}  obj to check if {num} is an individual 
number or the start of an indirect object. There are two ways to recover if 
I've read too many tokens and the number was in fact the individual object

a) depend on file position e.g. filePointer and seek
b) maintain some internal state

I currently tend to go for b) as this would remove the dependency on 
filePointer() and seek() or similar methods but that means if the parsing has 
to start from a new point within the file, object etc. there needs too be some 
reset() call to reset the state. Also the caller e.g. ConformingParser has to 
make sure that there is some way to reposition the cursor. On the other hand 
not being dependent on a specific position would enable the PDFLexer and 
SimpleParser to be extended to work on byte[] and similar. 

WDYT

Kind regards

Maruan Sahyoun

Re: Help me split PDF by sections

2012-11-14 Thread Maruan Sahyoun

Hi - what do you mean with sections and subsections. Are these bookmarks in PDF?

With kind regards

Maruan

Re: Help me split PDF by sections

2012-11-14 Thread Maruan Sahyoun

it's possible to split a PDF using PDFBOX. The question is how to retrieve the 
information where to split the PDF. That was the reason for me asking how your 
sections are stored e.g. bookmarks, links, text … . If it's a bookmark you need 
to get the information in a different way than sections being normal text where 
a human can see the division but retrieving that with a program might be 
difficult. There are APIS for retrieving bookmarks and text though.

Kind regards


Maruan Sahyoun



Am 14.11.2012 um 11:10 schrieb Tzali Maimon tzali.mai...@eqsquest.com:

 PDFs are sometimes divided into sections or subjects.
 for example:
 
 Title 1:
  Sub-title:
 some text...
 
  sub title:
 some title
 
 sub-sub-title:
 
 
 I want to split my PDF not by pages but by the this tree of titles. I dont
 know if PDF forces each subject to be a bookmark.
 
 
 On Wed, Nov 14, 2012 at 12:05 PM, Maruan Sahyoun 
 sahy...@fileaffairs.dewrote:
 
 Hi - what do you mean with sections and subsections. Are these bookmarks
 in PDF?
 
 With kind regards
 
 Maruan

Re: Help me split PDF by sections

2012-11-14 Thread Maruan Sahyoun

for working with bookmarks you can look at 
http://pdfbox.apache.org/userguide/bookmarks.html
for how to split a PDF you could use/review  org.apache.pdfbox.util.Splitter 

Kind regards

Maruan Sahyoun

Am 14.11.2012 um 11:30 schrieb Tzali Maimon tzali.mai...@eqsquest.com:

 Thanks for the attention.
 I assume I would like to take a look on both API and decide. It sounds
 though that splitting according to bookmarks is easy so Can you tell me how
 to pull that off plz?
 
 
 On Wed, Nov 14, 2012 at 12:22 PM, Maruan Sahyoun 
 sahy...@fileaffairs.dewrote:
 
 it's possible to split a PDF using PDFBOX. The question is how to retrieve
 the information where to split the PDF. That was the reason for me asking
 how your sections are stored e.g. bookmarks, links, text … . If it's a
 bookmark you need to get the information in a different way than sections
 being normal text where a human can see the division but retrieving that
 with a program might be difficult. There are APIS for retrieving bookmarks
 and text though.
 
 Kind regards
 
 
 Maruan Sahyoun
 
 
 
 Am 14.11.2012 um 11:10 schrieb Tzali Maimon tzali.mai...@eqsquest.com:
 
 PDFs are sometimes divided into sections or subjects.
 for example:
 
 Title 1:
 Sub-title:
some text...
 
 sub title:
some title
 
sub-sub-title:
 
 
 I want to split my PDF not by pages but by the this tree of titles. I
 dont
 know if PDF forces each subject to be a bookmark.
 
 
 On Wed, Nov 14, 2012 at 12:05 PM, Maruan Sahyoun sahy...@fileaffairs.de
 wrote:
 
 Hi - what do you mean with sections and subsections. Are these bookmarks
 in PDF?
 
 With kind regards
 
 Maruan

Re: Problem with PDFont.getStringWidth()

2012-12-05 Thread Maruan Sahyoun

Hi Gerrit,

which version of PDFBox are you using? Could you post a small code snippet to 
reproduce the issue? With a quick test I created  the numbers do print fine 
i.e. they have the same width.

Her my code:

PDDocument doc = new PDDocument();

PDPage page = new PDPage();
doc.addPage( page );
PDFont font = PDType1Font.HELVETICA;

PDPageContentStream contentStream = new PDPageContentStream(doc, page);
contentStream.beginText();
contentStream.setFont( font, 12 );
contentStream.moveTextPositionByAmount( 100, 700 );
contentStream.drawString(   + font.getStringWidth());
contentStream.moveTextPositionByAmount( 0, 20 );
contentStream.drawString(   + font.getStringWidth());
contentStream.moveTextPositionByAmount( 0, 20 );
contentStream.drawString( 12  + font.getStringWidth(12));
contentStream.moveTextPositionByAmount( 0, 20 );
contentStream.drawString( 32  + font.getStringWidth(32));
contentStream.endText();
contentStream.close();
doc.save( test.pdf );


Kind regards

Maruan Sahyoun


Am 05.12.2012 um 14:49 schrieb Gerrit Lober gerritlo...@gmx.de:

 Dear all,
 
 I've the following problem with PDFbox. I try to paint a rectangle with 
 PDPageContentStream and then put a figure in the center of this rectangle.
 
 Therefore I try to calculate the width of the text. The Method getStringWidth 
 returns everytime the same width for all figures. That means that I get the 
 same result for 12 and 32. Because the 1 is smaller than the 3 
 something is not correct and my 12 is a bit to right.
 
 What is the reason for this?
 
 I get the Font with the following code:
 private PDFont getFont() throws IOException {
  return PDType1Font.HELVETICA;
 }
 
 Thanks!

Font handling in PDFBox and XMLGraphics

2013-01-08 Thread Maruan Sahyoun

Hi …,

just wanted to make you aware of a recent discussion going on in fop to maybe 
use fontbox for font handling. Maybe it's possible to join forces as both 
projects need to enhance font handling.

http://markmail.org/thread/hkclkqaxlfh5wwvu

Kind regards

Maruan Sahyoun

Reading a Stream reported as EmbeddedFile

2013-01-11 Thread Maruan Sahyoun

Hi,

I have a handling question regarding PDFBox. Im trying to read an object which 
is defined as COSDictionary{(COSName{Filter}:COSArray{[COSName{FlateDecode}]}) 
(COSName{Length}:COSInt{477}) (COSName{Type}:COSName{EmbeddedFile}) }

How can I get the content of that object?

Kind regards

Maruan Sahyoun

Re: Reading a Stream reported as EmbeddedFile

2013-01-11 Thread Maruan Sahyoun

Hi Andreas,

maybe I should have been clearer in my question. What I' trying to do is 
reading the XFA part of a form,

where the XFA is part of an array

COSString{xdp:xdp}
COSObject{61, 0}
COSString{config}
COSObject{4, 0}
COSString{template}
COSObject{5, 0}
COSString{datasets}
COSObject{62, 0}
COSString{localeSet}
COSObject{7, 0}
COSString{xmpmeta}
COSObject{8, 0}
COSString{xfdf}
COSObject{9, 0}
COSString{form}
COSObject{63, 0}
COSString{/xdp:xdp}
COSObject{64, 0}

Now the array acts as as key value pair where the odd entry is the key (e.g. 
xdp:xdp) and the even part is the content of this subsection of the XFA. In my 
sample the content of datasets is contained in 62,0. Now this is a stream 
with the following dictionary  
COSDictionary{(COSName{Filter}:COSArray{[COSName{FlateDecode}]}) 
(COSName{Length}:COSInt{477}) (COSName{Type}:COSName{EmbeddedFile}) }

And this is what I'm trying to read. 

The other possible implementation for an XFA form is that the content is not 
splitted into individual parts contained in an array but the whole XFA is 
contained in a single stream.

Plan is to provide a patch to extract the XFA and in another stage to replace 
the XFA with new content so people using pdfbox can extract data from XFA forms 
and prepopulate XFA forms using pdfbox. 


Maruan Sahyoun

Am 11.01.2013 um 13:06 schrieb Andreas Lehmkühler andr...@lehmi.de:

 Hi,
 
 
 Maruan Sahyoun sahy...@fileaffairs.de hat am 11. Januar 2013 um 11:52
 geschrieben:
 Hi,
 
 I have a handling question regarding PDFBox. Im trying to read an object 
 which
 is defined as 
 COSDictionary{(COSName{Filter}:COSArray{[COSName{FlateDecode}]})
 (COSName{Length}:COSInt{477}) (COSName{Type}:COSName{EmbeddedFile}) }
 
 How can I get the content of that object?
 Have a look at the ExtractEmbeddedFiles example [1]
 
 
 Kind regards
 
 Maruan Sahyoun
 
 BR
 Andreas Lehmkühler
 
 [1]
 http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java?view=log

Re: Reading a Stream reported as EmbeddedFile

2013-01-11 Thread Maruan Sahyoun

Hi Andreas,

I found it - thanks for your help.

Kind regards

Maruan Sahyoun

Am 11.01.2013 um 13:21 schrieb Maruan Sahyoun sahy...@fileaffairs.de:

 Hi Andreas,
 
 maybe I should have been clearer in my question. What I' trying to do is 
 reading the XFA part of a form,
 
 where the XFA is part of an array
 
 COSString{xdp:xdp}
 COSObject{61, 0}
 COSString{config}
 COSObject{4, 0}
 COSString{template}
 COSObject{5, 0}
 COSString{datasets}
 COSObject{62, 0}
 COSString{localeSet}
 COSObject{7, 0}
 COSString{xmpmeta}
 COSObject{8, 0}
 COSString{xfdf}
 COSObject{9, 0}
 COSString{form}
 COSObject{63, 0}
 COSString{/xdp:xdp}
 COSObject{64, 0}
 
 Now the array acts as as key value pair where the odd entry is the key (e.g. 
 xdp:xdp) and the even part is the content of this subsection of the XFA. In 
 my sample the content of datasets is contained in 62,0. Now this is a 
 stream with the following dictionary  
 COSDictionary{(COSName{Filter}:COSArray{[COSName{FlateDecode}]}) 
 (COSName{Length}:COSInt{477}) (COSName{Type}:COSName{EmbeddedFile}) }
 
 And this is what I'm trying to read. 
 
 The other possible implementation for an XFA form is that the content is not 
 splitted into individual parts contained in an array but the whole XFA is 
 contained in a single stream.
 
 Plan is to provide a patch to extract the XFA and in another stage to replace 
 the XFA with new content so people using pdfbox can extract data from XFA 
 forms and prepopulate XFA forms using pdfbox. 
 
 
 Maruan Sahyoun
 
 Am 11.01.2013 um 13:06 schrieb Andreas Lehmkühler andr...@lehmi.de:
 
 Hi,
 
 
 Maruan Sahyoun sahy...@fileaffairs.de hat am 11. Januar 2013 um 11:52
 geschrieben:
 Hi,
 
 I have a handling question regarding PDFBox. Im trying to read an object 
 which
 is defined as 
 COSDictionary{(COSName{Filter}:COSArray{[COSName{FlateDecode}]})
 (COSName{Length}:COSInt{477}) (COSName{Type}:COSName{EmbeddedFile}) }
 
 How can I get the content of that object?
 Have a look at the ExtractEmbeddedFiles example [1]
 
 
 Kind regards
 
 Maruan Sahyoun
 
 BR
 Andreas Lehmkühler
 
 [1]
 http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java?view=log

PDFBOX-1492 Grant ASF license

2013-01-16 Thread Maruan Sahyoun

Hi,

as part of PDFBOX-1492 I added a small patch to extract the XFA from a pdf 
form. Unfortunately I can't find the button/checkbox … to grant ASF license. 
Would someone know how to do it?

With kind regards


Maruan Sahyoun

Re: [jira] [Commented] (PDFBOX-1498) Index Out Of Bounds Exception while reading large PDF Document

2013-01-23 Thread Maruan Sahyoun

Hi Manoj,

the size alone is not the cause of the issue. In a recent project we were 
handling PDF's larger than the one you are talking about.

1. Can you test with the Non Sequential Parser i.e. PDDocument.loadNonSeq(…) 
and confirm that this is causing the same issue.
2. Can you upload a sample PDF which enables us to reproduce the issue? Without 
that it will be very difficult to say why this is happening.
3. Of course you can try with larger heap settings until it works but I don't 
think this is a good approach.

In addition to that it would be good if you could describe what you want to 
achieve with the PDF. Maybe there are ways doing so without parsing the 
complete file.

With kind regards

Maruan Sahyoun


Am 23.01.2013 um 10:18 schrieb Manoj Patel (JIRA) j...@apache.org:

 
[ 
 https://issues.apache.org/jira/browse/PDFBOX-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560504#comment-13560504
  ] 
 
 Manoj Patel commented on PDFBOX-1498:
 -
 
 Sorry but i cannot share document with anyone. I have created new document 
 which is around 700mb. Now when i try  same program it is giving below Java 
 heap space exception, even i have set -Xmx1024 parameter for that
 
 Exception in thread main org.apache.pdfbox.exceptions.WrappedIOException
   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:243)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1071)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1038)
   at imageData.ReadLargeFile.main(ReadLargeFile.java:13)
 Caused by: java.lang.OutOfMemoryError: Java heap space
   at java.io.BufferedOutputStream.init(BufferedOutputStream.java:59)
   at 
 org.apache.pdfbox.cos.COSStream.createFilteredStream(COSStream.java:415)
   at 
 org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:452)
   at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:566)
   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
   ... 3 more
 
 Is there any way to read it?
 
 Index Out Of Bounds Exception while reading large PDF Document 
 ---
 
Key: PDFBOX-1498
URL: https://issues.apache.org/jira/browse/PDFBOX-1498
Project: PDFBox
 Issue Type: Bug
   Reporter: Manoj Patel
   Assignee: Andreas Lehmkühler
 
 I am getting java.lang.IndexOutOfBoundsException while reading large PDF 
 document (800 mb). 
 Below is the full stack
 Exception in thread main org.apache.pdfbox.exceptions.WrappedIOException
  at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:243)
  at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1071)
  at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1038)
  at imageData.AddFooter.main(AddFooter.java:26)
 Caused by: java.lang.IndexOutOfBoundsException: Index: 3377, Size: 3377
  at java.util.ArrayList.RangeCheck(ArrayList.java:547)
  at java.util.ArrayList.get(ArrayList.java:322)
  at 
 org.apache.pdfbox.io.RandomAccessBuffer.seek(RandomAccessBuffer.java:84)
  at 
 org.apache.pdfbox.io.RandomAccessFileOutputStream.write(RandomAccessFileOutputStream.java:106)
  at 
 java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
  at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
  at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
  at 
 org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:606)
  at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:566)
  at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
  ... 3 more
 
 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA administrators
 For more information on JIRA, see: http://www.atlassian.com/software/jira

Handling page imports

2013-03-08 Thread Maruan Sahyoun

Hi,

currently there are several areas in pdfbox where pages are imported from pdfs 
and reused to form new content e.g. Overlay, OverlayPDF, PDFMerger, PDFSplit. 
Some of these do have their own ways to handle the actual import some do reuse 
utility classes. For overlay purposes we need an imported page as xObject for 
splitting that's not necessary.

As I do not have a complete overview about the lib would it make sense to come 
up with something like a PageManager to handle these tasks e.g. 
PageManager.importPage(PDPage page), PageManager.importPage(PDDocument 
pdDocument, int pageNumber) …  or is that not needed? Is a call to PDage 
page.getContents() reliable to get the content stream or does it have to be 
done by iterating and copying the individual parts as has be done in 
OverlayPDF? Could that be enhanced? Shall we handle page imports always as 
xObjects?

Thanks for your feedback on these open questions.

Maruan Sahyoun

Re: Handling page imports

2013-03-08 Thread Maruan Sahyoun

Hi Glen,

thanks for your feedback. I was thinking in the lines of generalizing how to 
deal with page imports so the PageManager I was talking about is more low level 
than yours which is more towards a LayoutManager. If you look at Overlay.java, 
OverlayPDF.java …. all handle it slightly differently (as I was in some of our 
projects). It might also be possible to add functions to change the page order 
…. A higher level API like yours could then rely on the low level API. There 
might be some overlap though. BTW I quickly looked at your contribution. You 
put a lot of effort into what was a completely missing part!

With kind regards - Maruan

Am 08.03.2013 um 14:09 schrieb Glen Peterson g...@organicdesign.org:

 The concept of a page-manager is a useful one, and it makes sense to
 me to group the functionality you suggest with the stuff I called a
 page manager (handles reusing images, line-breaking, and
 page-breaking).  A new level of abstraction (a page manager) is
 necessary in order to cache some things before writing them to the
 underlying stream (cache lines as the line-breaking is being
 calculated, cache pages as the page-breaking is being calculated).
 Here is the PageManager code I submitted last week.  It doesn't import
 pages from other PDFs, but if people decide to incorporate this code
 into PDFBox, then I think your functionality would belong on this same
 PageManager:
 https://issues.apache.org/jira/browse/PDFBOX-1527
 
 On Fri, Mar 8, 2013 at 4:52 AM, Maruan Sahyoun sahy...@fileaffairs.de wrote:
 Hi,
 
 currently there are several areas in pdfbox where pages are imported from 
 pdfs and reused to form new content e.g. Overlay, OverlayPDF, PDFMerger, 
 PDFSplit. Some of these do have their own ways to handle the actual import 
 some do reuse utility classes. For overlay purposes we need an imported page 
 as xObject for splitting that's not necessary.
 
 As I do not have a complete overview about the lib would it make sense to 
 come up with something like a PageManager to handle these tasks e.g. 
 PageManager.importPage(PDPage page), PageManager.importPage(PDDocument 
 pdDocument, int pageNumber) …  or is that not needed? Is a call to PDage 
 page.getContents() reliable to get the content stream or does it have to be 
 done by iterating and copying the individual parts as has be done in 
 OverlayPDF? Could that be enhanced? Shall we handle page imports always as 
 xObjects?
 
 Thanks for your feedback on these open questions.
 
 Maruan Sahyoun
 
 
 
 --
 Glen K. Peterson
 (828) 393-0081

Re: New PDFBox committer

2013-03-18 Thread Maruan Sahyoun

Hi Thomas,

congratulations. I'm looking forward to working with you.

Maruan Sahyoun

Am 18.03.2013 um 18:21 schrieb Andreas Lehmkuehler andr...@lehmi.de:

 Hi,
 
 I'm happy to announce that the PDFBox PMC has decided to offer committership
 in Apache PDFBox to Thomas Chojecki. He has accepted the offer and should
 have his committer account ready by now.
 
 BR
 Andreas Lehmkühler

Re: What's wrong with this font ?

2013-03-20 Thread Maruan Sahyoun

Hi,

using the latest version of pdfbox (1.7.1) that's what I got

MLIPHOAP6 AE0TE
03D4  DR   DVGWEWNER5L  STLERC
60CO   L4PU7L

Please give it a try.

Maruan Sahyoun


Am 20.03.2013 um 11:45 schrieb Sébastien Dailly 
sebastien.dai...@elettermail.eu:

 Hello,
 
 I've got a problem while reading the attached document. (It has been 
 deflated, anonymised, text has been removed, and character shuffled).
 
 The text extraction works fine with some pdf reader (I tried with Acrobat and 
 Evince), but the text read by pdfbox is not the expected one, as if pdfbox is 
 using a wrong font description for reading the text : instead of
 
 
 60CO L4PU7L
  03D4 DR DVGWEWNER5L STLERC
 MLIPHOAP6 AE0TE
 
 I've got
 
 UvIKGMuK6RuN0TN
 0 E4RREDRRRElPéNéOND5vRRrTvNDp
 60pMRRRv4KS7v
 
 
 I'm using pdfbox 1.6.0 for that.
 
 Is the document invalid ? What can I do for reading correctly the document ?
 
 Thanks !
 
 -- 
 Sébastien Dailly
 +33 1 56 29 78 67
 ELETTERMAIL
 document.pdf

Re: [jira] [Commented] (PDFBOX-1176) Watermark

2013-03-20 Thread Maruan Sahyoun

can we move the discussion to the us...@pdfbox.apache.org mailing list?

Maruan Sahyoun

Am 20.03.2013 um 17:01 schrieb MH (JIRA) j...@apache.org:

 
   [ 
 https://issues.apache.org/jira/browse/PDFBOX-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13607765#comment-13607765
  ] 
 
 MH commented on PDFBOX-1176:
 
 
 setNonStrokingColor() ... how intuitive! 
 
 So, the visual output is like a watermark - but it's a transparent text on 
 each page. Better than nothing. I wonder if the same can be by drawing the 
 text to an underlay?
 
 Watermark
 -
 
   Key: PDFBOX-1176
   URL: https://issues.apache.org/jira/browse/PDFBOX-1176
   Project: PDFBox
Issue Type: Wish
  Reporter: Rubesh MX
Labels: Watermark
 Original Estimate: 24h
 Remaining Estimate: 24h
 
 I am checking if watermarks can  be added to a PDF doc and the same way can 
 be removed, so far I could not find any option to do that with PDFBox; It 
 will be better if we have an option to add and remove watermak to a PDF.
 
 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA administrators
 For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Various bugs in TTFSubFont, maybe blocking 1.8.0

2013-03-22 Thread Maruan Sahyoun

+1 for releasing as is and fixing afterwards

Maruan Sahyoun

Am 22.03.2013 um 15:00 schrieb Andreas Lehmkuehler andr...@lehmi.de:

 Hi,
 
 Am 22.03.2013 14:40, schrieb Wolfgang Glas:
 Am 22.03.13 14:29, schrieb Andreas Lehmkuehler:
 Hi,
 
 Am 22.03.2013 13:56, schrieb Wolfgang Glas:
 Hi all,
 
Andreas has kindly integrated my TTFSubFont class into pdfbox-1.8.0.
 I just started the integration. We are far away from the end.
 Thanks again for sharing the code with us.
 
 However, we are experiencing crashes of Minolta printers, which stumble
 across some subtle Problems in the extracted TTF fonts generated by this
 class. Most notably, the font checksum is calculated to an invalid value
 among other more subtle issues.
 Is this because I integrated some of your stuff or because of how I
 integrated
 your stuff?
 
 It has nothing to do with the way you are intergrating the code. These
 are all TTF-related misunderstandings I introduced myself.
 Sounds like my own experience, this font stuff is hard to understand if it
 comes to the details. :-(
 
 The problem is, that Minolta Printer are less tolerant on TTF problems
 than all PDF viewer I've tested so far.
 OK, I see
 
 Do you think it is a blocker for all or just for you? Maybe we should
 release
 1.8.0 as is and do another bugfix release in a couple of weeks?
 
 It is a blocker for all users of TTFSubFont, because their users will cheat
 them on breaking their Minolta printers like I've been cheated by my
 customer.
 I have a fix in my original code, which is build atop of pdfbox-1.7.1. I
 have not switched over to 1.8.0 so far, but surely I'd love to see a
 pdfbox-1.8.0 with working TTSubFont class.
 I tend to release 1.8.0 anyway as people are waiting for the new version and
 now that it is almost done I'm afraid some of them will become impatient if
 we stop it at this point. Furthermore I don't want to put some additional
 pressure on you to deliver a fix for that issue. Once your fix is available I
 can cut a new bugfix release. Maybe some other changes will be available too.
 WDYT?
 
   Wolfgang
 
 BR
 Andreas Lehmkühler

Overhaul PDFBox site

2013-03-26 Thread Maruan Sahyoun

Hi there,

what do you think about giving the PDFBox website an overhaul similar to 

http://cloudstack.apache.org/
http://ode.apache.org/index.html
http://cordova.apache.org

with a more prominent user guide such as http://ode.apache.org/userguide/
and a cleaner architecture description (together with main classes) for 
developers

to support a faster intro into pdfbox

Kind regards

Maruan Sahyoun

Re: Overhaul PDFBox site

2013-03-26 Thread Maruan Sahyoun

well - the navigation is similar also hidden behind drop downs on ode compared 
to cloudstack. Both are using the same css framework [1] and the navigation can 
even be combined - that should give us enough freedom (and is an implementation 
detail). Both seem to be using the  Apache CMS [2].

Maruan Sahyoun

[1] http://twitter.github.com/bootstrap/
[2] https://svn.apache.org/repos/infra/websites/cms/webgui/content/export.json

Am 26.03.2013 um 15:22 schrieb Timo Boehme timo.boe...@ontochem.com:

 Hi,
 
 an update to the website with a cleaner grouping of content etc. would help 
 to attract people. While 'ode' and 'cordova' are visually nice I would like 
 to keep more navigation possibilities at the start page like in 'cloudstack'.
 
 
 Best regards,
 Timo
 
 
 Am 26.03.2013 14:03, schrieb Maruan Sahyoun:
 Hi there,
 
 what do you think about giving the PDFBox website an overhaul similar to
 
 http://cloudstack.apache.org/
 http://ode.apache.org/index.html
 http://cordova.apache.org
 
 with a more prominent user guide such as http://ode.apache.org/userguide/
 and a cleaner architecture description (together with main classes) for 
 developers
 
 to support a faster intro into pdfbox
 
 Kind regards
 
 Maruan Sahyoun
 
 
 
 -- 
 
 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com
 
 _
 
 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
 _

Re: Overhaul PDFBox site

2013-03-26 Thread Maruan Sahyoun

would be happy to handle that

Maruan Sahyoun

Am 26.03.2013 um 22:35 schrieb Andreas Lehmkuehler andr...@lehmi.de:

 Hi,
 
 Am 26.03.2013 17:00, schrieb Maruan Sahyoun:
 well - the navigation is similar also hidden behind drop downs on ode
  compared to cloudstack. Both are using the same css framework [1] and the
  navigation can even be combined - that should give us enough freedom (and
  is an implementation detail). Both seem to be using the  Apache CMS [2].
 I guess we all know that we have to overhaul the content itself. :-)
 
 But first of all we have to decide how to manage the content. We have to use
 either svnpubsub or the Apache CMS [1], the latter is recommended. IMHO
 we should use the CMS [2] as it would be more flexible and it is easier to
 maintain the content.
 
 As a good starting point I've changed the maven skin of our site to the
 bootstrap like fluendo skin [3]. Maybe it is a good idea to fresh up the 
 layout
 a little bit in preparation of a possible transition to the CMS.
 
 WDYT and the more interesting question any volunteer to handle the transition?
 
 BR
 Andreas Lehmkühler
 
 [1] http://www.apache.org/dev/project-site.html
 [2] http://www.apache.org/dev/cmsref.html
 [3] http://people.apache.org/~lehmi/pdfbox_fluendo/index.html
 
 Maruan Sahyoun
 
 [1] http://twitter.github.com/bootstrap/
 [2] 
 https://svn.apache.org/repos/infra/websites/cms/webgui/content/export.json
 
 Am 26.03.2013 um 15:22 schrieb Timo Boehme timo.boe...@ontochem.com:
 
 Hi,
 
 an update to the website with a cleaner grouping of content etc. would help 
 to attract people. While 'ode' and 'cordova' are visually nice I would like 
 to keep more navigation possibilities at the start page like in 
 'cloudstack'.
 
 
 Best regards,
 Timo
 
 
 Am 26.03.2013 14:03, schrieb Maruan Sahyoun:
 Hi there,
 
 what do you think about giving the PDFBox website an overhaul similar to
 
 http://cloudstack.apache.org/
 http://ode.apache.org/index.html
 http://cordova.apache.org
 
 with a more prominent user guide such as http://ode.apache.org/userguide/
 and a cleaner architecture description (together with main classes) for 
 developers
 
 to support a faster intro into pdfbox
 
 Kind regards
 
 Maruan Sahyoun
 
 
 --
 
 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com
 
 _
 
 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
 _

Re: Overhaul PDFBox site

2013-03-27 Thread Maruan Sahyoun

thx for the offer to help. I think it's needed :-)

I already read about the Apache CMS and svnpubsub and think that the CMS is the 
way to go although it's initially a little more effort. One of the major 
benefits of the CMS is that non technical users can use it (the web UI) and 
it's easier for non comitters to contribute [1].  

As soon as I get a go to move forward I'll open a ticket on Jira to track the 
status of the move. The initial step will be to get myself familiar with the 
tools to build the site as described in [2]. I propose the migration to reuse 
the current content and most of the current navigation and optimize at a later 
stage but making a clearer distinction between users of and developers for 
pdfbox.

Maruan Sahyoun

[1] http://www.apache.org/dev/cmsref.html#non-committer
[2] http://www.apache.org/dev/cmsref.html


Am 27.03.2013 um 00:10 schrieb Andreas Lehmkuehler andr...@lehmi.de:

 Hi,
 
 Am 26.03.2013 23:04, schrieb Maruan Sahyoun:
 would be happy to handle that
 Cool! I'll try to help whenever possible.
 
 OK, I guess we don't need a formal vote on moving our site to the CMS, but
 let's wait a couple of days so that everybody has a chance to object.
 
 @Maruan
 Once we have lazy consensus we/you can start with the preparations. Please try
 to find out how we should start/proceed. I hope you'll find all you need
 using the pointers I gave in my earlier post
 
 
 Maruan Sahyoun
 
 Am 26.03.2013 um 22:35 schrieb Andreas Lehmkuehler andr...@lehmi.de:
 
 Hi,
 
 Am 26.03.2013 17:00, schrieb Maruan Sahyoun:
 well - the navigation is similar also hidden behind drop downs on ode
 compared to cloudstack. Both are using the same css framework [1] and the
 navigation can even be combined - that should give us enough freedom (and
 is an implementation detail). Both seem to be using the  Apache CMS [2].
 I guess we all know that we have to overhaul the content itself. :-)
 
 But first of all we have to decide how to manage the content. We have to use
 either svnpubsub or the Apache CMS [1], the latter is recommended. IMHO
 we should use the CMS [2] as it would be more flexible and it is easier to
 maintain the content.
 
 As a good starting point I've changed the maven skin of our site to the
 bootstrap like fluendo skin [3]. Maybe it is a good idea to fresh up the 
 layout
 a little bit in preparation of a possible transition to the CMS.
 
 WDYT and the more interesting question any volunteer to handle the 
 transition?
 
 BR
 Andreas Lehmkühler
 
 [1] http://www.apache.org/dev/project-site.html
 [2] http://www.apache.org/dev/cmsref.html
 [3] http://people.apache.org/~lehmi/pdfbox_fluendo/index.html
 
 BR
 Andreas Lehmkühler

Re: [PDFBox 2.0] Ideas

2013-03-29 Thread Maruan Sahyoun

Hi,

Am 29.03.2013 um 12:27 schrieb Andreas Lehmkuehler andr...@lehmi.de:

 Hi,
 
 Am 28.03.2013 21:04, schrieb Guillaume Bailleul:
 On Tue, Mar 26, 2013 at 2:21 PM, Maruan Sahyoun sahy...@fileaffairs.de 
 wrote:
 Hi there,
 
 here is a rough summary of some ideas I have for a potential pdfbox 2.0 
 release. Maybe we could capture these as part of a wiki or jira ticket so 
 we can add and agree on some of these if we want to. As soon as we have 
 agreement we could have individual tickets for them.
 
 WDYT?
 
 
 # rearchitect PDF parsing into lexing, incremental (non caching) parser and 
 caching parser
 o the lexer would be the low level component delivering tokens to the 
 parser. A sample implementation exists as part of PDFBOX-1000. The benefit 
 would be a clean low level handling of tokens. Although I proposed the 
 lexer I'm not totally happy with the current implementation. That's 
 something for another mail/ticket ...
 o the incremental (non caching) parser would allow for page by page 
 processing moving forward only to support text extraction, merging, 
 splitting … - the benefit would be a lower memory consumption as well as a 
 potential faster processing
 o the caching parser would support applications such a PDFDebugger or 
 PDFReader
 
 # handling of pdf versions
 the current implementation is a mix of PDF 1.4 and some adhoc additions 
 without a clear distinction what is and is not supported. We could ad some 
 support for explicitly handling versions in pdfbox e.g. my marking certain 
 methods and properties to the pdf version support level. This could in 
 addition be a good basis for PDF/A and other compliance checks.
 
 # handle large pdf files
 in addition to the pdf parsing pdfbox does not always handle large pdf 
 files well as some of the references are implemented as int instead of long
 
 # split pdfbox into modules to support use cases such as text extraction 
 and merge with the minimum amount of classes needed. more app like tolls 
 such as the PDFDebugger or PDFReader could be additional modules.
 
 With kind regards
 
 
 Maruan Sahyoun
 
 
 Hi Maruan,
 
 I think some wiki pages should be good. This discussion already
 started but as mails in the list or maybe jira tickets lost in the
 flow.
 
 There is an apache wiki [1], but I found nothing on PDFBox, a good way
 occasion to start.
 Once we migrated our site to the Apache CMS we'll have some sort of wiki, so
 that we IMHO don't have to ask for other one.
 
 I do not have many more ideas. According to me, having different
 modules for PDF parsers, PDF makers and PDF viewers is an important
 one.
 This is one of my favourites, too. Let's see what'll come up. At least we 
 don't
 only need people who are interested in some features but also in implementing 
 it ;-)

We might be able to split into modules based on the current code and 
rearchitect the individual parts later. E.g the command line tools could easily 
be separated as well as PDFDebugger, PDFReader. One thing to consider is how we 
handle releases afterwards. Will we always release all modules as part of a 
release (like Apache Camel does) or do releases seperately (as Apache Sling 
does).

I'm happy to help with implementation/rearrangement as soon as the transition 
to the CMS is done

 
 
 [1] http://wiki.apache.org/general/
 
 Guillaume Bailleul
 
 BR
 Andreas Lehmkühler

Re: [PDFBox 2.0] Ideas

2013-03-29 Thread Maruan Sahyoun

Hi,

Maruan Sahyoun

Am 29.03.2013 um 13:18 schrieb Andreas Lehmkuehler andr...@lehmi.de:

SNIP

 One thing to consider is how we handle releases afterwards. Will we always 
 release
  all modules as part of a release (like Apache Camel does) or do releases
  seperately (as Apache Sling does).
 That's a good point, but it'll depend on the details. AFAIK Sling is OSGI 
 based
 so that all components should be independent, which makes it easier to release
 them separately.
 

Correct Sling is OSGI based. But Apache Camel also has a core component on 
which others are based. And they had a similar discussion. I don't think it's a 
technical question as if we go for modules within minor releases API's should 
stay stable so e.g. PDFReader could count on PDFParser. But as a start why 
don't release all modules together and revisit that question later.

 I'm happy to help with implementation/rearrangement as soon as the 
 transition to the CMS is done
 Cool!
 
 
 BR
 Andreas Lehmkühler
 

BR
Maruan Sahyoun

Re: [PDFBox 2.0] Ideas

2013-03-31 Thread Maruan Sahyoun

+1 for releasing together

Maruan Sahyoun

Am 31.03.2013 um 19:54 schrieb Guillaume Bailleul gbm.baill...@gmail.com:

 Hi all,
 
 I agree with Timo, pdfbox is not (yet) a big project so releasing per
 module will cost too many.
 We can have modules definition and numbering that permit to do separate
 releases in the futur even if we do not for the moment.
 
 Guillaume
 Le 29 mars 2013 15:35, timo.boe...@ontochem.com timo.boe...@ontochem.com
 a écrit :
 
 Hi,
 
 I think that doing a release is quite a bit of work and having multiple
 modules
 with separate releases each requires extra time. As long as there are no
 module
 specific maintainers with responsibilities for releases we should do
 releases
 with the complete module set. This also prevents problems with
 incompatibilities
 between the modules.
 
 BR
 Timo
 
 Maruan Sahyoun sahy...@fileaffairs.de hat am 29. März 2013 um 14:15
 geschrieben:
 Am 29.03.2013 um 13:18 schrieb Andreas Lehmkuehler andr...@lehmi.de:
 SNIP
 One thing to consider is how we handle releases afterwards. Will we
 always
 release
 all modules as part of a release (like Apache Camel does) or do
 releases
 seperately (as Apache Sling does).
 That's a good point, but it'll depend on the details. AFAIK Sling is
 OSGI
 based
 so that all components should be independent, which makes it easier to
 release
 them separately.
 
 Correct Sling is OSGI based. But Apache Camel also has a core component
 on
 which others are based. And they had a similar discussion. I don't think
 it's
 a technical question as if we go for modules within minor releases API's
 should stay stable so e.g. PDFReader could count on PDFParser. But as a
 start
 why don't release all modules together and revisit that question later.
 
 I'm happy to help with implementation/rearrangement as soon as the
 transition to the CMS is done
 Cool!
 
 BR
 Andreas Lehmkühler
 
 BR
 Maruan Sahyoun

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3676 matches

Mail list logo