Hi Tilman, Sounds good. Should I wait for PDFBOX-2441? -----Original Message----- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, October 21, 2014 1:42 PM To: dev@pdfbox.apache.org Subject: Re: 2.0
Hi Tim, 2.0 doesn't seem to be released soon... what might be useful again is a comparison between seq v non-seq, Andreas recently resolved an issue (PDFBOX-2250) that improves the nonSeq parser a lot. Although this isn't fully done, a follow-up issue PDFBOX-2441 <https://issues.apache.org/jira/browse/PDFBOX-2441> has been opened which will improve a few more complex files. Tilman Am 21.10.2014 um 13:00 schrieb Allison, Timothy B.: > Been too busy over in Tika-land...just noticing this now. > > Let me know which comparisons you'd like to run (2.0 v 1.8.x or seq v > non-seq). I won't have time to integrate 2.0 into our Tika PDFParser any > time soon (Jeremy Anderson on TIKA-1285 has already started this), but I > could easily write a lightweight wrapper around PDFBox's TextStripper + > metadata inside of the tika-batch/tika-eval framework. > > Cheers, > > Tim > ________________________________________ > From: Andreas Lehmkühler [andr...@lehmi.de] > Sent: Wednesday, October 15, 2014 6:20 AM > To: dev@pdfbox.apache.org > Subject: Re: 2.0 > > Hi, > > >> Maruan Sahyoun <sahy...@fileaffairs.de> hat am 15. Oktober 2014 um 09:32 >> geschrieben: >> >> >> What about keeping both for the 2.0 release and phase the old one out for 3 >> but making the NonSequential the default parser. >> Would also give us some time to work with Tim (TIKA) on the test suite. > I agree, that's the only thing we can manage in a timely manner. > > >> Maybe we could simplify the variations of PDDocument.load to something like >> >> PDDocument.load(input, raf, enforce, useLegacyParser) or >> PDDocument.load(input, raf, enforce, withSignatureSupport) . >> >> and introduce PDDocument.load(input) to use the NonSequential >> >> >> WDYT? > Good idea, I've already created PDFBOX-2430 for this. > >> Maruan > > BR > Andreas Lehmkühler >> Am 15.10.2014 um 09:18 schrieb Timo Boehme <timo.boe...@ontochem.com>: >> >>> Hi, >>> >>> the difference between the parsers stems from the fact that the old parser >>> can cope with a completely broken xref table because it uses the objects as >>> it finds them on its sequential way. What we need (as I proposed before) is >>> a repair mechanism scanning the file for object start/end to be used for >>> re-creating the xref table. >>> I will see if I can find some time to do this. >>> >>> The only other stopper is as Andreas has pointed out the signing. I'm not >>> familiar with this and don't known what needs to be done here. >>> >>> >>> Best, >>> Timo >>> >>> >>> Am 14.10.2014 um 21:18 schrieb Tilman Hausherr: >>>> Here are some: >>>> >>>> 055/055794.pdf >>>> 082/082463.pdf >>>> 108/108362.pdf >>>> 113/113223.pdf >>>> 115/115458.pdf >>>> 115/115463.pdf >>>> 122/122393.pdf >>>> 129/129416.pdf >>>> 133/133423.pdf >>>> 148/148020.pdf >>>> 152/152012.pdf >>>> 161/161466.pdf >>>> >>>> to be found here: >>>> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/ >>>> >>>> Tilman >>>> >>>> Am 14.10.2014 um 21:06 schrieb John Hewson: >>>>> Unless somebody provides us with a list of those files, then I think >>>>> this is an unreasonable request. As long as we continue to leave the >>>>> old parser in PDFBox, we won't get the bug reports which we need to >>>>> fix the new parser, and the situation will never resolve itself. >>>>> Falling back to the old parser is just as bad - we won't get bug reports. >>>>> >>>>> -- John >>>>> >>>>> On 14 Oct 2014, at 07:39, Tilman Hausherr <thaush...@t-online.de> wrote: >>>>> >>>>>> I prefer that the "old" parser not be removed, because there are many >>>>>> files that can only be parsed by the old parser. This came out in a >>>>>> large scale test with TIKA. >>>>>> >>>>>> The best idea (in my current opinion) is to use the nonSeq parser >>>>>> first, and the old parser if there is an exception. >>>>>> >>>>>> Tilman >>>>>> >>>>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme: >>>>>>> Hi, >>>>>>> >>>>>>> Am 14.10.2014 um 07:22 schrieb John Hewson: >>>>>>>> Hi, >>>>>>>>>> John Hewson <j...@jahewson.com> hat am 10. Oktober 2014 um 20:05 >>>>>>>>>> geschrieben: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - Parsing (Andreas?) >>>>>>>>> I guess we won't get a complete new parser in 2.0, but I try to >>>>>>>>> improve the XRef >>>>>>>>> and the COSStream stuff >>>>>>>> It would be great if we could get rid of the old parser and switch >>>>>>>> to the non-sequential >>>>>>>> parser, WDYT? >>>>>>> I would also propose to completely remove the old parser. That way >>>>>>> we are more flexible in parsing streams etc. since parts of the >>>>>>> non-sequential parser are a compromise to work side-by-side with the >>>>>>> old parser. >>>>>>> Possibly there are a small number of functions for which the old >>>>>>> parser is still needed - e.g. signing? >>>>>>> >>>>>>> >>>>>>> Best, >>>>>>> Timo >>>>>>> >>>>>>> >>> >>> -- >>> >>> Timo Boehme >>> OntoChem GmbH >>> H.-Damerow-Str. 4 >>> 06120 Halle/Saale >>> T: +49 345 4780474 >>> F: +49 345 4780471 >>> timo.boe...@ontochem.com >>> >>> _____________________________________________________________________ >>> >>> OntoChem GmbH >>> Geschäftsführer: Dr. Lutz Weber >>> Sitz: Halle / Saale >>> Registergericht: Stendal >>> Registernummer: HRB 215461 >>> _____________________________________________________________________ >>>