Tim, first many thanks for the offer. I’d add that a comparison between 1.8 and 2.0 would be useful too to detect differences might it be because of enhancements or regressions.
BR Maruan Am 21.10.2014 um 19:42 schrieb Tilman Hausherr <thaush...@t-online.de>: > Hi Tim, > > 2.0 doesn't seem to be released soon... what might be useful again is a > comparison between seq v non-seq, Andreas recently resolved an issue > (PDFBOX-2250) that improves the nonSeq parser a lot. Although this isn't > fully done, a follow-up issue PDFBOX-2441 > <https://issues.apache.org/jira/browse/PDFBOX-2441> has been opened which > will improve a few more complex files. > > Tilman > > > > Am 21.10.2014 um 13:00 schrieb Allison, Timothy B.: >> Been too busy over in Tika-land...just noticing this now. >> >> Let me know which comparisons you'd like to run (2.0 v 1.8.x or seq v >> non-seq). I won't have time to integrate 2.0 into our Tika PDFParser any >> time soon (Jeremy Anderson on TIKA-1285 has already started this), but I >> could easily write a lightweight wrapper around PDFBox's TextStripper + >> metadata inside of the tika-batch/tika-eval framework. >> >> Cheers, >> >> Tim >> ________________________________________ >> From: Andreas Lehmkühler [andr...@lehmi.de] >> Sent: Wednesday, October 15, 2014 6:20 AM >> To: dev@pdfbox.apache.org >> Subject: Re: 2.0 >> >> Hi, >> >> >>> Maruan Sahyoun <sahy...@fileaffairs.de> hat am 15. Oktober 2014 um 09:32 >>> geschrieben: >>> >>> >>> What about keeping both for the 2.0 release and phase the old one out for 3 >>> but making the NonSequential the default parser. >>> Would also give us some time to work with Tim (TIKA) on the test suite. >> I agree, that's the only thing we can manage in a timely manner. >> >> >>> Maybe we could simplify the variations of PDDocument.load to something like >>> >>> PDDocument.load(input, raf, enforce, useLegacyParser) or >>> PDDocument.load(input, raf, enforce, withSignatureSupport) … >>> >>> and introduce PDDocument.load(input) to use the NonSequential >>> >>> >>> WDYT? >> Good idea, I've already created PDFBOX-2430 for this. >> >>> Maruan >> >> BR >> Andreas Lehmkühler >>> Am 15.10.2014 um 09:18 schrieb Timo Boehme <timo.boe...@ontochem.com>: >>> >>>> Hi, >>>> >>>> the difference between the parsers stems from the fact that the old parser >>>> can cope with a completely broken xref table because it uses the objects as >>>> it finds them on its sequential way. What we need (as I proposed before) is >>>> a repair mechanism scanning the file for object start/end to be used for >>>> re-creating the xref table. >>>> I will see if I can find some time to do this. >>>> >>>> The only other stopper is as Andreas has pointed out the signing. I'm not >>>> familiar with this and don't known what needs to be done here. >>>> >>>> >>>> Best, >>>> Timo >>>> >>>> >>>> Am 14.10.2014 um 21:18 schrieb Tilman Hausherr: >>>>> Here are some: >>>>> >>>>> 055/055794.pdf >>>>> 082/082463.pdf >>>>> 108/108362.pdf >>>>> 113/113223.pdf >>>>> 115/115458.pdf >>>>> 115/115463.pdf >>>>> 122/122393.pdf >>>>> 129/129416.pdf >>>>> 133/133423.pdf >>>>> 148/148020.pdf >>>>> 152/152012.pdf >>>>> 161/161466.pdf >>>>> >>>>> to be found here: >>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/ >>>>> >>>>> Tilman >>>>> >>>>> Am 14.10.2014 um 21:06 schrieb John Hewson: >>>>>> Unless somebody provides us with a list of those files, then I think >>>>>> this is an unreasonable request. As long as we continue to leave the >>>>>> old parser in PDFBox, we won’t get the bug reports which we need to >>>>>> fix the new parser, and the situation will never resolve itself. >>>>>> Falling back to the old parser is just as bad - we won’t get bug reports. >>>>>> >>>>>> -- John >>>>>> >>>>>> On 14 Oct 2014, at 07:39, Tilman Hausherr <thaush...@t-online.de> wrote: >>>>>> >>>>>>> I prefer that the "old" parser not be removed, because there are many >>>>>>> files that can only be parsed by the old parser. This came out in a >>>>>>> large scale test with TIKA. >>>>>>> >>>>>>> The best idea (in my current opinion) is to use the nonSeq parser >>>>>>> first, and the old parser if there is an exception. >>>>>>> >>>>>>> Tilman >>>>>>> >>>>>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme: >>>>>>>> Hi, >>>>>>>> >>>>>>>> Am 14.10.2014 um 07:22 schrieb John Hewson: >>>>>>>>> Hi, >>>>>>>>>>> John Hewson <j...@jahewson.com> hat am 10. Oktober 2014 um 20:05 >>>>>>>>>>> geschrieben: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> - Parsing (Andreas?) >>>>>>>>>> I guess we won't get a complete new parser in 2.0, but I try to >>>>>>>>>> improve the XRef >>>>>>>>>> and the COSStream stuff >>>>>>>>> It would be great if we could get rid of the old parser and switch >>>>>>>>> to the non-sequential >>>>>>>>> parser, WDYT? >>>>>>>> I would also propose to completely remove the old parser. That way >>>>>>>> we are more flexible in parsing streams etc. since parts of the >>>>>>>> non-sequential parser are a compromise to work side-by-side with the >>>>>>>> old parser. >>>>>>>> Possibly there are a small number of functions for which the old >>>>>>>> parser is still needed - e.g. signing? >>>>>>>> >>>>>>>> >>>>>>>> Best, >>>>>>>> Timo >>>>>>>> >>>>>>>> >>>> >>>> -- >>>> >>>> Timo Boehme >>>> OntoChem GmbH >>>> H.-Damerow-Str. 4 >>>> 06120 Halle/Saale >>>> T: +49 345 4780474 >>>> F: +49 345 4780471 >>>> timo.boe...@ontochem.com >>>> >>>> _____________________________________________________________________ >>>> >>>> OntoChem GmbH >>>> Geschäftsführer: Dr. Lutz Weber >>>> Sitz: Halle / Saale >>>> Registergericht: Stendal >>>> Registernummer: HRB 215461 >>>> _____________________________________________________________________ >>>> >