Been too busy over in Tika-land...just noticing this now. Let me know which comparisons you'd like to run (2.0 v 1.8.x or seq v non-seq). I won't have time to integrate 2.0 into our Tika PDFParser any time soon (Jeremy Anderson on TIKA-1285 has already started this), but I could easily write a lightweight wrapper around PDFBox's TextStripper + metadata inside of the tika-batch/tika-eval framework.
Cheers, Tim ________________________________________ From: Andreas Lehmkühler [andr...@lehmi.de] Sent: Wednesday, October 15, 2014 6:20 AM To: dev@pdfbox.apache.org Subject: Re: 2.0 Hi, > Maruan Sahyoun <sahy...@fileaffairs.de> hat am 15. Oktober 2014 um 09:32 > geschrieben: > > > What about keeping both for the 2.0 release and phase the old one out for 3 > but making the NonSequential the default parser. > Would also give us some time to work with Tim (TIKA) on the test suite. I agree, that's the only thing we can manage in a timely manner. > Maybe we could simplify the variations of PDDocument.load to something like > > PDDocument.load(input, raf, enforce, useLegacyParser) or > PDDocument.load(input, raf, enforce, withSignatureSupport) … > > and introduce PDDocument.load(input) to use the NonSequential > > > WDYT? Good idea, I've already created PDFBOX-2430 for this. > > Maruan BR Andreas Lehmkühler > > Am 15.10.2014 um 09:18 schrieb Timo Boehme <timo.boe...@ontochem.com>: > > > Hi, > > > > the difference between the parsers stems from the fact that the old parser > > can cope with a completely broken xref table because it uses the objects as > > it finds them on its sequential way. What we need (as I proposed before) is > > a repair mechanism scanning the file for object start/end to be used for > > re-creating the xref table. > > I will see if I can find some time to do this. > > > > The only other stopper is as Andreas has pointed out the signing. I'm not > > familiar with this and don't known what needs to be done here. > > > > > > Best, > > Timo > > > > > > Am 14.10.2014 um 21:18 schrieb Tilman Hausherr: > >> Here are some: > >> > >> 055/055794.pdf > >> 082/082463.pdf > >> 108/108362.pdf > >> 113/113223.pdf > >> 115/115458.pdf > >> 115/115463.pdf > >> 122/122393.pdf > >> 129/129416.pdf > >> 133/133423.pdf > >> 148/148020.pdf > >> 152/152012.pdf > >> 161/161466.pdf > >> > >> to be found here: > >> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/ > >> > >> Tilman > >> > >> Am 14.10.2014 um 21:06 schrieb John Hewson: > >>> Unless somebody provides us with a list of those files, then I think > >>> this is an unreasonable request. As long as we continue to leave the > >>> old parser in PDFBox, we won’t get the bug reports which we need to > >>> fix the new parser, and the situation will never resolve itself. > >>> Falling back to the old parser is just as bad - we won’t get bug reports. > >>> > >>> -- John > >>> > >>> On 14 Oct 2014, at 07:39, Tilman Hausherr <thaush...@t-online.de> wrote: > >>> > >>>> I prefer that the "old" parser not be removed, because there are many > >>>> files that can only be parsed by the old parser. This came out in a > >>>> large scale test with TIKA. > >>>> > >>>> The best idea (in my current opinion) is to use the nonSeq parser > >>>> first, and the old parser if there is an exception. > >>>> > >>>> Tilman > >>>> > >>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme: > >>>>> Hi, > >>>>> > >>>>> Am 14.10.2014 um 07:22 schrieb John Hewson: > >>>>>> Hi, > >>>>>>>> John Hewson <j...@jahewson.com> hat am 10. Oktober 2014 um 20:05 > >>>>>>>> geschrieben: > >>>>>>>> > >>>>>>>> > >>>>>>>> - Parsing (Andreas?) > >>>>>>> I guess we won't get a complete new parser in 2.0, but I try to > >>>>>>> improve the XRef > >>>>>>> and the COSStream stuff > >>>>>> It would be great if we could get rid of the old parser and switch > >>>>>> to the non-sequential > >>>>>> parser, WDYT? > >>>>> I would also propose to completely remove the old parser. That way > >>>>> we are more flexible in parsing streams etc. since parts of the > >>>>> non-sequential parser are a compromise to work side-by-side with the > >>>>> old parser. > >>>>> Possibly there are a small number of functions for which the old > >>>>> parser is still needed - e.g. signing? > >>>>> > >>>>> > >>>>> Best, > >>>>> Timo > >>>>> > >>>>> > >>> > >> > > > > > > -- > > > > Timo Boehme > > OntoChem GmbH > > H.-Damerow-Str. 4 > > 06120 Halle/Saale > > T: +49 345 4780474 > > F: +49 345 4780471 > > timo.boe...@ontochem.com > > > > _____________________________________________________________________ > > > > OntoChem GmbH > > Geschäftsführer: Dr. Lutz Weber > > Sitz: Halle / Saale > > Registergericht: Stendal > > Registernummer: HRB 215461 > > _____________________________________________________________________ > > >