RE: 2.0

Allison, Timothy B. Tue, 21 Oct 2014 04:01:29 -0700

Been too busy over in Tika-land...just noticing this now.

Let me know which comparisons you'd like to run (2.0 v 1.8.x or seq v non-seq). 
 I won't have time to integrate 2.0 into our Tika PDFParser any time soon 
(Jeremy Anderson on TIKA-1285 has already started this), but I could easily 
write a lightweight wrapper around PDFBox's TextStripper + metadata inside of 
the tika-batch/tika-eval framework.


Cheers,

      Tim
________________________________________
From: Andreas Lehmkühler [[email protected]]
Sent: Wednesday, October 15, 2014 6:20 AM
To: [email protected]
Subject: Re: 2.0

Hi,


> Maruan Sahyoun <[email protected]> hat am 15. Oktober 2014 um 09:32
> geschrieben:
>
>
> What about keeping both for the 2.0 release and phase the old one out for 3
> but making the NonSequential the default parser.
> Would also give us some time to work with Tim (TIKA) on the test suite.
I agree, that's the only thing we can manage in a timely manner.


> Maybe we could simplify the variations of PDDocument.load to something like
>
> PDDocument.load(input, raf, enforce, useLegacyParser) or
> PDDocument.load(input, raf, enforce, withSignatureSupport) …
>
> and introduce PDDocument.load(input) to use the NonSequential
>
>
> WDYT?
Good idea, I've already created PDFBOX-2430 for this.

>
> Maruan


BR
Andreas Lehmkühler
>
> Am 15.10.2014 um 09:18 schrieb Timo Boehme <[email protected]>:
>
> > Hi,
> >
> > the difference between the parsers stems from the fact that the old parser
> > can cope with a completely broken xref table because it uses the objects as
> > it finds them on its sequential way. What we need (as I proposed before) is
> > a repair mechanism scanning the file for object start/end to be used for
> > re-creating the xref table.
> > I will see if I can find some time to do this.
> >
> > The only other stopper is as Andreas has pointed out the signing. I'm not
> > familiar with this and don't known what needs to be done here.
> >
> >
> > Best,
> > Timo
> >
> >
> > Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
> >> Here are some:
> >>
> >> 055/055794.pdf
> >> 082/082463.pdf
> >> 108/108362.pdf
> >> 113/113223.pdf
> >> 115/115458.pdf
> >> 115/115463.pdf
> >> 122/122393.pdf
> >> 129/129416.pdf
> >> 133/133423.pdf
> >> 148/148020.pdf
> >> 152/152012.pdf
> >> 161/161466.pdf
> >>
> >> to be found here:
> >> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
> >>
> >> Tilman
> >>
> >> Am 14.10.2014 um 21:06 schrieb John Hewson:
> >>> Unless somebody provides us with a list of those files, then I think
> >>> this is an unreasonable request. As long as we continue to leave the
> >>> old parser in PDFBox, we won’t get the bug reports which we need to
> >>> fix the new parser, and the situation will never resolve itself.
> >>> Falling back to the old parser is just as bad - we won’t get bug reports.
> >>>
> >>> -- John
> >>>
> >>> On 14 Oct 2014, at 07:39, Tilman Hausherr <[email protected]> wrote:
> >>>
> >>>> I prefer that the "old" parser not be removed, because there are many
> >>>> files that can only be parsed by the old parser. This came out in a
> >>>> large scale test with TIKA.
> >>>>
> >>>> The best idea (in my current opinion) is to use the nonSeq parser
> >>>> first, and the old parser if there is an exception.
> >>>>
> >>>> Tilman
> >>>>
> >>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
> >>>>> Hi,
> >>>>>
> >>>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
> >>>>>> Hi,
> >>>>>>>> John Hewson <[email protected]> hat am 10. Oktober 2014 um 20:05
> >>>>>>>> geschrieben:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>        - Parsing (Andreas?)
> >>>>>>> I guess we won't get a complete new parser in 2.0, but I try to
> >>>>>>> improve the XRef
> >>>>>>> and the COSStream stuff
> >>>>>> It would be great if we could get rid of the old parser and switch
> >>>>>> to the non-sequential
> >>>>>> parser, WDYT?
> >>>>> I would also propose to completely remove the old parser. That way
> >>>>> we are more flexible in parsing streams etc. since parts of the
> >>>>> non-sequential parser are a compromise to work side-by-side with the
> >>>>> old parser.
> >>>>> Possibly there are a small number of functions for which the old
> >>>>> parser is still needed - e.g. signing?
> >>>>>
> >>>>>
> >>>>> Best,
> >>>>> Timo
> >>>>>
> >>>>>
> >>>
> >>
> >
> >
> > --
> >
> > Timo Boehme
> > OntoChem GmbH
> > H.-Damerow-Str. 4
> > 06120 Halle/Saale
> > T: +49 345 4780474
> > F: +49 345 4780471
> > [email protected]
> >
> > _____________________________________________________________________
> >
> > OntoChem GmbH
> > Geschäftsführer: Dr. Lutz Weber
> > Sitz: Halle / Saale
> > Registergericht: Stendal
> > Registernummer: HRB 215461
> > _____________________________________________________________________
> >
>

RE: 2.0

Reply via email to