Re: 2.0

Maruan Sahyoun Wed, 15 Oct 2014 00:33:47 -0700

What about keeping both for the 2.0 release and phase the old one out for 3 but 
making the NonSequential the default parser.
Would also give us some time to work with Tim (TIKA) on the test suite.


Maybe we could simplify the variations of PDDocument.load to something like 

PDDocument.load(input, raf, enforce, useLegacyParser) or
PDDocument.load(input, raf, enforce, withSignatureSupport) …

and introduce PDDocument.load(input) to use the NonSequential 


WDYT?

Maruan

Am 15.10.2014 um 09:18 schrieb Timo Boehme <[email protected]>:

> Hi,
> 
> the difference between the parsers stems from the fact that the old parser 
> can cope with a completely broken xref table because it uses the objects as 
> it finds them on its sequential way. What we need (as I proposed before) is a 
> repair mechanism scanning the file for object start/end to be used for 
> re-creating the xref table.
> I will see if I can find some time to do this.
> 
> The only other stopper is as Andreas has pointed out the signing. I'm not 
> familiar with this and don't known what needs to be done here.
> 
> 
> Best,
> Timo
> 
> 
> Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
>> Here are some:
>> 
>> 055/055794.pdf
>> 082/082463.pdf
>> 108/108362.pdf
>> 113/113223.pdf
>> 115/115458.pdf
>> 115/115463.pdf
>> 122/122393.pdf
>> 129/129416.pdf
>> 133/133423.pdf
>> 148/148020.pdf
>> 152/152012.pdf
>> 161/161466.pdf
>> 
>> to be found here:
>> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
>> 
>> Tilman
>> 
>> Am 14.10.2014 um 21:06 schrieb John Hewson:
>>> Unless somebody provides us with a list of those files, then I think
>>> this is an unreasonable request. As long as we continue to leave the
>>> old parser in PDFBox, we won’t get the bug reports which we need to
>>> fix the new parser, and the situation will never resolve itself.
>>> Falling back to the old parser is just as bad - we won’t get bug reports.
>>> 
>>> -- John
>>> 
>>> On 14 Oct 2014, at 07:39, Tilman Hausherr <[email protected]> wrote:
>>> 
>>>> I prefer that the "old" parser not be removed, because there are many
>>>> files that can only be parsed by the old parser. This came out in a
>>>> large scale test with TIKA.
>>>> 
>>>> The best idea (in my current opinion) is to use the nonSeq parser
>>>> first, and the old parser if there is an exception.
>>>> 
>>>> Tilman
>>>> 
>>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
>>>>> Hi,
>>>>> 
>>>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
>>>>>> Hi,
>>>>>>>> John Hewson <[email protected]> hat am 10. Oktober 2014 um 20:05
>>>>>>>> geschrieben:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>        - Parsing (Andreas?)
>>>>>>> I guess we won't get a complete new parser in 2.0, but I try to
>>>>>>> improve the XRef
>>>>>>> and the COSStream stuff
>>>>>> It would be great if we could get rid of the old parser and switch
>>>>>> to the non-sequential
>>>>>> parser, WDYT?
>>>>> I would also propose to completely remove the old parser. That way
>>>>> we are more flexible in parsing streams etc. since parts of the
>>>>> non-sequential parser are a compromise to work side-by-side with the
>>>>> old parser.
>>>>> Possibly there are a small number of functions for which the old
>>>>> parser is still needed - e.g. signing?
>>>>> 
>>>>> 
>>>>> Best,
>>>>> Timo
>>>>> 
>>>>> 
>>> 
>> 
> 
> 
> -- 
> 
> Timo Boehme
> OntoChem GmbH
> H.-Damerow-Str. 4
> 06120 Halle/Saale
> T: +49 345 4780474
> F: +49 345 4780471
> [email protected]
> 
> _____________________________________________________________________
> 
> OntoChem GmbH
> Geschäftsführer: Dr. Lutz Weber
> Sitz: Halle / Saale
> Registergericht: Stendal
> Registernummer: HRB 215461
> _____________________________________________________________________
>

Re: 2.0

Reply via email to