RE: 2.0

Allison, Timothy B. Tue, 21 Oct 2014 13:20:07 -0700

Hi Tilman,
  Sounds good.  Should I wait for PDFBOX-2441?

-----Original Message-----
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Tuesday, October 21, 2014 1:42 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0


Hi Tim,

2.0 doesn't seem to be released soon... what might be useful again is a 
comparison between seq v non-seq, Andreas recently resolved an issue 
(PDFBOX-2250) that improves the nonSeq parser a lot. Although this isn't 
fully done, a follow-up issue PDFBOX-2441 
<https://issues.apache.org/jira/browse/PDFBOX-2441> has been opened 
which will improve a few more complex files.

Tilman



Am 21.10.2014 um 13:00 schrieb Allison, Timothy B.:
> Been too busy over in Tika-land...just noticing this now.
>
> Let me know which comparisons you'd like to run (2.0 v 1.8.x or seq v 
> non-seq).  I won't have time to integrate 2.0 into our Tika PDFParser any 
> time soon (Jeremy Anderson on TIKA-1285 has already started this), but I 
> could easily write a lightweight wrapper around PDFBox's TextStripper + 
> metadata inside of the tika-batch/tika-eval framework.
>
> Cheers,
>
>        Tim
> ________________________________________
> From: Andreas Lehmkühler [andr...@lehmi.de]
> Sent: Wednesday, October 15, 2014 6:20 AM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0
>
> Hi,
>
>
>> Maruan Sahyoun <sahy...@fileaffairs.de> hat am 15. Oktober 2014 um 09:32
>> geschrieben:
>>
>>
>> What about keeping both for the 2.0 release and phase the old one out for 3
>> but making the NonSequential the default parser.
>> Would also give us some time to work with Tim (TIKA) on the test suite.
> I agree, that's the only thing we can manage in a timely manner.
>
>
>> Maybe we could simplify the variations of PDDocument.load to something like
>>
>> PDDocument.load(input, raf, enforce, useLegacyParser) or
>> PDDocument.load(input, raf, enforce, withSignatureSupport) .
>>
>> and introduce PDDocument.load(input) to use the NonSequential
>>
>>
>> WDYT?
> Good idea, I've already created PDFBOX-2430 for this.
>
>> Maruan
>
> BR
> Andreas Lehmkühler
>> Am 15.10.2014 um 09:18 schrieb Timo Boehme <timo.boe...@ontochem.com>:
>>
>>> Hi,
>>>
>>> the difference between the parsers stems from the fact that the old parser
>>> can cope with a completely broken xref table because it uses the objects as
>>> it finds them on its sequential way. What we need (as I proposed before) is
>>> a repair mechanism scanning the file for object start/end to be used for
>>> re-creating the xref table.
>>> I will see if I can find some time to do this.
>>>
>>> The only other stopper is as Andreas has pointed out the signing. I'm not
>>> familiar with this and don't known what needs to be done here.
>>>
>>>
>>> Best,
>>> Timo
>>>
>>>
>>> Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
>>>> Here are some:
>>>>
>>>> 055/055794.pdf
>>>> 082/082463.pdf
>>>> 108/108362.pdf
>>>> 113/113223.pdf
>>>> 115/115458.pdf
>>>> 115/115463.pdf
>>>> 122/122393.pdf
>>>> 129/129416.pdf
>>>> 133/133423.pdf
>>>> 148/148020.pdf
>>>> 152/152012.pdf
>>>> 161/161466.pdf
>>>>
>>>> to be found here:
>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
>>>>
>>>> Tilman
>>>>
>>>> Am 14.10.2014 um 21:06 schrieb John Hewson:
>>>>> Unless somebody provides us with a list of those files, then I think
>>>>> this is an unreasonable request. As long as we continue to leave the
>>>>> old parser in PDFBox, we won't get the bug reports which we need to
>>>>> fix the new parser, and the situation will never resolve itself.
>>>>> Falling back to the old parser is just as bad - we won't get bug reports.
>>>>>
>>>>> -- John
>>>>>
>>>>> On 14 Oct 2014, at 07:39, Tilman Hausherr <thaush...@t-online.de> wrote:
>>>>>
>>>>>> I prefer that the "old" parser not be removed, because there are many
>>>>>> files that can only be parsed by the old parser. This came out in a
>>>>>> large scale test with TIKA.
>>>>>>
>>>>>> The best idea (in my current opinion) is to use the nonSeq parser
>>>>>> first, and the old parser if there is an exception.
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
>>>>>>>> Hi,
>>>>>>>>>> John Hewson <j...@jahewson.com> hat am 10. Oktober 2014 um 20:05
>>>>>>>>>> geschrieben:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>          - Parsing (Andreas?)
>>>>>>>>> I guess we won't get a complete new parser in 2.0, but I try to
>>>>>>>>> improve the XRef
>>>>>>>>> and the COSStream stuff
>>>>>>>> It would be great if we could get rid of the old parser and switch
>>>>>>>> to the non-sequential
>>>>>>>> parser, WDYT?
>>>>>>> I would also propose to completely remove the old parser. That way
>>>>>>> we are more flexible in parsing streams etc. since parts of the
>>>>>>> non-sequential parser are a compromise to work side-by-side with the
>>>>>>> old parser.
>>>>>>> Possibly there are a small number of functions for which the old
>>>>>>> parser is still needed - e.g. signing?
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> Timo
>>>>>>>
>>>>>>>
>>>
>>> --
>>>
>>> Timo Boehme
>>> OntoChem GmbH
>>> H.-Damerow-Str. 4
>>> 06120 Halle/Saale
>>> T: +49 345 4780474
>>> F: +49 345 4780471
>>> timo.boe...@ontochem.com
>>>
>>> _____________________________________________________________________
>>>
>>> OntoChem GmbH
>>> Geschäftsführer: Dr. Lutz Weber
>>> Sitz: Halle / Saale
>>> Registergericht: Stendal
>>> Registernummer: HRB 215461
>>> _____________________________________________________________________
>>>

RE: 2.0

Reply via email to