RE: 2.0

Allison, Timothy B. Tue, 21 Oct 2014 13:46:07 -0700

Maruan,
  Sounds good.  I'll add it to my todo list to write the wrapper...probably be 
good for me to start moving to 2.0 anyways. :)


-----Original Message-----
From: Maruan Sahyoun [mailto:[email protected]] 
Sent: Tuesday, October 21, 2014 1:50 PM
To: [email protected]
Subject: Re: 2.0

Tim, 

first many thanks for the offer. I'd add that a comparison between 1.8 and 2.0 
would be useful too to detect differences might it be because of enhancements 
or regressions.

BR
Maruan


Am 21.10.2014 um 19:42 schrieb Tilman Hausherr <[email protected]>:

> Hi Tim,
> 
> 2.0 doesn't seem to be released soon... what might be useful again is a 
> comparison between seq v non-seq, Andreas recently resolved an issue 
> (PDFBOX-2250) that improves the nonSeq parser a lot. Although this isn't 
> fully done, a follow-up issue PDFBOX-2441 
> <https://issues.apache.org/jira/browse/PDFBOX-2441> has been opened which 
> will improve a few more complex files.
> 
> Tilman
> 
> 
> 
> Am 21.10.2014 um 13:00 schrieb Allison, Timothy B.:
>> Been too busy over in Tika-land...just noticing this now.
>> 
>> Let me know which comparisons you'd like to run (2.0 v 1.8.x or seq v 
>> non-seq).  I won't have time to integrate 2.0 into our Tika PDFParser any 
>> time soon (Jeremy Anderson on TIKA-1285 has already started this), but I 
>> could easily write a lightweight wrapper around PDFBox's TextStripper + 
>> metadata inside of the tika-batch/tika-eval framework.
>> 
>> Cheers,
>> 
>>       Tim
>> ________________________________________
>> From: Andreas Lehmkühler [[email protected]]
>> Sent: Wednesday, October 15, 2014 6:20 AM
>> To: [email protected]
>> Subject: Re: 2.0
>> 
>> Hi,
>> 
>> 
>>> Maruan Sahyoun <[email protected]> hat am 15. Oktober 2014 um 09:32
>>> geschrieben:
>>> 
>>> 
>>> What about keeping both for the 2.0 release and phase the old one out for 3
>>> but making the NonSequential the default parser.
>>> Would also give us some time to work with Tim (TIKA) on the test suite.
>> I agree, that's the only thing we can manage in a timely manner.
>> 
>> 
>>> Maybe we could simplify the variations of PDDocument.load to something like
>>> 
>>> PDDocument.load(input, raf, enforce, useLegacyParser) or
>>> PDDocument.load(input, raf, enforce, withSignatureSupport) .
>>> 
>>> and introduce PDDocument.load(input) to use the NonSequential
>>> 
>>> 
>>> WDYT?
>> Good idea, I've already created PDFBOX-2430 for this.
>> 
>>> Maruan
>> 
>> BR
>> Andreas Lehmkühler
>>> Am 15.10.2014 um 09:18 schrieb Timo Boehme <[email protected]>:
>>> 
>>>> Hi,
>>>> 
>>>> the difference between the parsers stems from the fact that the old parser
>>>> can cope with a completely broken xref table because it uses the objects as
>>>> it finds them on its sequential way. What we need (as I proposed before) is
>>>> a repair mechanism scanning the file for object start/end to be used for
>>>> re-creating the xref table.
>>>> I will see if I can find some time to do this.
>>>> 
>>>> The only other stopper is as Andreas has pointed out the signing. I'm not
>>>> familiar with this and don't known what needs to be done here.
>>>> 
>>>> 
>>>> Best,
>>>> Timo
>>>> 
>>>> 
>>>> Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
>>>>> Here are some:
>>>>> 
>>>>> 055/055794.pdf
>>>>> 082/082463.pdf
>>>>> 108/108362.pdf
>>>>> 113/113223.pdf
>>>>> 115/115458.pdf
>>>>> 115/115463.pdf
>>>>> 122/122393.pdf
>>>>> 129/129416.pdf
>>>>> 133/133423.pdf
>>>>> 148/148020.pdf
>>>>> 152/152012.pdf
>>>>> 161/161466.pdf
>>>>> 
>>>>> to be found here:
>>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
>>>>> 
>>>>> Tilman
>>>>> 
>>>>> Am 14.10.2014 um 21:06 schrieb John Hewson:
>>>>>> Unless somebody provides us with a list of those files, then I think
>>>>>> this is an unreasonable request. As long as we continue to leave the
>>>>>> old parser in PDFBox, we won't get the bug reports which we need to
>>>>>> fix the new parser, and the situation will never resolve itself.
>>>>>> Falling back to the old parser is just as bad - we won't get bug reports.
>>>>>> 
>>>>>> -- John
>>>>>> 
>>>>>> On 14 Oct 2014, at 07:39, Tilman Hausherr <[email protected]> wrote:
>>>>>> 
>>>>>>> I prefer that the "old" parser not be removed, because there are many
>>>>>>> files that can only be parsed by the old parser. This came out in a
>>>>>>> large scale test with TIKA.
>>>>>>> 
>>>>>>> The best idea (in my current opinion) is to use the nonSeq parser
>>>>>>> first, and the old parser if there is an exception.
>>>>>>> 
>>>>>>> Tilman
>>>>>>> 
>>>>>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
>>>>>>>>> Hi,
>>>>>>>>>>> John Hewson <[email protected]> hat am 10. Oktober 2014 um 20:05
>>>>>>>>>>> geschrieben:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>         - Parsing (Andreas?)
>>>>>>>>>> I guess we won't get a complete new parser in 2.0, but I try to
>>>>>>>>>> improve the XRef
>>>>>>>>>> and the COSStream stuff
>>>>>>>>> It would be great if we could get rid of the old parser and switch
>>>>>>>>> to the non-sequential
>>>>>>>>> parser, WDYT?
>>>>>>>> I would also propose to completely remove the old parser. That way
>>>>>>>> we are more flexible in parsing streams etc. since parts of the
>>>>>>>> non-sequential parser are a compromise to work side-by-side with the
>>>>>>>> old parser.
>>>>>>>> Possibly there are a small number of functions for which the old
>>>>>>>> parser is still needed - e.g. signing?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Timo
>>>>>>>> 
>>>>>>>> 
>>>> 
>>>> --
>>>> 
>>>> Timo Boehme
>>>> OntoChem GmbH
>>>> H.-Damerow-Str. 4
>>>> 06120 Halle/Saale
>>>> T: +49 345 4780474
>>>> F: +49 345 4780471
>>>> [email protected]
>>>> 
>>>> _____________________________________________________________________
>>>> 
>>>> OntoChem GmbH
>>>> Geschäftsführer: Dr. Lutz Weber
>>>> Sitz: Halle / Saale
>>>> Registergericht: Stendal
>>>> Registernummer: HRB 215461
>>>> _____________________________________________________________________
>>>> 
>

RE: 2.0

Reply via email to