+1 to including the modified docs.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Allison>, "Timothy B." <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Monday, March 30, 2015 at 6:51 AM
To: "[email protected]" <[email protected]>
Subject: RE: including refactored docs from govdocs1 in test suite

>I think this is an open question within Tika.  Some parsers prefer one
>thing over another.  And there are different levels of corruption.
>
>In the two cases where govdocs1 docs might be useful in tests, the
>hyperlinks in .doc files do not appear to be "standard", but  MSWord
>opens them without a problem.  In cases where an application can open and
>correctly process the content, I think we ought to try to extract content
>without throwing exceptions.
>
>-----Original Message-----
>From: Tyler Palsulich [mailto:[email protected]]
>Sent: Monday, March 30, 2015 9:39 AM
>To: [email protected]
>Subject: RE: including refactored docs from govdocs1 in test suite
>
>Ah. I see.
>
>In general, what is the goal with handling corrupted files? Extract as
>much
>as possible and fail gracefully?
>
>Tyler
>
>On Mar 30, 2015 9:32 AM, "Allison, Timothy B." <[email protected]> wrote:
>>
>> Unfortunately, no.  MSOffice fixes the document when I do that.
>>
>> -----Original Message-----
>> From: Tyler Palsulich [mailto:[email protected]]
>> Sent: Monday, March 30, 2015 9:24 AM
>> To: [email protected]
>> Subject: Re: including refactored docs from govdocs1 in test suite
>>
>> Can you copy the hyperlink into a new doc and change the URL? I have no
>> idea about including the modified version.
>>
>> Tyler
>> On Mar 30, 2015 9:18 AM, "Allison, Timothy B." <[email protected]>
>>wrote:
>>
>> > All,
>> >
>> >   As part of TIKA-1512, I found that I can delete all of the contents,
>> > including the metadata, except for one hyperlink in two documents from
>> > govdocs1 and still get the proper behavior -- fail before fix, work
>after
>> > fix.
>> >
>> >   These documents are in the public domain.
>> >
>> >   Is it ok to include these modified documents in our test suite or
>should
>> > I avoid inclusion?
>> >
>> >   Happy to avoid inclusion for the sake of a quick release of 1.8 and
>then
>> > we have time to discuss/determine way ahead... unless the answer is
>obvious.
>> >
>> >          Best,
>> >
>> >                      Tim
>> >
>> > -----Original Message-----
>> > From: Allison, Timothy B. [mailto:[email protected]]
>> > Sent: Monday, March 30, 2015 7:03 AM
>> > To: [email protected]
>> > Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1
>> >
>> > Unless there are objections, I'd like these to be resolved before 1.8:
>> >
>> > TIKA-1584 -- I'll fix
>> > TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
>> > TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs,
>but
>> > I'll leave this open and do some more digging to see if we need to
>>open
>a
>> > ticket at the POI level
>> > TIKA-1511 -- I'll remove "provided" for xerial
>> >
>> > TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?
>> >
>> > I'll have these fixes completed by noon EDT.  Should I run against
>> > govdocs1 before or after the RC?
>> >
>> > My last build of Tika app (a few days ago) ballooned to ~43MB, and
>that's
>> > before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my
>>last
>> > build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
>> > README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and
>tika-server
>> > jars.
>> >
>> > Best,
>> >
>> >               Tim
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: Tyler Palsulich [mailto:[email protected]]
>> > Sent: Sunday, March 29, 2015 9:13 AM
>> > To: [email protected]
>> > Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1
>> >
>> > Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
>> > something else pops up).
>> >
>> > Thank you everyone.
>> >
>> > Tyler
>> > On Mar 29, 2015 4:43 AM, "Hong-Thai Nguyen" <[email protected]>
>wrote:
>> >
>> > > +1 for 1.8
>> > >
>> > > Hong-Thai
>> > >
>> > > > On 28 Mar 2015, at 16:01, Tyler Palsulich <[email protected]>
>> > wrote:
>> > > >
>> > > > Hi Folks,
>> > > >
>> > > > Now that TIKA-1581 (JHighlight licensing issues) is resolved, we
>need
>> > to
>> > > > release a new version of Tika. I'll volunteer to be the release
>manager
>> > > > again.
>> > > >
>> > > > Should we release this as 1.8 or 1.7.1?
>> > > >
>> > > > Does anyone have any last minute issues they'd like to finish and
>see
>> > in
>> > > > Tika 1.X? I'd like to get the example working with CORS (TIKA-1585
>and
>> > > > TIKA-1586). Any others?
>> > > >
>> > > > Have a good weekend,
>> > > > Tyler
>> > >
>> >

Reply via email to