+1 to including the modified docs. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: <Allison>, "Timothy B." <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Monday, March 30, 2015 at 6:51 AM To: "[email protected]" <[email protected]> Subject: RE: including refactored docs from govdocs1 in test suite >I think this is an open question within Tika. Some parsers prefer one >thing over another. And there are different levels of corruption. > >In the two cases where govdocs1 docs might be useful in tests, the >hyperlinks in .doc files do not appear to be "standard", but MSWord >opens them without a problem. In cases where an application can open and >correctly process the content, I think we ought to try to extract content >without throwing exceptions. > >-----Original Message----- >From: Tyler Palsulich [mailto:[email protected]] >Sent: Monday, March 30, 2015 9:39 AM >To: [email protected] >Subject: RE: including refactored docs from govdocs1 in test suite > >Ah. I see. > >In general, what is the goal with handling corrupted files? Extract as >much >as possible and fail gracefully? > >Tyler > >On Mar 30, 2015 9:32 AM, "Allison, Timothy B." <[email protected]> wrote: >> >> Unfortunately, no. MSOffice fixes the document when I do that. >> >> -----Original Message----- >> From: Tyler Palsulich [mailto:[email protected]] >> Sent: Monday, March 30, 2015 9:24 AM >> To: [email protected] >> Subject: Re: including refactored docs from govdocs1 in test suite >> >> Can you copy the hyperlink into a new doc and change the URL? I have no >> idea about including the modified version. >> >> Tyler >> On Mar 30, 2015 9:18 AM, "Allison, Timothy B." <[email protected]> >>wrote: >> >> > All, >> > >> > As part of TIKA-1512, I found that I can delete all of the contents, >> > including the metadata, except for one hyperlink in two documents from >> > govdocs1 and still get the proper behavior -- fail before fix, work >after >> > fix. >> > >> > These documents are in the public domain. >> > >> > Is it ok to include these modified documents in our test suite or >should >> > I avoid inclusion? >> > >> > Happy to avoid inclusion for the sake of a quick release of 1.8 and >then >> > we have time to discuss/determine way ahead... unless the answer is >obvious. >> > >> > Best, >> > >> > Tim >> > >> > -----Original Message----- >> > From: Allison, Timothy B. [mailto:[email protected]] >> > Sent: Monday, March 30, 2015 7:03 AM >> > To: [email protected] >> > Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1 >> > >> > Unless there are objections, I'd like these to be resolved before 1.8: >> > >> > TIKA-1584 -- I'll fix >> > TIKA-1575 -- Resolved by Konstantin Gribov (thank you!) >> > TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, >but >> > I'll leave this open and do some more digging to see if we need to >>open >a >> > ticket at the POI level >> > TIKA-1511 -- I'll remove "provided" for xerial >> > >> > TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no? >> > >> > I'll have these fixes completed by noon EDT. Should I run against >> > govdocs1 before or after the RC? >> > >> > My last build of Tika app (a few days ago) ballooned to ~43MB, and >that's >> > before I add ~3MB for xerial. Tika server is now ~48MB. As of my >>last >> > build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and >> > README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and >tika-server >> > jars. >> > >> > Best, >> > >> > Tim >> > >> > >> > >> > -----Original Message----- >> > From: Tyler Palsulich [mailto:[email protected]] >> > Sent: Sunday, March 29, 2015 9:13 AM >> > To: [email protected] >> > Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1 >> > >> > Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless >> > something else pops up). >> > >> > Thank you everyone. >> > >> > Tyler >> > On Mar 29, 2015 4:43 AM, "Hong-Thai Nguyen" <[email protected]> >wrote: >> > >> > > +1 for 1.8 >> > > >> > > Hong-Thai >> > > >> > > > On 28 Mar 2015, at 16:01, Tyler Palsulich <[email protected]> >> > wrote: >> > > > >> > > > Hi Folks, >> > > > >> > > > Now that TIKA-1581 (JHighlight licensing issues) is resolved, we >need >> > to >> > > > release a new version of Tika. I'll volunteer to be the release >manager >> > > > again. >> > > > >> > > > Should we release this as 1.8 or 1.7.1? >> > > > >> > > > Does anyone have any last minute issues they'd like to finish and >see >> > in >> > > > Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 >and >> > > > TIKA-1586). Any others? >> > > > >> > > > Have a good weekend, >> > > > Tyler >> > > >> >
