> From: Allison, Timothy B. > Sent: April 9, 2015 9:02:44am PDT > To: [email protected] > Subject: RE: [VOTE] Release Apache Tika 1.8 Candidate #1 > > I just finished the against govdocs1 with 1.7 vs. 1.8-rc1, and all looks good > with one major change... on first glance. > > Because of my "fix" on TIKA-1519 and the law of unintended consequences, > files that start like so: > > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > <html xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> > > Have different Content-Type(s) between the > > In Tika 1.7, they used to have a Content-Type of: text/html; > charset=iso-8859-1 > > In Tika 1.8-rc1, they now have a Content-Type of: application/xhtml+xml > > This is a major change. > > Do we want this? > > Or do we want to revert to the old behavior but add some kind of filter to > prevent crazy Content-Type information like the following from overwriting > what the detector detected: > <meta http-equiv="Content-Type" content="application/pdf" /> > or > <meta http-equiv="Content-Type" content="anythingIFeelLikeInserting" />
I'd have to vote for option #2, as currently there are a number of places in my various crawl projects (that uses Tika for mime type detection) which use text/html to mean HTML pages. Though we've mostly transitioned to using HtmlParser.getSupportedTypes() to generate this list dynamically based on what Tika reports. -- Ken > -----Original Message----- > From: David Meikle [mailto:[email protected]] > Sent: Wednesday, April 08, 2015 8:06 PM > To: [email protected] > Subject: Re: [VOTE] Release Apache Tika 1.8 Candidate #1 > > Hey Tyler, > >> On 7 Apr 2015, at 19:54, Tyler Palsulich <[email protected]> wrote: >> >> [ ] +1 Release this package as Apache Tika 1.8 >> [ ] -1 Do not release this package because... > > Whilst my testing with the release is good so far on Mac and Linux with > Windows to go, and I am inclined to +1, it would be good if you were able to > get your code signing key signed by someone nearby to avoid the warning below? > > amadeaus-air:release david$ gpg --verify tika-1.8-src.zip.asc > gpg: Signature made Tue 7 Apr 19:45:15 2015 EDT using RSA key ID D4F10117 > gpg: Good signature from "Tyler Palsulich <[email protected]>" > gpg: WARNING: This key is not certified with a trusted signature! > gpg: There is no indication that the signature belongs to the owner. > Primary key fingerprint: 1D32 9CC2 D69C 821B FBE4 183E 8810 BB19 D4F1 0117 > > Not sure if Chris, Lewis et al are near you and do this quickly? > > Cheers, > Dave -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
