> From: Allison, Timothy B.
> Sent: April 9, 2015 9:02:44am PDT
> To: [email protected]
> Subject: RE: [VOTE] Release Apache Tika 1.8 Candidate #1
> 
> I just finished the against govdocs1 with 1.7 vs. 1.8-rc1, and all looks good 
> with one major change... on first glance.
> 
> Because of my "fix" on TIKA-1519 and the law of unintended consequences, 
> files that start like so:
> 
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> <html xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
> 
> Have different Content-Type(s) between the 
> 
> In Tika 1.7, they used to have a Content-Type of: text/html; 
> charset=iso-8859-1 
> 
> In Tika 1.8-rc1, they now have a Content-Type of: application/xhtml+xml
> 
> This is a major change.
> 
> Do we want this? 
> 
> Or do we want to revert to the old behavior but add some kind of filter to 
> prevent crazy Content-Type information like the following from overwriting 
> what the detector detected:
> <meta http-equiv="Content-Type" content="application/pdf" />
> or
> <meta http-equiv="Content-Type" content="anythingIFeelLikeInserting" />

I'd have to vote for option #2, as currently there are a number of places in my 
various crawl projects (that uses Tika for mime type detection) which use 
text/html to mean HTML pages.

Though we've mostly transitioned to using HtmlParser.getSupportedTypes() to 
generate this list dynamically based on what Tika reports.

-- Ken

 
> -----Original Message-----
> From: David Meikle [mailto:[email protected]] 
> Sent: Wednesday, April 08, 2015 8:06 PM
> To: [email protected]
> Subject: Re: [VOTE] Release Apache Tika 1.8 Candidate #1
> 
> Hey Tyler,
> 
>> On 7 Apr 2015, at 19:54, Tyler Palsulich <[email protected]> wrote:
>> 
>> [ ] +1 Release this package as Apache Tika 1.8
>> [ ] -1 Do not release this package because...
> 
> Whilst my testing with the release is good so far on Mac and Linux with 
> Windows to go, and I am inclined to +1, it would be good if you were able to 
> get your code signing key signed by someone nearby to avoid the warning below?
> 
> amadeaus-air:release david$ gpg --verify tika-1.8-src.zip.asc 
> gpg: Signature made Tue  7 Apr 19:45:15 2015 EDT using RSA key ID D4F10117
> gpg: Good signature from "Tyler Palsulich <[email protected]>"
> gpg: WARNING: This key is not certified with a trusted signature!
> gpg:          There is no indication that the signature belongs to the owner.
> Primary key fingerprint: 1D32 9CC2 D69C 821B FBE4  183E 8810 BB19 D4F1 0117
> 
> Not sure if Chris, Lewis et al are near you and do this quickly?
> 
> Cheers,
> Dave

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to