[jira] [Commented] (NUTCH-2606) MIME detection is wrong for plain-text documents send as Content-Type "application/msword"

2018-11-19 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692320#comment-16692320
 ] 

Hudson commented on NUTCH-2606:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3590 (See 
[https://builds.apache.org/job/Nutch-trunk/3590/])
NUTCH-2606 MIME detection is wrong for plain-text documents send as (snagel: 
[https://github.com/apache/nutch/commit/5f53fd4807f62d002d24f6cfe4b3fae5c0e62741])
* (edit) src/test/org/apache/nutch/util/TestMimeUtil.java
* (edit) src/java/org/apache/nutch/util/MimeUtil.java


> MIME detection is wrong for plain-text documents send as Content-Type 
> "application/msword"
> --
>
> Key: NUTCH-2606
> URL: https://issues.apache.org/jira/browse/NUTCH-2606
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> Plain-text documents send as Content-Type "application/msword" are tried to 
> parse as Word documents. The MIME detection should be fixed, so that these 
> are correctly identified as plain-text documents. See NUTCH-2603 and 
> https://www.atnf.csiro.au/computing/software/gipsy/doc/update.doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2606) MIME detection is wrong for plain-text documents send as Content-Type "application/msword"

2018-11-19 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692254#comment-16692254
 ] 

ASF GitHub Bot commented on NUTCH-2606:
---

sebastian-nagel closed pull request #392: NUTCH-2606 MIME detection is wrong 
for plain-text documents send as Content-Type "application/msword"
URL: https://github.com/apache/nutch/pull/392
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/util/MimeUtil.java 
b/src/java/org/apache/nutch/util/MimeUtil.java
index d380427ae..443341ecd 100644
--- a/src/java/org/apache/nutch/util/MimeUtil.java
+++ b/src/java/org/apache/nutch/util/MimeUtil.java
@@ -200,8 +200,7 @@ public String autoResolveContentType(String typeName, 
String url, byte[] data) {
   }
 
   if (magicType != null && !magicType.equals(MimeTypes.OCTET_STREAM)
-  && !magicType.equals(MimeTypes.PLAIN_TEXT) && retType != null
-  && !retType.equals(magicType)) {
+  && retType != null && !retType.equals(magicType)) {
 
 // If magic enabled and the current mime type differs from that of the
 // one returned from the magic, take the magic mimeType
diff --git a/src/test/org/apache/nutch/util/TestMimeUtil.java 
b/src/test/org/apache/nutch/util/TestMimeUtil.java
index d0b45dbac..72a42b457 100644
--- a/src/test/org/apache/nutch/util/TestMimeUtil.java
+++ b/src/test/org/apache/nutch/util/TestMimeUtil.java
@@ -68,7 +68,16 @@
   "\nhttp://www.w3.org/1999/xhtml\;>"
   + "\n\n"
   + ""
-  + "\nHello, World!" } };
+  + "\nHello, World!" },
+  { /*
+ * test detection of plain-text documents with erroneous Content-Type
+ * sent in HTTP header (NUTCH-2606)
+ */
+  "text/plain", // correct MIME type
+  "test.doc", // erroneously indicates MS-Word document
+  "application/msword", // erroneous Content-Type
+  "This is a plain text document",
+  "requires-mime-magic" } };
 
   public static String[][] binaryFiles = { {
   "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
@@ -99,6 +108,9 @@ public void testWithMimeMagic() {
   /** use only HTTP Content-Type (if given) and URL pattern */
   public void testWithoutMimeMagic() {
 for (String[] testPage : textBasedFormats) {
+  if (testPage.length > 4 && "requires-mime-magic".equals(testPage[4])) {
+continue;
+  }
   String mimeType = getMimeType(urlPrefix + testPage[1],
   testPage[3].getBytes(defaultCharset), testPage[2], false);
   assertEquals("", testPage[0], mimeType);


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> MIME detection is wrong for plain-text documents send as Content-Type 
> "application/msword"
> --
>
> Key: NUTCH-2606
> URL: https://issues.apache.org/jira/browse/NUTCH-2606
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> Plain-text documents send as Content-Type "application/msword" are tried to 
> parse as Word documents. The MIME detection should be fixed, so that these 
> are correctly identified as plain-text documents. See NUTCH-2603 and 
> https://www.atnf.csiro.au/computing/software/gipsy/doc/update.doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2606) MIME detection is wrong for plain-text documents send as Content-Type "application/msword"

2018-11-14 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686245#comment-16686245
 ] 

Sebastian Nagel commented on NUTCH-2606:


Any objections? Otherwise I'll commit this fix during the next days.

> MIME detection is wrong for plain-text documents send as Content-Type 
> "application/msword"
> --
>
> Key: NUTCH-2606
> URL: https://issues.apache.org/jira/browse/NUTCH-2606
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> Plain-text documents send as Content-Type "application/msword" are tried to 
> parse as Word documents. The MIME detection should be fixed, so that these 
> are correctly identified as plain-text documents. See NUTCH-2603 and 
> https://www.atnf.csiro.au/computing/software/gipsy/doc/update.doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2606) MIME detection is wrong for plain-text documents send as Content-Type "application/msword"

2018-10-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16648903#comment-16648903
 ] 

ASF GitHub Bot commented on NUTCH-2606:
---

sebastian-nagel opened a new pull request #392: NUTCH-2606 MIME detection is 
wrong for plain-text documents send as Content-Type "application/msword"
URL: https://github.com/apache/nutch/pull/392
 
 
   - allow text/plain (from MIME magic) to overwrite type derived from HTTP 
Content-Type or file extension


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> MIME detection is wrong for plain-text documents send as Content-Type 
> "application/msword"
> --
>
> Key: NUTCH-2606
> URL: https://issues.apache.org/jira/browse/NUTCH-2606
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> Plain-text documents send as Content-Type "application/msword" are tried to 
> parse as Word documents. The MIME detection should be fixed, so that these 
> are correctly identified as plain-text documents. See NUTCH-2603 and 
> https://www.atnf.csiro.au/computing/software/gipsy/doc/update.doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2606) MIME detection is wrong for plain-text documents send as Content-Type "application/msword"

2018-10-13 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16648893#comment-16648893
 ] 

Sebastian Nagel commented on NUTCH-2606:


The debugger brought the stupid reason to the light: a detected type 
{{text/plain}} is ignored and hence cannot overwrite a wrong type sent by the 
server, see [MimeUtil, line 
203|https://github.com/apache/nutch/blob/2c69694/src/java/org/apache/nutch/util/MimeUtil.java#L203].
 This logic has been added with NUTCH-767 because "the type for empty content 
is auto-detected as "text/plain" and this value overrides the hint from the 
Content-Type header". That's not the case anymore: Empty documents are not 
detected as {{text/plain}} by Tika, except they're named with the extension 
{{.txt}}.

> MIME detection is wrong for plain-text documents send as Content-Type 
> "application/msword"
> --
>
> Key: NUTCH-2606
> URL: https://issues.apache.org/jira/browse/NUTCH-2606
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> Plain-text documents send as Content-Type "application/msword" are tried to 
> parse as Word documents. The MIME detection should be fixed, so that these 
> are correctly identified as plain-text documents. See NUTCH-2603 and 
> https://www.atnf.csiro.au/computing/software/gipsy/doc/update.doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2606) MIME detection is wrong for plain-text documents send as Content-Type "application/msword"

2018-06-20 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518338#comment-16518338
 ] 

Sebastian Nagel commented on NUTCH-2606:


Need to check what's going wrong. Actually, Tika alone (here a 1.19 snapshot) 
detects the content type correctly and consequently chooses the right parser:
{noformat}
% tika https://www.atnf.csiro.au/computing/software/gipsy/doc/update.doc   
http://www.w3.org/1999/xhtml;>










Script:   update

Purpose:  Updates the GIPSY libraries and applications.

Category: MANAGEMENT

File: update.csh

Author:   K.G. Begeman

Use:  This shell script is executed by the atqueue or by crontab
  script. When this script has completed the update, a message
  is send to the GIPSY manager (see $gip_loc/manager), or if
  there is no GIPSY manager found for this machine, to the
  initiating user (usually gipsy).
  The complete logfile is stored in $gip_loc/update.`hostname`.
  Old executables older than 1 week are removed from $gip_exe.

Updates:  Jul 23, 1991: KGB, Document created.
  Mar 15, 1993: KGB, Logfile store in $gip_loc.
  Jul 29, 1993: KGB, Check for existing log files.
  Nov 10, 1998: JPT, Version number upgraded to 101.



{noformat}


> MIME detection is wrong for plain-text documents send as Content-Type 
> "application/msword"
> --
>
> Key: NUTCH-2606
> URL: https://issues.apache.org/jira/browse/NUTCH-2606
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> Plain-text documents send as Content-Type "application/msword" are tried to 
> parse as Word documents. The MIME detection should be fixed, so that these 
> are correctly identified as plain-text documents. See NUTCH-2603 and 
> https://www.atnf.csiro.au/computing/software/gipsy/doc/update.doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2606) MIME detection is wrong for plain-text documents send as Content-Type "application/msword"

2018-06-20 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517997#comment-16517997
 ] 

Markus Jelsma commented on NUTCH-2606:
--

Ah, this is interesting. Nutch indeed believes it is a Word document, but my 
browser agrees and opens a word processor. Only cli command file correctly 
identifies it as plain text.

> MIME detection is wrong for plain-text documents send as Content-Type 
> "application/msword"
> --
>
> Key: NUTCH-2606
> URL: https://issues.apache.org/jira/browse/NUTCH-2606
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> Plain-text documents send as Content-Type "application/msword" are tried to 
> parse as Word documents. The MIME detection should be fixed, so that these 
> are correctly identified as plain-text documents. See NUTCH-2603 and 
> https://www.atnf.csiro.au/computing/software/gipsy/doc/update.doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)