[tika] branch main updated (9488d076e -> 02f0d0441)

2023-06-08 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git


from 9488d076e bump OpenSearch version to latest
 add 500900d67 TIKA-4060 Test AAC files, based on testWAV.wav, one without 
ID3, one with dummy ID3 values
 add ae85b9e4e AAC magic, based on PRONOM patterns found by Gregory Lepore
 add 8f838c512 AAC detection tests, ID3 one currently failing...
 add 04021e427 Hex values in a match regex need escaping to be treated as 
hex
 add 02f0d0441 Merge branch 'main' into TIKA-4060

No new revisions were added by this update.

Summary of changes:
 .../resources/org/apache/tika/mime/tika-mimetypes.xml|  10 ++
 .../test/java/org/apache/tika/mime/TestMimeTypes.java|   9 +
 .../src/test/resources/test-documents/testAAC.aac| Bin 0 -> 779 bytes
 .../src/test/resources/test-documents/testAACid3.aac | Bin 0 -> 2176 bytes
 4 files changed, 19 insertions(+)
 create mode 100644 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAAC.aac
 create mode 100644 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAACid3.aac



[tika] 01/01: Merge branch 'main' into TIKA-4060

2023-06-08 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch TIKA-4060
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 02f0d0441b8c380faebcf8bb14d6f91b0252f058
Merge: 04021e427 9488d076e
Author: Nick Burch 
AuthorDate: Thu Jun 8 22:12:00 2023 +0100

Merge branch 'main' into TIKA-4060

 CHANGES.txt|  2 +
 .../tika/exception/WriteLimitReachedException.java |  6 +-
 .../apache/tika/metadata/TikaCoreProperties.java   |  1 +
 .../org/apache/tika/parser/AutoDetectParser.java   |  4 ++
 .../java/org/apache/tika/pipes/PipesClient.java| 63 --
 .../java/org/apache/tika/pipes/PipesResult.java| 28 +---
 .../java/org/apache/tika/pipes/PipesServer.java| 73 -
 .../org/apache/tika/pipes/async/AsyncConfig.java   | 10 +++
 .../apache/tika/pipes/async/AsyncProcessor.java| 14 +++-
 .../apache/tika/sax/ContentHandlerDecorator.java   | 38 ++-
 .../org/apache/tika/pipes/PipesServerTest.java | 76 ++
 .../tika/pipes/async/AsyncProcessorTest.java   | 74 +
 .../tika/pipes/async/MockDigesterFactory.java  | 56 
 .../org/apache/tika/pipes/async/MockReporter.java  |  6 +-
 .../resources/org/apache/tika/pipes/TIKA-3941.xml  | 30 +
 .../opensearch/tests/TikaPipesOpenSearchTest.java  |  2 +-
 tika-parent/pom.xml|  2 +-
 .../reporters/jdbc/TestJDBCPipesReporter.java  |  2 +-
 .../server/core/resource/UnpackerResource.java | 52 ++-
 .../tika/server/standard/UnpackerResourceTest.java |  9 +++
 20 files changed, 494 insertions(+), 54 deletions(-)



[tika] branch TIKA-4060 updated (04021e427 -> 02f0d0441)

2023-06-08 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch TIKA-4060
in repository https://gitbox.apache.org/repos/asf/tika.git


from 04021e427 Hex values in a match regex need escaping to be treated as 
hex
 add d72077833 Bump aws.version from 1.12.483 to 1.12.484
 add d149fc71b Merge pull request #1180 from 
apache/dependabot/maven/aws.version-1.12.484
 add 2d9daef85 TIKA-4039 (#1181)
 add ceed7be8b TIKA-4062 (#1179)
 add 6cea7717c TIKA-3941 -- allow reporting of intermediate results from 
the pipes processor (#1167)
 add 9488d076e bump OpenSearch version to latest
 new 02f0d0441 Merge branch 'main' into TIKA-4060

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 CHANGES.txt|  2 +
 .../tika/exception/WriteLimitReachedException.java |  6 +-
 .../apache/tika/metadata/TikaCoreProperties.java   |  1 +
 .../org/apache/tika/parser/AutoDetectParser.java   |  4 ++
 .../java/org/apache/tika/pipes/PipesClient.java| 63 --
 .../java/org/apache/tika/pipes/PipesResult.java| 28 +---
 .../java/org/apache/tika/pipes/PipesServer.java| 73 -
 .../org/apache/tika/pipes/async/AsyncConfig.java   | 10 +++
 .../apache/tika/pipes/async/AsyncProcessor.java| 14 +++-
 .../apache/tika/sax/ContentHandlerDecorator.java   | 38 ++-
 .../org/apache/tika/pipes/PipesServerTest.java | 76 ++
 .../tika/pipes/async/AsyncProcessorTest.java   | 74 +
 .../tika/pipes/async/MockDigesterFactory.java  | 38 +--
 .../org/apache/tika/pipes/async/MockReporter.java  |  6 +-
 .../TIKA-3941.xml} | 18 +++--
 .../opensearch/tests/TikaPipesOpenSearchTest.java  |  2 +-
 tika-parent/pom.xml|  2 +-
 .../reporters/jdbc/TestJDBCPipesReporter.java  |  2 +-
 .../server/core/resource/UnpackerResource.java | 52 ++-
 .../tika/server/standard/UnpackerResourceTest.java |  9 +++
 20 files changed, 434 insertions(+), 84 deletions(-)
 create mode 100644 
tika-core/src/test/java/org/apache/tika/pipes/PipesServerTest.java
 copy 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-digest-commons/src/main/java/org/apache/tika/parser/digestutils/CommonsDigesterFactory.java
 => 
tika-core/src/test/java/org/apache/tika/pipes/async/MockDigesterFactory.java 
(61%)
 copy 
tika-core/src/test/resources/org/apache/tika/{config/fetchers-noname-config.xml 
=> pipes/TIKA-3941.xml} (76%)



[tika] branch TIKA-4060 updated: Hex values in a match regex need escaping to be treated as hex

2023-06-08 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch TIKA-4060
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/TIKA-4060 by this push:
 new 04021e427 Hex values in a match regex need escaping to be treated as 
hex
04021e427 is described below

commit 04021e4276606bb2ca8837444651da049f21c222
Author: Nick Burch 
AuthorDate: Thu Jun 8 21:55:49 2023 +0100

Hex values in a match regex need escaping to be treated as hex
---
 tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 461ad6128..5c8cbbcb1 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -5627,12 +5627,12 @@
 
 
   
-  
+  
 
 
   
   
- 
+ 
   
 
   



[tika] 03/03: AAC detection tests, ID3 one currently failing...

2023-06-07 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch TIKA-4060
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 8f838c512c6880ba21d2d6df36f592614710aba8
Author: Nick Burch 
AuthorDate: Wed Jun 7 23:58:11 2023 +0100

AAC detection tests, ID3 one currently failing...
---
 .../src/test/java/org/apache/tika/mime/TestMimeTypes.java| 9 +
 1 file changed, 9 insertions(+)

diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java
index f534060a1..73945f355 100644
--- 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java
+++ 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java
@@ -1345,6 +1345,15 @@ public class TestMimeTypes {
 assertTypeByData("application/onix-message+xml", 
"testONIXMessageShort.xml");
 }
 
+@Test
+public void testAACDetection() throws Exception {
+assertType("audio/x-aac", "testAAC.aac");
+assertType("audio/x-aac", "testAACid3.aac");
+assertTypeByData("audio/x-aac", "testAAC.aac");
+assertTypeByData("audio/x-aac", "testAACid3.aac");
+assertTypeByName("audio/x-aac", "x.aac");
+}
+
 private void assertText(byte[] prefix) throws IOException {
 assertMagic("text/plain", prefix);
 }



[tika] branch TIKA-4060 created (now 8f838c512)

2023-06-07 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch TIKA-4060
in repository https://gitbox.apache.org/repos/asf/tika.git


  at 8f838c512 AAC detection tests, ID3 one currently failing...

This branch includes the following new commits:

 new 500900d67 TIKA-4060 Test AAC files, based on testWAV.wav, one without 
ID3, one with dummy ID3 values
 new ae85b9e4e AAC magic, based on PRONOM patterns found by Gregory Lepore
 new 8f838c512 AAC detection tests, ID3 one currently failing...

The 3 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.




[tika] 02/03: AAC magic, based on PRONOM patterns found by Gregory Lepore

2023-06-07 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch TIKA-4060
in repository https://gitbox.apache.org/repos/asf/tika.git

commit ae85b9e4e4fb897ec901779fa7301c9316fb9a79
Author: Nick Burch 
AuthorDate: Wed Jun 7 23:57:46 2023 +0100

AAC magic, based on PRONOM patterns found by Gregory Lepore
---
 .../src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 10 ++
 1 file changed, 10 insertions(+)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 39c1c5891..461ad6128 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -5625,6 +5625,16 @@
   
   
 
+
+  
+  
+
+
+  
+  
+ 
+  
+
   
 
   



[tika] 01/03: TIKA-4060 Test AAC files, based on testWAV.wav, one without ID3, one with dummy ID3 values

2023-06-07 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch TIKA-4060
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 500900d67ede02e87440caa9f67501d3fe59b770
Author: Nick Burch 
AuthorDate: Wed Jun 7 23:56:55 2023 +0100

TIKA-4060 Test AAC files, based on testWAV.wav, one without ID3, one with 
dummy ID3 values
---
 .../src/test/resources/test-documents/testAAC.aac| Bin 0 -> 779 bytes
 .../src/test/resources/test-documents/testAACid3.aac | Bin 0 -> 2176 bytes
 2 files changed, 0 insertions(+), 0 deletions(-)

diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAAC.aac
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAAC.aac
new file mode 100644
index 0..514887020
Binary files /dev/null and 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAAC.aac
 differ
diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAACid3.aac
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAACid3.aac
new file mode 100644
index 0..82bad4f2c
Binary files /dev/null and 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAACid3.aac
 differ



[tika] branch main updated (0d7a42f34 -> fc887690a)

2022-07-05 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git


from 0d7a42f34 TIKA-3795: update protobuf
 new 9d928bbf9 TIKA-3810 VTT with UTF-8 BOM
 new ec4cb612d WebVTT is text based, so check for both line endings on the 
BOM cases like we do for no-BOM
 new fc887690a Merge branch 'main' of https://github.com/apache/tika into 
main

The 3 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../org/apache/tika/mime/tika-mimetypes.xml|  6 
 .../java/org/apache/tika/mime/TestMimeTypes.java   |  4 +++
 .../resources/test-documents/testWebVTT_utf8.vtt   | 42 ++
 3 files changed, 52 insertions(+)
 create mode 100644 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testWebVTT_utf8.vtt



[tika] 03/03: Merge branch 'main' of https://github.com/apache/tika into main

2022-07-05 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git

commit fc887690a91a4b689a40a0be11d68dcdeb45a66f
Merge: ec4cb612d 0d7a42f34
Author: Nick Burch 
AuthorDate: Tue Jul 5 11:32:57 2022 +0100

Merge branch 'main' of https://github.com/apache/tika into main

 tika-parent/pom.xml | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)



[tika] 01/03: TIKA-3810 VTT with UTF-8 BOM

2022-07-05 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 9d928bbf9e93131d5021d4e5afddb4ba18df6531
Author: Nick Burch 
AuthorDate: Tue Jul 5 11:21:17 2022 +0100

TIKA-3810 VTT with UTF-8 BOM
---
 .../org/apache/tika/mime/tika-mimetypes.xml|  3 ++
 .../java/org/apache/tika/mime/TestMimeTypes.java   |  4 +++
 .../resources/test-documents/testWebVTT_utf8.vtt   | 42 ++
 3 files changed, 49 insertions(+)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 2329c0a3b..7b4ac0d7d 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -7008,6 +7008,9 @@
   
  
   
+  
+ 
+  
   
   
   
diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java
index 2a2936bae..ea2ecbeff 100644
--- 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java
+++ 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java
@@ -1140,6 +1140,10 @@ public class TestMimeTypes {
 // With a custom text header
 assertType("text/vtt", "testWebVTT_header.vtt");
 assertTypeByData("text/vtt", "testWebVTT_header.vtt");
+
+// With a UTF-8 BOM before the header
+assertType("text/vtt", "testWebVTT_utf8.vtt");
+assertTypeByData("text/vtt", "testWebVTT_utf8.vtt");
 }
 
 @Test
diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testWebVTT_utf8.vtt
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testWebVTT_utf8.vtt
new file mode 100644
index 0..722a923fc
--- /dev/null
+++ 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testWebVTT_utf8.vtt
@@ -0,0 +1,42 @@
+WEBVTT
+
+1
+00:00:00.350 --> 00:00:02.010
+Well, the feedback indicates
+
+2
+00:00:02.010 --> 00:00:03.880
+that many new hires aren't sure
+
+3
+00:00:03.880 --> 00:00:05.560
+where to find information related
+
+4
+00:00:05.560 --> 00:00:09.390
+to HR, benefits and other onboarding processes
+
+5
+00:00:09.390 --> 00:00:11.050
+or who to ask.
+
+6
+00:00:11.050 --> 00:00:13.850
+Also, they're not always sure where they belong
+
+7
+00:00:13.850 --> 00:00:15.740
+in the structure of the company.
+
+8
+00:00:15.740 --> 00:00:18.470
+Because the company is growing and changing,
+
+9
+00:00:18.470 --> 00:00:20.890
+even tenured employees are getting confused
+
+10
+00:00:20.890 --> 00:00:23.663
+about who does what and who reports to whom.
+



[tika] 02/03: WebVTT is text based, so check for both line endings on the BOM cases like we do for no-BOM

2022-07-05 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git

commit ec4cb612d1cda09907c88f2c5a06cc3cb7a839ef
Author: Nick Burch 
AuthorDate: Tue Jul 5 11:22:59 2022 +0100

WebVTT is text based, so check for both line endings on the BOM cases like 
we do for no-BOM
---
 tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 3 +++
 1 file changed, 3 insertions(+)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 7b4ac0d7d..9b24ae3f4 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -7004,11 +7004,14 @@
   
   
  
+ 
   
   
+ 
  
   
   
+ 
  
   
   



[tika] 01/02: Crypto test files - Encrypted version of testRSAKEY.pem, and a PKCS12 wrapped version

2022-06-05 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 8ef8d636f87dd571a8dc844d1d7ac503522b13ed
Author: Nick Burch 
AuthorDate: Sun Jun 5 15:34:54 2022 +0100

Crypto test files - Encrypted version of testRSAKEY.pem, and a PKCS12 
wrapped version
---
 .../resources/test-documents/testRSAKEYandCERT.p12| Bin 0 -> 1717 bytes
 .../test/resources/test-documents/testRSAKEYenc.der   | Bin 0 -> 610 bytes
 .../test/resources/test-documents/testRSAKEYenc.pem   |  18 ++
 3 files changed, 18 insertions(+)

diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYandCERT.p12
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYandCERT.p12
new file mode 100644
index 0..1c536e8fb
Binary files /dev/null and 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYandCERT.p12
 differ
diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYenc.der
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYenc.der
new file mode 100644
index 0..22c4f86d4
Binary files /dev/null and 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYenc.der
 differ
diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYenc.pem
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYenc.pem
new file mode 100644
index 0..5d8f9057e
--- /dev/null
+++ 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYenc.pem
@@ -0,0 +1,18 @@
+-BEGIN RSA PRIVATE KEY-
+Proc-Type: 4,ENCRYPTED
+DEK-Info: DES-EDE3-CBC,9CF7E869357B3BD6
+
+37j1uZrDJ6VIz2VPWJJObW0TtXvU1zCpNo0w/xO6y6Lanq7OutBtQ5+SBSXFRjxa
+v2YsZ9OAw0PkeMRnw2CUN5bn2L5gvSgtzIV14slNDf20FAjemHpPAxFqNg6yFdZx
+O7xbkKK9KgbFAc4lnBEVPEzfLmUZEI0d/vTnYzciyInUcfq2rkCrAmDq0e642whR
+276/ayCNSEXMDpE7N3d7CT43Df5Fk7YsPYvvvyVInV56MSoESmMA093PeMiHcXwG
+VKQCvJpdzxookQpwIYBqGnjahIOhWCRvm9ji17GN+tjaU3kqUzCoKldxSFS/9mAz
+tiF4dDJk2BVF5yMQ7jplnVOW0dYv7wT5yPlbv6vOWeIL1igrM6YjK8hbC87s6kDD
+DwnpPKYBP3MX8lvXJb/cMdeSDuWgjT9jhlDklmq00FzHJBwI+1neTfzSmsI++qi5
+sah3TCXmv/3uuZrTXwq73pjyi4W0VxdH0FPgyspeayn6dP2j3WrRgWQVdwLSEIqi
+wl2kCIyxsHr5LUqwVJn0zSGXQ+Zs/gkoFrz7sriGe9yecLGHzv+8UVCksVWWadDg
+RuknU0EUwuK2nkGg+mazfcxHf6RzqOMwT3oZDvNysE2vwDyC1ExxpFlZPSRP0Uwv
+rCe2aVMzEj2zPLLkPR2dyUyTHgyfWuI/0MYO1Dg/0LvSg1cwyN5cAyHlN6D9aP4B
+SjU+WgSq5VJGBCZcPnYLyz8n4z9cRPsaO6/11p99HytmNh2EOwROv/VlCh5VKW1q
+SSjRrI754tUvyf+pbTtEOI2yvUbmIcql/wskwE/BbPWtQxEPb/v3QQ==
+-END RSA PRIVATE KEY-



[tika] branch main updated (5e3dab7ae -> 6bf9ee120)

2022-06-05 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git


from 5e3dab7ae TIKA-3751: update aws
 new 8ef8d636f Crypto test files - Encrypted version of testRSAKEY.pem, and 
a PKCS12 wrapped version
 new 6bf9ee120 Tests for encrypted RSA keys in PEM and DER, plus a disabled 
PKCS12 test pending TIKA-3784

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../test/java/org/apache/tika/mime/TestMimeTypes.java |   7 +++
 .../resources/test-documents/testRSAKEYandCERT.p12| Bin 0 -> 1717 bytes
 .../test/resources/test-documents/testRSAKEYenc.der}  | Bin
 .../test/resources/test-documents/testRSAKEYenc.pem   |  18 ++
 4 files changed, 25 insertions(+)
 create mode 100644 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYandCERT.p12
 copy 
tika-parsers/tika-parsers-standard/{tika-parsers-standard-modules/tika-parser-crypto-module/src/test/resources/test-documents/testRSAKEY.der
 => 
tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYenc.der}
 (100%)
 create mode 100644 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYenc.pem



[tika] 02/02: Tests for encrypted RSA keys in PEM and DER, plus a disabled PKCS12 test pending TIKA-3784

2022-06-05 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 6bf9ee120c2845ccdf61207322dcea2373388e75
Author: Nick Burch 
AuthorDate: Sun Jun 5 15:48:36 2022 +0100

Tests for encrypted RSA keys in PEM and DER, plus a disabled PKCS12 test 
pending TIKA-3784
---
 .../src/test/java/org/apache/tika/mime/TestMimeTypes.java  | 7 +++
 1 file changed, 7 insertions(+)

diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java
index a90d27272..2a2936bae 100644
--- 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java
+++ 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java
@@ -1163,6 +1163,7 @@ public class TestMimeTypes {
 
 @Test
 public void testCertificatesKeys() throws Exception {
+// Certificates can be identified by name alone, or with data
 assertType("application/x-x509-cert; format=pem", "testCERT.pem");
 assertType("application/x-x509-cert; format=der", "testCERT.der");
 assertTypeByData("application/x-x509-cert; format=pem", 
"testCERT.pem");
@@ -1174,9 +1175,15 @@ public class TestMimeTypes {
 assertTypeByData("application/x-x509-key; format=der", 
"testRSAKEY.der");
 assertTypeByData("application/x-x509-key; format=pem", 
"testDSAKEY.pem");
 assertTypeByData("application/x-x509-key; format=der", 
"testDSAKEY.der");
+assertTypeByData("application/x-x509-key; format=pem", 
"testRSAKEYenc.pem"); // pass=tika
+assertTypeByData("application/x-x509-key; format=der", 
"testRSAKEYenc.der"); // pass=tika
 // Parameters only have PEM form, always need data
 assertTypeByData("application/x-x509-dsa-parameters", 
"testDSAPARAMS.pem");
 assertTypeByData("application/x-x509-ec-parameters", 
"testECPARAMS.pem");
+// PKCS12 wrappers of Certs+Keys cannot currently be identified
+// Once solved, see TIKA-3784, ought to work for name or data
+//assertType("application/x-pkcs12", "testRSAKEYandCERT.p12");
+//assertTypeByData("application/x-pkcs12", "testRSAKEYandCERT.p12"); 
// pass=tika
 }
 
 @Test



[tika] branch main updated: PDP-11 style "Middle Endian" 32 bit read util, as used in the DGN file format

2022-04-28 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/main by this push:
 new f33d8930e PDP-11 style "Middle Endian" 32 bit read util, as used in 
the DGN file format
f33d8930e is described below

commit f33d8930e660e61fb04f9232cd7fb6a96cdacdf3
Author: Nick Burch 
AuthorDate: Thu Apr 28 11:27:36 2022 +0100

PDP-11 style "Middle Endian" 32 bit read util, as used in the DGN file 
format
---
 .../src/main/java/org/apache/tika/io/EndianUtils.java | 19 +++
 .../test/java/org/apache/tika/io/EndianUtilsTest.java | 18 ++
 2 files changed, 37 insertions(+)

diff --git a/tika-core/src/main/java/org/apache/tika/io/EndianUtils.java 
b/tika-core/src/main/java/org/apache/tika/io/EndianUtils.java
index c09eadceb..242dd8c74 100644
--- a/tika-core/src/main/java/org/apache/tika/io/EndianUtils.java
+++ b/tika-core/src/main/java/org/apache/tika/io/EndianUtils.java
@@ -152,6 +152,25 @@ public class EndianUtils {
 return (ch1 << 24) + (ch2 << 16) + (ch3 << 8) + (ch4);
 }
 
+/**
+ * Get a PDP-11 style Middle Endian int value from an InputStream
+ *
+ * @param stream the InputStream from which the int is to be read
+ * @return the int (32-bit) value
+ * @throws IOException will be propagated back to the caller
+ * @throws BufferUnderrunException if the stream cannot provide enough 
bytes
+ */
+public static int readIntME(InputStream stream) throws IOException, 
BufferUnderrunException {
+int ch1 = stream.read();
+int ch2 = stream.read();
+int ch3 = stream.read();
+int ch4 = stream.read();
+if ((ch1 | ch2 | ch3 | ch4) < 0) {
+throw new BufferUnderrunException();
+}
+return (ch2 << 24) + (ch1 << 16) + (ch4 << 8) + (ch3);
+}
+
 /**
  * Get a LE long value from an InputStream
  *
diff --git a/tika-core/src/test/java/org/apache/tika/io/EndianUtilsTest.java 
b/tika-core/src/test/java/org/apache/tika/io/EndianUtilsTest.java
index 8ead23218..906870e73 100644
--- a/tika-core/src/test/java/org/apache/tika/io/EndianUtilsTest.java
+++ b/tika-core/src/test/java/org/apache/tika/io/EndianUtilsTest.java
@@ -72,4 +72,22 @@ public class EndianUtilsTest {
 //swallow
 }
 }
+
+@Test
+public void testReadIntME() throws Exception {
+// Example from https://yamm.finance/wiki/Endianness.html#mwAiw 
+byte[] data = new byte[]{(byte) 0x0b, (byte) 0x0a, (byte) 0x0d, (byte) 
0x0c};
+assertEquals(0x0a0b0c0d, EndianUtils.readIntME(new 
ByteArrayInputStream(data)));
+
+data = new byte[]{(byte) 0xFE, (byte) 0xFF, (byte) 0xFC, (byte) 0xFD};
+assertEquals(0xfffefdfc, EndianUtils.readIntME(new 
ByteArrayInputStream(data)));
+
+data = new byte[]{(byte) 0xFF, (byte) 0xFF, (byte) 0xFF};
+try {
+EndianUtils.readIntME(new ByteArrayInputStream(data));
+fail("Should have thrown exception");
+} catch (EndianUtils.BufferUnderrunException e) {
+//swallow
+}
+}
 }



[tika] 01/03: TIKA-3694 Additional details in HTML on mime type, and per-type json

2022-03-07 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 7768c87b467bb6cc9d01f6b92c45131af3d44fef
Author: Nick Burch 
AuthorDate: Mon Mar 7 22:49:22 2022 +

TIKA-3694 Additional details in HTML on mime type, and per-type json
---
 .../tika/server/core/resource/TikaMimeTypes.java   | 59 --
 1 file changed, 56 insertions(+), 3 deletions(-)

diff --git 
a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaMimeTypes.java
 
b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaMimeTypes.java
index 7b1887f..784660c 100644
--- 
a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaMimeTypes.java
+++ 
b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaMimeTypes.java
@@ -24,13 +24,19 @@ import java.util.Map;
 import java.util.SortedMap;
 import java.util.TreeMap;
 import javax.ws.rs.GET;
+import javax.ws.rs.NotFoundException;
 import javax.ws.rs.Path;
+import javax.ws.rs.PathParam;
 import javax.ws.rs.Produces;
 
 import com.fasterxml.jackson.databind.ObjectMapper;
 
+import org.apache.tika.config.TikaConfig;
 import org.apache.tika.mime.MediaType;
 import org.apache.tika.mime.MediaTypeRegistry;
+import org.apache.tika.mime.MimeType;
+import org.apache.tika.mime.MimeTypes;
+import org.apache.tika.mime.MimeTypeException;
 import org.apache.tika.parser.CompositeParser;
 import org.apache.tika.parser.Parser;
 import org.apache.tika.server.core.HTMLHelper;
@@ -38,6 +44,7 @@ import org.apache.tika.server.core.HTMLHelper;
 /**
  * Provides details of all the mimetypes known to Apache Tika,
  * similar to --list-supported-types with the Tika CLI.
+ * Can also provide full details on a single known type.
  */
 @Path("/mime-types")
 public class TikaMimeTypes {
@@ -91,6 +98,17 @@ public class TikaMimeTypes {
 h.append("Super Type: ")
 .append(type.supertype).append("\n");
 }
+if (type.mime != null) {
+   if (!type.mime.getDescription().isEmpty()) {
+  h.append("Description: 
").append(type.mime.getDescription()).append("\n");
+   }
+   if (!type.mime.getAcronym().isEmpty()) {
+  h.append("Acronym: 
").append(type.mime.getAcronym()).append("\n");
+   }
+   if (!type.mime.getExtension().isEmpty()) {
+  h.append("Default Extension: 
").append(type.mime.getExtension()).append("\n");
+   }
+}
 
 if (type.parser != null) {
 h.append("Parser: 
").append(type.parser).append("\n");
@@ -124,6 +142,27 @@ public class TikaMimeTypes {
 }
 
 @GET
+@Path("/{type}/{subtype}")
+@Produces(javax.ws.rs.core.MediaType.APPLICATION_JSON)
+public String getMimeTypeDetailsJSON(@PathParam("type") String typePart,
+ @PathParam("subtype") String subtype) 
throws IOException {
+MediaTypeDetails type = getMediaType(typePart, subtype);
+Map details = new HashMap<>();
+
+details.put("type", type.type.toString());
+details.put("alias", copyToStringArray(type.aliases));
+if (type.supertype != null) {
+   details.put("supertype", type.supertype.toString());
+}
+if (type.parser != null) {
+   details.put("parser", type.parser);
+}
+// TODO Additional details from Mime
+
+return new 
ObjectMapper().writerWithDefaultPrettyPrinter().writeValueAsString(details);
+}
+
+@GET
 @Produces("text/plain")
 public String getMimeTypesPlain() {
 StringBuffer text = new StringBuffer();
@@ -147,10 +186,19 @@ public class TikaMimeTypes {
 return text.toString();
 }
 
+protected MediaTypeDetails getMediaType(String type, String subtype) 
throws NotFoundException {
+   MediaType mt = MediaType.parse(type+"/"+subtype);
+   for (MediaTypeDetails mtd : getMediaTypes()) {
+  if (mtd.type.equals(mt)) return mtd;
+   }
+   throw new NotFoundException("No Media Type registered in Tika for " + 
mt);
+}
 protected List getMediaTypes() {
-MediaTypeRegistry registry = 
TikaResource.getConfig().getMediaTypeRegistry();
-Map parsers =
-((CompositeParser) 
TikaResource.getConfig().getParser()).getParsers();
+TikaConfig config = TikaResource.getConfig();
+MimeTypes mimeTypes = config.getMimeRepository();
+MediaTypeRegistry registry = config.getMediaTypeRegistry();
+Map parsers = 
((Comp

[tika] branch main updated (eda4427 -> d583973)

2022-03-07 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git.


from eda4427  Merge branch 'TIKA-3689' into main
 new 7768c87  TIKA-3694 Additional details in HTML on mime type, and 
per-type json
 new c54dd20  TIKA-3694 Per-Type HTML page, and more info in the JSON
 new d583973  TIKA-3694 Unit test for type-specific page

The 3 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../tika/server/core/resource/TikaMimeTypes.java   | 135 -
 .../apache/tika/server/core/TikaMimeTypesTest.java |  20 ++-
 2 files changed, 147 insertions(+), 8 deletions(-)


[tika] 03/03: TIKA-3694 Unit test for type-specific page

2022-03-07 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git

commit d583973f829aa2b48b8a08cb8c46927a3446ca7a
Author: Nick Burch 
AuthorDate: Mon Mar 7 23:30:15 2022 +

TIKA-3694 Unit test for type-specific page
---
 .../tika/server/core/resource/TikaMimeTypes.java   |  4 +++-
 .../org/apache/tika/server/core/TikaMimeTypesTest.java | 18 --
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git 
a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaMimeTypes.java
 
b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaMimeTypes.java
index cfb42b4..1dc0462 100644
--- 
a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaMimeTypes.java
+++ 
b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaMimeTypes.java
@@ -277,7 +277,9 @@ public class TikaMimeTypes {
 
 try {
details.mime = mimeTypes.getRegisteredMimeType(type.toString());
-} catch (MimeTypeException e) {}
+} catch (MimeTypeException e) {
+   // Ignore if invalid
+}
 
 MediaType supertype = registry.getSupertype(type);
 if (supertype != null && 
!MediaType.OCTET_STREAM.equals(supertype)) {
diff --git 
a/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaMimeTypesTest.java
 
b/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaMimeTypesTest.java
index f373515..dc8a0c1 100644
--- 
a/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaMimeTypesTest.java
+++ 
b/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaMimeTypesTest.java
@@ -70,7 +70,8 @@ public class TikaMimeTypesTest extends CXFTestBase {
 assertContains("application/xml", text);
 assertContains("video/x-ogm", text);
 
-assertContains("text/plain", text);
+assertContains("", text);
+assertContains("/text/plain\">", text);
 assertContains("name=\"text/plain", text);
 
 assertContains("Super Type: video/ogg", text);
@@ -80,5 +81,18 @@ public class TikaMimeTypesTest extends CXFTestBase {
 assertContains("Extension: .ogg", text);
 }
 
-// TODO Type Specific
+@Test
+public void testGetHTMLDetails() throws Exception {
+   Response response =
+ WebClient.create(endPoint + MIMETYPES_PATH + "/application/cbor")
+  .type("text/html").accept("text/html").get();
+
+   String text = getStringFromInputStream((InputStream) 
response.getEntity());
+   assertNotFound("text/plain", text);
+   assertContains("application/cbor", text);
+
+   assertContains("Acronym: CBOR", text);
+   assertContains("Link: http://tools.ietf.org/html/rfc7049;, text);
+   assertContains("Extension: .cbor", text);
+}
 }


[tika] branch branch_1x updated: TIKA-3373 Add the *.yml extension for YAML, which is commonly used, along with aliases for popular alternate mimetypes for it

2021-04-27 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/branch_1x by this push:
 new f7d5119  TIKA-3373 Add the *.yml extension for YAML, which is commonly 
used, along with aliases for popular alternate mimetypes for it
f7d5119 is described below

commit f7d5119f496578bfff8bebc470d9fe8f9fdc3860
Author: Nick Burch 
AuthorDate: Tue Apr 27 13:05:34 2021 +0100

TIKA-3373 Add the *.yml extension for YAML, which is commonly used, along 
with aliases for popular alternate mimetypes for it
---
 .../src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 7 +++
 1 file changed, 7 insertions(+)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 0ea388b..87e50b7 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -7166,7 +7166,14 @@
 
   
 <_comment>YAML source code
+
+
+
+
+
+
 
+
 
   
 


[tika] branch main updated: TIKA-3373 Add the *.yml extension for YAML, which is commonly used, along with aliases for popular alternate mimetypes for it

2021-04-27 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/main by this push:
 new 60c0aae  TIKA-3373 Add the *.yml extension for YAML, which is commonly 
used, along with aliases for popular alternate mimetypes for it
60c0aae is described below

commit 60c0aaebf724f078811937c45bdca83a797901d8
Author: Nick Burch 
AuthorDate: Tue Apr 27 13:05:34 2021 +0100

TIKA-3373 Add the *.yml extension for YAML, which is commonly used, along 
with aliases for popular alternate mimetypes for it
---
 .../src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 7 +++
 1 file changed, 7 insertions(+)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index ef0ea14..6c2ea14 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -7404,7 +7404,14 @@
 
   
 <_comment>YAML source code
+
+
+
+
+
+
 
+
 
   
 


[tika] branch branch_1x updated: Changelog update

2021-03-14 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/branch_1x by this push:
 new a1ec3fd  Changelog update
a1ec3fd is described below

commit a1ec3fd2c00605864b5c4543d4943abb151c7ef0
Author: Nick Burch 
AuthorDate: Sun Mar 14 20:55:28 2021 +

Changelog update
---
 CHANGES.txt | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/CHANGES.txt b/CHANGES.txt
index 1b2bd7f..6d56a7e 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -9,7 +9,10 @@ Release 1.26 - 03/09/2021
* Fix parsing of emails attached to other emails in PST files (TIKA-3004).
 
* MP3 parser should output the xmpDM:duration metadata as seconds not 
- milliseconds, consistent with the other Audio and Video parsers 
(TIKA-3318)
+ milliseconds, consistent with the other Audio and Video parsers 
(TIKA-3318).
+
+   * MP4 parser check if any of the Compatible Brands match when identifying 
+ the subtype (TIKA-3310).
 
 Release 1.25 - 11/25/2020
 



[tika] branch branch_1x updated: Backport to 1.x - TIKA-3310 Check if MP4 file's compatible brands match any of the expected values, from Peter Kronenberg

2021-03-14 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/branch_1x by this push:
 new b0242ee  Backport to 1.x - TIKA-3310 Check if MP4 file's compatible 
brands match any of the expected values, from Peter Kronenberg
b0242ee is described below

commit b0242ee617857fe85db2ba5ce186f6c9965b67bd
Author: Nick Burch 
AuthorDate: Sun Mar 14 20:53:38 2021 +

Backport to 1.x - TIKA-3310 Check if MP4 file's compatible brands match any 
of the expected values, from Peter Kronenberg
---
 .../java/org/apache/tika/parser/mp4/MP4Parser.java | 26 --
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java 
b/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java
index 933c53c..f06e556 100644
--- a/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java
+++ b/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java
@@ -70,6 +70,7 @@ import java.util.HashMap;
 import java.util.List;
 import java.util.Locale;
 import java.util.Map;
+import java.util.Optional;
 import java.util.Set;
 
 /**
@@ -132,14 +133,25 @@ public class MP4Parser extends AbstractParser {
 // Grab the file type box
 FileTypeBox fileType = getOrNull(isoFile, FileTypeBox.class);
 if (fileType != null) {
-// Identify the type
-MediaType type = MediaType.application("mp4");
-for (Map.Entry> e : 
typesMap.entrySet()) {
-if (e.getValue().contains(fileType.getMajorBrand())) {
-type = e.getKey();
-break;
-}
+// Identify the type based on the major brand
+Optional typeHolder = typesMap.entrySet()
+.stream()
+.filter(e -> 
e.getValue().contains(fileType.getMajorBrand()))
+.findFirst()
+.map(Map.Entry::getKey);
+
+if (!typeHolder.isPresent()) {
+// If no match for major brand, see if any of the 
compatible brands match
+typeHolder = typesMap.entrySet()
+.stream()
+.filter(e -> e.getValue()
+.stream()
+
.anyMatch(fileType.getCompatibleBrands()::contains))
+.findFirst()
+.map(Map.Entry::getKey);
 }
+
+MediaType type = 
typeHolder.orElse(MediaType.application("mp4"));
 metadata.set(Metadata.CONTENT_TYPE, type.toString());
 
 if (type.getType().equals("audio")) {



[tika] branch main updated (356cf44 -> 4bd931d)

2021-03-14 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git.


from 356cf44  TIKA-3318 Document the units of xmpDM:duration as seconds by 
default
 new d80dc36  TIKA-3310 Check if MP4 file's compatible brands match any of 
the expected values
 new 187fd47  TIKA-3310 Check major brand before checking compatible brands
 new 4551f7d  Separate search for major brand and compatible brands
 new 4bd931d  Merge pull request #410 from peterkronenberg/main

The 5072 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../java/org/apache/tika/parser/mp4/MP4Parser.java | 35 +++---
 1 file changed, 24 insertions(+), 11 deletions(-)



[tika] branch branch_1x updated: Changelog update

2021-03-14 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/branch_1x by this push:
 new a4c9257  Changelog update
a4c9257 is described below

commit a4c92579d2a012e0296f057b70dd9fb2d0842445
Author: Nick Burch 
AuthorDate: Sun Mar 14 20:22:59 2021 +

Changelog update
---
 CHANGES.txt | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/CHANGES.txt b/CHANGES.txt
index 57ca53c..1b2bd7f 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -8,6 +8,9 @@ Release 1.26 - 03/09/2021

* Fix parsing of emails attached to other emails in PST files (TIKA-3004).
 
+   * MP3 parser should output the xmpDM:duration metadata as seconds not 
+ milliseconds, consistent with the other Audio and Video parsers 
(TIKA-3318)
+
 Release 1.25 - 11/25/2020
 
* Fix inconsistent license in xmpcore (TIKA-3204).



[tika] 02/02: TIKA-3318 MP3 parser should output the xmpDM:duration metadata as seconds not milliseconds

2021-03-14 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 21b3cf8b5a209ab6cf0176d8bc55e640fdc8c351
Author: Nick Burch 
AuthorDate: Sun Mar 14 20:20:14 2021 +

TIKA-3318 MP3 parser should output the xmpDM:duration metadata as seconds 
not milliseconds
---
 .../src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java | 13 -
 .../test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java |  6 +++---
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java 
b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java
index 52dad7c..c14b300 100644
--- a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java
+++ b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java
@@ -69,10 +69,10 @@ public class Mp3Parser extends AbstractParser {
 // Create handlers for the various kinds of ID3 tags
 ID3TagsAndAudio audioAndTags = getAllTagHandlers(stream, handler);
 
-//process as much metadata as possible before
-//writing to xhtml
+// Before we start on the XHTML output, process and store
+//  as much metadata as possible
 if (audioAndTags.duration > 0) {
-metadata.set(XMPDM.DURATION, audioAndTags.duration);
+   metadata.set(XMPDM.DURATION, audioAndTags.durationSeconds());
 }
 
 if (audioAndTags.audio != null) {
@@ -151,7 +151,7 @@ public class Mp3Parser extends AbstractParser {
 xhtml.element("p", tag.getYear());
 xhtml.element("p", tag.getGenre());
 }
-xhtml.element("p", String.valueOf(audioAndTags.duration));
+xhtml.element("p", String.valueOf(audioAndTags.durationSeconds()));
 for (String comment : comments) {
 xhtml.element("p", comment);
 }
@@ -250,7 +250,10 @@ public class Mp3Parser extends AbstractParser {
 private ID3Tags[] tags;
 private AudioFrame audio;
 private LyricsHandler lyrics;
-private float duration;
+private float duration; // Milliseconds
+private float durationSeconds() {
+   return duration / 1000;
+}
 }
 
 }
diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java 
b/tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java
index e670809..01fa4f7 100644
--- a/tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java
+++ b/tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java
@@ -39,7 +39,7 @@ public class Mp3ParserTest extends TikaTest {
  */
 private static void checkDuration(Metadata metadata, int expected) {
 assertEquals("Wrong duration", expected,
-Math.round(Float.valueOf(metadata.get(XMPDM.DURATION)) / 
1000));
+Math.round(Float.valueOf(metadata.get(XMPDM.DURATION;
 }
 
 /**
@@ -126,7 +126,7 @@ public class Mp3ParserTest extends TikaTest {
 String content = getXML("testMP3id3v1.mp3").xml;
 assertContains("

[tika] branch branch_1x updated (02ed830 -> 21b3cf8)

2021-03-14 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git.


from 02ed830  TIKA-3244: update pax-url-aether
 new 8081e6d  TIKA-3318 Document the units of xmpDM:duration as seconds by 
default
 new 21b3cf8  TIKA-3318 MP3 parser should output the xmpDM:duration 
metadata as seconds not milliseconds

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java |  1 +
 .../src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java | 13 -
 .../test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java |  6 +++---
 3 files changed, 12 insertions(+), 8 deletions(-)



[tika] 01/02: TIKA-3318 Document the units of xmpDM:duration as seconds by default

2021-03-14 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 8081e6da8d34ef9675638699eb2ec6d6145c89d4
Author: Nick Burch 
AuthorDate: Sun Mar 14 19:24:43 2021 +

TIKA-3318 Document the units of xmpDM:duration as seconds by default
---
 tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java 
b/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java
index ce78145..60a3d1e 100644
--- a/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java
+++ b/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java
@@ -173,6 +173,7 @@ public interface XMPDM {
 
 /**
  * "The duration of the media file."
+ * Value is in Seconds, unless xmpDM:scale is also set.
  */
 Property DURATION = Property.externalReal("xmpDM:duration");
 



[tika] branch main updated: TIKA-3318 Document the units of xmpDM:duration as seconds by default

2021-03-14 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/main by this push:
 new 356cf44  TIKA-3318 Document the units of xmpDM:duration as seconds by 
default
356cf44 is described below

commit 356cf44e6c426ad4411bb2c8a945597dbac4543c
Author: Nick Burch 
AuthorDate: Sun Mar 14 19:24:43 2021 +

TIKA-3318 Document the units of xmpDM:duration as seconds by default
---
 tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java 
b/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java
index ce78145..60a3d1e 100644
--- a/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java
+++ b/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java
@@ -173,6 +173,7 @@ public interface XMPDM {
 
 /**
  * "The duration of the media file."
+ * Value is in Seconds, unless xmpDM:scale is also set.
  */
 Property DURATION = Property.externalReal("xmpDM:duration");
 



[tika] branch main updated: TIKA-3318 MP3 parser should output the xmpDM:duration metadata as seconds not milliseconds

2021-03-14 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/main by this push:
 new 31da853  TIKA-3318 MP3 parser should output the xmpDM:duration 
metadata as seconds not milliseconds
31da853 is described below

commit 31da853a5779806b1b83f4709e90ac2e3ac2688e
Author: Nick Burch 
AuthorDate: Sun Mar 14 19:07:02 2021 +

TIKA-3318 MP3 parser should output the xmpDM:duration metadata as seconds 
not milliseconds
---
 .../main/java/org/apache/tika/parser/mp3/Mp3Parser.java| 14 --
 .../java/org/apache/tika/parser/mp3/Mp3ParserTest.java |  8 
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git 
a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java
 
b/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java
index 7a02473..11a7d4b 100644
--- 
a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java
+++ 
b/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java
@@ -70,10 +70,10 @@ public class Mp3Parser extends AbstractParser {
 // Create handlers for the various kinds of ID3 tags
 ID3TagsAndAudio audioAndTags = getAllTagHandlers(stream, handler);
 
-//process as much metadata as possible before
-//writing to xhtml
+// Before we start on the XHTML output, process and store
+//  as much metadata as possible
 if (audioAndTags.duration > 0) {
-metadata.set(XMPDM.DURATION, audioAndTags.duration);
+   metadata.set(XMPDM.DURATION, audioAndTags.durationSeconds());
 }
 
 if (audioAndTags.audio != null) {
@@ -152,7 +152,7 @@ public class Mp3Parser extends AbstractParser {
 xhtml.element("p", tag.getYear());
 xhtml.element("p", tag.getGenre());
 }
-xhtml.element("p", String.valueOf(audioAndTags.duration));
+xhtml.element("p", String.valueOf(audioAndTags.durationSeconds()));
 for (String comment : comments) {
 xhtml.element("p", comment);
 }
@@ -261,7 +261,9 @@ public class Mp3Parser extends AbstractParser {
 private ID3Tags[] tags;
 private AudioFrame audio;
 private LyricsHandler lyrics;
-private float duration;
+private float duration; // Milliseconds
+private float durationSeconds() {
+   return duration / 1000;
+}
 }
-
 }
diff --git 
a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-audiovideo-module/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java
 
b/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-audiovideo-module/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java
index f952c84..ed0b16c 100644
--- 
a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-audiovideo-module/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java
+++ 
b/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-audiovideo-module/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java
@@ -38,8 +38,8 @@ public class Mp3ParserTest extends TikaTest {
  * @param expected the expected duration, rounded as seconds
  */
 private static void checkDuration(Metadata metadata, int expected) {
-assertEquals("Wrong duration", expected,
-Math.round(Float.valueOf(metadata.get(XMPDM.DURATION)) / 
1000));
+assertEquals("Wrong duration", expected, 
+Math.round(Float.valueOf(metadata.get(XMPDM.DURATION;
 }
 
 /**
@@ -124,7 +124,7 @@ public class Mp3ParserTest extends TikaTest {
 String content = getXML("testMP3id3v1.mp3").xml;
 assertContains("

[tika] 04/05: Split the Certificate and Key mimetypes into DER and PEM subtypes, add test EC files. TIKA-3205

2020-09-30 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 51829d630360060d2fff84e8dc2b1346834ecfda
Author: Nick Burch 
AuthorDate: Tue Sep 29 16:48:40 2020 +0100

Split the Certificate and Key mimetypes into DER and PEM subtypes, add test 
EC files. TIKA-3205
---
 .../org/apache/tika/mime/tika-mimetypes.xml|  41 +
 .../java/org/apache/tika/mime/TestMimeTypes.java   |  19 +++---
 .../test/resources/test-documents/testDSAKEY.der   | Bin 0 -> 834 bytes
 .../test/resources/test-documents/testDSAKEY.pem   |  15 
 .../resources/test-documents/testDSAPARAMS.pem |  14 +++
 .../test/resources/test-documents/testECKEY.der| Bin 0 -> 167 bytes
 .../test/resources/test-documents/testECKEY.pem|   6 +++
 .../test/resources/test-documents/testECPARAMS.pem |   3 ++
 8 files changed, 84 insertions(+), 14 deletions(-)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 96301aa..792448b 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -4538,13 +4538,19 @@
   
 
 
-
-
 
+  
+  
+
+
 
-  
   
-  
+
+  
+  
+
+
+
   
  
@@ -4559,9 +4565,12 @@
   
 
   
+
   
+  
+  
+
 
-  
   
   
   
@@ -4569,16 +4578,32 @@
   
   
   
-  
-  
 
   
+  
+
+
+
+
+  
+  
+
+  
+
   
 
-  
+  
   
 
   
+  
+
+  
+  
+
+  
 
   
 
diff --git a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java 
b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
index 80c60a2..c765dae 100644
--- a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
+++ b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
@@ -1144,13 +1144,20 @@ public class TestMimeTypes {
 
 @Test
 public void testCertificatesKeys() throws Exception {
-assertType("application/x-x509-cert", "testCERT.pem");
-assertType("application/x-x509-cert", "testCERT.der");
-assertTypeByData("application/x-x509-cert", "testCERT.pem");
-assertTypeByData("application/x-x509-cert", "testCERT.der");
+assertType("application/x-x509-cert; format=pem", "testCERT.pem");
+assertType("application/x-x509-cert; format=der", "testCERT.der");
+assertTypeByData("application/x-x509-cert; format=pem", 
"testCERT.pem");
+assertTypeByData("application/x-x509-cert; format=der", 
"testCERT.der");
 // Keys need the data to identify, name isn't enough
-assertTypeByData("application/x-x509-key", "testRSAKEY.pem");
-assertTypeByData("application/x-x509-key", "testRSAKEY.der");
+assertTypeByData("application/x-x509-key; format=pem", 
"testECKEY.pem");
+assertTypeByData("application/x-x509-key; format=der", 
"testECKEY.der");
+assertTypeByData("application/x-x509-key; format=pem", 
"testRSAKEY.pem");
+assertTypeByData("application/x-x509-key; format=der", 
"testRSAKEY.der");
+assertTypeByData("application/x-x509-key; format=pem", 
"testDSAKEY.pem");
+assertTypeByData("application/x-x509-key; format=der", 
"testDSAKEY.der");
+// Parameters only have PEM form, always need data
+assertTypeByData("application/x-x509-dsa-parameters", 
"testDSAPARAMS.pem");
+assertTypeByData("application/x-x509-ec-parameters", 
"testECPARAMS.pem");
 }
 
 @Test
diff --git a/tika-parsers/src/test/resources/test-documents/testDSAKEY.der 
b/tika-parsers/src/test/resources/test-documents/testDSAKEY.der
new file mode 100644
index 000..9ed2eb9
Binary files /dev/null and 
b/tika-parsers/src/test/resources/test-documents/testDSAKEY.der differ
diff --git a/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem 
b/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem
new file mode 100644
index 000..2b8781a
--- /dev/null
+++ b/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem
@@ -0,0 +1,15 @@
+-BEGIN PRIVATE KEY-
+MIICTQIBADCCAi0GByqGSM44BAEwggIgAoIBAQDRXU0Be5k0MI3skB6K0PhyptBh
+WSJ1l+NVSOX7wpXC37upcH7a0ZCfU9RyWqcX9dQFw+TWjlH2ANll/FO4osXkkJVY
+oylJ+p0599v6WRPBS/yQpKuvfqEm5HA78J8ILhnyCCw8hqdlrADBOMGf7tGF5Agw
+hEZJdtHjYRzPWzY0eogptg3wQPd/

[tika] branch branch_1x updated (9736af8 -> 1fce089)

2020-09-30 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git.


from 9736af8  Fix TIKA-3196 (#364)
 new 5c2c4a2  Add test certificate and key for TIKA-3205
 new 28ec71d  TIKA-3205 Add magic for X509 PEM certificate, and tweak 
default type
 new e952877  Add some more DER magic for certificates, and add tests 
TIKA-3205
 new 51829d6  Split the Certificate and Key mimetypes into DER and PEM 
subtypes, add test EC files. TIKA-3205
 new 1fce089  Make the DER private key mostly-match a bit more specific

The 5 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../org/apache/tika/mime/tika-mimetypes.xml|  72 -
 .../java/org/apache/tika/TikaDetectionTest.java|   5 +-
 .../java/org/apache/tika/mime/TestMimeTypes.java   |  18 ++
 .../src/test/resources/test-documents/testCERT.der | Bin 0 -> 702 bytes
 .../src/test/resources/test-documents/testCERT.pem |  17 +
 .../test/resources/test-documents/testDSAKEY.der   | Bin 0 -> 834 bytes
 .../test/resources/test-documents/testDSAKEY.pem   |  15 +
 .../resources/test-documents/testDSAPARAMS.pem |  14 
 .../test/resources/test-documents/testECKEY.der| Bin 0 -> 167 bytes
 .../test/resources/test-documents/testECKEY.pem|   6 ++
 .../test/resources/test-documents/testECPARAMS.pem |   3 +
 .../test/resources/test-documents/testRSAKEY.der   | Bin 0 -> 610 bytes
 .../test/resources/test-documents/testRSAKEY.pem   |  16 +
 13 files changed, 162 insertions(+), 4 deletions(-)
 create mode 100644 tika-parsers/src/test/resources/test-documents/testCERT.der
 create mode 100644 tika-parsers/src/test/resources/test-documents/testCERT.pem
 create mode 100644 
tika-parsers/src/test/resources/test-documents/testDSAKEY.der
 create mode 100644 
tika-parsers/src/test/resources/test-documents/testDSAKEY.pem
 create mode 100644 
tika-parsers/src/test/resources/test-documents/testDSAPARAMS.pem
 create mode 100644 tika-parsers/src/test/resources/test-documents/testECKEY.der
 create mode 100644 tika-parsers/src/test/resources/test-documents/testECKEY.pem
 create mode 100644 
tika-parsers/src/test/resources/test-documents/testECPARAMS.pem
 create mode 100644 
tika-parsers/src/test/resources/test-documents/testRSAKEY.der
 create mode 100644 
tika-parsers/src/test/resources/test-documents/testRSAKEY.pem



[tika] 02/05: TIKA-3205 Add magic for X509 PEM certificate, and tweak default type

2020-09-30 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 28ec71d2d3afa52e84fa16ee5df289dd696980ed
Author: Nick Burch 
AuthorDate: Tue Sep 29 15:49:14 2020 +0100

TIKA-3205 Add magic for X509 PEM certificate, and tweak default type
---
 .../org/apache/tika/mime/tika-mimetypes.xml| 26 +-
 .../java/org/apache/tika/TikaDetectionTest.java|  5 +++--
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index b4981a5..a0d172d 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -4534,10 +4534,34 @@
 
 
   
-  
+
+  
+
+
 
+
 
+
+  
+  
+  
+  
+ 
+ 
+  
+
   
+  
+
+  
+  
+  
+  
+
+  
+
   
 
   
diff --git a/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java 
b/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
index 8f14a2b..1908489 100644
--- a/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
+++ b/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
@@ -624,8 +624,9 @@ public class TikaDetectionTest {
 assertEquals("application/x-texinfo", tika.detect("x.texi"));
 assertEquals("application/x-ustar", tika.detect("x.ustar"));
 assertEquals("application/x-wais-source", tika.detect("x.src"));
-assertEquals("application/x-x509-ca-cert", tika.detect("x.der"));
-assertEquals("application/x-x509-ca-cert", tika.detect("x.crt"));
+// Differ from httpd - use a common parent for CA and User certs
+//assertEquals("application/x-x509-ca-cert", tika.detect("x.der"));
+//assertEquals("application/x-x509-ca-cert", tika.detect("x.crt"));
 assertEquals("application/x-xfig", tika.detect("x.fig"));
 assertEquals("application/x-xpinstall", tika.detect("x.xpi"));
 assertEquals("application/xenc+xml", tika.detect("x.xenc"));



[tika] 03/05: Add some more DER magic for certificates, and add tests TIKA-3205

2020-09-30 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit e95287761da40c72f45390d1b892d8cdef33c216
Author: Nick Burch 
AuthorDate: Tue Sep 29 16:23:08 2020 +0100

Add some more DER magic for certificates, and add tests TIKA-3205
---
 .../org/apache/tika/mime/tika-mimetypes.xml| 28 ++
 .../java/org/apache/tika/mime/TestMimeTypes.java   | 11 +
 2 files changed, 34 insertions(+), 5 deletions(-)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index a0d172d..96301aa 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -4546,10 +4546,16 @@
   
   
   
- 
- 
+  mask="0xFFF8" offset="0">
+ 
+ 
+ 
+ 
+ 
+ 
   
 
   
@@ -4557,8 +4563,20 @@
 
   
   
+  
+  
+  
+  
+  
+  
   
-  
+  
+
+  
+  
+
+  
+  
 
   
 
diff --git a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java 
b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
index d05d080..80c60a2 100644
--- a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
+++ b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
@@ -1143,6 +1143,17 @@ public class TestMimeTypes {
 }
 
 @Test
+public void testCertificatesKeys() throws Exception {
+assertType("application/x-x509-cert", "testCERT.pem");
+assertType("application/x-x509-cert", "testCERT.der");
+assertTypeByData("application/x-x509-cert", "testCERT.pem");
+assertTypeByData("application/x-x509-cert", "testCERT.der");
+// Keys need the data to identify, name isn't enough
+assertTypeByData("application/x-x509-key", "testRSAKEY.pem");
+assertTypeByData("application/x-x509-key", "testRSAKEY.der");
+}
+
+@Test
 public void testVandICalendars() throws Exception {
 assertType("text/calendar", "testICalendar.ics");
 assertType("text/x-vcalendar", "testVCalendar.vcs");



[tika] 05/05: Make the DER private key mostly-match a bit more specific

2020-09-30 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 1fce08921cea11bc79c708d4f72b9e4bf70b8c2c
Author: Nick Burch 
AuthorDate: Tue Sep 29 16:51:19 2020 +0100

Make the DER private key mostly-match a bit more specific
---
 .../main/resources/org/apache/tika/mime/tika-mimetypes.xml| 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 792448b..d281751 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -4583,12 +4583,13 @@
   
 
 
-
+
+
 
-  
-  
+  
+  
 
   
 



[tika] 01/05: Add test certificate and key for TIKA-3205

2020-09-30 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 5c2c4a2fb91cc160eaf007b71efcd854402e1624
Author: Nick Burch 
AuthorDate: Tue Sep 29 15:26:48 2020 +0100

Add test certificate and key for TIKA-3205
---
 .../src/test/resources/test-documents/testCERT.der | Bin 0 -> 702 bytes
 .../src/test/resources/test-documents/testCERT.pem |  17 +
 .../src/test/resources/test-documents/testRSAKEY.der   | Bin 0 -> 610 bytes
 .../src/test/resources/test-documents/testRSAKEY.pem   |  16 
 4 files changed, 33 insertions(+)

diff --git a/tika-parsers/src/test/resources/test-documents/testCERT.der 
b/tika-parsers/src/test/resources/test-documents/testCERT.der
new file mode 100644
index 000..935f1f6
Binary files /dev/null and 
b/tika-parsers/src/test/resources/test-documents/testCERT.der differ
diff --git a/tika-parsers/src/test/resources/test-documents/testCERT.pem 
b/tika-parsers/src/test/resources/test-documents/testCERT.pem
new file mode 100644
index 000..dbfd849
--- /dev/null
+++ b/tika-parsers/src/test/resources/test-documents/testCERT.pem
@@ -0,0 +1,17 @@
+-BEGIN CERTIFICATE-
+MIICujCCAiOgAwIBAgIUKOX/l1c68ya6jnfeRJ8uP9kvVx8wDQYJKoZIhvcNAQEL
+BQAwbzELMAkGA1UEBhMCWloxFDASBgNVBAgMC0FwYWNoZSBUaWthMQ8wDQYDVQQH
+DAZBcGFjaGUxDTALBgNVBAoMBFRpa2ExFDASBgNVBAsMC0FwYWNoZSBUaWthMRQw
+EgYDVQQDDAtBcGFjaGUgVGlrYTAeFw0yMDA5MjkxNDIzNDRaFw0zMDA5MjcxNDIz
+NDRaMG8xCzAJBgNVBAYTAlpaMRQwEgYDVQQIDAtBcGFjaGUgVGlrYTEPMA0GA1UE
+BwwGQXBhY2hlMQ0wCwYDVQQKDARUaWthMRQwEgYDVQQLDAtBcGFjaGUgVGlrYTEU
+MBIGA1UEAwwLQXBhY2hlIFRpa2EwgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGB
+AMeVjMm2uyhe7HkNFFBU6nnI9niJnn+hv+3TTDw23/GH2/d1T3JSpuSFstHvwS0t
+umrSb6Py7igD4fz3+aeZU1gDrby4f9KemnLBlAU63VuPyCDyWj2XqbsZDGdnbIy2
+FnBBy2LBn5p0gDwoDpnmhHPHIJZo9OMGH/5hWQUt6+rtAgMBAAGjUzBRMB0GA1Ud
+DgQWBBTM8K2WIAuPiv0VgrRoMn2fAGua1jAfBgNVHSMEGDAWgBTM8K2WIAuPiv0V
+grRoMn2fAGua1jAPBgNVHRMBAf8EBTADAQH/MA0GCSqGSIb3DQEBCwUAA4GBALqE
++ja5Hx78Dpym/HxP50TfadwmEes+JXYptykWnuOWgLlqLuGAqJctLOKoR73r7d9d
+zJBtdr3A5uTg9vWNMSA2lPdBr/NplNaI8bso+8dRWdkiMut+j7xqTFl8MVMriRSR
+a2cA9BsUlpHjJdVjcFweAtdlINZDACoZubCTM7ng
+-END CERTIFICATE-
diff --git a/tika-parsers/src/test/resources/test-documents/testRSAKEY.der 
b/tika-parsers/src/test/resources/test-documents/testRSAKEY.der
new file mode 100644
index 000..22c4f86
Binary files /dev/null and 
b/tika-parsers/src/test/resources/test-documents/testRSAKEY.der differ
diff --git a/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem 
b/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem
new file mode 100644
index 000..0971b76
--- /dev/null
+++ b/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem
@@ -0,0 +1,16 @@
+-BEGIN PRIVATE KEY-
+MIICeAIBADANBgkqhkiG9w0BAQEFAASCAmIwggJeAgEAAoGBAMeVjMm2uyhe7HkN
+FFBU6nnI9niJnn+hv+3TTDw23/GH2/d1T3JSpuSFstHvwS0tumrSb6Py7igD4fz3
++aeZU1gDrby4f9KemnLBlAU63VuPyCDyWj2XqbsZDGdnbIy2FnBBy2LBn5p0gDwo
+DpnmhHPHIJZo9OMGH/5hWQUt6+rtAgMBAAECgYBdb1TGxiYeQzoffZEJ/ob61qsU
+SRELnVS16RqigeobL8g5tBCqa6k4CKNrhvt/xA2mnrenID6AOzkb7ZdR8ATEtojF
+JjLZ8zmXACU3WetoRUvh2uxlFpxFeK0yQlaEWcvE4Z9MQe3V8pBvMQUNEZxN4bHT
+1eMla9O65TR49uxaPQJBAO/Spm9ln02CjnxCHiGmRQ77gUNz39AtrKRLQBv/uEB2
+fhHAvFoSPGXaIgd73GgZEnM/a+faLrMu9NvemMd5aYMCQQDVDAsjaa72+5ZS87zE
+xLDrFT1cKM8U1G0ikdGl6rejDnSoiwfZ8DXpSBOOiSkf/PX9zDXDPQl9nHLjmDn9
+wN7PAkEAxsPTF66lGoujZk8yQ/dXczR2DR7Dl/nTBZQsvUfzQNI0aKhSM2C72Dqz
+S3qX0Vs+VHBzEYVegTngzT4vZ9wz2wJBAMNXCZdsvUokIA7rALgCCJ1jmiE4Ibdd
+lrtNrEZO0hWlmX04DPjc8PF2bsgQJy73R61vYhQjkOIlYoof93wdLa0CQQDTLHSB
+8e8f81Jq+zbLReAQ6ch+fEulaMPlPY0OqgExBxdbwXnlPENw09+EiQkKSSo8qhY1
+guri/IWyq3LYm8nE
+-END PRIVATE KEY-



[tika] branch main updated: Move new test files to 2.x folder, doh!

2020-09-30 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/main by this push:
 new 0844dce  Move new test files to 2.x folder, doh!
0844dce is described below

commit 0844dce7d19b90de288c456e78018ca8729895a7
Author: Nick Burch 
AuthorDate: Wed Sep 30 16:54:44 2020 +0100

Move new test files to 2.x folder, doh!
---
 .../src/test/resources/test-documents/testCERT.der  | Bin
 .../src/test/resources/test-documents/testCERT.pem  |   0
 .../src/test/resources/test-documents/testDSAKEY.der| Bin
 .../src/test/resources/test-documents/testDSAKEY.pem|   0
 .../src/test/resources/test-documents/testDSAPARAMS.pem |   0
 .../src/test/resources/test-documents/testECKEY.der | Bin
 .../src/test/resources/test-documents/testECKEY.pem |   0
 .../src/test/resources/test-documents/testECPARAMS.pem  |   0
 .../src/test/resources/test-documents/testRSAKEY.der| Bin
 .../src/test/resources/test-documents/testRSAKEY.pem|   0
 10 files changed, 0 insertions(+), 0 deletions(-)

diff --git a/tika-parsers/src/test/resources/test-documents/testCERT.der 
b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testCERT.der
similarity index 100%
rename from tika-parsers/src/test/resources/test-documents/testCERT.der
rename to 
tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testCERT.der
diff --git a/tika-parsers/src/test/resources/test-documents/testCERT.pem 
b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testCERT.pem
similarity index 100%
rename from tika-parsers/src/test/resources/test-documents/testCERT.pem
rename to 
tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testCERT.pem
diff --git a/tika-parsers/src/test/resources/test-documents/testDSAKEY.der 
b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testDSAKEY.der
similarity index 100%
rename from tika-parsers/src/test/resources/test-documents/testDSAKEY.der
rename to 
tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testDSAKEY.der
diff --git a/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem 
b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testDSAKEY.pem
similarity index 100%
rename from tika-parsers/src/test/resources/test-documents/testDSAKEY.pem
rename to 
tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testDSAKEY.pem
diff --git a/tika-parsers/src/test/resources/test-documents/testDSAPARAMS.pem 
b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testDSAPARAMS.pem
similarity index 100%
rename from tika-parsers/src/test/resources/test-documents/testDSAPARAMS.pem
rename to 
tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testDSAPARAMS.pem
diff --git a/tika-parsers/src/test/resources/test-documents/testECKEY.der 
b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testECKEY.der
similarity index 100%
rename from tika-parsers/src/test/resources/test-documents/testECKEY.der
rename to 
tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testECKEY.der
diff --git a/tika-parsers/src/test/resources/test-documents/testECKEY.pem 
b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testECKEY.pem
similarity index 100%
rename from tika-parsers/src/test/resources/test-documents/testECKEY.pem
rename to 
tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testECKEY.pem
diff --git a/tika-parsers/src/test/resources/test-documents/testECPARAMS.pem 
b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testECPARAMS.pem
similarity index 100%
rename from tika-parsers/src/test/resources/test-documents/testECPARAMS.pem
rename to 
tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testECPARAMS.pem
diff --git a/tika-parsers/src/test/resources/test-documents/testRSAKEY.der 
b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testRSAKEY.der
similarity index 100%
rename from tika-parsers/src/test/resources/test-documents/testRSAKEY.der
rename to 
tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testRSAKEY.der
diff --git a/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem 
b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testRSAKEY.pem
similarity index 100%
rename from tika-parsers/src/test/resources/test-documents/testRSAKEY.pem
rename to 
tika-parser-modules/tika-parser-integration-tests/src/test/resources

[tika] 02/05: TIKA-3205 Add magic for X509 PEM certificate, and tweak default type

2020-09-29 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git

commit ecd1d62ad9e4d2ddd53abf204539e5d765e6c624
Author: Nick Burch 
AuthorDate: Tue Sep 29 15:49:14 2020 +0100

TIKA-3205 Add magic for X509 PEM certificate, and tweak default type
---
 .../org/apache/tika/mime/tika-mimetypes.xml| 26 +-
 .../java/org/apache/tika/TikaDetectionTest.java|  5 +++--
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 396fb9a..bdbeee5 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -4678,10 +4678,34 @@
 
 
   
-  
+
+  
+
+
 
+
 
+
+  
+  
+  
+  
+ 
+ 
+  
+
   
+  
+
+  
+  
+  
+  
+
+  
+
   
 
   
diff --git a/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java 
b/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
index eb3bb19..2364daa 100644
--- a/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
+++ b/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
@@ -624,8 +624,9 @@ public class TikaDetectionTest {
 assertEquals("application/x-texinfo", tika.detect("x.texi"));
 assertEquals("application/x-ustar", tika.detect("x.ustar"));
 assertEquals("application/x-wais-source", tika.detect("x.src"));
-assertEquals("application/x-x509-ca-cert", tika.detect("x.der"));
-assertEquals("application/x-x509-ca-cert", tika.detect("x.crt"));
+// Differ from httpd - use a common parent for CA and User certs
+//assertEquals("application/x-x509-ca-cert", tika.detect("x.der"));
+//assertEquals("application/x-x509-ca-cert", tika.detect("x.crt"));
 assertEquals("application/x-xfig", tika.detect("x.fig"));
 assertEquals("application/x-xpinstall", tika.detect("x.xpi"));
 assertEquals("application/xenc+xml", tika.detect("x.xenc"));



[tika] 04/05: Split the Certificate and Key mimetypes into DER and PEM subtypes, add test EC files. TIKA-3205

2020-09-29 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git

commit fa1b2ef87157f51797d0dcaed36ebc990e538910
Author: Nick Burch 
AuthorDate: Tue Sep 29 16:48:40 2020 +0100

Split the Certificate and Key mimetypes into DER and PEM subtypes, add test 
EC files. TIKA-3205
---
 .../org/apache/tika/mime/tika-mimetypes.xml|  41 +
 .../java/org/apache/tika/mime/TestMimeTypes.java   |  19 +++---
 .../test/resources/test-documents/testDSAKEY.der   | Bin 0 -> 834 bytes
 .../test/resources/test-documents/testDSAKEY.pem   |  15 
 .../resources/test-documents/testDSAPARAMS.pem |  14 +++
 .../test/resources/test-documents/testECKEY.der| Bin 0 -> 167 bytes
 .../test/resources/test-documents/testECKEY.pem|   6 +++
 .../test/resources/test-documents/testECPARAMS.pem |   3 ++
 8 files changed, 84 insertions(+), 14 deletions(-)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index a995563..2c4a5e5 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -4682,13 +4682,19 @@
   
 
 
-
-
 
+  
+  
+
+
 
-  
   
-  
+
+  
+  
+
+
+
   
  
@@ -4703,9 +4709,12 @@
   
 
   
+
   
+  
+  
+
 
-  
   
   
   
@@ -4713,16 +4722,32 @@
   
   
   
-  
-  
 
   
+  
+
+
+
+
+  
+  
+
+  
+
   
 
-  
+  
   
 
   
+  
+
+  
+  
+
+  
 
   
 
diff --git 
a/tika-parser-modules/tika-parser-integration-tests/src/test/java/org/apache/tika/mime/TestMimeTypes.java
 
b/tika-parser-modules/tika-parser-integration-tests/src/test/java/org/apache/tika/mime/TestMimeTypes.java
index dc3f303..2960b56 100644
--- 
a/tika-parser-modules/tika-parser-integration-tests/src/test/java/org/apache/tika/mime/TestMimeTypes.java
+++ 
b/tika-parser-modules/tika-parser-integration-tests/src/test/java/org/apache/tika/mime/TestMimeTypes.java
@@ -1142,13 +1142,20 @@ public class TestMimeTypes {
 
 @Test
 public void testCertificatesKeys() throws Exception {
-assertType("application/x-x509-cert", "testCERT.pem");
-assertType("application/x-x509-cert", "testCERT.der");
-assertTypeByData("application/x-x509-cert", "testCERT.pem");
-assertTypeByData("application/x-x509-cert", "testCERT.der");
+assertType("application/x-x509-cert; format=pem", "testCERT.pem");
+assertType("application/x-x509-cert; format=der", "testCERT.der");
+assertTypeByData("application/x-x509-cert; format=pem", 
"testCERT.pem");
+assertTypeByData("application/x-x509-cert; format=der", 
"testCERT.der");
 // Keys need the data to identify, name isn't enough
-assertTypeByData("application/x-x509-key", "testRSAKEY.pem");
-assertTypeByData("application/x-x509-key", "testRSAKEY.der");
+assertTypeByData("application/x-x509-key; format=pem", 
"testECKEY.pem");
+assertTypeByData("application/x-x509-key; format=der", 
"testECKEY.der");
+assertTypeByData("application/x-x509-key; format=pem", 
"testRSAKEY.pem");
+assertTypeByData("application/x-x509-key; format=der", 
"testRSAKEY.der");
+assertTypeByData("application/x-x509-key; format=pem", 
"testDSAKEY.pem");
+assertTypeByData("application/x-x509-key; format=der", 
"testDSAKEY.der");
+// Parameters only have PEM form, always need data
+assertTypeByData("application/x-x509-dsa-parameters", 
"testDSAPARAMS.pem");
+assertTypeByData("application/x-x509-ec-parameters", 
"testECPARAMS.pem");
 }
 
 @Test
diff --git a/tika-parsers/src/test/resources/test-documents/testDSAKEY.der 
b/tika-parsers/src/test/resources/test-documents/testDSAKEY.der
new file mode 100644
index 000..9ed2eb9
Binary files /dev/null and 
b/tika-parsers/src/test/resources/test-documents/testDSAKEY.der differ
diff --git a/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem 
b/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem
new file mode 100644
index 000..2b8781a
--- /dev/null
+++ b/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem
@@ -0,0 +1,15 @@
+-BEGIN PRIVATE KEY-
+MIICTQIBADCCAi0GByqGSM44BAEwggIgAoIBAQDRXU0Be5k0MI3skB6K0PhyptBh
+WSJ1l+NV

[tika] 01/05: Add test certificate and key for TIKA-3205

2020-09-29 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git

commit ad0d98b9a155e483b815eb01e36ebd02a101695a
Author: Nick Burch 
AuthorDate: Tue Sep 29 15:26:48 2020 +0100

Add test certificate and key for TIKA-3205
---
 .../src/test/resources/test-documents/testCERT.der | Bin 0 -> 702 bytes
 .../src/test/resources/test-documents/testCERT.pem |  17 +
 .../src/test/resources/test-documents/testRSAKEY.der   | Bin 0 -> 610 bytes
 .../src/test/resources/test-documents/testRSAKEY.pem   |  16 
 4 files changed, 33 insertions(+)

diff --git a/tika-parsers/src/test/resources/test-documents/testCERT.der 
b/tika-parsers/src/test/resources/test-documents/testCERT.der
new file mode 100644
index 000..935f1f6
Binary files /dev/null and 
b/tika-parsers/src/test/resources/test-documents/testCERT.der differ
diff --git a/tika-parsers/src/test/resources/test-documents/testCERT.pem 
b/tika-parsers/src/test/resources/test-documents/testCERT.pem
new file mode 100644
index 000..dbfd849
--- /dev/null
+++ b/tika-parsers/src/test/resources/test-documents/testCERT.pem
@@ -0,0 +1,17 @@
+-BEGIN CERTIFICATE-
+MIICujCCAiOgAwIBAgIUKOX/l1c68ya6jnfeRJ8uP9kvVx8wDQYJKoZIhvcNAQEL
+BQAwbzELMAkGA1UEBhMCWloxFDASBgNVBAgMC0FwYWNoZSBUaWthMQ8wDQYDVQQH
+DAZBcGFjaGUxDTALBgNVBAoMBFRpa2ExFDASBgNVBAsMC0FwYWNoZSBUaWthMRQw
+EgYDVQQDDAtBcGFjaGUgVGlrYTAeFw0yMDA5MjkxNDIzNDRaFw0zMDA5MjcxNDIz
+NDRaMG8xCzAJBgNVBAYTAlpaMRQwEgYDVQQIDAtBcGFjaGUgVGlrYTEPMA0GA1UE
+BwwGQXBhY2hlMQ0wCwYDVQQKDARUaWthMRQwEgYDVQQLDAtBcGFjaGUgVGlrYTEU
+MBIGA1UEAwwLQXBhY2hlIFRpa2EwgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGB
+AMeVjMm2uyhe7HkNFFBU6nnI9niJnn+hv+3TTDw23/GH2/d1T3JSpuSFstHvwS0t
+umrSb6Py7igD4fz3+aeZU1gDrby4f9KemnLBlAU63VuPyCDyWj2XqbsZDGdnbIy2
+FnBBy2LBn5p0gDwoDpnmhHPHIJZo9OMGH/5hWQUt6+rtAgMBAAGjUzBRMB0GA1Ud
+DgQWBBTM8K2WIAuPiv0VgrRoMn2fAGua1jAfBgNVHSMEGDAWgBTM8K2WIAuPiv0V
+grRoMn2fAGua1jAPBgNVHRMBAf8EBTADAQH/MA0GCSqGSIb3DQEBCwUAA4GBALqE
++ja5Hx78Dpym/HxP50TfadwmEes+JXYptykWnuOWgLlqLuGAqJctLOKoR73r7d9d
+zJBtdr3A5uTg9vWNMSA2lPdBr/NplNaI8bso+8dRWdkiMut+j7xqTFl8MVMriRSR
+a2cA9BsUlpHjJdVjcFweAtdlINZDACoZubCTM7ng
+-END CERTIFICATE-
diff --git a/tika-parsers/src/test/resources/test-documents/testRSAKEY.der 
b/tika-parsers/src/test/resources/test-documents/testRSAKEY.der
new file mode 100644
index 000..22c4f86
Binary files /dev/null and 
b/tika-parsers/src/test/resources/test-documents/testRSAKEY.der differ
diff --git a/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem 
b/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem
new file mode 100644
index 000..0971b76
--- /dev/null
+++ b/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem
@@ -0,0 +1,16 @@
+-BEGIN PRIVATE KEY-
+MIICeAIBADANBgkqhkiG9w0BAQEFAASCAmIwggJeAgEAAoGBAMeVjMm2uyhe7HkN
+FFBU6nnI9niJnn+hv+3TTDw23/GH2/d1T3JSpuSFstHvwS0tumrSb6Py7igD4fz3
++aeZU1gDrby4f9KemnLBlAU63VuPyCDyWj2XqbsZDGdnbIy2FnBBy2LBn5p0gDwo
+DpnmhHPHIJZo9OMGH/5hWQUt6+rtAgMBAAECgYBdb1TGxiYeQzoffZEJ/ob61qsU
+SRELnVS16RqigeobL8g5tBCqa6k4CKNrhvt/xA2mnrenID6AOzkb7ZdR8ATEtojF
+JjLZ8zmXACU3WetoRUvh2uxlFpxFeK0yQlaEWcvE4Z9MQe3V8pBvMQUNEZxN4bHT
+1eMla9O65TR49uxaPQJBAO/Spm9ln02CjnxCHiGmRQ77gUNz39AtrKRLQBv/uEB2
+fhHAvFoSPGXaIgd73GgZEnM/a+faLrMu9NvemMd5aYMCQQDVDAsjaa72+5ZS87zE
+xLDrFT1cKM8U1G0ikdGl6rejDnSoiwfZ8DXpSBOOiSkf/PX9zDXDPQl9nHLjmDn9
+wN7PAkEAxsPTF66lGoujZk8yQ/dXczR2DR7Dl/nTBZQsvUfzQNI0aKhSM2C72Dqz
+S3qX0Vs+VHBzEYVegTngzT4vZ9wz2wJBAMNXCZdsvUokIA7rALgCCJ1jmiE4Ibdd
+lrtNrEZO0hWlmX04DPjc8PF2bsgQJy73R61vYhQjkOIlYoof93wdLa0CQQDTLHSB
+8e8f81Jq+zbLReAQ6ch+fEulaMPlPY0OqgExBxdbwXnlPENw09+EiQkKSSo8qhY1
+guri/IWyq3LYm8nE
+-END PRIVATE KEY-



[tika] 05/05: Make the DER private key mostly-match a bit more specific

2020-09-29 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 75c2ff5686a70c0fb15c4b52534c1be09669af1b
Author: Nick Burch 
AuthorDate: Tue Sep 29 16:51:19 2020 +0100

Make the DER private key mostly-match a bit more specific
---
 .../main/resources/org/apache/tika/mime/tika-mimetypes.xml| 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 2c4a5e5..404e462 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -4727,12 +4727,13 @@
   
 
 
-
+
+
 
-  
-  
+  
+  
 
   
 



[tika] 03/05: Add some more DER magic for certificates, and add tests TIKA-3205

2020-09-29 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git

commit b0ae63a1c59ef60ac6b134cadf2053f2e73152d4
Author: Nick Burch 
AuthorDate: Tue Sep 29 16:23:08 2020 +0100

Add some more DER magic for certificates, and add tests TIKA-3205
---
 .../org/apache/tika/mime/tika-mimetypes.xml| 28 ++
 .../java/org/apache/tika/mime/TestMimeTypes.java   | 11 +
 2 files changed, 34 insertions(+), 5 deletions(-)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index bdbeee5..a995563 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -4690,10 +4690,16 @@
   
   
   
- 
- 
+  mask="0xFFF8" offset="0">
+ 
+ 
+ 
+ 
+ 
+ 
   
 
   
@@ -4701,8 +4707,20 @@
 
   
   
+  
+  
+  
+  
+  
+  
   
-  
+  
+
+  
+  
+
+  
+  
 
   
 
diff --git 
a/tika-parser-modules/tika-parser-integration-tests/src/test/java/org/apache/tika/mime/TestMimeTypes.java
 
b/tika-parser-modules/tika-parser-integration-tests/src/test/java/org/apache/tika/mime/TestMimeTypes.java
index c507986..dc3f303 100644
--- 
a/tika-parser-modules/tika-parser-integration-tests/src/test/java/org/apache/tika/mime/TestMimeTypes.java
+++ 
b/tika-parser-modules/tika-parser-integration-tests/src/test/java/org/apache/tika/mime/TestMimeTypes.java
@@ -1141,6 +1141,17 @@ public class TestMimeTypes {
 }
 
 @Test
+public void testCertificatesKeys() throws Exception {
+assertType("application/x-x509-cert", "testCERT.pem");
+assertType("application/x-x509-cert", "testCERT.der");
+assertTypeByData("application/x-x509-cert", "testCERT.pem");
+assertTypeByData("application/x-x509-cert", "testCERT.der");
+// Keys need the data to identify, name isn't enough
+assertTypeByData("application/x-x509-key", "testRSAKEY.pem");
+assertTypeByData("application/x-x509-key", "testRSAKEY.der");
+}
+
+@Test
 public void testVandICalendars() throws Exception {
 assertType("text/calendar", "testICalendar.ics");
 assertType("text/x-vcalendar", "testVCalendar.vcs");



[tika] branch main updated (6591b32 -> 75c2ff5)

2020-09-29 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git.


from 6591b32  TIKA-3196 -- ensure that entryCnt is thread-safe across 
parses; add integration test; clean up existing unused imports.
 new ad0d98b  Add test certificate and key for TIKA-3205
 new ecd1d62  TIKA-3205 Add magic for X509 PEM certificate, and tweak 
default type
 new b0ae63a  Add some more DER magic for certificates, and add tests 
TIKA-3205
 new fa1b2ef  Split the Certificate and Key mimetypes into DER and PEM 
subtypes, add test EC files. TIKA-3205
 new 75c2ff5  Make the DER private key mostly-match a bit more specific

The 5 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../org/apache/tika/mime/tika-mimetypes.xml|  72 -
 .../java/org/apache/tika/TikaDetectionTest.java|   5 +-
 .../java/org/apache/tika/mime/TestMimeTypes.java   |  18 ++
 .../src/test/resources/test-documents/testCERT.der | Bin 0 -> 702 bytes
 .../src/test/resources/test-documents/testCERT.pem |  17 +
 .../test/resources/test-documents/testDSAKEY.der   | Bin 0 -> 834 bytes
 .../test/resources/test-documents/testDSAKEY.pem   |  15 +
 .../resources/test-documents/testDSAPARAMS.pem |  14 
 .../test/resources/test-documents/testECKEY.der| Bin 0 -> 167 bytes
 .../test/resources/test-documents/testECKEY.pem|   6 ++
 .../test/resources/test-documents/testECPARAMS.pem |   3 +
 .../test/resources/test-documents/testRSAKEY.der   | Bin 0 -> 610 bytes
 .../test/resources/test-documents/testRSAKEY.pem   |  16 +
 13 files changed, 162 insertions(+), 4 deletions(-)
 create mode 100644 tika-parsers/src/test/resources/test-documents/testCERT.der
 create mode 100644 tika-parsers/src/test/resources/test-documents/testCERT.pem
 create mode 100644 
tika-parsers/src/test/resources/test-documents/testDSAKEY.der
 create mode 100644 
tika-parsers/src/test/resources/test-documents/testDSAKEY.pem
 create mode 100644 
tika-parsers/src/test/resources/test-documents/testDSAPARAMS.pem
 create mode 100644 tika-parsers/src/test/resources/test-documents/testECKEY.der
 create mode 100644 tika-parsers/src/test/resources/test-documents/testECKEY.pem
 create mode 100644 
tika-parsers/src/test/resources/test-documents/testECPARAMS.pem
 create mode 100644 
tika-parsers/src/test/resources/test-documents/testRSAKEY.der
 create mode 100644 
tika-parsers/src/test/resources/test-documents/testRSAKEY.pem



[tika] 04/05: Split the Certificate and Key mimetypes into DER and PEM subtypes, add test EC files. TIKA-3205

2020-09-29 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit c6b30c578e98373496f895cd7caa8317f4212d51
Author: Nick Burch 
AuthorDate: Tue Sep 29 16:48:40 2020 +0100

Split the Certificate and Key mimetypes into DER and PEM subtypes, add test 
EC files. TIKA-3205
---
 .../org/apache/tika/mime/tika-mimetypes.xml|  41 +
 .../java/org/apache/tika/mime/TestMimeTypes.java   |  19 +++---
 .../test/resources/test-documents/testDSAKEY.der   | Bin 0 -> 834 bytes
 .../test/resources/test-documents/testDSAKEY.pem   |  15 
 .../resources/test-documents/testDSAPARAMS.pem |  14 +++
 .../test/resources/test-documents/testECKEY.der| Bin 0 -> 167 bytes
 .../test/resources/test-documents/testECKEY.pem|   6 +++
 .../test/resources/test-documents/testECPARAMS.pem |   3 ++
 8 files changed, 84 insertions(+), 14 deletions(-)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 630b429..abcc5d5 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -4631,13 +4631,19 @@
   
 
 
-
-
 
+  
+  
+
+
 
-  
   
-  
+
+  
+  
+
+
+
   
  
@@ -4652,9 +4658,12 @@
   
 
   
+
   
+  
+  
+
 
-  
   
   
   
@@ -4662,16 +4671,32 @@
   
   
   
-  
-  
 
   
+  
+
+
+
+
+  
+  
+
+  
+
   
 
-  
+  
   
 
   
+  
+
+  
+  
+
+  
 
   
 
diff --git a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java 
b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
index a80dc8e..de45faf 100644
--- a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
+++ b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
@@ -1137,13 +1137,20 @@ public class TestMimeTypes {
 
 @Test
 public void testCertificatesKeys() throws Exception {
-assertType("application/x-x509-cert", "testCERT.pem");
-assertType("application/x-x509-cert", "testCERT.der");
-assertTypeByData("application/x-x509-cert", "testCERT.pem");
-assertTypeByData("application/x-x509-cert", "testCERT.der");
+assertType("application/x-x509-cert; format=pem", "testCERT.pem");
+assertType("application/x-x509-cert; format=der", "testCERT.der");
+assertTypeByData("application/x-x509-cert; format=pem", 
"testCERT.pem");
+assertTypeByData("application/x-x509-cert; format=der", 
"testCERT.der");
 // Keys need the data to identify, name isn't enough
-assertTypeByData("application/x-x509-key", "testRSAKEY.pem");
-assertTypeByData("application/x-x509-key", "testRSAKEY.der");
+assertTypeByData("application/x-x509-key; format=pem", 
"testECKEY.pem");
+assertTypeByData("application/x-x509-key; format=der", 
"testECKEY.der");
+assertTypeByData("application/x-x509-key; format=pem", 
"testRSAKEY.pem");
+assertTypeByData("application/x-x509-key; format=der", 
"testRSAKEY.der");
+assertTypeByData("application/x-x509-key; format=pem", 
"testDSAKEY.pem");
+assertTypeByData("application/x-x509-key; format=der", 
"testDSAKEY.der");
+// Parameters only have PEM form, always need data
+assertTypeByData("application/x-x509-dsa-parameters", 
"testDSAPARAMS.pem");
+assertTypeByData("application/x-x509-ec-parameters", 
"testECPARAMS.pem");
 }
 
 @Test
diff --git a/tika-parsers/src/test/resources/test-documents/testDSAKEY.der 
b/tika-parsers/src/test/resources/test-documents/testDSAKEY.der
new file mode 100644
index 000..9ed2eb9
Binary files /dev/null and 
b/tika-parsers/src/test/resources/test-documents/testDSAKEY.der differ
diff --git a/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem 
b/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem
new file mode 100644
index 000..2b8781a
--- /dev/null
+++ b/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem
@@ -0,0 +1,15 @@
+-BEGIN PRIVATE KEY-
+MIICTQIBADCCAi0GByqGSM44BAEwggIgAoIBAQDRXU0Be5k0MI3skB6K0PhyptBh
+WSJ1l+NVSOX7wpXC37upcH7a0ZCfU9RyWqcX9dQFw+TWjlH2ANll/FO4osXkkJVY
+oylJ+p0599v6WRPBS/yQpKuvfqEm5HA78J8ILhnyCCw8hqdlrADBOMGf7tGF5Agw
+hEZJdtHjYRzPWzY0eogptg3wQPd/

[tika] 03/05: Add some more DER magic for certificates, and add tests TIKA-3205

2020-09-29 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit eaa712f89d5de9ad06647fa29d10ac1baa47a4c0
Author: Nick Burch 
AuthorDate: Tue Sep 29 16:23:08 2020 +0100

Add some more DER magic for certificates, and add tests TIKA-3205
---
 .../org/apache/tika/mime/tika-mimetypes.xml| 28 ++
 .../java/org/apache/tika/mime/TestMimeTypes.java   | 11 +
 2 files changed, 34 insertions(+), 5 deletions(-)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 3cdea61..630b429 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -4639,10 +4639,16 @@
   
   
   
- 
- 
+  mask="0xFFF8" offset="0">
+ 
+ 
+ 
+ 
+ 
+ 
   
 
   
@@ -4650,8 +4656,20 @@
 
   
   
+  
+  
+  
+  
+  
+  
   
-  
+  
+
+  
+  
+
+  
+  
 
   
 
diff --git a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java 
b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
index 83f67eb..a80dc8e 100644
--- a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
+++ b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
@@ -1136,6 +1136,17 @@ public class TestMimeTypes {
 }
 
 @Test
+public void testCertificatesKeys() throws Exception {
+assertType("application/x-x509-cert", "testCERT.pem");
+assertType("application/x-x509-cert", "testCERT.der");
+assertTypeByData("application/x-x509-cert", "testCERT.pem");
+assertTypeByData("application/x-x509-cert", "testCERT.der");
+// Keys need the data to identify, name isn't enough
+assertTypeByData("application/x-x509-key", "testRSAKEY.pem");
+assertTypeByData("application/x-x509-key", "testRSAKEY.der");
+}
+
+@Test
 public void testVandICalendars() throws Exception {
 assertType("text/calendar", "testICalendar.ics");
 assertType("text/x-vcalendar", "testVCalendar.vcs");



[tika] branch master updated (62fe4ad -> 6183452)

2020-09-29 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git.


from 62fe4ad  TIKA-3104 -- add detection and parsing for xml based plist 
files
 new 5fdb70a  Add test certificate and key for TIKA-3205
 new c3fff83  TIKA-3205 Add magic for X509 PEM certificate, and tweak 
default type
 new eaa712f  Add some more DER magic for certificates, and add tests 
TIKA-3205
 new c6b30c5  Split the Certificate and Key mimetypes into DER and PEM 
subtypes, add test EC files. TIKA-3205
 new 6183452  Make the DER private key mostly-match a bit more specific

The 5 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../org/apache/tika/mime/tika-mimetypes.xml|  72 -
 .../java/org/apache/tika/TikaDetectionTest.java|   5 +-
 .../java/org/apache/tika/mime/TestMimeTypes.java   |  18 ++
 .../src/test/resources/test-documents/testCERT.der | Bin 0 -> 702 bytes
 .../src/test/resources/test-documents/testCERT.pem |  17 +
 .../test/resources/test-documents/testDSAKEY.der   | Bin 0 -> 834 bytes
 .../test/resources/test-documents/testDSAKEY.pem   |  15 +
 .../resources/test-documents/testDSAPARAMS.pem |  14 
 .../test/resources/test-documents/testECKEY.der| Bin 0 -> 167 bytes
 .../test/resources/test-documents/testECKEY.pem|   6 ++
 .../test/resources/test-documents/testECPARAMS.pem |   3 +
 .../test/resources/test-documents/testRSAKEY.der   | Bin 0 -> 610 bytes
 .../test/resources/test-documents/testRSAKEY.pem   |  16 +
 13 files changed, 162 insertions(+), 4 deletions(-)
 create mode 100644 tika-parsers/src/test/resources/test-documents/testCERT.der
 create mode 100644 tika-parsers/src/test/resources/test-documents/testCERT.pem
 create mode 100644 
tika-parsers/src/test/resources/test-documents/testDSAKEY.der
 create mode 100644 
tika-parsers/src/test/resources/test-documents/testDSAKEY.pem
 create mode 100644 
tika-parsers/src/test/resources/test-documents/testDSAPARAMS.pem
 create mode 100644 tika-parsers/src/test/resources/test-documents/testECKEY.der
 create mode 100644 tika-parsers/src/test/resources/test-documents/testECKEY.pem
 create mode 100644 
tika-parsers/src/test/resources/test-documents/testECPARAMS.pem
 create mode 100644 
tika-parsers/src/test/resources/test-documents/testRSAKEY.der
 create mode 100644 
tika-parsers/src/test/resources/test-documents/testRSAKEY.pem



[tika] 01/05: Add test certificate and key for TIKA-3205

2020-09-29 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 5fdb70ae4770301d6b101e9007a1058e15abac94
Author: Nick Burch 
AuthorDate: Tue Sep 29 15:26:48 2020 +0100

Add test certificate and key for TIKA-3205
---
 .../src/test/resources/test-documents/testCERT.der | Bin 0 -> 702 bytes
 .../src/test/resources/test-documents/testCERT.pem |  17 +
 .../src/test/resources/test-documents/testRSAKEY.der   | Bin 0 -> 610 bytes
 .../src/test/resources/test-documents/testRSAKEY.pem   |  16 
 4 files changed, 33 insertions(+)

diff --git a/tika-parsers/src/test/resources/test-documents/testCERT.der 
b/tika-parsers/src/test/resources/test-documents/testCERT.der
new file mode 100644
index 000..935f1f6
Binary files /dev/null and 
b/tika-parsers/src/test/resources/test-documents/testCERT.der differ
diff --git a/tika-parsers/src/test/resources/test-documents/testCERT.pem 
b/tika-parsers/src/test/resources/test-documents/testCERT.pem
new file mode 100644
index 000..dbfd849
--- /dev/null
+++ b/tika-parsers/src/test/resources/test-documents/testCERT.pem
@@ -0,0 +1,17 @@
+-BEGIN CERTIFICATE-
+MIICujCCAiOgAwIBAgIUKOX/l1c68ya6jnfeRJ8uP9kvVx8wDQYJKoZIhvcNAQEL
+BQAwbzELMAkGA1UEBhMCWloxFDASBgNVBAgMC0FwYWNoZSBUaWthMQ8wDQYDVQQH
+DAZBcGFjaGUxDTALBgNVBAoMBFRpa2ExFDASBgNVBAsMC0FwYWNoZSBUaWthMRQw
+EgYDVQQDDAtBcGFjaGUgVGlrYTAeFw0yMDA5MjkxNDIzNDRaFw0zMDA5MjcxNDIz
+NDRaMG8xCzAJBgNVBAYTAlpaMRQwEgYDVQQIDAtBcGFjaGUgVGlrYTEPMA0GA1UE
+BwwGQXBhY2hlMQ0wCwYDVQQKDARUaWthMRQwEgYDVQQLDAtBcGFjaGUgVGlrYTEU
+MBIGA1UEAwwLQXBhY2hlIFRpa2EwgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGB
+AMeVjMm2uyhe7HkNFFBU6nnI9niJnn+hv+3TTDw23/GH2/d1T3JSpuSFstHvwS0t
+umrSb6Py7igD4fz3+aeZU1gDrby4f9KemnLBlAU63VuPyCDyWj2XqbsZDGdnbIy2
+FnBBy2LBn5p0gDwoDpnmhHPHIJZo9OMGH/5hWQUt6+rtAgMBAAGjUzBRMB0GA1Ud
+DgQWBBTM8K2WIAuPiv0VgrRoMn2fAGua1jAfBgNVHSMEGDAWgBTM8K2WIAuPiv0V
+grRoMn2fAGua1jAPBgNVHRMBAf8EBTADAQH/MA0GCSqGSIb3DQEBCwUAA4GBALqE
++ja5Hx78Dpym/HxP50TfadwmEes+JXYptykWnuOWgLlqLuGAqJctLOKoR73r7d9d
+zJBtdr3A5uTg9vWNMSA2lPdBr/NplNaI8bso+8dRWdkiMut+j7xqTFl8MVMriRSR
+a2cA9BsUlpHjJdVjcFweAtdlINZDACoZubCTM7ng
+-END CERTIFICATE-
diff --git a/tika-parsers/src/test/resources/test-documents/testRSAKEY.der 
b/tika-parsers/src/test/resources/test-documents/testRSAKEY.der
new file mode 100644
index 000..22c4f86
Binary files /dev/null and 
b/tika-parsers/src/test/resources/test-documents/testRSAKEY.der differ
diff --git a/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem 
b/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem
new file mode 100644
index 000..0971b76
--- /dev/null
+++ b/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem
@@ -0,0 +1,16 @@
+-BEGIN PRIVATE KEY-
+MIICeAIBADANBgkqhkiG9w0BAQEFAASCAmIwggJeAgEAAoGBAMeVjMm2uyhe7HkN
+FFBU6nnI9niJnn+hv+3TTDw23/GH2/d1T3JSpuSFstHvwS0tumrSb6Py7igD4fz3
++aeZU1gDrby4f9KemnLBlAU63VuPyCDyWj2XqbsZDGdnbIy2FnBBy2LBn5p0gDwo
+DpnmhHPHIJZo9OMGH/5hWQUt6+rtAgMBAAECgYBdb1TGxiYeQzoffZEJ/ob61qsU
+SRELnVS16RqigeobL8g5tBCqa6k4CKNrhvt/xA2mnrenID6AOzkb7ZdR8ATEtojF
+JjLZ8zmXACU3WetoRUvh2uxlFpxFeK0yQlaEWcvE4Z9MQe3V8pBvMQUNEZxN4bHT
+1eMla9O65TR49uxaPQJBAO/Spm9ln02CjnxCHiGmRQ77gUNz39AtrKRLQBv/uEB2
+fhHAvFoSPGXaIgd73GgZEnM/a+faLrMu9NvemMd5aYMCQQDVDAsjaa72+5ZS87zE
+xLDrFT1cKM8U1G0ikdGl6rejDnSoiwfZ8DXpSBOOiSkf/PX9zDXDPQl9nHLjmDn9
+wN7PAkEAxsPTF66lGoujZk8yQ/dXczR2DR7Dl/nTBZQsvUfzQNI0aKhSM2C72Dqz
+S3qX0Vs+VHBzEYVegTngzT4vZ9wz2wJBAMNXCZdsvUokIA7rALgCCJ1jmiE4Ibdd
+lrtNrEZO0hWlmX04DPjc8PF2bsgQJy73R61vYhQjkOIlYoof93wdLa0CQQDTLHSB
+8e8f81Jq+zbLReAQ6ch+fEulaMPlPY0OqgExBxdbwXnlPENw09+EiQkKSSo8qhY1
+guri/IWyq3LYm8nE
+-END PRIVATE KEY-



[tika] 05/05: Make the DER private key mostly-match a bit more specific

2020-09-29 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 618345263ee41108e1a225dbcdbb8db16b2aae28
Author: Nick Burch 
AuthorDate: Tue Sep 29 16:51:19 2020 +0100

Make the DER private key mostly-match a bit more specific
---
 .../main/resources/org/apache/tika/mime/tika-mimetypes.xml| 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index abcc5d5..92cbb21 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -4676,12 +4676,13 @@
   
 
 
-
+
+
 
-  
-  
+  
+  
 
   
 



[tika] 02/05: TIKA-3205 Add magic for X509 PEM certificate, and tweak default type

2020-09-29 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit c3fff83c7e955ff5de0e4cb9098b06a15ee2cf7e
Author: Nick Burch 
AuthorDate: Tue Sep 29 15:49:14 2020 +0100

TIKA-3205 Add magic for X509 PEM certificate, and tweak default type
---
 .../org/apache/tika/mime/tika-mimetypes.xml| 26 +-
 .../java/org/apache/tika/TikaDetectionTest.java|  5 +++--
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 5dbcf99..3cdea61 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -4627,10 +4627,34 @@
 
 
   
-  
+
+  
+
+
 
+
 
+
+  
+  
+  
+  
+ 
+ 
+  
+
   
+  
+
+  
+  
+  
+  
+
+  
+
   
 
   
diff --git a/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java 
b/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
index eb3bb19..2364daa 100644
--- a/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
+++ b/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
@@ -624,8 +624,9 @@ public class TikaDetectionTest {
 assertEquals("application/x-texinfo", tika.detect("x.texi"));
 assertEquals("application/x-ustar", tika.detect("x.ustar"));
 assertEquals("application/x-wais-source", tika.detect("x.src"));
-assertEquals("application/x-x509-ca-cert", tika.detect("x.der"));
-assertEquals("application/x-x509-ca-cert", tika.detect("x.crt"));
+// Differ from httpd - use a common parent for CA and User certs
+//assertEquals("application/x-x509-ca-cert", tika.detect("x.der"));
+//assertEquals("application/x-x509-ca-cert", tika.detect("x.crt"));
 assertEquals("application/x-xfig", tika.detect("x.fig"));
 assertEquals("application/x-xpinstall", tika.detect("x.xpi"));
 assertEquals("application/xenc+xml", tika.detect("x.xenc"));



[tika] branch master updated: Tweak whitespace to be consistent

2020-05-28 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/master by this push:
 new f233f3b  Tweak whitespace to be consistent
f233f3b is described below

commit f233f3bacb5ec62c948f46d51c2a1ab54744073f
Author: Nick Burch 
AuthorDate: Thu May 28 07:15:16 2020 +0100

Tweak whitespace to be consistent
---
 tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index feaef21..4ea9252 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -3869,8 +3869,8 @@
 
   
 <_comment>Apple Xcode Memgraph
- 
- 
+
+
   
 
   



[tika] branch master updated (0bf11ae -> 1140091)

2020-05-28 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git.


from 0bf11ae  TIKA-2961 Make the CAF mime magic more specific to avoid 
false positives, by checking for a version number after the "caff" header text
 new e9d62d2  Make the bplist magic more specific where possible, keep 
version catch-all as now otherwise
 new 1140091  Add glob for Xcode Memgraph files, which are bplist-based

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../resources/org/apache/tika/mime/tika-mimetypes.xml  | 18 ++
 1 file changed, 18 insertions(+)



[tika] 01/02: Make the bplist magic more specific where possible, keep version catch-all as now otherwise

2020-05-28 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit e9d62d24c19250053aee07a59c9e4de5197f2f42
Author: Nick Burch 
AuthorDate: Thu May 28 07:05:30 2020 +0100

Make the bplist magic more specific where possible, keep version catch-all 
as now otherwise
---
 .../main/resources/org/apache/tika/mime/tika-mimetypes.xml| 11 +++
 1 file changed, 11 insertions(+)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 7210066..aad1c39 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -3295,6 +3295,17 @@
   
 
   
+
+
+  
+  
+  
+  
+  
+  
+  
+  
+
 

[tika] 02/02: Add glob for Xcode Memgraph files, which are bplist-based

2020-05-28 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 114009165410c91b57b91fc4eaddb089a8559451
Author: Nick Burch 
AuthorDate: Thu May 28 07:06:14 2020 +0100

Add glob for Xcode Memgraph files, which are bplist-based
---
 .../src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 7 +++
 1 file changed, 7 insertions(+)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index aad1c39..feaef21 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -3315,6 +3315,7 @@
   
 
   
+
   
 <_comment>GNU tar Compressed File Archive (GNU Tape Archive)
 
@@ -3866,6 +3867,12 @@
 
 
 
+  
+<_comment>Apple Xcode Memgraph
+ 
+ 
+  
+
   
 MOBI
 <_comment>Mobipocket Ebook



[tika] branch master updated: TIKA-2961 Make the CAF mime magic more specific to avoid false positives, by checking for a version number after the "caff" header text

2020-05-17 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/master by this push:
 new 0bf11ae  TIKA-2961 Make the CAF mime magic more specific to avoid 
false positives, by checking for a version number after the "caff" header text
0bf11ae is described below

commit 0bf11aec86079b8f1ae2f1ea680910ba79665c4f
Author: Nick Burch 
AuthorDate: Mon May 18 05:06:27 2020 +0100

TIKA-2961 Make the CAF mime magic more specific to avoid false positives, 
by checking for a version number after the "caff" header text
---
 .../src/main/resources/org/apache/tika/mime/tika-mimetypes.xml  | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 551e55e..7210066 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -5139,7 +5139,11 @@
  <_comment>Core Audio Format
  <_comment>com.apple.coreaudio-format
  
-
+
+
+
+
+
  
  
   



[tika] branch master updated: TIKA-3023 Make the SGI Movie mime magic more specific to avoid false positives on text files starting with MOVI

2020-02-06 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/master by this push:
 new 0d259bc  TIKA-3023 Make the SGI Movie mime magic more specific to 
avoid false positives on text files starting with MOVI
0d259bc is described below

commit 0d259bc8b6beccaa9bac2e85212b57a48f171e83
Author: Nick Burch 
AuthorDate: Thu Feb 6 11:42:30 2020 +

TIKA-3023 Make the SGI Movie mime magic more specific to avoid false 
positives on text files starting with MOVI
---
 .../src/main/resources/org/apache/tika/mime/tika-mimetypes.xml  | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 174dad0..3211cfb 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -7438,7 +7438,11 @@
 
   
 
-  
+  
+  
+  
+  
+  
 
 
   



[tika] branch master updated: TIKA-3034 Mathematica files don't have a unique magic, but try to detect based on the file starting with a Mathematica-style comment as all we can do. Also add the newer

2020-02-04 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/master by this push:
 new f5571fa  TIKA-3034 Mathematica files don't have a unique magic, but 
try to detect based on the file starting with a Mathematica-style comment as 
all we can do. Also add the newer Wolfram Language mimetype, which extends 
mathematica, with a unix  detection
f5571fa is described below

commit f5571fa99ef6f178a16bd1bd3a3cded83c7b0013
Author: Nick Burch 
AuthorDate: Tue Feb 4 10:31:31 2020 +

TIKA-3034 Mathematica files don't have a unique magic, but try to detect 
based on the file starting with a Mathematica-style comment as all we can do. 
Also add the newer Wolfram Language mimetype, which extends mathematica, with a 
unix  detection
---
 .../resources/org/apache/tika/mime/tika-mimetypes.xml  | 18 ++
 1 file changed, 18 insertions(+)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 34e8d98..174dad0 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -409,11 +409,29 @@
   
 
   
+
   
+<_comment>Wolfram Mathematica
 
 
 
+
+
+
+  
+  
+
+
   
+  
+<_comment>Wolfram Language
+
+
+  
+
+
+  
+
   
 
   



[tika] 04/05: HEIF detection unit test. When tooling improves, should ideally create another HEIF test file with another codec too

2019-11-18 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 433a8c1625d302bf1a9d81f2ad1223df7bf83d31
Author: Nick Burch 
AuthorDate: Mon Nov 18 14:57:09 2019 +

HEIF detection unit test. When tooling improves, should ideally create 
another HEIF test file with another codec too
---
 .../src/test/java/org/apache/tika/mime/TestMimeTypes.java| 12 
 1 file changed, 12 insertions(+)

diff --git a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java 
b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
index bdf7da1..d45d116 100644
--- a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
+++ b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
@@ -408,6 +408,18 @@ public class TestMimeTypes {
 }
 
 @Test
+public void testHeifDetection() throws Exception {
+// HEIF image using the HEVC Codec == HEIC
+//  created using https://compare.rokka.io/_compare on testJPEG_GEO.jpg
+assertType("image/heic", "testHEIF.heic");
+assertTypeByData("image/heic", "testHEIF.heic");
+assertTypeByName("image/heic", "testHEIF.heic");
+
+// TODO Create a HEIF using another codec, to test .heif data
+assertTypeByName("image/heif", "testHEIF.heif");
+}
+
+@Test
 public void testJpegDetection() throws Exception {
 assertType("image/jpeg", "testJPEG.jpg");
 assertTypeByData("image/jpeg", "testJPEG.jpg");



[tika] 02/05: Test file uses the HEVC codec, so switch to the more specific extension

2019-11-18 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit efd071aa595d76d094f549f25db856229baace5d
Author: Nick Burch 
AuthorDate: Mon Nov 18 14:54:42 2019 +

Test file uses the HEVC codec, so switch to the more specific extension
---
 .../test-documents/{testHEIF.heif => testHEIF.heic} | Bin
 1 file changed, 0 insertions(+), 0 deletions(-)

diff --git a/tika-parsers/src/test/resources/test-documents/testHEIF.heif 
b/tika-parsers/src/test/resources/test-documents/testHEIF.heic
similarity index 100%
rename from tika-parsers/src/test/resources/test-documents/testHEIF.heif
rename to tika-parsers/src/test/resources/test-documents/testHEIF.heic



[tika] branch master updated (f6a5749 -> 1bb1895)

2019-11-18 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git.


from f6a5749  TIKA-2982 -- don't require 'DataSpaces' in ooxml-encrypted 
detection
 new 8cfacfe  Test HEIF file, generated with 
https://compare.rokka.io/_compare on testJPEG_GEO.jpg
 new efd071a  Test file uses the HEVC codec, so switch to the more specific 
extension
 new 0758598  Add mimetypes for the HEIF (High Efficiency Image File) 
format family - TIKA-2942
 new 433a8c1  HEIF detection unit test. When tooling improves, should 
ideally create another HEIF test file with another codec too
 new 1bb1895  Changelog update

The 5 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 CHANGES.txt|   3 +-
 .../org/apache/tika/mime/tika-mimetypes.xml|  40 +
 .../java/org/apache/tika/mime/TestMimeTypes.java   |  12 +++
 .../test/resources/test-documents/testHEIF.heic| Bin 0 -> 13706 bytes
 4 files changed, 54 insertions(+), 1 deletion(-)
 create mode 100644 tika-parsers/src/test/resources/test-documents/testHEIF.heic



[tika] 05/05: Changelog update

2019-11-18 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 1bb1895a30b722a9780122a6447598dd29e75ca7
Author: Nick Burch 
AuthorDate: Mon Nov 18 15:00:33 2019 +

Changelog update
---
 CHANGES.txt | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/CHANGES.txt b/CHANGES.txt
index 3b66d3b..17b401c 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -16,7 +16,8 @@ Release 1.23
 
* Add parser for XLIFF v1.2 files (TIKA-2975).
 
-   * Add mime type detection support for WebAssembly (TIKA-2894).
+   * Add mime type detection support for WebAssembly (TIKA-2894) and
+ HEIF / HEIC images (TIKA-2942).
 
* Add an XLZ Parser (TIKA-2976).
 



[tika] 03/05: Add mimetypes for the HEIF (High Efficiency Image File) format family - TIKA-2942

2019-11-18 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 0758598bece92f97418f88d0c443e8d9cff7a7ee
Author: Nick Burch 
AuthorDate: Mon Nov 18 14:55:45 2019 +

Add mimetypes for the HEIF (High Efficiency Image File) format family - 
TIKA-2942
---
 .../org/apache/tika/mime/tika-mimetypes.xml| 40 ++
 1 file changed, 40 insertions(+)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index c5ad55d..6e967b6 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -5314,6 +5314,46 @@
 
   
 
+  
+<_comment>HEIF - High Efficiency Image File
+HEIF
+
https://en.wikipedia.org/wiki/High_Efficiency_Image_File_Format
+
+  
+
+
+
+  
+  
+<_comment>HEIF Sequence - High Efficiency Image Sequence
+
+  
+
+
+  
+
+  
+
+<_comment>HEIF Image using HEVC Codec
+HEIC
+
+  
+  
+
+
+
+  
+  
+
+<_comment>HEIF Sequence using HEVC Codec
+HEVC
+
+  
+  
+
+
+  
+
   
 <_comment>Apple Icon Image Format
 



svn commit: r1869088 - /tika/site/src/site/resources/doap.rdf

2019-10-28 Thread nick
Author: nick
Date: Mon Oct 28 21:35:45 2019
New Revision: 1869088

URL: http://svn.apache.org/viewvc?rev=1869088=rev
Log:
Correct the RDF link for the projects category, and add us to Content too

Modified:
tika/site/src/site/resources/doap.rdf

Modified: tika/site/src/site/resources/doap.rdf
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/resources/doap.rdf?rev=1869088=1869087=1869088=diff
==
--- tika/site/src/site/resources/doap.rdf (original)
+++ tika/site/src/site/resources/doap.rdf Mon Oct 28 21:35:45 2019
@@ -38,7 +38,8 @@
   
 
 Java
-https://projects.apache.org/projects.html?category#library; />
+http://projects.apache.org/category/content; />
+http://projects.apache.org/category/library; />
 
   
 Apache Tika 1.22




svn commit: r1867123 - in /tika/site: pom.xml publish/.htaccess publish/source-repository.html

2019-09-18 Thread nick
Author: nick
Date: Wed Sep 18 13:56:36 2019
New Revision: 1867123

URL: http://svn.apache.org/viewvc?rev=1867123=rev
Log:
TIKA-2947 Update the source control details in the site pom, so the 
auto-generated source repo file is correct

Added:
tika/site/publish/source-repository.html
Removed:
tika/site/publish/.htaccess
Modified:
tika/site/pom.xml

Modified: tika/site/pom.xml
URL: 
http://svn.apache.org/viewvc/tika/site/pom.xml?rev=1867123=1867122=1867123=diff
==
--- tika/site/pom.xml (original)
+++ tika/site/pom.xml Wed Sep 18 13:56:36 2019
@@ -39,12 +39,12 @@
 
   
 
-  scm:svn:http://svn.apache.org/repos/asf/tika/trunk
+  scm:git:https://github.com/apache/tika/
 
 
-  scm:svn:https://svn.apache.org/repos/asf/tika/trunk
+  scm:git:https://gitbox.apache.org/repos/asf/tika.git
 
-http://svn.apache.org/repos/asf/tika/trunk
+https://github.com/apache/tika/
   
 
   

Added: tika/site/publish/source-repository.html
URL: 
http://svn.apache.org/viewvc/tika/site/publish/source-repository.html?rev=1867123=auto
==
--- tika/site/publish/source-repository.html (added)
+++ tika/site/publish/source-repository.html Wed Sep 18 13:56:36 2019
@@ -0,0 +1,457 @@
+http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd;>
+
+
+
+
+
+
+
+
+
+http://www.w3.org/1999/xhtml;>
+  
+
+Apache Tika  Source Repository
+
+  @import url("./css/site.css");
+
+
+
+  function selectProvider(form) {
+provider = form.elements['searchProvider'].value;
+if (provider == "any") {
+  if (Math.random() > 0.5) {
+provider = "lucid";
+  } else {
+provider = "sl";
+  }
+}
+if (provider == "lucid") {
+  form.action = "<a  rel="nofollow" href="http://find.searchhub.org/p:tika&quot">http://find.searchhub.org/p:tika&quot</a>;;
+} else if (provider == "sl") {
+  form.action = "<a  rel="nofollow" href="http://search-lucene.com/tika&quot">http://search-lucene.com/tika&quot</a>;;
+}
+days = 90;
+date = new Date();
+date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000));
+expires = "; expires=" + date.toGMTString();
+document.cookie = "searchProvider=" + provider + expires + "; path=/";
+  }
+  function initProvider() {
+if (document.cookie.length>0) {
+  cStart=document.cookie.indexOf("searchProvider=");
+  if (cStart!=-1) {
+cStart=cStart + "searchProvider=".length;
+cEnd=document.cookie.indexOf(";", cStart);
+if (cEnd==-1) {
+  cEnd=document.cookie.length;
+}
+provider = unescape(document.cookie.substring(cStart,cEnd));
+document.forms['searchform'].elements['searchProvider'].value = 
provider;
+  }
+}
+document.forms['searchform'].elements['q'].focus();
+  }
+
+  
+  
+
+  
+https://tika.apache.org; id="bannerLeft" title="Apache Tika"
+  >https://tika.apache.org/tika.png; alt="Apache Tika"
+width="292" height="100"/>
+https://www.apache.org/; id="bannerRight"
+   title="The Apache Software Foundation"
+  >https://tika.apache.org/asf-logo.gif; alt="The Apache 
Software Foundation"
+width="387" height="100"/>
+  
+  
+
+Overview
+This project uses http://git-scm.com/;>GIT to manage its source code. Instructions on 
GIT use can be found at http://git-scm.com/documentation;>http://git-scm.com/documentation.
+
+Web Access
+The following is a link to the online source repository.
+
+https://github.com/apache/tika/;>https://github.com/apache/tika/
+
+Anonymous access
+The source can be checked out anonymously from GIT with this command (See 
http://git-scm.com/docs/git-clone;>http://git-scm.com/docs/git-clone):
+
+$ git clone https://github.com/apache/tika/
+
+Developer access
+Only project developers can access the GIT tree via this method (See http://git-scm.com/docs/git-clone;>http://git-scm.com/docs/git-clone).
+
+$ git clone https://gitbox.apache.org/repos/asf/tika.git
+
+Access from behind a 
firewall
+Refer to the documentation of the SCM used for more information about 
access behind a firewall.
+  
+  
+
+Apache Tika
+
+  
+
+Introduction
+  
+  
+
+Download
+  

svn commit: r1867122 [2/2] - in /tika/site/publish: 0.10/ 0.5/ 0.6/ 0.7/ 0.8/ 0.9/ 1.0/ 1.1/ 1.10/ 1.11/ 1.12/ 1.13/ 1.14/ 1.15/ 1.16/ 1.17/ 1.18/ 1.19.1/ 1.19/ 1.2/ 1.20/ 1.21/ 1.22/ 1.3/ 1.4/ 1.5/ 1

2019-09-18 Thread nick
Modified: tika/site/publish/1.18/gettingstarted.html
URL: 
http://svn.apache.org/viewvc/tika/site/publish/1.18/gettingstarted.html?rev=1867122=1867121=1867122=diff
==
--- tika/site/publish/1.18/gettingstarted.html (original)
+++ tika/site/publish/1.18/gettingstarted.html Wed Sep 18 13:50:40 2019
@@ -89,11 +89,10 @@
 This document describes how to build Apache Tika from sources and how to 
start using Tika in an application.
 
 Getting and building the 
sources
-To build Tika from sources you first need to either download a source release or checkout the latest sources from version 
control.
+To build Tika from sources you first need to either download a source release or checkout the latest sources from 
version control.
 Once you have the sources, you can build them using the http://maven.apache.org/;>Maven 2 build system. 
Executing the following command in the base directory will build the sources 
and install the resulting artifacts in your local Maven repository.
 
-mvn install
-
+mvn install
 See the Maven documentation for more information about the available build 
options.
 Note that you need Java 7 or higher to build Tika.
 
@@ -120,36 +119,31 @@
 groupIdorg.apache.tika/groupId
 artifactIdtika-core/artifactId
 version1.18/version
-  /dependency
-
+  /dependency
 If you want to use Tika to parse documents (instead of simply detecting 
document types, etc.), you'll want to depend on  tika-parsers  
instead: 
 
   dependency
 groupIdorg.apache.tika/groupId
 artifactIdtika-parsers/artifactId
 version1.18/version
-  /dependency
-
+  /dependency
 Note that adding this dependency will introduce a number of transitive 
dependencies to your project, including one on tika-core. You need to make sure 
that these dependencies won't conflict with your existing project dependencies. 
You can use the following command in the tika-parsers directory to get a full 
listing of all the dependencies.
 
-$ mvn dependency:tree | grep :compile
-
+$ mvn dependency:tree | grep :compile
 
 Using Tika in a 
Gradle-built project
 To add a dependency on Apache Tika to your Gradle built project, including 
the full set of parsers, you should depend on the  tika-parsers  
artifact:
 
 dependencies {
 runtime 'org.apache.tika:tika-parsers:1.18'
-}
-
+}
 
 Using Tika in an Ant 
project
 If you are using http://ant.apache.org/ivy/;>Apache Ivy as your dependency manager 
tool with Ant, then to include Tika with the full set of parsers, you should 
depend on the  tika-parsers  artifact like this:
 
 dependencies
 dependency org=org.apache.tika 
name=tika-parsers rev=1.18/
-/dependencies
-
+/dependencies
 Otherwise, probably the easiest way to use Tika is to include the full  
tika-app  jar on your classpath. For just core functionality, you can add 
the  tika-core  jar, but be aware that the full set of parsers have a 
large number of dependencies which must be included which is very fiddly to do 
by hand with Ant! To include Tika in your Ant project, you should do something 
like:
 
 classpath
@@ -160,8 +154,7 @@
   !-- or: Tika with all Parsers--
   pathelement 
location=path/to/tika-app-${tika.version}.jar/
 
-/classpath
-
+/classpath
 
 Using Tika as a command 
line utility
 The Tika application jar (tika-app-*.jar) can be used as a command line 
utility for extracting text content and metadata from all sorts of files. This 
runnable jar contains all the dependencies it needs, so you don't need to worry 
about classpath settings to run it.
@@ -277,15 +270,13 @@ Batch Options:
 
 To modify child process jvm args, prepend J as in:
 -JXmx4g or -JDlog4j.configuration=file:log4j.xml.
-
 
 You can also use the jar as a component in a Unix pipeline or as an 
external tool in many scripting languages.
 
 # Check if an Internet resource contains a specific keyword
 curl http://.../document.doc \
   | java -jar tika-app.jar --text \
-  | grep -q keyword
-
+  | grep -q keyword
 
 Wrappers
 Several wrappers are available to use Tika in another programming language, 
such as https://github.com/aviks/Taro.jl;>Julia or https://github.com/chrismattmann/tika-python;>Python.

Modified: tika/site/publish/1.19.1/gettingstarted.html
URL: 
http://svn.apache.org/viewvc/tika/site/publish/1.19.1/gettingstarted.html?rev=1867122=1867121=1867122=diff
==
--- tika/site/publish/1.19.1/gettingstarted.html (original)
+++ tika/site/publish/1.19.1/gettingstarted.html Wed Sep 18 13:50:40 2019
@@ -89,11 +89,10 @@
 This document describes how to build Apache Tika from sources and how to 
start using Tika in an application.
 
 Getting and building the 
sources
-To build Tika from sources you first need to either download a source release or checkout the latest sources from version 
control.
+To build Tika from sources you first need to either download a source 

svn commit: r1867122 [1/2] - in /tika/site/publish: 0.10/ 0.5/ 0.6/ 0.7/ 0.8/ 0.9/ 1.0/ 1.1/ 1.10/ 1.11/ 1.12/ 1.13/ 1.14/ 1.15/ 1.16/ 1.17/ 1.18/ 1.19.1/ 1.19/ 1.2/ 1.20/ 1.21/ 1.22/ 1.3/ 1.4/ 1.5/ 1

2019-09-18 Thread nick
Author: nick
Date: Wed Sep 18 13:50:40 2019
New Revision: 1867122

URL: http://svn.apache.org/viewvc?rev=1867122=rev
Log:
TIKA-2947 Update source code link

Modified:
tika/site/publish/0.10/gettingstarted.html
tika/site/publish/0.5/gettingstarted.html
tika/site/publish/0.6/gettingstarted.html
tika/site/publish/0.7/gettingstarted.html
tika/site/publish/0.8/gettingstarted.html
tika/site/publish/0.9/gettingstarted.html
tika/site/publish/1.0/gettingstarted.html
tika/site/publish/1.1/gettingstarted.html
tika/site/publish/1.10/gettingstarted.html
tika/site/publish/1.11/gettingstarted.html
tika/site/publish/1.12/gettingstarted.html
tika/site/publish/1.13/gettingstarted.html
tika/site/publish/1.14/gettingstarted.html
tika/site/publish/1.15/gettingstarted.html
tika/site/publish/1.16/gettingstarted.html
tika/site/publish/1.17/gettingstarted.html
tika/site/publish/1.18/gettingstarted.html
tika/site/publish/1.19.1/gettingstarted.html
tika/site/publish/1.19/gettingstarted.html
tika/site/publish/1.2/gettingstarted.html
tika/site/publish/1.20/gettingstarted.html
tika/site/publish/1.21/gettingstarted.html
tika/site/publish/1.22/gettingstarted.html
tika/site/publish/1.3/gettingstarted.html
tika/site/publish/1.4/gettingstarted.html
tika/site/publish/1.5/gettingstarted.html
tika/site/publish/1.6/gettingstarted.html
tika/site/publish/1.7/gettingstarted.html
tika/site/publish/1.8/gettingstarted.html
tika/site/publish/1.9/gettingstarted.html

Modified: tika/site/publish/0.10/gettingstarted.html
URL: 
http://svn.apache.org/viewvc/tika/site/publish/0.10/gettingstarted.html?rev=1867122=1867121=1867122=diff
==
--- tika/site/publish/0.10/gettingstarted.html (original)
+++ tika/site/publish/0.10/gettingstarted.html Wed Sep 18 13:50:40 2019
@@ -89,11 +89,10 @@
 This document describes how to build Apache Tika from sources and how to 
start using Tika in an application.
 
 Getting and building the 
sources
-To build Tika from sources you first need to either download a source release or checkout the latest sources from version 
control.
+To build Tika from sources you first need to either download a source release or checkout the latest sources from 
version control.
 Once you have the sources, you can build them using the http://maven.apache.org/;>Maven 2 build system. 
Executing the following command in the base directory will build the sources 
and install the resulting artifacts in your local Maven repository.
 
-mvn install
-
+mvn install
 See the Maven documentation for more information about the available build 
options.
 Note that you need Java 5 or higher to build Tika.
 
@@ -116,16 +115,14 @@
 groupIdorg.apache.tika/groupId
 artifactIdtika-core/artifactId
 version0.10/version
-  /dependency
-
+  /dependency
 If you want to use Tika to parse documents (instead of simply detecting 
document types, etc.), you'll want to depend on tika-parsers instead: 
 
   dependency
 groupIdorg.apache.tika/groupId
 artifactIdtika-parsers/artifactId
 version0.10/version
-  /dependency
-
+  /dependency
 Note that adding this dependency will introduce a number of transitive 
dependencies to your project, including one on tika-core. You need to make sure 
that these dependencies won't conflict with your existing project dependencies. 
The listing below shows all the compile-scope dependencies of tika-parsers in 
the Tika 0.10 release.
 
 org.apache.tika:tika-parsers:bundle:0.10
@@ -154,8 +151,7 @@
 +- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
 +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
 +- rome:rome:jar:0.9:compile
-|  \- jdom:jdom:jar:1.0:compile
-
+|  \- jdom:jdom:jar:1.0:compile
 
 Using Tika in an Ant 
project
 Unless you use a dependency manager tool like http://ant.apache.org/ivy/;>Apache Ivy, to use Tika in you 
application you can include the Tika jar files and the dependencies 
individually.
@@ -187,8 +183,7 @@
   pathelement location=path/to/boilerpipe-1.1.0.jar/
   pathelement location=path/to/rome-0.9.jar/
   pathelement location=path/to/jdom-1.0.jar/
-/classpath
-
+/classpath
 An easy way to gather all these libraries is to run mvn 
dependency:copy-dependencies in the tika-parsers source directory. This 
will copy all Tika dependencies to the target/dependencies 
directory.
 Alternatively you can simply drop the entire tika-app jar to your classpath 
to get all of the above dependencies in a single archive.
 
@@ -253,15 +248,13 @@ Description:
 
 Use the -server (or -s) option to start the
 Apache Tika server. The server will listen to the
-ports you specify as one or more arguments.
-
+ports you specify as one or more arguments.
 You can also use the jar as a component in a Unix pipeline or as an 
external tool in many scripting languages.
 
 # Check if an Internet re

svn commit: r1867120 - /tika/site/publish/.htaccess

2019-09-18 Thread nick
Author: nick
Date: Wed Sep 18 13:43:41 2019
New Revision: 1867120

URL: http://svn.apache.org/viewvc?rev=1867120=rev
Log:
Remove the old source code page, redirect to the new one

Modified:
tika/site/publish/.htaccess

Modified: tika/site/publish/.htaccess
URL: 
http://svn.apache.org/viewvc/tika/site/publish/.htaccess?rev=1867120=1867119=1867120=diff
==
--- tika/site/publish/.htaccess (original)
+++ tika/site/publish/.htaccess Wed Sep 18 13:43:41 2019
@@ -2,4 +2,4 @@
 # See http://httpd.apache.org/docs/current/mod/mod_alias.html#redirect
 
 # Redirect old source code page to the new one
-Redirect source-repository.html contribute.html
+Redirect "/source-repository.html" "/contribute.html"




svn commit: r1867119 - /tika/site/publish/.htaccess

2019-09-18 Thread nick
Author: nick
Date: Wed Sep 18 13:42:35 2019
New Revision: 1867119

URL: http://svn.apache.org/viewvc?rev=1867119=rev
Log:
Remove the old source code page, redirect to the new one

Modified:
tika/site/publish/.htaccess

Modified: tika/site/publish/.htaccess
URL: 
http://svn.apache.org/viewvc/tika/site/publish/.htaccess?rev=1867119=1867118=1867119=diff
==
--- tika/site/publish/.htaccess (original)
+++ tika/site/publish/.htaccess Wed Sep 18 13:42:35 2019
@@ -2,4 +2,4 @@
 # See http://httpd.apache.org/docs/current/mod/mod_alias.html#redirect
 
 # Redirect old source code page to the new one
-Redirect source-repository.html contribute.html#Source_Code
+Redirect source-repository.html contribute.html




svn commit: r1867118 - in /tika/site/publish: .htaccess source-repository.html

2019-09-18 Thread nick
Author: nick
Date: Wed Sep 18 13:41:56 2019
New Revision: 1867118

URL: http://svn.apache.org/viewvc?rev=1867118=rev
Log:
Remove the old source code page, redirect to the new one

Added:
tika/site/publish/.htaccess
Removed:
tika/site/publish/source-repository.html

Added: tika/site/publish/.htaccess
URL: 
http://svn.apache.org/viewvc/tika/site/publish/.htaccess?rev=1867118=auto
==
--- tika/site/publish/.htaccess (added)
+++ tika/site/publish/.htaccess Wed Sep 18 13:41:56 2019
@@ -0,0 +1,5 @@
+# Apache Tika website redirects
+# See http://httpd.apache.org/docs/current/mod/mod_alias.html#redirect
+
+# Redirect old source code page to the new one
+Redirect source-repository.html contribute.html#Source_Code




svn commit: r1867117 - in /tika/site/src/site/apt: 0.10/ 0.5/ 0.6/ 0.7/ 0.8/ 0.9/ 1.0/ 1.1/ 1.10/ 1.11/ 1.12/ 1.13/ 1.14/ 1.15/ 1.16/ 1.17/ 1.18/ 1.19.1/ 1.19/ 1.2/ 1.20/ 1.21/ 1.22/ 1.3/ 1.4/ 1.5/ 1.

2019-09-18 Thread nick
Author: nick
Date: Wed Sep 18 13:38:59 2019
New Revision: 1867117

URL: http://svn.apache.org/viewvc?rev=1867117=rev
Log:
TIKA-2947 Fix source code documentation link

Modified:
tika/site/src/site/apt/0.10/gettingstarted.apt
tika/site/src/site/apt/0.5/gettingstarted.apt
tika/site/src/site/apt/0.6/gettingstarted.apt
tika/site/src/site/apt/0.7/gettingstarted.apt
tika/site/src/site/apt/0.8/gettingstarted.apt
tika/site/src/site/apt/0.9/gettingstarted.apt
tika/site/src/site/apt/1.0/gettingstarted.apt
tika/site/src/site/apt/1.1/gettingstarted.apt
tika/site/src/site/apt/1.10/gettingstarted.apt
tika/site/src/site/apt/1.11/gettingstarted.apt
tika/site/src/site/apt/1.12/gettingstarted.apt
tika/site/src/site/apt/1.13/gettingstarted.apt
tika/site/src/site/apt/1.14/gettingstarted.apt
tika/site/src/site/apt/1.15/gettingstarted.apt
tika/site/src/site/apt/1.16/gettingstarted.apt
tika/site/src/site/apt/1.17/gettingstarted.apt
tika/site/src/site/apt/1.18/gettingstarted.apt
tika/site/src/site/apt/1.19.1/gettingstarted.apt
tika/site/src/site/apt/1.19/gettingstarted.apt
tika/site/src/site/apt/1.2/gettingstarted.apt
tika/site/src/site/apt/1.20/gettingstarted.apt
tika/site/src/site/apt/1.21/gettingstarted.apt
tika/site/src/site/apt/1.22/gettingstarted.apt
tika/site/src/site/apt/1.3/gettingstarted.apt
tika/site/src/site/apt/1.4/gettingstarted.apt
tika/site/src/site/apt/1.5/gettingstarted.apt
tika/site/src/site/apt/1.6/gettingstarted.apt
tika/site/src/site/apt/1.7/gettingstarted.apt
tika/site/src/site/apt/1.8/gettingstarted.apt
tika/site/src/site/apt/1.9/gettingstarted.apt

Modified: tika/site/src/site/apt/0.10/gettingstarted.apt
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/0.10/gettingstarted.apt?rev=1867117=1867116=1867117=diff
==
--- tika/site/src/site/apt/0.10/gettingstarted.apt (original)
+++ tika/site/src/site/apt/0.10/gettingstarted.apt Wed Sep 18 13:38:59 2019
@@ -26,7 +26,7 @@ Getting and building the sources
 
  To build Tika from sources you first need to either
  {{{../download.html}download}} a source release or
- {{{../source-repository.html}checkout}} the latest sources from
+ {{{../contribute.html#Source_Code}checkout}} the latest sources from
  version control.
 
  Once you have the sources, you can build them using the

Modified: tika/site/src/site/apt/0.5/gettingstarted.apt
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/0.5/gettingstarted.apt?rev=1867117=1867116=1867117=diff
==
--- tika/site/src/site/apt/0.5/gettingstarted.apt (original)
+++ tika/site/src/site/apt/0.5/gettingstarted.apt Wed Sep 18 13:38:59 2019
@@ -26,7 +26,7 @@ Getting and building the sources
 
  To build Tika from sources you first need to either
  {{{../download.html}download}} a source release or
- {{{../source-repository.html}checkout}} the latest sources from
+ {{{../contribute.html#Source_Code}checkout}} the latest sources from
  version control.
 
  Once you have the sources, you can build them using the

Modified: tika/site/src/site/apt/0.6/gettingstarted.apt
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/0.6/gettingstarted.apt?rev=1867117=1867116=1867117=diff
==
--- tika/site/src/site/apt/0.6/gettingstarted.apt (original)
+++ tika/site/src/site/apt/0.6/gettingstarted.apt Wed Sep 18 13:38:59 2019
@@ -26,7 +26,7 @@ Getting and building the sources
 
  To build Tika from sources you first need to either
  {{{../download.html}download}} a source release or
- {{{../source-repository.html}checkout}} the latest sources from
+ {{{../contribute.html#Source_Code}checkout}} the latest sources from
  version control.
 
  Once you have the sources, you can build them using the

Modified: tika/site/src/site/apt/0.7/gettingstarted.apt
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/0.7/gettingstarted.apt?rev=1867117=1867116=1867117=diff
==
--- tika/site/src/site/apt/0.7/gettingstarted.apt (original)
+++ tika/site/src/site/apt/0.7/gettingstarted.apt Wed Sep 18 13:38:59 2019
@@ -26,7 +26,7 @@ Getting and building the sources
 
  To build Tika from sources you first need to either
  {{{../download.html}download}} a source release or
- {{{../source-repository.html}checkout}} the latest sources from
+ {{{../contribute.html#Source_Code}checkout}} the latest sources from
  version control.
 
  Once you have the sources, you can build them using the

Modified: tika/site/src/site/apt/0.8/gettingstarted.apt
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/0.8/gettingstarted.apt?rev=1867117=1867116=1867117=diff

[tika] 03/03: Use the new RSS 2.0 file in tests too, alongside the current 0.91 one

2018-10-17 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit a0546b6cb98c949bb747b2e0e8d5675f651f6a16
Author: Nick Burch 
AuthorDate: Wed Oct 17 17:43:12 2018 +0100

Use the new RSS 2.0 file in tests too, alongside the current 0.91 one
---
 .../java/org/apache/tika/mime/TestMimeTypes.java   |  3 ++
 .../apache/tika/parser/feed/FeedParserTest.java| 38 +-
 2 files changed, 25 insertions(+), 16 deletions(-)

diff --git a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java 
b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
index bfb4c62..a527d4e 100644
--- a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
+++ b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
@@ -387,9 +387,12 @@ public class TestMimeTypes {
 @Test
 public void testFeedsDetection() throws Exception {
 assertType("application/rss+xml",  "rsstest_091.rss");
+assertType("application/rss+xml",  "rsstest_20.rss");
 assertType("application/atom+xml", "testATOM.atom");
 assertTypeByData("application/rss+xml",  "rsstest_091.rss");
 assertTypeByName("application/rss+xml",  "rsstest_091.rss");
+assertTypeByData("application/rss+xml",  "rsstest_20.rss");
+assertTypeByName("application/rss+xml",  "rsstest_20.rss");
 assertTypeByData("application/atom+xml", "testATOM.atom");
 assertTypeByName("application/atom+xml", "testATOM.atom");
 }
diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java 
b/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java
index d7e7c76..1a5c293 100644
--- a/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java
+++ b/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java
@@ -31,22 +31,28 @@ import org.xml.sax.ContentHandler;
 public class FeedParserTest {
 @Test
 public void testRSSParser() throws Exception {
-try (InputStream input = FeedParserTest.class.getResourceAsStream(
-"/test-documents/rsstest_091.rss")) {
-Metadata metadata = new Metadata();
-ContentHandler handler = new BodyContentHandler();
-ParseContext context = new ParseContext();
-
-new FeedParser().parse(input, handler, metadata, context);
-
-String content = handler.toString();
-assertFalse(content == null);
-
-assertEquals("Sample RSS File for Junit test",
-metadata.get(TikaCoreProperties.DESCRIPTION));
-assertEquals("TestChannel", 
metadata.get(TikaCoreProperties.TITLE));
-
-// TODO find a way of testing the paragraphs and anchors
+// These RSS files should have basically the same contents,
+//  represented in the various RSS format versions
+for (String rssFile : new String[] {
+"/test-documents/rsstest_091.rss",
+"/test-documents/rsstest_20.rss"
+}) {
+try (InputStream input = 
FeedParserTest.class.getResourceAsStream(rssFile)) {
+Metadata metadata = new Metadata();
+ContentHandler handler = new BodyContentHandler();
+ParseContext context = new ParseContext();
+
+new FeedParser().parse(input, handler, metadata, context);
+
+String content = handler.toString();
+assertFalse(content == null);
+
+assertEquals("Sample RSS File for Junit test",
+metadata.get(TikaCoreProperties.DESCRIPTION));
+assertEquals("TestChannel", 
metadata.get(TikaCoreProperties.TITLE));
+
+// TODO find a way of testing the paragraphs and anchors
+}
 }
 }
 



[tika] 01/03: RSS test file is RSS v0.91, so name appropriately

2018-10-17 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 429b22b2ac9ff96cfca714895d65dce311522616
Author: Nick Burch 
AuthorDate: Wed Oct 17 17:15:33 2018 +0100

RSS test file is RSS v0.91, so name appropriately
---
 tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java  | 6 +++---
 .../src/test/java/org/apache/tika/parser/AutoDetectParserTest.java  | 2 +-
 .../src/test/java/org/apache/tika/parser/feed/FeedParserTest.java   | 2 +-
 .../test/resources/test-documents/{rsstest.rss => rsstest_091.rss}  | 0
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java 
b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
index 9205530..bfb4c62 100644
--- a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
+++ b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
@@ -386,10 +386,10 @@ public class TestMimeTypes {
 
 @Test
 public void testFeedsDetection() throws Exception {
-assertType("application/rss+xml",  "rsstest.rss");
+assertType("application/rss+xml",  "rsstest_091.rss");
 assertType("application/atom+xml", "testATOM.atom");
-assertTypeByData("application/rss+xml",  "rsstest.rss");
-assertTypeByName("application/rss+xml",  "rsstest.rss");
+assertTypeByData("application/rss+xml",  "rsstest_091.rss");
+assertTypeByName("application/rss+xml",  "rsstest_091.rss");
 assertTypeByData("application/atom+xml", "testATOM.atom");
 assertTypeByName("application/atom+xml", "testATOM.atom");
 }
diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java 
b/tika-parsers/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java
index 10d2a0f..ddbbd75 100644
--- 
a/tika-parsers/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java
+++ 
b/tika-parsers/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java
@@ -241,7 +241,7 @@ public class AutoDetectParserTest extends TikaTest {
 
 @Test
 public void testRss() throws Exception {
-assertAutoDetect("/test-documents/rsstest.rss", "feed", RSS, 
"application/rss+xml", "Sample RSS File for Junit test");
+assertAutoDetect("/test-documents/rsstest_091.rss", "feed", RSS, 
"application/rss+xml", "Sample RSS File for Junit test");
 }
 
 @Test
diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java 
b/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java
index cc10dd2..d7e7c76 100644
--- a/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java
+++ b/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java
@@ -32,7 +32,7 @@ public class FeedParserTest {
 @Test
 public void testRSSParser() throws Exception {
 try (InputStream input = FeedParserTest.class.getResourceAsStream(
-"/test-documents/rsstest.rss")) {
+"/test-documents/rsstest_091.rss")) {
 Metadata metadata = new Metadata();
 ContentHandler handler = new BodyContentHandler();
 ParseContext context = new ParseContext();
diff --git a/tika-parsers/src/test/resources/test-documents/rsstest.rss 
b/tika-parsers/src/test/resources/test-documents/rsstest_091.rss
similarity index 100%
rename from tika-parsers/src/test/resources/test-documents/rsstest.rss
rename to tika-parsers/src/test/resources/test-documents/rsstest_091.rss



[tika] branch master updated (5310f17 -> a0546b6)

2018-10-17 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git.


from 5310f17  TIKA-2757 -- add versions plugin
 new 429b22b  RSS test file is RSS v0.91, so name appropriately
 new 1fca098  Add a test RSS 2.0 file
 new a0546b6  Use the new RSS 2.0 file in tests too, alongside the current 
0.91 one

The 3 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../java/org/apache/tika/mime/TestMimeTypes.java   |  9 +++--
 .../apache/tika/parser/AutoDetectParserTest.java   |  2 +-
 .../apache/tika/parser/feed/FeedParserTest.java| 38 +-
 .../{rsstest.rss => rsstest_091.rss}   |  0
 .../test-documents/{rsstest.rss => rsstest_20.rss} |  8 -
 5 files changed, 36 insertions(+), 21 deletions(-)
 copy tika-parsers/src/test/resources/test-documents/{rsstest.rss => 
rsstest_091.rss} (100%)
 rename tika-parsers/src/test/resources/test-documents/{rsstest.rss => 
rsstest_20.rss} (74%)



[tika] branch master updated (3d5d4d8 -> 705b79c)

2018-09-06 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git.


from 3d5d4d8  Merge pull request #239 from wowselim/master
 new b26a0cc  Merge branch 'master' of https://github.com/wowselim/tika
 new 53c8434  Merge branch 'master' of https://github.com/apache/tika
 new 9a2c7d8  Mime magic for "MIME Encapsulation of Aggregate HTML 
Documents" (MHTML), pulled out from rfc822 (may not be fully correct 
long-term...)
 new 705b79c  Changes update

The 4 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 CHANGES.txt|  3 +++
 .../org/apache/tika/mime/tika-mimetypes.xml| 22 --
 2 files changed, 23 insertions(+), 2 deletions(-)



[tika] 01/04: Merge branch 'master' of https://github.com/wowselim/tika

2018-09-06 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit b26a0ccdbd5620b870df0dc434d2f9265b2df082
Merge: e4f0fe5 eb33286
Author: Nick Burch 
AuthorDate: Wed Sep 5 20:46:56 2018 +0100

Merge branch 'master' of https://github.com/wowselim/tika

 tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 3 +++
 1 file changed, 3 insertions(+)




[tika] 04/04: Changes update

2018-09-06 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 705b79ccb6c0ad0f92a3a185bf7e66cacf899931
Author: Nick Burch 
AuthorDate: Thu Sep 6 09:28:24 2018 +0100

Changes update
---
 CHANGES.txt | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/CHANGES.txt b/CHANGES.txt
index ce647a9..81782ec 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -5,6 +5,9 @@ Release 2.0.0 - ???
 
Other changes
 
+   * Mime magic improvements for Olympus RAW (TIKA-2658), interpreted
+ server-side languages via HTTP (TIKA-2648), MHTML (TIKA-2723)
+
 Release 1.19 ???
 
* Add absolute timeout to ForkParser rather than testing



[tika] 03/04: Mime magic for "MIME Encapsulation of Aggregate HTML Documents" (MHTML), pulled out from rfc822 (may not be fully correct long-term...)

2018-09-06 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 9a2c7d89e03ca7c0e821b69c394165297edfb9d4
Author: Nick Burch 
AuthorDate: Thu Sep 6 09:28:14 2018 +0100

Mime magic for "MIME Encapsulation of Aggregate HTML Documents" (MHTML), 
pulled out from rfc822 (may not be fully correct long-term...)
---
 .../org/apache/tika/mime/tika-mimetypes.xml| 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 007ec53..bd1adfa 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -5980,9 +5980,28 @@
 
 
 
+
+  
+
+  
+  
+  
+MHTML
+<_comment>MIME Encapsulation of Aggregate HTML Documents
+http://tools.ietf.org/html/rfc2557
+
+
+
+
+  
+  
+  
+
+  
+
 
 
-
+
   
 
   
@@ -6084,7 +6103,6 @@
   
   
   
-  
   
   
   



[tika] 02/04: Merge branch 'master' of https://github.com/apache/tika

2018-09-06 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 53c8434f497795885ff129e17440881f059c1624
Merge: b26a0cc 3d5d4d8
Author: Nick Burch 
AuthorDate: Wed Sep 5 20:58:20 2018 +0100

Merge branch 'master' of https://github.com/apache/tika




[tika] branch master updated (e4f0fe5 -> 3d5d4d8)

2018-09-05 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git.


from e4f0fe5  Use DateUtils to format dates to strings, rather than relying 
on explicit/implicit toString calls
 add eb33286  TIKA-2658: add olympus raw file magic numbers
 new 3d5d4d8  Merge pull request #239 from wowselim/master

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 3 +++
 1 file changed, 3 insertions(+)



[tika] 01/01: Merge pull request #239 from wowselim/master

2018-09-05 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 3d5d4d8b9667a31e3cb30a9d02543347feefbcc7
Merge: e4f0fe5 eb33286
Author: Gagravarr 
AuthorDate: Wed Sep 5 20:58:07 2018 +0100

Merge pull request #239 from wowselim/master

TIKA-2658: add olympus raw file magic numbers

 tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 3 +++
 1 file changed, 3 insertions(+)




[tika] branch master updated: Use DateUtils to format dates to strings, rather than relying on explicit/implicit toString calls

2018-09-05 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/master by this push:
 new e4f0fe5  Use DateUtils to format dates to strings, rather than relying 
on explicit/implicit toString calls
e4f0fe5 is described below

commit e4f0fe5184db47724c6bf366a12ea0868972a83f
Author: Nick Burch 
AuthorDate: Wed Sep 5 18:14:28 2018 +0100

Use DateUtils to format dates to strings, rather than relying on 
explicit/implicit toString calls
---
 .../geoinfo/GeographicInformationParser.java   | 31 ++
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/geoinfo/GeographicInformationParser.java
 
b/tika-parsers/src/main/java/org/apache/tika/parser/geoinfo/GeographicInformationParser.java
index 27b8040..268dd93 100644
--- 
a/tika-parsers/src/main/java/org/apache/tika/parser/geoinfo/GeographicInformationParser.java
+++ 
b/tika-parsers/src/main/java/org/apache/tika/parser/geoinfo/GeographicInformationParser.java
@@ -48,6 +48,7 @@ import org.apache.tika.mime.MediaType;
 import org.apache.tika.parser.AbstractParser;
 import org.apache.tika.parser.ParseContext;
 import org.apache.tika.sax.XHTMLContentHandler;
+import org.apache.tika.utils.DateUtils;
 import org.opengis.metadata.Identifier;
 import org.opengis.metadata.citation.Citation;
 import org.opengis.metadata.citation.CitationDate;
@@ -227,9 +228,11 @@ public class GeographicInformationParser extends 
AbstractParser{
 metadata.add("IdentificationInfoCitationTitle 
",i.getCitation().getTitle().toString());
 
 ArrayList dateArrayList= (ArrayList) 
i.getCitation().getDates();
-for (CitationDate d:dateArrayList){
-if(d.getDateType()!=null)
-metadata.add("CitationDate 
",d.getDateType().name()+"-->"+d.getDate());
+for (CitationDate d:dateArrayList) {
+if (d.getDateType()!=null) {
+String date = DateUtils.formatDate(d.getDate());
+metadata.add("CitationDate 
",d.getDateType().name()+"-->"+date);
+}
 }
 ArrayList responsiblePartyArrayList= 
(ArrayList) i.getCitation().getCitedResponsibleParties();
 for(ResponsibleParty r:responsiblePartyArrayList){
@@ -282,9 +285,11 @@ public class GeographicInformationParser extends 
AbstractParser{
 metadata.add("ThesaurusNameAlternativeTitle 
"+j,k.getThesaurusName().getAlternateTitles().toString());
 
 ArrayListcitationDates= 
(ArrayList) k.getThesaurusName().getDates();
-for(CitationDate cd:citationDates) {
-   if(cd.getDateType()!=null)
-metadata.add("ThesaurusNameDate 
",cd.getDateType().name() +"-->" + cd.getDate());
+for (CitationDate cd:citationDates) {
+   if (cd.getDateType()!=null) {
+   String date = DateUtils.formatDate(cd.getDate());
+   metadata.add("ThesaurusNameDate 
",cd.getDateType().name() +"-->" + date);
+   }
 }
 }
 ArrayList constraintList= 
(ArrayList) i.getResourceConstraints();
@@ -315,9 +320,11 @@ public class GeographicInformationParser extends 
AbstractParser{
 for(InternationalString 
s:((DefaultGeographicDescription) 
g).getGeographicIdentifier().getAuthority().getAlternateTitles()) {
 
metadata.add("GeographicIdentifierAuthorityAlternativeTitle ",s.toString());
 }
-for(CitationDate cd:((DefaultGeographicDescription) 
g).getGeographicIdentifier().getAuthority().getDates()){
-if(cd.getDateType()!=null && cd.getDate()!=null)
-
metadata.add("GeographicIdentifierAuthorityDate ",cd.getDateType().name()+" 
"+cd.getDate().toString());
+for (CitationDate cd:((DefaultGeographicDescription) 
g).getGeographicIdentifier().getAuthority().getDates()){
+if (cd.getDateType()!=null && cd.getDate()!=null) {
+String date = 
DateUtils.formatDate(cd.getDate());
+
metadata.add("GeographicIdentifierAuthorityDate ",cd.getDateType().name()+" 
"+date);
+}
 }
 }
 }
@@ -363,8 +370,10 @@ public class GeographicInformationParser extends 
AbstractParser{
 private void getMetaDataDateInfo(Metadata metadata, DefaultMetadata 
defau

[tika] 01/07: TIKA-2479 Option to request missing rows where possible in Excel-like formats

2018-05-18 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit a1e42a0659ba33e90cb1bba0a0a10eeb97d4fac7
Author: Nick Burch <n...@gagravarr.org>
AuthorDate: Thu May 17 22:15:34 2018 +0100

TIKA-2479 Option to request missing rows where possible in Excel-like 
formats
---
 .../apache/tika/parser/microsoft/OfficeParserConfig.java | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java
 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java
index 34b865e..5d34b2e 100644
--- 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java
+++ 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java
@@ -29,6 +29,7 @@ public class OfficeParserConfig implements Serializable {
 private boolean includeMoveFromContent = false;
 private boolean includeShapeBasedContent = true;
 private boolean includeHeadersAndFooters = true;
+private boolean includeMissingRows = false;
 private boolean concatenatePhoneticRuns = true;
 
 private boolean useSAXDocxExtractor = false;
@@ -188,10 +189,23 @@ public class OfficeParserConfig implements Serializable {
 this.extractAllAlternativesFromMSG = extractAllAlternativesFromMSG;
 }
 
-
 public boolean getExtractAllAlternativesFromMSG() {
 return extractAllAlternativesFromMSG;
 }
+
+/**
+ * For table-like formats, and tables within other formats, should
+ *  missing rows in sparse tables be output where detected?
+ * The default is to only output rows defined within the file, which
+ *  avoid lots of blank lines, but means layout isn't preserved.
+ */
+public void setIncludeMissingRows(boolean includeMissingRows) {
+this.includeMissingRows = includeMissingRows;
+}
+
+public boolean getIncludeMissingRows() {
+return includeMissingRows;
+}
 }
 
 

-- 
To stop receiving notification emails like this one, please contact
n...@apache.org.


[tika] 07/07: Add the other jackcess jar to the bundle

2018-05-18 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 12693ea18f1a05894272aa3a9293d41215f63c06
Author: Nick Burch <n...@gagravarr.org>
AuthorDate: Fri May 18 15:35:06 2018 +0100

Add the other jackcess jar to the bundle
---
 tika-bundle/pom.xml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tika-bundle/pom.xml b/tika-bundle/pom.xml
index 2b500d7..fa13e21 100644
--- a/tika-bundle/pom.xml
+++ b/tika-bundle/pom.xml
@@ -170,6 +170,7 @@
   curvesapi|
   xmlbeans|
   jackcess|
+  jackcess-encrypt|
   commons-lang|
   tagsoup|
   asm|

-- 
To stop receiving notification emails like this one, please contact
n...@apache.org.


[tika] 03/07: Updated Columnar output from SAS with better formats

2018-05-18 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit b01b059331f198d3829b111002cf03cbcaf1bab3
Author: Nick Burch <n...@gagravarr.org>
AuthorDate: Fri May 18 11:43:47 2018 +0100

Updated Columnar output from SAS with better formats
---
 .../apache/tika/parser/sas/SAS7BDATParserTest.java |   8 
 .../test-documents/test-columnar.sas7bdat  | Bin 17408 -> 131072 bytes
 .../resources/test-documents/test-columnar.xls | Bin 6656 -> 66048 bytes
 .../resources/test-documents/test-columnar.xlsx| Bin 4941 -> 6603 bytes
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java 
b/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java
index 610ffc3..00a2aaa 100644
--- 
a/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java
+++ 
b/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java
@@ -89,15 +89,15 @@ public class SAS7BDATParserTest extends TikaTest {
 assertEquals("application/x-sas-data", 
metadata.get(Metadata.CONTENT_TYPE));
 assertEquals("TESTING", metadata.get(TikaCoreProperties.TITLE));
 
-assertEquals("2018-05-09T17:59:33Z", 
metadata.get(TikaCoreProperties.CREATED));
-assertEquals("2018-05-09T17:59:33Z", 
metadata.get(TikaCoreProperties.MODIFIED));
+assertEquals("2018-05-18T11:38:30Z", 
metadata.get(TikaCoreProperties.CREATED));
+assertEquals("2018-05-18T11:38:30Z", 
metadata.get(TikaCoreProperties.MODIFIED));
 
 assertEquals("1", metadata.get(PagedText.N_PAGES));
 assertEquals("8", metadata.get(Database.COLUMN_COUNT));
 assertEquals("11", metadata.get(Database.ROW_COUNT));
 assertEquals("windows-1252", 
metadata.get(HttpHeaders.CONTENT_ENCODING));
-assertEquals("W32_7PRO", 
metadata.get(OfficeOpenXMLExtended.APPLICATION));
-assertEquals("9.0301M2", 
metadata.get(OfficeOpenXMLExtended.APP_VERSION));
+assertEquals("X64_7PRO", 
metadata.get(OfficeOpenXMLExtended.APPLICATION));
+assertEquals("9.0401M5", 
metadata.get(OfficeOpenXMLExtended.APP_VERSION));
 assertEquals("32", metadata.get(MachineMetadata.ARCHITECTURE_BITS));
 assertEquals("Little", metadata.get(MachineMetadata.ENDIAN));
 assertEquals(Arrays.asList("Record Number","Square of the Record 
Number",
diff --git 
a/tika-parsers/src/test/resources/test-documents/test-columnar.sas7bdat 
b/tika-parsers/src/test/resources/test-documents/test-columnar.sas7bdat
index 33ee412..f6cab63 100644
Binary files 
a/tika-parsers/src/test/resources/test-documents/test-columnar.sas7bdat and 
b/tika-parsers/src/test/resources/test-documents/test-columnar.sas7bdat differ
diff --git a/tika-parsers/src/test/resources/test-documents/test-columnar.xls 
b/tika-parsers/src/test/resources/test-documents/test-columnar.xls
index 1d7b2cf..cc45372 100644
Binary files a/tika-parsers/src/test/resources/test-documents/test-columnar.xls 
and b/tika-parsers/src/test/resources/test-documents/test-columnar.xls differ
diff --git a/tika-parsers/src/test/resources/test-documents/test-columnar.xlsx 
b/tika-parsers/src/test/resources/test-documents/test-columnar.xlsx
index 58ffd47..22483f1 100644
Binary files 
a/tika-parsers/src/test/resources/test-documents/test-columnar.xlsx and 
b/tika-parsers/src/test/resources/test-documents/test-columnar.xlsx differ

-- 
To stop receiving notification emails like this one, please contact
n...@apache.org.


[tika] 05/07: TIKA-2479 Update XLS missing cell/row handling to match XLSX and XLSB, add unit test for missing rows, and enable the Columnar tests for the Excel formats

2018-05-18 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 348b87e7f41b79ff115e17d9c91d2dad63a57c15
Author: Nick Burch <n...@gagravarr.org>
AuthorDate: Fri May 18 15:15:32 2018 +0100

TIKA-2479 Update XLS missing cell/row handling to match XLSX and XLSB, add 
unit test for missing rows, and enable the Columnar tests for the Excel formats
---
 .../tika/parser/microsoft/ExcelExtractor.java  | 26 ++--
 .../org/apache/tika/parser/TabularFormatsTest.java | 47 ++
 .../tika/parser/microsoft/ExcelParserTest.java | 25 +++-
 3 files changed, 60 insertions(+), 38 deletions(-)

diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
index 0dc33ee..ff5971a 100644
--- 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
+++ 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
@@ -16,7 +16,7 @@
  */
 package org.apache.tika.parser.microsoft;
 
-import java.awt.*;
+import java.awt.Point;
 import java.io.IOException;
 import java.text.NumberFormat;
 import java.util.ArrayList;
@@ -42,7 +42,6 @@ import org.apache.poi.hssf.record.CountryRecord;
 import org.apache.poi.hssf.record.DateWindow1904Record;
 import org.apache.poi.hssf.record.DrawingGroupRecord;
 import org.apache.poi.hssf.record.EOFRecord;
-import org.apache.poi.hssf.record.ExtSSTRecord;
 import org.apache.poi.hssf.record.ExtendedFormatRecord;
 import org.apache.poi.hssf.record.FooterRecord;
 import org.apache.poi.hssf.record.FormatRecord;
@@ -281,7 +280,6 @@ public class ExcelExtractor extends AbstractPOIFSExtractor {
 
 public void processFile(DirectoryNode root, boolean 
listenForAllRecords)
 throws IOException, SAXException, TikaException {
-
 // Set up listener and register the records we want to process
 HSSFRequest hssfRequest = new HSSFRequest();
 if (listenForAllRecords) {
@@ -494,15 +492,14 @@ public class ExcelExtractor extends 
AbstractPOIFSExtractor {
 HeaderRecord headerRecord = (HeaderRecord) record;
 addTextCell(record, headerRecord.getText());
 }
-   break;
+break;

 case FooterRecord.sid:
 if 
(extractor.officeParserConfig.getIncludeHeadersAndFooters()) {
 FooterRecord footerRecord = (FooterRecord) record;
 addTextCell(record, footerRecord.getText());
 }
-   break;
-
+break;
 }
 
 previousSid = record.getSid();
@@ -599,12 +596,17 @@ public class ExcelExtractor extends 
AbstractPOIFSExtractor {
 handler.startElement("tr");
 handler.startElement("td");
 for (Map.Entry<Point, Cell> entry : currentSheet.entrySet()) {
-while (currentRow < entry.getKey().y) {
-handler.endElement("td");
-handler.endElement("tr");
-handler.startElement("tr");
-handler.startElement("td");
-currentRow++;
+if (currentRow != entry.getKey().y) {
+// We've moved onto a new row, possibly skipping some
+do {
+handler.endElement("td");
+handler.endElement("tr");
+handler.startElement("tr");
+handler.startElement("td");
+currentRow++;
+} while (officeParserConfig.getIncludeMissingRows() &&
+ currentRow < entry.getKey().y);
+currentRow = entry.getKey().y;
 currentColumn = 0;
 }
 
diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java 
b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
index 41139e2..4a52118 100644
--- a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
+++ b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
@@ -64,8 +64,8 @@ public class TabularFormatsTest extends TikaTest {
 "87.5%","88.9%","90.0%"
 },
 new Pattern[] {
-Pattern.compile("01-(01|JAN|Jan)-(60|1960)"),
-Pattern.compile("02-01-1960"),
+Pattern.compile("0?1-01-1960"),
+Pattern.compile("

[tika] 06/07: Changelog update

2018-05-18 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 9673fbdbba8feebb72fee569074e94b0868a89df
Author: Nick Burch <n...@gagravarr.org>
AuthorDate: Fri May 18 15:17:56 2018 +0100

Changelog update
---
 CHANGES.txt | 5 +
 1 file changed, 5 insertions(+)

diff --git a/CHANGES.txt b/CHANGES.txt
index 38f1973..0ffc5de 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -81,6 +81,11 @@ Release 2.0.0 - ???
 
* Mime magic for ACES Images (TIKA-2628) and DPX Images (TIKA-2629)
 
+   * For sparse XLSX and XLSB files, always output missing cells to
+ the left of filled ones (matching XLS), and optionally output
+ missing rows on all 3 formats if requested via the
+ OfficeParserContext (TIKA-2479)
+
 Release 1.17 - 12/8/2017
 
   ***NOTE: THIS IS THE LAST VERSION OF TIKA THAT WILL RUN

-- 
To stop receiving notification emails like this one, please contact
n...@apache.org.


[tika] 04/07: Formatted columns in the columnar test Excel files

2018-05-18 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 6fa1105e0669ffeec5c3cf0d1db247a8c16f3bc5
Author: Nick Burch <n...@gagravarr.org>
AuthorDate: Fri May 18 15:13:43 2018 +0100

Formatted columns in the columnar test Excel files
---
 .../test/resources/test-documents/test-columnar.xls | Bin 66048 -> 32768 bytes
 .../resources/test-documents/test-columnar.xlsb | Bin 0 -> 9691 bytes
 .../resources/test-documents/test-columnar.xlsx | Bin 6603 -> 10556 bytes
 .../src/test/resources/test-documents/testSAS2.sas  |   3 +++
 4 files changed, 3 insertions(+)

diff --git a/tika-parsers/src/test/resources/test-documents/test-columnar.xls 
b/tika-parsers/src/test/resources/test-documents/test-columnar.xls
index cc45372..3f1009c 100644
Binary files a/tika-parsers/src/test/resources/test-documents/test-columnar.xls 
and b/tika-parsers/src/test/resources/test-documents/test-columnar.xls differ
diff --git a/tika-parsers/src/test/resources/test-documents/test-columnar.xlsb 
b/tika-parsers/src/test/resources/test-documents/test-columnar.xlsb
new file mode 100644
index 000..0ce5139
Binary files /dev/null and 
b/tika-parsers/src/test/resources/test-documents/test-columnar.xlsb differ
diff --git a/tika-parsers/src/test/resources/test-documents/test-columnar.xlsx 
b/tika-parsers/src/test/resources/test-documents/test-columnar.xlsx
index 22483f1..f1f4dc4 100644
Binary files 
a/tika-parsers/src/test/resources/test-documents/test-columnar.xlsx and 
b/tika-parsers/src/test/resources/test-documents/test-columnar.xlsx differ
diff --git a/tika-parsers/src/test/resources/test-documents/testSAS2.sas 
b/tika-parsers/src/test/resources/test-documents/testSAS2.sas
index 96a9121..df52b1a 100644
--- a/tika-parsers/src/test/resources/test-documents/testSAS2.sas
+++ b/tika-parsers/src/test/resources/test-documents/testSAS2.sas
@@ -57,6 +57,9 @@ proc export data=testing label
 putnames=yes;
 run;
 
+/* Due to SAS Limitations, you will need to manually */
+/* style the % and Date/Datetime columns in Excel */
+/* You will also need to save-as XLSB to generate that */
 proc export data=testing label 
   outfile="/testing.xls"
   dbms=XLS;

-- 
To stop receiving notification emails like this one, please contact
n...@apache.org.


[tika] 02/07: TIKA-2479 Output missing left/mid cells in XLSX and XLSB, and optionally also missing rows

2018-05-18 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit b1b035e6bbcff0db24e133b682ac79916f92f599
Author: Nick Burch <n...@gagravarr.org>
AuthorDate: Thu May 17 23:07:04 2018 +0100

TIKA-2479 Output missing left/mid cells in XLSX and XLSB, and optionally 
also missing rows
---
 .../ooxml/XSSFBExcelExtractorDecorator.java|  2 +-
 .../ooxml/XSSFExcelExtractorDecorator.java | 35 ++
 .../org/apache/tika/parser/TabularFormatsTest.java | 11 ++-
 3 files changed, 41 insertions(+), 7 deletions(-)

diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFBExcelExtractorDecorator.java
 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFBExcelExtractorDecorator.java
index dcde62b..33dbb7e 100644
--- 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFBExcelExtractorDecorator.java
+++ 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFBExcelExtractorDecorator.java
@@ -117,7 +117,7 @@ public class XSSFBExcelExtractorDecorator extends 
XSSFExcelExtractorDecorator {
 addDrawingHyperLinks(sheetPart);
 sheetParts.add(sheetPart);
 
-SheetTextAsHTML sheetExtractor = new 
SheetTextAsHTML(config.getIncludeHeadersAndFooters(), xhtml);
+SheetTextAsHTML sheetExtractor = new SheetTextAsHTML(config, 
xhtml);
 XSSFBCommentsTable comments = iter.getXSSFBSheetComments();
 
 // Start, and output the sheet name
diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
index 9a2b017..7e1a7cd 100644
--- 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
+++ 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
@@ -25,7 +25,6 @@ import java.util.List;
 import java.util.Locale;
 import java.util.Map;
 
-import org.apache.poi.POIXMLDocument;
 import org.apache.poi.POIXMLTextExtractor;
 import org.apache.poi.hssf.extractor.ExcelExtractor;
 import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
@@ -39,6 +38,7 @@ import org.apache.poi.openxml4j.opc.PackagingURIHelper;
 import org.apache.poi.openxml4j.opc.TargetMode;
 import org.apache.poi.ss.usermodel.DataFormatter;
 import org.apache.poi.ss.usermodel.HeaderFooter;
+import org.apache.poi.ss.util.CellReference;
 import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
 import org.apache.poi.xssf.eventusermodel.XSSFReader;
 import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler;
@@ -56,6 +56,7 @@ import org.apache.tika.exception.TikaException;
 import org.apache.tika.metadata.Metadata;
 import org.apache.tika.metadata.TikaCoreProperties;
 import org.apache.tika.parser.ParseContext;
+import org.apache.tika.parser.microsoft.OfficeParserConfig;
 import org.apache.tika.parser.microsoft.TikaExcelDataFormatter;
 import org.apache.tika.sax.OfflineContentHandler;
 import org.apache.tika.sax.XHTMLContentHandler;
@@ -144,8 +145,7 @@ public class XSSFExcelExtractorDecorator extends 
AbstractOOXMLExtractor {
 }
 
 while (iter.hasNext()) {
-
-SheetTextAsHTML sheetExtractor = new 
SheetTextAsHTML(config.getIncludeHeadersAndFooters(), xhtml);
+SheetTextAsHTML sheetExtractor = new SheetTextAsHTML(config, 
xhtml);
 PackagePart sheetPart = null;
 try (InputStream stream = iter.next()) {
 sheetPart = iter.getSheetPart();
@@ -397,11 +397,15 @@ public class XSSFExcelExtractorDecorator extends 
AbstractOOXMLExtractor {
 protected static class SheetTextAsHTML implements SheetContentsHandler {
 private XHTMLContentHandler xhtml;
 private final boolean includeHeadersFooters;
+private final boolean includeMissingRows;
 protected List headers;
 protected List footers;
+private int lastSeenRow = -1;
+private int lastSeenCol = -1;
 
-protected SheetTextAsHTML(boolean includeHeaderFooters, 
XHTMLContentHandler xhtml) {
-this.includeHeadersFooters = includeHeaderFooters;
+protected SheetTextAsHTML(OfficeParserConfig config, 
XHTMLContentHandler xhtml) {
+this.includeHeadersFooters = config.getIncludeHeadersAndFooters();
+this.includeMissingRows = config.getIncludeMissingRows();
 this.xhtml = xhtml;
 headers = new ArrayList();
 footers = new ArrayList();
@@ -409,7 +413,19 @@ public class XSSFExcelExtractorDecorator extends 
AbstractOOXMLExtractor {
 
 public void startRow(int rowNum) {
 try {
+// Missing rows, if desired, with a single emp

[tika] branch master updated (5f05b51 -> 12693ea)

2018-05-18 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git.


from 5f05b51  TIKA-2644 - refactor recursiveparserwrapper api
 new a1e42a0  TIKA-2479 Option to request missing rows where possible in 
Excel-like formats
 new b1b035e  TIKA-2479 Output missing left/mid cells in XLSX and XLSB, and 
optionally also missing rows
 new b01b059  Updated Columnar output from SAS with better formats
 new 6fa1105  Formatted columns in the columnar test Excel files
 new 348b87e  TIKA-2479 Update XLS missing cell/row handling to match XLSX 
and XLSB, add unit test for missing rows, and enable the Columnar tests for the 
Excel formats
 new 9673fbd  Changelog update
 new 12693ea  Add the other jackcess jar to the bundle

The 7 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 CHANGES.txt|   5 +++
 tika-bundle/pom.xml|   1 +
 .../tika/parser/microsoft/ExcelExtractor.java  |  26 +++---
 .../tika/parser/microsoft/OfficeParserConfig.java  |  16 -
 .../ooxml/XSSFBExcelExtractorDecorator.java|   2 +-
 .../ooxml/XSSFExcelExtractorDecorator.java |  35 +++---
 .../org/apache/tika/parser/TabularFormatsTest.java |  40 -
 .../tika/parser/microsoft/ExcelParserTest.java |  25 -
 .../apache/tika/parser/sas/SAS7BDATParserTest.java |   8 ++---
 .../test-documents/test-columnar.sas7bdat  | Bin 17408 -> 131072 bytes
 .../resources/test-documents/test-columnar.xls | Bin 6656 -> 32768 bytes
 .../resources/test-documents/test-columnar.xlsb| Bin 0 -> 9691 bytes
 .../resources/test-documents/test-columnar.xlsx| Bin 4941 -> 10556 bytes
 .../src/test/resources/test-documents/testSAS2.sas |   3 ++
 14 files changed, 120 insertions(+), 41 deletions(-)
 create mode 100644 
tika-parsers/src/test/resources/test-documents/test-columnar.xlsb

-- 
To stop receiving notification emails like this one, please contact
n...@apache.org.


[tika] branch master updated: Mime magic for DPX and ACES, thanks to Andreas Meier (TIKA-2628 and TIKA-2629)

2018-05-10 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/master by this push:
 new ca3207c  Mime magic for DPX and ACES, thanks to Andreas Meier 
(TIKA-2628 and TIKA-2629)
ca3207c is described below

commit ca3207c3b0dd408b32a07b70dcfef42aa4d0a9bd
Author: Nick Burch <n...@gagravarr.org>
AuthorDate: Thu May 10 22:18:36 2018 +0100

Mime magic for DPX and ACES, thanks to Andreas Meier (TIKA-2628 and 
TIKA-2629)
---
 CHANGES.txt   |  2 ++
 .../resources/org/apache/tika/mime/tika-mimetypes.xml | 19 +++
 2 files changed, 21 insertions(+)

diff --git a/CHANGES.txt b/CHANGES.txt
index c66e883..b24df29 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -76,6 +76,8 @@ Release 2.0.0 - ???
* Handle .epub files using .htm rather than .html extensions for the
  embedded contents (TIKA-1288)
 
+   * Mime magic for ACES Images (TIKA-2628) and DPX Images (TIKA-2629)
+
 Release 1.17 - 12/8/2017
 
   ***NOTE: THIS IS THE LAST VERSION OF TIKA THAT WILL RUN
diff --git 
a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 7c0cd91..104cd2c 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -5074,6 +5074,15 @@
 
   
 
+  
+<_comment>ACES Image Container File
+
+  
+  
+
+
+   
+
   
 
 
@@ -5123,6 +5132,16 @@
 
   
 
+  
+DPX
+<_comment>Digital Picture Exchange from SMPTE
+
+  
+  
+
+
+  
+
   
 
 

-- 
To stop receiving notification emails like this one, please contact
n...@apache.org.


[tika] 04/04: Add disabled, currently failing ODS test

2018-05-10 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 49833d88cb323928c3de7bd7a86ab38444530418
Author: Nick Burch <n...@gagravarr.org>
AuthorDate: Thu May 10 17:13:24 2018 +0100

Add disabled, currently failing ODS test
---
 .../java/org/apache/tika/parser/TabularFormatsTest.java |  14 +++---
 .../src/test/resources/test-documents/test-columnar.ods | Bin 0 -> 12854 bytes
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java 
b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
index 119c9cd..ea326bd 100644
--- a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
+++ b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
@@ -226,7 +226,7 @@ public class TabularFormatsTest extends TikaTest {
 XMLResult result = getXML("test-columnar.xls");
 String xml = result.xml;
 assertHeaders(xml, false, true, false);
-// TODO Correctly handle empty cells then test
+// TODO Correctly handle empty cells then enable this test
 //assertContents(xml, true, false);
 }
 @Test
@@ -234,10 +234,18 @@ public class TabularFormatsTest extends TikaTest {
 XMLResult result = getXML("test-columnar.xlsx");
 String xml = result.xml;
 assertHeaders(xml, false, true, false);
-// TODO Correctly handle empty cells then test
+// TODO Correctly handle empty cells then enable this test
 //assertContents(xml, true, false);
 }
-// TODO Test OpenDocument ODS test
+// TODO Fix the ODS test - currently failing with
+// org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not 
declared
+//@Test
+//public void testODS() throws Exception {
+//XMLResult result = getXML("test-columnar.ods");
+//String xml = result.xml;
+//assertHeaders(xml, false, true, false);
+//assertContents(xml, true, true);
+//}
 
 // TODO Test other formats, eg Database formats
 
diff --git a/tika-parsers/src/test/resources/test-documents/test-columnar.ods 
b/tika-parsers/src/test/resources/test-documents/test-columnar.ods
new file mode 100644
index 000..067ca18
Binary files /dev/null and 
b/tika-parsers/src/test/resources/test-documents/test-columnar.ods differ

-- 
To stop receiving notification emails like this one, please contact
n...@apache.org.


[tika] branch master updated (cfd6256 -> 49833d8)

2018-05-10 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git.


from cfd6256  Remaining values to check
 new 6cff602  Ensure that empty cells are still output
 new d0fb697  Not all formats know about %s, dates not completely 
consistent either...
 new 72994c8  Use patterns to handle the date format variations
 new 49833d8  Add disabled, currently failing ODS test

The 4 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../org/apache/tika/parser/sas/SAS7BDATParser.java |   6 +-
 .../org/apache/tika/parser/TabularFormatsTest.java | 126 ++---
 .../resources/test-documents/test-columnar.ods | Bin 0 -> 12854 bytes
 3 files changed, 88 insertions(+), 44 deletions(-)
 create mode 100644 
tika-parsers/src/test/resources/test-documents/test-columnar.ods

-- 
To stop receiving notification emails like this one, please contact
n...@apache.org.


[tika] 03/04: Use patterns to handle the date format variations

2018-05-10 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 72994c8ac8f0c749f26f4f19b7992b8224fc2a12
Author: Nick Burch <n...@gagravarr.org>
AuthorDate: Thu May 10 16:59:09 2018 +0100

Use patterns to handle the date format variations
---
 .../org/apache/tika/parser/TabularFormatsTest.java | 101 -
 1 file changed, 56 insertions(+), 45 deletions(-)

diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java 
b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
index 80a7f56..119c9cd 100644
--- a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
+++ b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
@@ -18,10 +18,11 @@ package org.apache.tika.parser;
 
 
 import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
 
 import java.util.Arrays;
 import java.util.List;
-import java.util.Locale;
+import java.util.regex.Pattern;
 
 import org.apache.tika.TikaTest;
 import org.junit.Test;
@@ -45,14 +46,14 @@ public class TabularFormatsTest extends TikaTest {
 /**
  * Expected values, by column
  */
-protected static final String[][] table = new String[][] {
+protected static final Object[][] table = new Object[][] {
 new String[] {
  "0","1","2","3","4","5","6","7","8","9","10"
 },
 new String[] {
  "0","1","4","9","16","25","36","49","64","81","100"
 },
-new String[] {}, // Done later
+new String[] {}, // Generated later
 new String[] {
 "0%","10%","20%","30%","40%","50%",
 "60%","70%","80%","90%","100%"
@@ -62,37 +63,44 @@ public class TabularFormatsTest extends TikaTest {
 "75.0%","80.0%","83.3%","85.7%",
 "87.5%","88.9%","90.0%"
 },
-new String[] {
- "01-01-1960", "02-01-1960", "17-01-1960",
- "22-03-1960", "13-09-1960", "17-09-1961",
- "20-07-1963", "29-07-1966", "20-03-1971",
- "18-12-1977", "19-05-1987"
+new Pattern[] {
+Pattern.compile("01-(01|JAN|Jan)-(60|1960)"),
+Pattern.compile("02-01-1960"),
+Pattern.compile("17-01-1960"),
+Pattern.compile("22-03-1960"),
+Pattern.compile("13-09-1960"),
+Pattern.compile("17-09-1961"),
+Pattern.compile("20-07-1963"),
+Pattern.compile("29-07-1966"),
+Pattern.compile("20-03-1971"),
+Pattern.compile("18-12-1977"),
+Pattern.compile("19-05-1987"),
 },
-new String[] {
- "01JAN60:00:00:01",
- "01JAN60:00:00:10",
- "01JAN60:00:01:40",
- "01JAN60:00:16:40",
- "01JAN60:02:46:40",
- "02JAN60:03:46:40",
- "12JAN60:13:46:40",
- "25APR60:17:46:40",
- "03MAR63:09:46:40",
- "09SEP91:01:46:40",
- "19NOV76:17:46:40"
+new Pattern[] {
+ Pattern.compile("01(JAN|Jan)(60|1960):00:00:01(.00)?"),
+ Pattern.compile("01(JAN|Jan)(60|1960):00:00:10(.00)?"),
+ Pattern.compile("01(JAN|Jan)(60|1960):00:01:40(.00)?"),
+ Pattern.compile("01(JAN|Jan)(60|1960):00:16:40(.00)?"),
+ Pattern.compile("01(JAN|Jan)(60|1960):02:46:40(.00)?"),
+ Pattern.compile("02(JAN|Jan)(60|1960):03:46:40(.00)?"),
+ Pattern.compile("12(JAN|Jan)(60|1960):13:46:40(.00)?"),
+ Pattern.compile("25(APR|Apr)(60|1960):17:46:40(.00)?"),
+ Pattern.compile("03(MAR|Mar)(63|1963):09:46:40(.00)?"),
+ Pattern.compile("09(SEP|Sep)(91|1991):01:46:40(.00)?"),
+ Pattern.compile("19(NOV|Nov)(76|2276):17:46:40(.00)?")
 },
-new String[] {
- "0:00:01",
- "0:00:03",
- "0:00:09",
-  

[tika] 02/04: Not all formats know about %s, dates not completely consistent either...

2018-05-10 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit d0fb69715e83a42db2ee5c2750eaa9d3b4f4d86c
Author: Nick Burch <n...@gagravarr.org>
AuthorDate: Thu May 10 16:33:45 2018 +0100

Not all formats know about %s, dates not completely consistent either...
---
 .../org/apache/tika/parser/TabularFormatsTest.java | 33 ++
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java 
b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
index 7330f6a..80a7f56 100644
--- a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
+++ b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
@@ -20,6 +20,8 @@ package org.apache.tika.parser;
 import static org.junit.Assert.assertEquals;
 
 import java.util.Arrays;
+import java.util.List;
+import java.util.Locale;
 
 import org.apache.tika.TikaTest;
 import org.junit.Test;
@@ -56,7 +58,7 @@ public class TabularFormatsTest extends TikaTest {
 "60%","70%","80%","90%","100%"
 },
 new String[] {
-"M","0.0%","50.0%","66.7%",
+"","0.0%","50.0%","66.7%",
 "75.0%","80.0%","83.3%","85.7%",
 "87.5%","88.9%","90.0%"
 },
@@ -100,6 +102,15 @@ public class TabularFormatsTest extends TikaTest {
 table[2][i] = "This is row " + i + " of 10";
 }
 }
+// Which columns hold percentages? Not all parsers
+//  correctly format these...
+protected static final List percentageColumns = 
+Arrays.asList(new Integer[] { 3, 4 });
+// Which columns hold dates? Some parsers output
+//  bits of the month in lower case, some all upper, eg JAN vs Jan
+protected static final List dateColumns = 
+Arrays.asList(new Integer[] { 5, 6 });
+// TODO Handle 60 vs 1960
 
 protected static String[] toCells(String row, boolean isTH) {
 // Split into cells, ignoring stuff before first cell
@@ -152,7 +163,7 @@ public class TabularFormatsTest extends TikaTest {
 }
 }
 }
-protected void assertContents(String xml, boolean hasHeader) {
+protected void assertContents(String xml, boolean hasHeader, boolean 
doesPercents) {
 // Ignore anything before the first 
 // Ignore the header row if there is one
 int ignores = 1;
@@ -178,8 +189,14 @@ public class TabularFormatsTest extends TikaTest {
  table.length, cells.length);
 
 for (int cn=0; cn<table.length; cn++) {
+String val = cells[cn];
+
+// If the parser doesn't know about % formats,
+//  skip the cell if the column in a % one
+if (!doesPercents && percentageColumns.contains(cn)) continue;
+if (dateColumns.contains(cn)) val = 
val.toUpperCase(Locale.ROOT);
+
 // Ignore cell attributes
-String val = cells.length > (cn-1) ? cells[cn] : "";
 if (! val.isEmpty()) val = val.split(">")[1];
 // Check
 assertEquals("Wrong text in row " + (rn+1) + " and column " + 
(cn+1),
@@ -193,21 +210,25 @@ public class TabularFormatsTest extends TikaTest {
 XMLResult result = getXML("test-columnar.sas7bdat");
 String xml = result.xml;
 assertHeaders(xml, true, true, true);
-//assertContents(xml, true);
+// TODO Wait for https://github.com/epam/parso/issues/28 to be fixed
+//  then check the % formats again
+//assertContents(xml, true, false);
 }
 @Test
 public void testXLS() throws Exception {
 XMLResult result = getXML("test-columnar.xls");
 String xml = result.xml;
 assertHeaders(xml, false, true, false);
-//assertContents(xml, true);
+// TODO Correctly handle empty cells then test
+//assertContents(xml, true, false);
 }
 @Test
 public void testXLSX() throws Exception {
 XMLResult result = getXML("test-columnar.xlsx");
 String xml = result.xml;
 assertHeaders(xml, false, true, false);
-//assertContents(xml, true);
+// TODO Correctly handle empty cells then test
+//assertContents(xml, true, false);
 }
 // TODO Test ODS
 

-- 
To stop receiving notification emails like this one, please contact
n...@apache.org.


[tika] 01/04: Ensure that empty cells are still output

2018-05-10 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 6cff6029beb4316e541169d788fe1884b338
Author: Nick Burch <n...@gagravarr.org>
AuthorDate: Thu May 10 16:26:22 2018 +0100

Ensure that empty cells are still output
---
 .../src/main/java/org/apache/tika/parser/sas/SAS7BDATParser.java| 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/sas/SAS7BDATParser.java 
b/tika-parsers/src/main/java/org/apache/tika/parser/sas/SAS7BDATParser.java
index 121d958..8b28644 100644
--- a/tika-parsers/src/main/java/org/apache/tika/parser/sas/SAS7BDATParser.java
+++ b/tika-parsers/src/main/java/org/apache/tika/parser/sas/SAS7BDATParser.java
@@ -134,7 +134,11 @@ public class SAS7BDATParser extends AbstractParser {
 while ((row = sas.readNext()) != null) {
 xhtml.startElement("tr");
 for (String val : DataWriterUtil.getRowValues(sas.getColumns(), 
row)) {
-xhtml.element("td", val);
+// Use explicit start/end, rather than element, to 
+//  ensure that empty cells still get output
+xhtml.startElement("td");
+xhtml.characters(val);
+xhtml.endElement("td");
 }
 xhtml.endElement("tr");
 xhtml.newline();

-- 
To stop receiving notification emails like this one, please contact
n...@apache.org.


[tika] 02/05: Add a time column to the test columnar files

2018-05-10 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit ca2f5bc63b7595730e53e95758dc9aaf6b567daa
Author: Nick Burch <n...@gagravarr.org>
AuthorDate: Thu May 10 11:35:04 2018 +0100

Add a time column to the test columnar files
---
 .../org/apache/tika/parser/TabularFormatsTest.java |  22 +++-
 .../apache/tika/parser/sas/SAS7BDATParserTest.java |   8 ++---
 .../resources/test-documents/test-columnar.csv |  37 +++--
 .../resources/test-documents/test-columnar.sas.xml |  11 ++
 .../test-documents/test-columnar.sas7bdat  | Bin 17408 -> 17408 bytes
 .../resources/test-documents/test-columnar.xls | Bin 0 -> 6656 bytes
 .../resources/test-documents/test-columnar.xlsx| Bin 0 -> 4941 bytes
 .../resources/test-documents/test-columnar.xpt | Bin 4560 -> 4720 bytes
 .../src/test/resources/test-documents/testSAS2.sas |  27 ---
 9 files changed, 64 insertions(+), 41 deletions(-)

diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java 
b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
index 61fcca2..4dc7336 100644
--- a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
+++ b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
@@ -26,25 +26,31 @@ import org.junit.Test;
  * This is mostly focused on the XHTML output
  */
 public class TabularFormatsTest extends TikaTest {
-protected static final String[] headers = new String[] {
-"String (Num=)","Number","Date","Datetime","Number"
+protected static final String[] columnNames = new String[] {
+ "recnum","square","desc","pctdone","pctinc",
+ "date","datetime","time"
 };
+protected static final String[] columnLabels = new String[] {
+"Record Number","Square of the Record Number",
+"Description of the Row","Percent Done",
+"Percent Increment","date","datetime","time"
+};
+
 /**
  * Expected values, by column
  */
 protected static final String[][] table = new String[][] {
 // TODO All values
 new String[] {
-"Num=0"
+ "0","1","2","3","4","5","6","7","8","9","10"
 },
 new String[] {
-"0.0"
+ "0","1","4" // etc
 },
-new String[] {
-"1899-12-30"
+new String[] {  // etc
+"01-01-1960"
 },
-new String[] {
-"1900-01-01 11:00:00"
+new String[] {  // etc
 },
 new String[] {
 ""
diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java 
b/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java
index 3bb3e01..610ffc3 100644
--- 
a/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java
+++ 
b/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java
@@ -89,11 +89,11 @@ public class SAS7BDATParserTest extends TikaTest {
 assertEquals("application/x-sas-data", 
metadata.get(Metadata.CONTENT_TYPE));
 assertEquals("TESTING", metadata.get(TikaCoreProperties.TITLE));
 
-assertEquals("2018-05-09T16:42:04Z", 
metadata.get(TikaCoreProperties.CREATED));
-assertEquals("2018-05-09T16:42:04Z", 
metadata.get(TikaCoreProperties.MODIFIED));
+assertEquals("2018-05-09T17:59:33Z", 
metadata.get(TikaCoreProperties.CREATED));
+assertEquals("2018-05-09T17:59:33Z", 
metadata.get(TikaCoreProperties.MODIFIED));
 
 assertEquals("1", metadata.get(PagedText.N_PAGES));
-assertEquals("7", metadata.get(Database.COLUMN_COUNT));
+assertEquals("8", metadata.get(Database.COLUMN_COUNT));
 assertEquals("11", metadata.get(Database.ROW_COUNT));
 assertEquals("windows-1252", 
metadata.get(HttpHeaders.CONTENT_ENCODING));
 assertEquals("W32_7PRO", 
metadata.get(OfficeOpenXMLExtended.APPLICATION));
@@ -102,7 +102,7 @@ public class SAS7BDATParserTest extends TikaTest {
 assertEquals("Little", metadata.get(MachineMetadata.ENDIAN));
 assertEquals(Arrays.asList("Record Number","Square of the Record 
Number",
"Description of the

[tika] branch master updated (a0ffec1 -> cfd6256)

2018-05-10 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git.


from a0ffec1  Handle .epub files using .htm rather than .html extensions 
for the embedded contents (TIKA-1288)
 new d0324f8  Add a test .sas7bdat file with labels, and generate the 
columnar/tabular test file in a few more formats
 new ca2f5bc  Add a time column to the test columnar files
 new 1d7a113  CSV assert as best we can (no dedicated parser), start on XLS 
and SAS7BDAT consistency tests
 new 7f89db3  Check header contents, check data rows count, add XLSX test
 new cfd6256  Remaining values to check

The 5 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../org/apache/tika/parser/TabularFormatsTest.java | 196 +++--
 .../apache/tika/parser/sas/SAS7BDATParserTest.java |  51 --
 .../resources/test-documents/test-columnar.csv |  37 ++--
 .../resources/test-documents/test-columnar.sas.xml | 113 
 .../test-documents/test-columnar.sas7bdat  | Bin 9216 -> 17408 bytes
 .../resources/test-documents/test-columnar.xls | Bin 0 -> 6656 bytes
 .../resources/test-documents/test-columnar.xlsx| Bin 0 -> 4941 bytes
 .../resources/test-documents/test-columnar.xpt | Bin 0 -> 4720 bytes
 .../src/test/resources/test-documents/testSAS2.sas |  67 +++
 9 files changed, 405 insertions(+), 59 deletions(-)
 create mode 100644 
tika-parsers/src/test/resources/test-documents/test-columnar.sas.xml
 create mode 100644 
tika-parsers/src/test/resources/test-documents/test-columnar.xls
 create mode 100644 
tika-parsers/src/test/resources/test-documents/test-columnar.xlsx
 create mode 100644 
tika-parsers/src/test/resources/test-documents/test-columnar.xpt
 create mode 100644 tika-parsers/src/test/resources/test-documents/testSAS2.sas

-- 
To stop receiving notification emails like this one, please contact
n...@apache.org.


[tika] 01/05: Add a test .sas7bdat file with labels, and generate the columnar/tabular test file in a few more formats

2018-05-10 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit d0324f8e4fa70fce67d56dc70f611f5535fe229b
Author: Nick Burch <n...@gagravarr.org>
AuthorDate: Wed May 9 18:19:34 2018 +0100

Add a test .sas7bdat file with labels, and generate the columnar/tabular 
test file in a few more formats
---
 .../apache/tika/parser/sas/SAS7BDATParserTest.java |  51 +++
 .../resources/test-documents/test-columnar.sas.xml | 102 +
 .../test-documents/test-columnar.sas7bdat  | Bin 9216 -> 17408 bytes
 .../resources/test-documents/test-columnar.xpt | Bin 0 -> 4560 bytes
 .../src/test/resources/test-documents/testSAS2.sas |  48 ++
 5 files changed, 182 insertions(+), 19 deletions(-)

diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java 
b/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java
index 2657ac2..3bb3e01 100644
--- 
a/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java
+++ 
b/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java
@@ -82,36 +82,36 @@ public class SAS7BDATParserTest extends TikaTest {
 Metadata metadata = new Metadata();
 
 try (InputStream stream = SAS7BDATParserTest.class.getResourceAsStream(
-"/test-documents/test-columnar.sas7bdat")) {
+"/test-documents/test-columnar.sas7bdat")) {
 parser.parse(stream, handler, metadata, new ParseContext());
 }
 
 assertEquals("application/x-sas-data", 
metadata.get(Metadata.CONTENT_TYPE));
-assertEquals("SHEET1", metadata.get(TikaCoreProperties.TITLE));
+assertEquals("TESTING", metadata.get(TikaCoreProperties.TITLE));
 
-// Fri Mar 06 19:10:19 GMT 2015
-assertEquals("2015-03-06T19:10:19Z", 
metadata.get(TikaCoreProperties.CREATED));
-assertEquals("2015-03-06T19:10:19Z", 
metadata.get(TikaCoreProperties.MODIFIED));
+assertEquals("2018-05-09T16:42:04Z", 
metadata.get(TikaCoreProperties.CREATED));
+assertEquals("2018-05-09T16:42:04Z", 
metadata.get(TikaCoreProperties.MODIFIED));
 
 assertEquals("1", metadata.get(PagedText.N_PAGES));
-assertEquals("5", metadata.get(Database.COLUMN_COUNT));
-assertEquals("31", metadata.get(Database.ROW_COUNT));
+assertEquals("7", metadata.get(Database.COLUMN_COUNT));
+assertEquals("11", metadata.get(Database.ROW_COUNT));
 assertEquals("windows-1252", 
metadata.get(HttpHeaders.CONTENT_ENCODING));
-assertEquals("XP_PRO", 
metadata.get(OfficeOpenXMLExtended.APPLICATION));
-assertEquals("9.0101M3", 
metadata.get(OfficeOpenXMLExtended.APP_VERSION));
+assertEquals("W32_7PRO", 
metadata.get(OfficeOpenXMLExtended.APPLICATION));
+assertEquals("9.0301M2", 
metadata.get(OfficeOpenXMLExtended.APP_VERSION));
 assertEquals("32", metadata.get(MachineMetadata.ARCHITECTURE_BITS));
 assertEquals("Little", metadata.get(MachineMetadata.ENDIAN));
-assertEquals(Arrays.asList("A","B","C","D","E"),
+assertEquals(Arrays.asList("Record Number","Square of the Record 
Number",
+   "Description of the Row","Percent Done",
+   "Percent Increment","date","datetime"),
  Arrays.asList(metadata.getValues(Database.COLUMN_NAME)));
 
 String content = handler.toString();
-assertContains("SHEET1", content);
-assertContains("A\tB\tC", content);
-assertContains("Num=0\t", content);
-assertContains("Num=404242\t", content);
-assertContains("\t0\t", content);
-assertContains("\t404242\t", content);
-assertContains("\t08Feb1904\t", content);
+assertContains("TESTING", content);
+assertContains("0\t0\tThis", content);
+assertContains("2\t4\tThis", content);
+assertContains("4\t16\tThis", content);
+assertContains("\t01-01-1960\t", content);
+assertContains("\t01Jan1960:00:00", content);
 }
 
 @Test
@@ -129,7 +129,20 @@ public class SAS7BDATParserTest extends TikaTest {
 assertContains("This is row", xml);
 assertContains("10", xml);
 }
+
+@Test
+public void testHTML2() throws Exception {
+XMLResult result = getXML(&q

[tika] 05/05: Remaining values to check

2018-05-10 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit cfd62569a8f6bf79ba5d15bb3f4063d49347c7fd
Author: Nick Burch <n...@gagravarr.org>
AuthorDate: Thu May 10 15:41:16 2018 +0100

Remaining values to check
---
 .../org/apache/tika/parser/TabularFormatsTest.java | 84 +++---
 1 file changed, 73 insertions(+), 11 deletions(-)

diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java 
b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
index 023f49d..7330f6a 100644
--- a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
+++ b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
@@ -44,24 +44,62 @@ public class TabularFormatsTest extends TikaTest {
  * Expected values, by column
  */
 protected static final String[][] table = new String[][] {
-// TODO All values
 new String[] {
  "0","1","2","3","4","5","6","7","8","9","10"
 },
 new String[] {
  "0","1","4","9","16","25","36","49","64","81","100"
 },
-/*
-new String[] {  // etc
-"01-01-1960"
+new String[] {}, // Done later
+new String[] {
+"0%","10%","20%","30%","40%","50%",
+"60%","70%","80%","90%","100%"
+},
+new String[] {
+"M","0.0%","50.0%","66.7%",
+"75.0%","80.0%","83.3%","85.7%",
+"87.5%","88.9%","90.0%"
 },
-new String[] {  // etc
+new String[] {
+ "01-01-1960", "02-01-1960", "17-01-1960",
+ "22-03-1960", "13-09-1960", "17-09-1961",
+ "20-07-1963", "29-07-1966", "20-03-1971",
+ "18-12-1977", "19-05-1987"
 },
 new String[] {
-""
+ "01JAN60:00:00:01",
+ "01JAN60:00:00:10",
+ "01JAN60:00:01:40",
+ "01JAN60:00:16:40",
+ "01JAN60:02:46:40",
+ "02JAN60:03:46:40",
+ "12JAN60:13:46:40",
+ "25APR60:17:46:40",
+ "03MAR63:09:46:40",
+ "09SEP91:01:46:40",
+ "19NOV76:17:46:40"
+},
+new String[] {
+ "0:00:01",
+ "0:00:03",
+ "0:00:09",
+ "0:00:27",
+ "0:01:21",
+ "0:04:03",
+ "0:12:09",
+ "0:36:27",
+ "1:49:21",
+ "5:28:03",
+ "16:24:09"
 }
-*/
 };
+static {
+// Row text in 3rd column
+table[2] = new String[table[0].length];
+for (int i=0; i<table[0].length; i++) {
+table[2][i] = "This is row " + i + " of 10";
+}
+}
 
 protected static String[] toCells(String row, boolean isTH) {
 // Split into cells, ignoring stuff before first cell
@@ -72,9 +110,18 @@ public class TabularFormatsTest extends TikaTest {
 cells = row.split("<td");
 }
 cells = Arrays.copyOfRange(cells, 1, cells.length);
+
+// Ignore the closing tag onwards, and normalise whitespace
 for (int i=0; i<cells.length; i++) {
+cells[i] = cells[i].trim();
+if (cells[i].equals("/>")) {
+cells[i] = "";
+continue;
+}
+
 int splitAt = cells[i].lastIndexOf(" (cn-1) ? cells[cn] : "";
+if (! val.isEmpty()) val = val.split(">")[1];
+// Check
+assertEquals("Wrong text in row " + (rn+1) + " and column " + 
(cn+1),
+ table[cn][rn], val);
+}
+}
 }
 
 @Test
@@ -133,21 +193,21 @@ public class TabularFormatsTest extends TikaTest {
 XMLResult result = getXML("test-columnar.sas7bdat");
 String xml = result.xml;
 assertHeaders(xml, true, true, true);
-assertCont

[tika] 04/05: Check header contents, check data rows count, add XLSX test

2018-05-10 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 7f89db35d066e6c4ae35490c5bad67d376e5365e
Author: Nick Burch <n...@gagravarr.org>
AuthorDate: Thu May 10 15:13:43 2018 +0100

Check header contents, check data rows count, add XLSX test
---
 .../org/apache/tika/parser/TabularFormatsTest.java | 77 +-
 1 file changed, 61 insertions(+), 16 deletions(-)

diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java 
b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
index 8574d37..023f49d 100644
--- a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
+++ b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java
@@ -31,7 +31,7 @@ import org.junit.Test;
  */
 public class TabularFormatsTest extends TikaTest {
 protected static final String[] columnNames = new String[] {
- "recnum","square","desc","pctdone","pctinc",
+ "recnum","square","desc","pctdone","pctincr",
  "date","datetime","time"
 };
 protected static final String[] columnLabels = new String[] {
@@ -49,8 +49,9 @@ public class TabularFormatsTest extends TikaTest {
  "0","1","2","3","4","5","6","7","8","9","10"
 },
 new String[] {
- "0","1","4" // etc
+ "0","1","4","9","16","25","36","49","64","81","100"
 },
+/*
 new String[] {  // etc
 "01-01-1960"
 },
@@ -59,37 +60,72 @@ public class TabularFormatsTest extends TikaTest {
 new String[] {
 ""
 }
+*/
 };
-
-protected void assertHeaders(String xml, boolean isTH, boolean hasLabel, 
boolean hasName) {
-// Find the first row
-int splitAt = xml.indexOf("");
-String hRow = xml.substring(0, splitAt);
-splitAt = xml.indexOf("");
-hRow = hRow.substring(splitAt+4);
-
+
+protected static String[] toCells(String row, boolean isTH) {
 // Split into cells, ignoring stuff before first cell
 String[] cells;
 if (isTH) {
-cells = hRow.split("<th");
+cells = row.split("<th");
 } else {
-cells = hRow.split("<td");
+cells = row.split("<td");
 }
 cells = Arrays.copyOfRange(cells, 1, cells.length);
 for (int i=0; i<cells.length; i++) {
-splitAt = cells[i].lastIndexOf("");
+String hRow = xml.substring(0, splitAt);
+splitAt = xml.indexOf("");
+hRow = hRow.substring(splitAt+4);
+
+// Split into cells, ignoring stuff before first cell
+String[] cells = toCells(hRow, isTH);
 
 // Check we got the right number
 assertEquals("Wrong number of cells in header row " + hRow,
  columnLabels.length, cells.length);
 
 // Check we got the right stuff
-// TODO
+for (int i=0; i<cells.length; i++) {
+if (hasLabel && hasName) {
+assertContains("title=\"" + columnNames[i] + "\"", cells[i]); 
+assertContains(">" + columnLabels[i], cells[i]); 
+} else if (hasName) {
+assertContains(">" + columnNames[i], cells[i]); 
+} else {
+assertContains(">" + columnLabels[i], cells[i]); 
+}
+}
 }
 protected void assertContents(String xml, boolean hasHeader) {
-// TODO Check the rows
+// Ignore anything before the first 
+// Ignore the header row if there is one
+int ignores = 1;
+if (hasHeader) ignores++;
+
+// Split into rows, and discard the row closing (and anything after)
+String[] rows = xml.split("");
+rows = Arrays.copyOfRange(rows, ignores, rows.length);
+for (int i=0; i<rows.length; i++) {
+rows[i] = rows[i].split("")[0].trim();
+}
+
+// Check we got the right number of rows
+for (int cn=0; cn<table.length; cn++) {
+assertEquals("Wrong number of rows found compared to column " + 
(cn+1),
+ table[cn].length, rows.length);
+}
+
+// Check each row's values
+// TODO
 }
 
 @Te

[tika] branch master updated: Handle .epub files using .htm rather than .html extensions for the embedded contents (TIKA-1288)

2018-05-09 Thread nick
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/master by this push:
 new a0ffec1  Handle .epub files using .htm rather than .html extensions 
for the embedded contents (TIKA-1288)
a0ffec1 is described below

commit a0ffec146e84fdcf4c747b4375f92ae283944f4c
Author: Nick Burch <n...@gagravarr.org>
AuthorDate: Wed May 9 10:23:09 2018 +0100

Handle .epub files using .htm rather than .html extensions for the embedded 
contents (TIKA-1288)
---
 CHANGES.txt| 3 +++
 tika-parsers/src/main/java/org/apache/tika/parser/epub/EpubParser.java | 3 ++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/CHANGES.txt b/CHANGES.txt
index 194fef8..c66e883 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -73,6 +73,9 @@ Release 2.0.0 - ???
 
* Support for SAS7BDAT data files (TIKA-2462)
 
+   * Handle .epub files using .htm rather than .html extensions for the
+ embedded contents (TIKA-1288)
+
 Release 1.17 - 12/8/2017
 
   ***NOTE: THIS IS THE LAST VERSION OF TIKA THAT WILL RUN
diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/epub/EpubParser.java 
b/tika-parsers/src/main/java/org/apache/tika/parser/epub/EpubParser.java
index c4f72de..775b319 100644
--- a/tika-parsers/src/main/java/org/apache/tika/parser/epub/EpubParser.java
+++ b/tika-parsers/src/main/java/org/apache/tika/parser/epub/EpubParser.java
@@ -105,7 +105,8 @@ public class EpubParser extends AbstractParser {
 meta.parse(zip, new DefaultHandler(), metadata, context);
 } else if (entry.getName().endsWith(".opf")) {
 meta.parse(zip, new DefaultHandler(), metadata, context);
-} else if (entry.getName().endsWith(".html") || 
+} else if (entry.getName().endsWith(".htm") || 
+   entry.getName().endsWith(".html") || 
   entry.getName().endsWith(".xhtml")) {
 content.parse(zip, childHandler, metadata, context);
 }

-- 
To stop receiving notification emails like this one, please contact
n...@apache.org.


  1   2   3   4   5   6   7   8   >