[tika] branch main updated (9488d076e -> 02f0d0441)
This is an automated email from the ASF dual-hosted git repository. nick pushed a change to branch main in repository https://gitbox.apache.org/repos/asf/tika.git from 9488d076e bump OpenSearch version to latest add 500900d67 TIKA-4060 Test AAC files, based on testWAV.wav, one without ID3, one with dummy ID3 values add ae85b9e4e AAC magic, based on PRONOM patterns found by Gregory Lepore add 8f838c512 AAC detection tests, ID3 one currently failing... add 04021e427 Hex values in a match regex need escaping to be treated as hex add 02f0d0441 Merge branch 'main' into TIKA-4060 No new revisions were added by this update. Summary of changes: .../resources/org/apache/tika/mime/tika-mimetypes.xml| 10 ++ .../test/java/org/apache/tika/mime/TestMimeTypes.java| 9 + .../src/test/resources/test-documents/testAAC.aac| Bin 0 -> 779 bytes .../src/test/resources/test-documents/testAACid3.aac | Bin 0 -> 2176 bytes 4 files changed, 19 insertions(+) create mode 100644 tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAAC.aac create mode 100644 tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAACid3.aac
[tika] 01/01: Merge branch 'main' into TIKA-4060
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch TIKA-4060 in repository https://gitbox.apache.org/repos/asf/tika.git commit 02f0d0441b8c380faebcf8bb14d6f91b0252f058 Merge: 04021e427 9488d076e Author: Nick Burch AuthorDate: Thu Jun 8 22:12:00 2023 +0100 Merge branch 'main' into TIKA-4060 CHANGES.txt| 2 + .../tika/exception/WriteLimitReachedException.java | 6 +- .../apache/tika/metadata/TikaCoreProperties.java | 1 + .../org/apache/tika/parser/AutoDetectParser.java | 4 ++ .../java/org/apache/tika/pipes/PipesClient.java| 63 -- .../java/org/apache/tika/pipes/PipesResult.java| 28 +--- .../java/org/apache/tika/pipes/PipesServer.java| 73 - .../org/apache/tika/pipes/async/AsyncConfig.java | 10 +++ .../apache/tika/pipes/async/AsyncProcessor.java| 14 +++- .../apache/tika/sax/ContentHandlerDecorator.java | 38 ++- .../org/apache/tika/pipes/PipesServerTest.java | 76 ++ .../tika/pipes/async/AsyncProcessorTest.java | 74 + .../tika/pipes/async/MockDigesterFactory.java | 56 .../org/apache/tika/pipes/async/MockReporter.java | 6 +- .../resources/org/apache/tika/pipes/TIKA-3941.xml | 30 + .../opensearch/tests/TikaPipesOpenSearchTest.java | 2 +- tika-parent/pom.xml| 2 +- .../reporters/jdbc/TestJDBCPipesReporter.java | 2 +- .../server/core/resource/UnpackerResource.java | 52 ++- .../tika/server/standard/UnpackerResourceTest.java | 9 +++ 20 files changed, 494 insertions(+), 54 deletions(-)
[tika] branch TIKA-4060 updated (04021e427 -> 02f0d0441)
This is an automated email from the ASF dual-hosted git repository. nick pushed a change to branch TIKA-4060 in repository https://gitbox.apache.org/repos/asf/tika.git from 04021e427 Hex values in a match regex need escaping to be treated as hex add d72077833 Bump aws.version from 1.12.483 to 1.12.484 add d149fc71b Merge pull request #1180 from apache/dependabot/maven/aws.version-1.12.484 add 2d9daef85 TIKA-4039 (#1181) add ceed7be8b TIKA-4062 (#1179) add 6cea7717c TIKA-3941 -- allow reporting of intermediate results from the pipes processor (#1167) add 9488d076e bump OpenSearch version to latest new 02f0d0441 Merge branch 'main' into TIKA-4060 The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: CHANGES.txt| 2 + .../tika/exception/WriteLimitReachedException.java | 6 +- .../apache/tika/metadata/TikaCoreProperties.java | 1 + .../org/apache/tika/parser/AutoDetectParser.java | 4 ++ .../java/org/apache/tika/pipes/PipesClient.java| 63 -- .../java/org/apache/tika/pipes/PipesResult.java| 28 +--- .../java/org/apache/tika/pipes/PipesServer.java| 73 - .../org/apache/tika/pipes/async/AsyncConfig.java | 10 +++ .../apache/tika/pipes/async/AsyncProcessor.java| 14 +++- .../apache/tika/sax/ContentHandlerDecorator.java | 38 ++- .../org/apache/tika/pipes/PipesServerTest.java | 76 ++ .../tika/pipes/async/AsyncProcessorTest.java | 74 + .../tika/pipes/async/MockDigesterFactory.java | 38 +-- .../org/apache/tika/pipes/async/MockReporter.java | 6 +- .../TIKA-3941.xml} | 18 +++-- .../opensearch/tests/TikaPipesOpenSearchTest.java | 2 +- tika-parent/pom.xml| 2 +- .../reporters/jdbc/TestJDBCPipesReporter.java | 2 +- .../server/core/resource/UnpackerResource.java | 52 ++- .../tika/server/standard/UnpackerResourceTest.java | 9 +++ 20 files changed, 434 insertions(+), 84 deletions(-) create mode 100644 tika-core/src/test/java/org/apache/tika/pipes/PipesServerTest.java copy tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-digest-commons/src/main/java/org/apache/tika/parser/digestutils/CommonsDigesterFactory.java => tika-core/src/test/java/org/apache/tika/pipes/async/MockDigesterFactory.java (61%) copy tika-core/src/test/resources/org/apache/tika/{config/fetchers-noname-config.xml => pipes/TIKA-3941.xml} (76%)
[tika] branch TIKA-4060 updated: Hex values in a match regex need escaping to be treated as hex
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch TIKA-4060 in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/TIKA-4060 by this push: new 04021e427 Hex values in a match regex need escaping to be treated as hex 04021e427 is described below commit 04021e4276606bb2ca8837444651da049f21c222 Author: Nick Burch AuthorDate: Thu Jun 8 21:55:49 2023 +0100 Hex values in a match regex need escaping to be treated as hex --- tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 461ad6128..5c8cbbcb1 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -5627,12 +5627,12 @@ - + - +
[tika] 03/03: AAC detection tests, ID3 one currently failing...
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch TIKA-4060 in repository https://gitbox.apache.org/repos/asf/tika.git commit 8f838c512c6880ba21d2d6df36f592614710aba8 Author: Nick Burch AuthorDate: Wed Jun 7 23:58:11 2023 +0100 AAC detection tests, ID3 one currently failing... --- .../src/test/java/org/apache/tika/mime/TestMimeTypes.java| 9 + 1 file changed, 9 insertions(+) diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java index f534060a1..73945f355 100644 --- a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java +++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java @@ -1345,6 +1345,15 @@ public class TestMimeTypes { assertTypeByData("application/onix-message+xml", "testONIXMessageShort.xml"); } +@Test +public void testAACDetection() throws Exception { +assertType("audio/x-aac", "testAAC.aac"); +assertType("audio/x-aac", "testAACid3.aac"); +assertTypeByData("audio/x-aac", "testAAC.aac"); +assertTypeByData("audio/x-aac", "testAACid3.aac"); +assertTypeByName("audio/x-aac", "x.aac"); +} + private void assertText(byte[] prefix) throws IOException { assertMagic("text/plain", prefix); }
[tika] branch TIKA-4060 created (now 8f838c512)
This is an automated email from the ASF dual-hosted git repository. nick pushed a change to branch TIKA-4060 in repository https://gitbox.apache.org/repos/asf/tika.git at 8f838c512 AAC detection tests, ID3 one currently failing... This branch includes the following new commits: new 500900d67 TIKA-4060 Test AAC files, based on testWAV.wav, one without ID3, one with dummy ID3 values new ae85b9e4e AAC magic, based on PRONOM patterns found by Gregory Lepore new 8f838c512 AAC detection tests, ID3 one currently failing... The 3 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
[tika] 02/03: AAC magic, based on PRONOM patterns found by Gregory Lepore
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch TIKA-4060 in repository https://gitbox.apache.org/repos/asf/tika.git commit ae85b9e4e4fb897ec901779fa7301c9316fb9a79 Author: Nick Burch AuthorDate: Wed Jun 7 23:57:46 2023 +0100 AAC magic, based on PRONOM patterns found by Gregory Lepore --- .../src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 10 ++ 1 file changed, 10 insertions(+) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 39c1c5891..461ad6128 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -5625,6 +5625,16 @@ + + + + + + + + + +
[tika] 01/03: TIKA-4060 Test AAC files, based on testWAV.wav, one without ID3, one with dummy ID3 values
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch TIKA-4060 in repository https://gitbox.apache.org/repos/asf/tika.git commit 500900d67ede02e87440caa9f67501d3fe59b770 Author: Nick Burch AuthorDate: Wed Jun 7 23:56:55 2023 +0100 TIKA-4060 Test AAC files, based on testWAV.wav, one without ID3, one with dummy ID3 values --- .../src/test/resources/test-documents/testAAC.aac| Bin 0 -> 779 bytes .../src/test/resources/test-documents/testAACid3.aac | Bin 0 -> 2176 bytes 2 files changed, 0 insertions(+), 0 deletions(-) diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAAC.aac b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAAC.aac new file mode 100644 index 0..514887020 Binary files /dev/null and b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAAC.aac differ diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAACid3.aac b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAACid3.aac new file mode 100644 index 0..82bad4f2c Binary files /dev/null and b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAACid3.aac differ
[tika] branch main updated (0d7a42f34 -> fc887690a)
This is an automated email from the ASF dual-hosted git repository. nick pushed a change to branch main in repository https://gitbox.apache.org/repos/asf/tika.git from 0d7a42f34 TIKA-3795: update protobuf new 9d928bbf9 TIKA-3810 VTT with UTF-8 BOM new ec4cb612d WebVTT is text based, so check for both line endings on the BOM cases like we do for no-BOM new fc887690a Merge branch 'main' of https://github.com/apache/tika into main The 3 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .../org/apache/tika/mime/tika-mimetypes.xml| 6 .../java/org/apache/tika/mime/TestMimeTypes.java | 4 +++ .../resources/test-documents/testWebVTT_utf8.vtt | 42 ++ 3 files changed, 52 insertions(+) create mode 100644 tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testWebVTT_utf8.vtt
[tika] 03/03: Merge branch 'main' of https://github.com/apache/tika into main
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git commit fc887690a91a4b689a40a0be11d68dcdeb45a66f Merge: ec4cb612d 0d7a42f34 Author: Nick Burch AuthorDate: Tue Jul 5 11:32:57 2022 +0100 Merge branch 'main' of https://github.com/apache/tika into main tika-parent/pom.xml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
[tika] 01/03: TIKA-3810 VTT with UTF-8 BOM
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git commit 9d928bbf9e93131d5021d4e5afddb4ba18df6531 Author: Nick Burch AuthorDate: Tue Jul 5 11:21:17 2022 +0100 TIKA-3810 VTT with UTF-8 BOM --- .../org/apache/tika/mime/tika-mimetypes.xml| 3 ++ .../java/org/apache/tika/mime/TestMimeTypes.java | 4 +++ .../resources/test-documents/testWebVTT_utf8.vtt | 42 ++ 3 files changed, 49 insertions(+) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 2329c0a3b..7b4ac0d7d 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -7008,6 +7008,9 @@ + + + diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java index 2a2936bae..ea2ecbeff 100644 --- a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java +++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java @@ -1140,6 +1140,10 @@ public class TestMimeTypes { // With a custom text header assertType("text/vtt", "testWebVTT_header.vtt"); assertTypeByData("text/vtt", "testWebVTT_header.vtt"); + +// With a UTF-8 BOM before the header +assertType("text/vtt", "testWebVTT_utf8.vtt"); +assertTypeByData("text/vtt", "testWebVTT_utf8.vtt"); } @Test diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testWebVTT_utf8.vtt b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testWebVTT_utf8.vtt new file mode 100644 index 0..722a923fc --- /dev/null +++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testWebVTT_utf8.vtt @@ -0,0 +1,42 @@ +WEBVTT + +1 +00:00:00.350 --> 00:00:02.010 +Well, the feedback indicates + +2 +00:00:02.010 --> 00:00:03.880 +that many new hires aren't sure + +3 +00:00:03.880 --> 00:00:05.560 +where to find information related + +4 +00:00:05.560 --> 00:00:09.390 +to HR, benefits and other onboarding processes + +5 +00:00:09.390 --> 00:00:11.050 +or who to ask. + +6 +00:00:11.050 --> 00:00:13.850 +Also, they're not always sure where they belong + +7 +00:00:13.850 --> 00:00:15.740 +in the structure of the company. + +8 +00:00:15.740 --> 00:00:18.470 +Because the company is growing and changing, + +9 +00:00:18.470 --> 00:00:20.890 +even tenured employees are getting confused + +10 +00:00:20.890 --> 00:00:23.663 +about who does what and who reports to whom. +
[tika] 02/03: WebVTT is text based, so check for both line endings on the BOM cases like we do for no-BOM
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git commit ec4cb612d1cda09907c88f2c5a06cc3cb7a839ef Author: Nick Burch AuthorDate: Tue Jul 5 11:22:59 2022 +0100 WebVTT is text based, so check for both line endings on the BOM cases like we do for no-BOM --- tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 3 +++ 1 file changed, 3 insertions(+) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 7b4ac0d7d..9b24ae3f4 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -7004,11 +7004,14 @@ + + +
[tika] 01/02: Crypto test files - Encrypted version of testRSAKEY.pem, and a PKCS12 wrapped version
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git commit 8ef8d636f87dd571a8dc844d1d7ac503522b13ed Author: Nick Burch AuthorDate: Sun Jun 5 15:34:54 2022 +0100 Crypto test files - Encrypted version of testRSAKEY.pem, and a PKCS12 wrapped version --- .../resources/test-documents/testRSAKEYandCERT.p12| Bin 0 -> 1717 bytes .../test/resources/test-documents/testRSAKEYenc.der | Bin 0 -> 610 bytes .../test/resources/test-documents/testRSAKEYenc.pem | 18 ++ 3 files changed, 18 insertions(+) diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYandCERT.p12 b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYandCERT.p12 new file mode 100644 index 0..1c536e8fb Binary files /dev/null and b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYandCERT.p12 differ diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYenc.der b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYenc.der new file mode 100644 index 0..22c4f86d4 Binary files /dev/null and b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYenc.der differ diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYenc.pem b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYenc.pem new file mode 100644 index 0..5d8f9057e --- /dev/null +++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYenc.pem @@ -0,0 +1,18 @@ +-BEGIN RSA PRIVATE KEY- +Proc-Type: 4,ENCRYPTED +DEK-Info: DES-EDE3-CBC,9CF7E869357B3BD6 + +37j1uZrDJ6VIz2VPWJJObW0TtXvU1zCpNo0w/xO6y6Lanq7OutBtQ5+SBSXFRjxa +v2YsZ9OAw0PkeMRnw2CUN5bn2L5gvSgtzIV14slNDf20FAjemHpPAxFqNg6yFdZx +O7xbkKK9KgbFAc4lnBEVPEzfLmUZEI0d/vTnYzciyInUcfq2rkCrAmDq0e642whR +276/ayCNSEXMDpE7N3d7CT43Df5Fk7YsPYvvvyVInV56MSoESmMA093PeMiHcXwG +VKQCvJpdzxookQpwIYBqGnjahIOhWCRvm9ji17GN+tjaU3kqUzCoKldxSFS/9mAz +tiF4dDJk2BVF5yMQ7jplnVOW0dYv7wT5yPlbv6vOWeIL1igrM6YjK8hbC87s6kDD +DwnpPKYBP3MX8lvXJb/cMdeSDuWgjT9jhlDklmq00FzHJBwI+1neTfzSmsI++qi5 +sah3TCXmv/3uuZrTXwq73pjyi4W0VxdH0FPgyspeayn6dP2j3WrRgWQVdwLSEIqi +wl2kCIyxsHr5LUqwVJn0zSGXQ+Zs/gkoFrz7sriGe9yecLGHzv+8UVCksVWWadDg +RuknU0EUwuK2nkGg+mazfcxHf6RzqOMwT3oZDvNysE2vwDyC1ExxpFlZPSRP0Uwv +rCe2aVMzEj2zPLLkPR2dyUyTHgyfWuI/0MYO1Dg/0LvSg1cwyN5cAyHlN6D9aP4B +SjU+WgSq5VJGBCZcPnYLyz8n4z9cRPsaO6/11p99HytmNh2EOwROv/VlCh5VKW1q +SSjRrI754tUvyf+pbTtEOI2yvUbmIcql/wskwE/BbPWtQxEPb/v3QQ== +-END RSA PRIVATE KEY-
[tika] branch main updated (5e3dab7ae -> 6bf9ee120)
This is an automated email from the ASF dual-hosted git repository. nick pushed a change to branch main in repository https://gitbox.apache.org/repos/asf/tika.git from 5e3dab7ae TIKA-3751: update aws new 8ef8d636f Crypto test files - Encrypted version of testRSAKEY.pem, and a PKCS12 wrapped version new 6bf9ee120 Tests for encrypted RSA keys in PEM and DER, plus a disabled PKCS12 test pending TIKA-3784 The 2 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .../test/java/org/apache/tika/mime/TestMimeTypes.java | 7 +++ .../resources/test-documents/testRSAKEYandCERT.p12| Bin 0 -> 1717 bytes .../test/resources/test-documents/testRSAKEYenc.der} | Bin .../test/resources/test-documents/testRSAKEYenc.pem | 18 ++ 4 files changed, 25 insertions(+) create mode 100644 tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYandCERT.p12 copy tika-parsers/tika-parsers-standard/{tika-parsers-standard-modules/tika-parser-crypto-module/src/test/resources/test-documents/testRSAKEY.der => tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYenc.der} (100%) create mode 100644 tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testRSAKEYenc.pem
[tika] 02/02: Tests for encrypted RSA keys in PEM and DER, plus a disabled PKCS12 test pending TIKA-3784
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git commit 6bf9ee120c2845ccdf61207322dcea2373388e75 Author: Nick Burch AuthorDate: Sun Jun 5 15:48:36 2022 +0100 Tests for encrypted RSA keys in PEM and DER, plus a disabled PKCS12 test pending TIKA-3784 --- .../src/test/java/org/apache/tika/mime/TestMimeTypes.java | 7 +++ 1 file changed, 7 insertions(+) diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java index a90d27272..2a2936bae 100644 --- a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java +++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java @@ -1163,6 +1163,7 @@ public class TestMimeTypes { @Test public void testCertificatesKeys() throws Exception { +// Certificates can be identified by name alone, or with data assertType("application/x-x509-cert; format=pem", "testCERT.pem"); assertType("application/x-x509-cert; format=der", "testCERT.der"); assertTypeByData("application/x-x509-cert; format=pem", "testCERT.pem"); @@ -1174,9 +1175,15 @@ public class TestMimeTypes { assertTypeByData("application/x-x509-key; format=der", "testRSAKEY.der"); assertTypeByData("application/x-x509-key; format=pem", "testDSAKEY.pem"); assertTypeByData("application/x-x509-key; format=der", "testDSAKEY.der"); +assertTypeByData("application/x-x509-key; format=pem", "testRSAKEYenc.pem"); // pass=tika +assertTypeByData("application/x-x509-key; format=der", "testRSAKEYenc.der"); // pass=tika // Parameters only have PEM form, always need data assertTypeByData("application/x-x509-dsa-parameters", "testDSAPARAMS.pem"); assertTypeByData("application/x-x509-ec-parameters", "testECPARAMS.pem"); +// PKCS12 wrappers of Certs+Keys cannot currently be identified +// Once solved, see TIKA-3784, ought to work for name or data +//assertType("application/x-pkcs12", "testRSAKEYandCERT.p12"); +//assertTypeByData("application/x-pkcs12", "testRSAKEYandCERT.p12"); // pass=tika } @Test
[tika] branch main updated: PDP-11 style "Middle Endian" 32 bit read util, as used in the DGN file format
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/main by this push: new f33d8930e PDP-11 style "Middle Endian" 32 bit read util, as used in the DGN file format f33d8930e is described below commit f33d8930e660e61fb04f9232cd7fb6a96cdacdf3 Author: Nick Burch AuthorDate: Thu Apr 28 11:27:36 2022 +0100 PDP-11 style "Middle Endian" 32 bit read util, as used in the DGN file format --- .../src/main/java/org/apache/tika/io/EndianUtils.java | 19 +++ .../test/java/org/apache/tika/io/EndianUtilsTest.java | 18 ++ 2 files changed, 37 insertions(+) diff --git a/tika-core/src/main/java/org/apache/tika/io/EndianUtils.java b/tika-core/src/main/java/org/apache/tika/io/EndianUtils.java index c09eadceb..242dd8c74 100644 --- a/tika-core/src/main/java/org/apache/tika/io/EndianUtils.java +++ b/tika-core/src/main/java/org/apache/tika/io/EndianUtils.java @@ -152,6 +152,25 @@ public class EndianUtils { return (ch1 << 24) + (ch2 << 16) + (ch3 << 8) + (ch4); } +/** + * Get a PDP-11 style Middle Endian int value from an InputStream + * + * @param stream the InputStream from which the int is to be read + * @return the int (32-bit) value + * @throws IOException will be propagated back to the caller + * @throws BufferUnderrunException if the stream cannot provide enough bytes + */ +public static int readIntME(InputStream stream) throws IOException, BufferUnderrunException { +int ch1 = stream.read(); +int ch2 = stream.read(); +int ch3 = stream.read(); +int ch4 = stream.read(); +if ((ch1 | ch2 | ch3 | ch4) < 0) { +throw new BufferUnderrunException(); +} +return (ch2 << 24) + (ch1 << 16) + (ch4 << 8) + (ch3); +} + /** * Get a LE long value from an InputStream * diff --git a/tika-core/src/test/java/org/apache/tika/io/EndianUtilsTest.java b/tika-core/src/test/java/org/apache/tika/io/EndianUtilsTest.java index 8ead23218..906870e73 100644 --- a/tika-core/src/test/java/org/apache/tika/io/EndianUtilsTest.java +++ b/tika-core/src/test/java/org/apache/tika/io/EndianUtilsTest.java @@ -72,4 +72,22 @@ public class EndianUtilsTest { //swallow } } + +@Test +public void testReadIntME() throws Exception { +// Example from https://yamm.finance/wiki/Endianness.html#mwAiw +byte[] data = new byte[]{(byte) 0x0b, (byte) 0x0a, (byte) 0x0d, (byte) 0x0c}; +assertEquals(0x0a0b0c0d, EndianUtils.readIntME(new ByteArrayInputStream(data))); + +data = new byte[]{(byte) 0xFE, (byte) 0xFF, (byte) 0xFC, (byte) 0xFD}; +assertEquals(0xfffefdfc, EndianUtils.readIntME(new ByteArrayInputStream(data))); + +data = new byte[]{(byte) 0xFF, (byte) 0xFF, (byte) 0xFF}; +try { +EndianUtils.readIntME(new ByteArrayInputStream(data)); +fail("Should have thrown exception"); +} catch (EndianUtils.BufferUnderrunException e) { +//swallow +} +} }
[tika] 01/03: TIKA-3694 Additional details in HTML on mime type, and per-type json
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git commit 7768c87b467bb6cc9d01f6b92c45131af3d44fef Author: Nick Burch AuthorDate: Mon Mar 7 22:49:22 2022 + TIKA-3694 Additional details in HTML on mime type, and per-type json --- .../tika/server/core/resource/TikaMimeTypes.java | 59 -- 1 file changed, 56 insertions(+), 3 deletions(-) diff --git a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaMimeTypes.java b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaMimeTypes.java index 7b1887f..784660c 100644 --- a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaMimeTypes.java +++ b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaMimeTypes.java @@ -24,13 +24,19 @@ import java.util.Map; import java.util.SortedMap; import java.util.TreeMap; import javax.ws.rs.GET; +import javax.ws.rs.NotFoundException; import javax.ws.rs.Path; +import javax.ws.rs.PathParam; import javax.ws.rs.Produces; import com.fasterxml.jackson.databind.ObjectMapper; +import org.apache.tika.config.TikaConfig; import org.apache.tika.mime.MediaType; import org.apache.tika.mime.MediaTypeRegistry; +import org.apache.tika.mime.MimeType; +import org.apache.tika.mime.MimeTypes; +import org.apache.tika.mime.MimeTypeException; import org.apache.tika.parser.CompositeParser; import org.apache.tika.parser.Parser; import org.apache.tika.server.core.HTMLHelper; @@ -38,6 +44,7 @@ import org.apache.tika.server.core.HTMLHelper; /** * Provides details of all the mimetypes known to Apache Tika, * similar to --list-supported-types with the Tika CLI. + * Can also provide full details on a single known type. */ @Path("/mime-types") public class TikaMimeTypes { @@ -91,6 +98,17 @@ public class TikaMimeTypes { h.append("Super Type: ") .append(type.supertype).append("\n"); } +if (type.mime != null) { + if (!type.mime.getDescription().isEmpty()) { + h.append("Description: ").append(type.mime.getDescription()).append("\n"); + } + if (!type.mime.getAcronym().isEmpty()) { + h.append("Acronym: ").append(type.mime.getAcronym()).append("\n"); + } + if (!type.mime.getExtension().isEmpty()) { + h.append("Default Extension: ").append(type.mime.getExtension()).append("\n"); + } +} if (type.parser != null) { h.append("Parser: ").append(type.parser).append("\n"); @@ -124,6 +142,27 @@ public class TikaMimeTypes { } @GET +@Path("/{type}/{subtype}") +@Produces(javax.ws.rs.core.MediaType.APPLICATION_JSON) +public String getMimeTypeDetailsJSON(@PathParam("type") String typePart, + @PathParam("subtype") String subtype) throws IOException { +MediaTypeDetails type = getMediaType(typePart, subtype); +Map details = new HashMap<>(); + +details.put("type", type.type.toString()); +details.put("alias", copyToStringArray(type.aliases)); +if (type.supertype != null) { + details.put("supertype", type.supertype.toString()); +} +if (type.parser != null) { + details.put("parser", type.parser); +} +// TODO Additional details from Mime + +return new ObjectMapper().writerWithDefaultPrettyPrinter().writeValueAsString(details); +} + +@GET @Produces("text/plain") public String getMimeTypesPlain() { StringBuffer text = new StringBuffer(); @@ -147,10 +186,19 @@ public class TikaMimeTypes { return text.toString(); } +protected MediaTypeDetails getMediaType(String type, String subtype) throws NotFoundException { + MediaType mt = MediaType.parse(type+"/"+subtype); + for (MediaTypeDetails mtd : getMediaTypes()) { + if (mtd.type.equals(mt)) return mtd; + } + throw new NotFoundException("No Media Type registered in Tika for " + mt); +} protected List getMediaTypes() { -MediaTypeRegistry registry = TikaResource.getConfig().getMediaTypeRegistry(); -Map parsers = -((CompositeParser) TikaResource.getConfig().getParser()).getParsers(); +TikaConfig config = TikaResource.getConfig(); +MimeTypes mimeTypes = config.getMimeRepository(); +MediaTypeRegistry registry = config.getMediaTypeRegistry(); +Map parsers = ((Comp
[tika] branch main updated (eda4427 -> d583973)
This is an automated email from the ASF dual-hosted git repository. nick pushed a change to branch main in repository https://gitbox.apache.org/repos/asf/tika.git. from eda4427 Merge branch 'TIKA-3689' into main new 7768c87 TIKA-3694 Additional details in HTML on mime type, and per-type json new c54dd20 TIKA-3694 Per-Type HTML page, and more info in the JSON new d583973 TIKA-3694 Unit test for type-specific page The 3 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .../tika/server/core/resource/TikaMimeTypes.java | 135 - .../apache/tika/server/core/TikaMimeTypesTest.java | 20 ++- 2 files changed, 147 insertions(+), 8 deletions(-)
[tika] 03/03: TIKA-3694 Unit test for type-specific page
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git commit d583973f829aa2b48b8a08cb8c46927a3446ca7a Author: Nick Burch AuthorDate: Mon Mar 7 23:30:15 2022 + TIKA-3694 Unit test for type-specific page --- .../tika/server/core/resource/TikaMimeTypes.java | 4 +++- .../org/apache/tika/server/core/TikaMimeTypesTest.java | 18 -- 2 files changed, 19 insertions(+), 3 deletions(-) diff --git a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaMimeTypes.java b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaMimeTypes.java index cfb42b4..1dc0462 100644 --- a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaMimeTypes.java +++ b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaMimeTypes.java @@ -277,7 +277,9 @@ public class TikaMimeTypes { try { details.mime = mimeTypes.getRegisteredMimeType(type.toString()); -} catch (MimeTypeException e) {} +} catch (MimeTypeException e) { + // Ignore if invalid +} MediaType supertype = registry.getSupertype(type); if (supertype != null && !MediaType.OCTET_STREAM.equals(supertype)) { diff --git a/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaMimeTypesTest.java b/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaMimeTypesTest.java index f373515..dc8a0c1 100644 --- a/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaMimeTypesTest.java +++ b/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaMimeTypesTest.java @@ -70,7 +70,8 @@ public class TikaMimeTypesTest extends CXFTestBase { assertContains("application/xml", text); assertContains("video/x-ogm", text); -assertContains("text/plain", text); +assertContains("", text); +assertContains("/text/plain\">", text); assertContains("name=\"text/plain", text); assertContains("Super Type: video/ogg", text); @@ -80,5 +81,18 @@ public class TikaMimeTypesTest extends CXFTestBase { assertContains("Extension: .ogg", text); } -// TODO Type Specific +@Test +public void testGetHTMLDetails() throws Exception { + Response response = + WebClient.create(endPoint + MIMETYPES_PATH + "/application/cbor") + .type("text/html").accept("text/html").get(); + + String text = getStringFromInputStream((InputStream) response.getEntity()); + assertNotFound("text/plain", text); + assertContains("application/cbor", text); + + assertContains("Acronym: CBOR", text); + assertContains("Link: http://tools.ietf.org/html/rfc7049;, text); + assertContains("Extension: .cbor", text); +} }
[tika] branch branch_1x updated: TIKA-3373 Add the *.yml extension for YAML, which is commonly used, along with aliases for popular alternate mimetypes for it
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch branch_1x in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/branch_1x by this push: new f7d5119 TIKA-3373 Add the *.yml extension for YAML, which is commonly used, along with aliases for popular alternate mimetypes for it f7d5119 is described below commit f7d5119f496578bfff8bebc470d9fe8f9fdc3860 Author: Nick Burch AuthorDate: Tue Apr 27 13:05:34 2021 +0100 TIKA-3373 Add the *.yml extension for YAML, which is commonly used, along with aliases for popular alternate mimetypes for it --- .../src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 7 +++ 1 file changed, 7 insertions(+) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 0ea388b..87e50b7 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -7166,7 +7166,14 @@ <_comment>YAML source code + + + + + + +
[tika] branch main updated: TIKA-3373 Add the *.yml extension for YAML, which is commonly used, along with aliases for popular alternate mimetypes for it
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/main by this push: new 60c0aae TIKA-3373 Add the *.yml extension for YAML, which is commonly used, along with aliases for popular alternate mimetypes for it 60c0aae is described below commit 60c0aaebf724f078811937c45bdca83a797901d8 Author: Nick Burch AuthorDate: Tue Apr 27 13:05:34 2021 +0100 TIKA-3373 Add the *.yml extension for YAML, which is commonly used, along with aliases for popular alternate mimetypes for it --- .../src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 7 +++ 1 file changed, 7 insertions(+) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index ef0ea14..6c2ea14 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -7404,7 +7404,14 @@ <_comment>YAML source code + + + + + + +
[tika] branch branch_1x updated: Changelog update
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch branch_1x in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/branch_1x by this push: new a1ec3fd Changelog update a1ec3fd is described below commit a1ec3fd2c00605864b5c4543d4943abb151c7ef0 Author: Nick Burch AuthorDate: Sun Mar 14 20:55:28 2021 + Changelog update --- CHANGES.txt | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/CHANGES.txt b/CHANGES.txt index 1b2bd7f..6d56a7e 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -9,7 +9,10 @@ Release 1.26 - 03/09/2021 * Fix parsing of emails attached to other emails in PST files (TIKA-3004). * MP3 parser should output the xmpDM:duration metadata as seconds not - milliseconds, consistent with the other Audio and Video parsers (TIKA-3318) + milliseconds, consistent with the other Audio and Video parsers (TIKA-3318). + + * MP4 parser check if any of the Compatible Brands match when identifying + the subtype (TIKA-3310). Release 1.25 - 11/25/2020
[tika] branch branch_1x updated: Backport to 1.x - TIKA-3310 Check if MP4 file's compatible brands match any of the expected values, from Peter Kronenberg
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch branch_1x in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/branch_1x by this push: new b0242ee Backport to 1.x - TIKA-3310 Check if MP4 file's compatible brands match any of the expected values, from Peter Kronenberg b0242ee is described below commit b0242ee617857fe85db2ba5ce186f6c9965b67bd Author: Nick Burch AuthorDate: Sun Mar 14 20:53:38 2021 + Backport to 1.x - TIKA-3310 Check if MP4 file's compatible brands match any of the expected values, from Peter Kronenberg --- .../java/org/apache/tika/parser/mp4/MP4Parser.java | 26 -- 1 file changed, 19 insertions(+), 7 deletions(-) diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java b/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java index 933c53c..f06e556 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java @@ -70,6 +70,7 @@ import java.util.HashMap; import java.util.List; import java.util.Locale; import java.util.Map; +import java.util.Optional; import java.util.Set; /** @@ -132,14 +133,25 @@ public class MP4Parser extends AbstractParser { // Grab the file type box FileTypeBox fileType = getOrNull(isoFile, FileTypeBox.class); if (fileType != null) { -// Identify the type -MediaType type = MediaType.application("mp4"); -for (Map.Entry> e : typesMap.entrySet()) { -if (e.getValue().contains(fileType.getMajorBrand())) { -type = e.getKey(); -break; -} +// Identify the type based on the major brand +Optional typeHolder = typesMap.entrySet() +.stream() +.filter(e -> e.getValue().contains(fileType.getMajorBrand())) +.findFirst() +.map(Map.Entry::getKey); + +if (!typeHolder.isPresent()) { +// If no match for major brand, see if any of the compatible brands match +typeHolder = typesMap.entrySet() +.stream() +.filter(e -> e.getValue() +.stream() + .anyMatch(fileType.getCompatibleBrands()::contains)) +.findFirst() +.map(Map.Entry::getKey); } + +MediaType type = typeHolder.orElse(MediaType.application("mp4")); metadata.set(Metadata.CONTENT_TYPE, type.toString()); if (type.getType().equals("audio")) {
[tika] branch main updated (356cf44 -> 4bd931d)
This is an automated email from the ASF dual-hosted git repository. nick pushed a change to branch main in repository https://gitbox.apache.org/repos/asf/tika.git. from 356cf44 TIKA-3318 Document the units of xmpDM:duration as seconds by default new d80dc36 TIKA-3310 Check if MP4 file's compatible brands match any of the expected values new 187fd47 TIKA-3310 Check major brand before checking compatible brands new 4551f7d Separate search for major brand and compatible brands new 4bd931d Merge pull request #410 from peterkronenberg/main The 5072 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .../java/org/apache/tika/parser/mp4/MP4Parser.java | 35 +++--- 1 file changed, 24 insertions(+), 11 deletions(-)
[tika] branch branch_1x updated: Changelog update
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch branch_1x in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/branch_1x by this push: new a4c9257 Changelog update a4c9257 is described below commit a4c92579d2a012e0296f057b70dd9fb2d0842445 Author: Nick Burch AuthorDate: Sun Mar 14 20:22:59 2021 + Changelog update --- CHANGES.txt | 3 +++ 1 file changed, 3 insertions(+) diff --git a/CHANGES.txt b/CHANGES.txt index 57ca53c..1b2bd7f 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -8,6 +8,9 @@ Release 1.26 - 03/09/2021 * Fix parsing of emails attached to other emails in PST files (TIKA-3004). + * MP3 parser should output the xmpDM:duration metadata as seconds not + milliseconds, consistent with the other Audio and Video parsers (TIKA-3318) + Release 1.25 - 11/25/2020 * Fix inconsistent license in xmpcore (TIKA-3204).
[tika] 02/02: TIKA-3318 MP3 parser should output the xmpDM:duration metadata as seconds not milliseconds
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch branch_1x in repository https://gitbox.apache.org/repos/asf/tika.git commit 21b3cf8b5a209ab6cf0176d8bc55e640fdc8c351 Author: Nick Burch AuthorDate: Sun Mar 14 20:20:14 2021 + TIKA-3318 MP3 parser should output the xmpDM:duration metadata as seconds not milliseconds --- .../src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java | 13 - .../test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java | 6 +++--- 2 files changed, 11 insertions(+), 8 deletions(-) diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java index 52dad7c..c14b300 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java @@ -69,10 +69,10 @@ public class Mp3Parser extends AbstractParser { // Create handlers for the various kinds of ID3 tags ID3TagsAndAudio audioAndTags = getAllTagHandlers(stream, handler); -//process as much metadata as possible before -//writing to xhtml +// Before we start on the XHTML output, process and store +// as much metadata as possible if (audioAndTags.duration > 0) { -metadata.set(XMPDM.DURATION, audioAndTags.duration); + metadata.set(XMPDM.DURATION, audioAndTags.durationSeconds()); } if (audioAndTags.audio != null) { @@ -151,7 +151,7 @@ public class Mp3Parser extends AbstractParser { xhtml.element("p", tag.getYear()); xhtml.element("p", tag.getGenre()); } -xhtml.element("p", String.valueOf(audioAndTags.duration)); +xhtml.element("p", String.valueOf(audioAndTags.durationSeconds())); for (String comment : comments) { xhtml.element("p", comment); } @@ -250,7 +250,10 @@ public class Mp3Parser extends AbstractParser { private ID3Tags[] tags; private AudioFrame audio; private LyricsHandler lyrics; -private float duration; +private float duration; // Milliseconds +private float durationSeconds() { + return duration / 1000; +} } } diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java index e670809..01fa4f7 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java @@ -39,7 +39,7 @@ public class Mp3ParserTest extends TikaTest { */ private static void checkDuration(Metadata metadata, int expected) { assertEquals("Wrong duration", expected, -Math.round(Float.valueOf(metadata.get(XMPDM.DURATION)) / 1000)); +Math.round(Float.valueOf(metadata.get(XMPDM.DURATION; } /** @@ -126,7 +126,7 @@ public class Mp3ParserTest extends TikaTest { String content = getXML("testMP3id3v1.mp3").xml; assertContains("
[tika] branch branch_1x updated (02ed830 -> 21b3cf8)
This is an automated email from the ASF dual-hosted git repository. nick pushed a change to branch branch_1x in repository https://gitbox.apache.org/repos/asf/tika.git. from 02ed830 TIKA-3244: update pax-url-aether new 8081e6d TIKA-3318 Document the units of xmpDM:duration as seconds by default new 21b3cf8 TIKA-3318 MP3 parser should output the xmpDM:duration metadata as seconds not milliseconds The 2 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java | 1 + .../src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java | 13 - .../test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java | 6 +++--- 3 files changed, 12 insertions(+), 8 deletions(-)
[tika] 01/02: TIKA-3318 Document the units of xmpDM:duration as seconds by default
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch branch_1x in repository https://gitbox.apache.org/repos/asf/tika.git commit 8081e6da8d34ef9675638699eb2ec6d6145c89d4 Author: Nick Burch AuthorDate: Sun Mar 14 19:24:43 2021 + TIKA-3318 Document the units of xmpDM:duration as seconds by default --- tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java | 1 + 1 file changed, 1 insertion(+) diff --git a/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java b/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java index ce78145..60a3d1e 100644 --- a/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java +++ b/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java @@ -173,6 +173,7 @@ public interface XMPDM { /** * "The duration of the media file." + * Value is in Seconds, unless xmpDM:scale is also set. */ Property DURATION = Property.externalReal("xmpDM:duration");
[tika] branch main updated: TIKA-3318 Document the units of xmpDM:duration as seconds by default
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/main by this push: new 356cf44 TIKA-3318 Document the units of xmpDM:duration as seconds by default 356cf44 is described below commit 356cf44e6c426ad4411bb2c8a945597dbac4543c Author: Nick Burch AuthorDate: Sun Mar 14 19:24:43 2021 + TIKA-3318 Document the units of xmpDM:duration as seconds by default --- tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java | 1 + 1 file changed, 1 insertion(+) diff --git a/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java b/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java index ce78145..60a3d1e 100644 --- a/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java +++ b/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java @@ -173,6 +173,7 @@ public interface XMPDM { /** * "The duration of the media file." + * Value is in Seconds, unless xmpDM:scale is also set. */ Property DURATION = Property.externalReal("xmpDM:duration");
[tika] branch main updated: TIKA-3318 MP3 parser should output the xmpDM:duration metadata as seconds not milliseconds
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/main by this push: new 31da853 TIKA-3318 MP3 parser should output the xmpDM:duration metadata as seconds not milliseconds 31da853 is described below commit 31da853a5779806b1b83f4709e90ac2e3ac2688e Author: Nick Burch AuthorDate: Sun Mar 14 19:07:02 2021 + TIKA-3318 MP3 parser should output the xmpDM:duration metadata as seconds not milliseconds --- .../main/java/org/apache/tika/parser/mp3/Mp3Parser.java| 14 -- .../java/org/apache/tika/parser/mp3/Mp3ParserTest.java | 8 2 files changed, 12 insertions(+), 10 deletions(-) diff --git a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java b/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java index 7a02473..11a7d4b 100644 --- a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java +++ b/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java @@ -70,10 +70,10 @@ public class Mp3Parser extends AbstractParser { // Create handlers for the various kinds of ID3 tags ID3TagsAndAudio audioAndTags = getAllTagHandlers(stream, handler); -//process as much metadata as possible before -//writing to xhtml +// Before we start on the XHTML output, process and store +// as much metadata as possible if (audioAndTags.duration > 0) { -metadata.set(XMPDM.DURATION, audioAndTags.duration); + metadata.set(XMPDM.DURATION, audioAndTags.durationSeconds()); } if (audioAndTags.audio != null) { @@ -152,7 +152,7 @@ public class Mp3Parser extends AbstractParser { xhtml.element("p", tag.getYear()); xhtml.element("p", tag.getGenre()); } -xhtml.element("p", String.valueOf(audioAndTags.duration)); +xhtml.element("p", String.valueOf(audioAndTags.durationSeconds())); for (String comment : comments) { xhtml.element("p", comment); } @@ -261,7 +261,9 @@ public class Mp3Parser extends AbstractParser { private ID3Tags[] tags; private AudioFrame audio; private LyricsHandler lyrics; -private float duration; +private float duration; // Milliseconds +private float durationSeconds() { + return duration / 1000; +} } - } diff --git a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-audiovideo-module/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java b/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-audiovideo-module/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java index f952c84..ed0b16c 100644 --- a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-audiovideo-module/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java +++ b/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-audiovideo-module/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java @@ -38,8 +38,8 @@ public class Mp3ParserTest extends TikaTest { * @param expected the expected duration, rounded as seconds */ private static void checkDuration(Metadata metadata, int expected) { -assertEquals("Wrong duration", expected, -Math.round(Float.valueOf(metadata.get(XMPDM.DURATION)) / 1000)); +assertEquals("Wrong duration", expected, +Math.round(Float.valueOf(metadata.get(XMPDM.DURATION; } /** @@ -124,7 +124,7 @@ public class Mp3ParserTest extends TikaTest { String content = getXML("testMP3id3v1.mp3").xml; assertContains("
[tika] 04/05: Split the Certificate and Key mimetypes into DER and PEM subtypes, add test EC files. TIKA-3205
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch branch_1x in repository https://gitbox.apache.org/repos/asf/tika.git commit 51829d630360060d2fff84e8dc2b1346834ecfda Author: Nick Burch AuthorDate: Tue Sep 29 16:48:40 2020 +0100 Split the Certificate and Key mimetypes into DER and PEM subtypes, add test EC files. TIKA-3205 --- .../org/apache/tika/mime/tika-mimetypes.xml| 41 + .../java/org/apache/tika/mime/TestMimeTypes.java | 19 +++--- .../test/resources/test-documents/testDSAKEY.der | Bin 0 -> 834 bytes .../test/resources/test-documents/testDSAKEY.pem | 15 .../resources/test-documents/testDSAPARAMS.pem | 14 +++ .../test/resources/test-documents/testECKEY.der| Bin 0 -> 167 bytes .../test/resources/test-documents/testECKEY.pem| 6 +++ .../test/resources/test-documents/testECPARAMS.pem | 3 ++ 8 files changed, 84 insertions(+), 14 deletions(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 96301aa..792448b 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -4538,13 +4538,19 @@ - - + + + + - - + + + + + + @@ -4559,9 +4565,12 @@ + + + + - @@ -4569,16 +4578,32 @@ - - + + + + + + + + + + - + + + + + + + diff --git a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java index 80c60a2..c765dae 100644 --- a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java +++ b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java @@ -1144,13 +1144,20 @@ public class TestMimeTypes { @Test public void testCertificatesKeys() throws Exception { -assertType("application/x-x509-cert", "testCERT.pem"); -assertType("application/x-x509-cert", "testCERT.der"); -assertTypeByData("application/x-x509-cert", "testCERT.pem"); -assertTypeByData("application/x-x509-cert", "testCERT.der"); +assertType("application/x-x509-cert; format=pem", "testCERT.pem"); +assertType("application/x-x509-cert; format=der", "testCERT.der"); +assertTypeByData("application/x-x509-cert; format=pem", "testCERT.pem"); +assertTypeByData("application/x-x509-cert; format=der", "testCERT.der"); // Keys need the data to identify, name isn't enough -assertTypeByData("application/x-x509-key", "testRSAKEY.pem"); -assertTypeByData("application/x-x509-key", "testRSAKEY.der"); +assertTypeByData("application/x-x509-key; format=pem", "testECKEY.pem"); +assertTypeByData("application/x-x509-key; format=der", "testECKEY.der"); +assertTypeByData("application/x-x509-key; format=pem", "testRSAKEY.pem"); +assertTypeByData("application/x-x509-key; format=der", "testRSAKEY.der"); +assertTypeByData("application/x-x509-key; format=pem", "testDSAKEY.pem"); +assertTypeByData("application/x-x509-key; format=der", "testDSAKEY.der"); +// Parameters only have PEM form, always need data +assertTypeByData("application/x-x509-dsa-parameters", "testDSAPARAMS.pem"); +assertTypeByData("application/x-x509-ec-parameters", "testECPARAMS.pem"); } @Test diff --git a/tika-parsers/src/test/resources/test-documents/testDSAKEY.der b/tika-parsers/src/test/resources/test-documents/testDSAKEY.der new file mode 100644 index 000..9ed2eb9 Binary files /dev/null and b/tika-parsers/src/test/resources/test-documents/testDSAKEY.der differ diff --git a/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem b/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem new file mode 100644 index 000..2b8781a --- /dev/null +++ b/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem @@ -0,0 +1,15 @@ +-BEGIN PRIVATE KEY- +MIICTQIBADCCAi0GByqGSM44BAEwggIgAoIBAQDRXU0Be5k0MI3skB6K0PhyptBh +WSJ1l+NVSOX7wpXC37upcH7a0ZCfU9RyWqcX9dQFw+TWjlH2ANll/FO4osXkkJVY +oylJ+p0599v6WRPBS/yQpKuvfqEm5HA78J8ILhnyCCw8hqdlrADBOMGf7tGF5Agw +hEZJdtHjYRzPWzY0eogptg3wQPd/
[tika] branch branch_1x updated (9736af8 -> 1fce089)
This is an automated email from the ASF dual-hosted git repository. nick pushed a change to branch branch_1x in repository https://gitbox.apache.org/repos/asf/tika.git. from 9736af8 Fix TIKA-3196 (#364) new 5c2c4a2 Add test certificate and key for TIKA-3205 new 28ec71d TIKA-3205 Add magic for X509 PEM certificate, and tweak default type new e952877 Add some more DER magic for certificates, and add tests TIKA-3205 new 51829d6 Split the Certificate and Key mimetypes into DER and PEM subtypes, add test EC files. TIKA-3205 new 1fce089 Make the DER private key mostly-match a bit more specific The 5 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .../org/apache/tika/mime/tika-mimetypes.xml| 72 - .../java/org/apache/tika/TikaDetectionTest.java| 5 +- .../java/org/apache/tika/mime/TestMimeTypes.java | 18 ++ .../src/test/resources/test-documents/testCERT.der | Bin 0 -> 702 bytes .../src/test/resources/test-documents/testCERT.pem | 17 + .../test/resources/test-documents/testDSAKEY.der | Bin 0 -> 834 bytes .../test/resources/test-documents/testDSAKEY.pem | 15 + .../resources/test-documents/testDSAPARAMS.pem | 14 .../test/resources/test-documents/testECKEY.der| Bin 0 -> 167 bytes .../test/resources/test-documents/testECKEY.pem| 6 ++ .../test/resources/test-documents/testECPARAMS.pem | 3 + .../test/resources/test-documents/testRSAKEY.der | Bin 0 -> 610 bytes .../test/resources/test-documents/testRSAKEY.pem | 16 + 13 files changed, 162 insertions(+), 4 deletions(-) create mode 100644 tika-parsers/src/test/resources/test-documents/testCERT.der create mode 100644 tika-parsers/src/test/resources/test-documents/testCERT.pem create mode 100644 tika-parsers/src/test/resources/test-documents/testDSAKEY.der create mode 100644 tika-parsers/src/test/resources/test-documents/testDSAKEY.pem create mode 100644 tika-parsers/src/test/resources/test-documents/testDSAPARAMS.pem create mode 100644 tika-parsers/src/test/resources/test-documents/testECKEY.der create mode 100644 tika-parsers/src/test/resources/test-documents/testECKEY.pem create mode 100644 tika-parsers/src/test/resources/test-documents/testECPARAMS.pem create mode 100644 tika-parsers/src/test/resources/test-documents/testRSAKEY.der create mode 100644 tika-parsers/src/test/resources/test-documents/testRSAKEY.pem
[tika] 02/05: TIKA-3205 Add magic for X509 PEM certificate, and tweak default type
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch branch_1x in repository https://gitbox.apache.org/repos/asf/tika.git commit 28ec71d2d3afa52e84fa16ee5df289dd696980ed Author: Nick Burch AuthorDate: Tue Sep 29 15:49:14 2020 +0100 TIKA-3205 Add magic for X509 PEM certificate, and tweak default type --- .../org/apache/tika/mime/tika-mimetypes.xml| 26 +- .../java/org/apache/tika/TikaDetectionTest.java| 5 +++-- 2 files changed, 28 insertions(+), 3 deletions(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index b4981a5..a0d172d 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -4534,10 +4534,34 @@ - + + + + + + + + + + + + + + + + + + + + + + + diff --git a/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java b/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java index 8f14a2b..1908489 100644 --- a/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java +++ b/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java @@ -624,8 +624,9 @@ public class TikaDetectionTest { assertEquals("application/x-texinfo", tika.detect("x.texi")); assertEquals("application/x-ustar", tika.detect("x.ustar")); assertEquals("application/x-wais-source", tika.detect("x.src")); -assertEquals("application/x-x509-ca-cert", tika.detect("x.der")); -assertEquals("application/x-x509-ca-cert", tika.detect("x.crt")); +// Differ from httpd - use a common parent for CA and User certs +//assertEquals("application/x-x509-ca-cert", tika.detect("x.der")); +//assertEquals("application/x-x509-ca-cert", tika.detect("x.crt")); assertEquals("application/x-xfig", tika.detect("x.fig")); assertEquals("application/x-xpinstall", tika.detect("x.xpi")); assertEquals("application/xenc+xml", tika.detect("x.xenc"));
[tika] 03/05: Add some more DER magic for certificates, and add tests TIKA-3205
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch branch_1x in repository https://gitbox.apache.org/repos/asf/tika.git commit e95287761da40c72f45390d1b892d8cdef33c216 Author: Nick Burch AuthorDate: Tue Sep 29 16:23:08 2020 +0100 Add some more DER magic for certificates, and add tests TIKA-3205 --- .../org/apache/tika/mime/tika-mimetypes.xml| 28 ++ .../java/org/apache/tika/mime/TestMimeTypes.java | 11 + 2 files changed, 34 insertions(+), 5 deletions(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index a0d172d..96301aa 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -4546,10 +4546,16 @@ - - + mask="0xFFF8" offset="0"> + + + + + + @@ -4557,8 +4563,20 @@ + + + + + + - + + + + + + + diff --git a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java index d05d080..80c60a2 100644 --- a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java +++ b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java @@ -1143,6 +1143,17 @@ public class TestMimeTypes { } @Test +public void testCertificatesKeys() throws Exception { +assertType("application/x-x509-cert", "testCERT.pem"); +assertType("application/x-x509-cert", "testCERT.der"); +assertTypeByData("application/x-x509-cert", "testCERT.pem"); +assertTypeByData("application/x-x509-cert", "testCERT.der"); +// Keys need the data to identify, name isn't enough +assertTypeByData("application/x-x509-key", "testRSAKEY.pem"); +assertTypeByData("application/x-x509-key", "testRSAKEY.der"); +} + +@Test public void testVandICalendars() throws Exception { assertType("text/calendar", "testICalendar.ics"); assertType("text/x-vcalendar", "testVCalendar.vcs");
[tika] 05/05: Make the DER private key mostly-match a bit more specific
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch branch_1x in repository https://gitbox.apache.org/repos/asf/tika.git commit 1fce08921cea11bc79c708d4f72b9e4bf70b8c2c Author: Nick Burch AuthorDate: Tue Sep 29 16:51:19 2020 +0100 Make the DER private key mostly-match a bit more specific --- .../main/resources/org/apache/tika/mime/tika-mimetypes.xml| 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 792448b..d281751 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -4583,12 +4583,13 @@ - + + - - + +
[tika] 01/05: Add test certificate and key for TIKA-3205
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch branch_1x in repository https://gitbox.apache.org/repos/asf/tika.git commit 5c2c4a2fb91cc160eaf007b71efcd854402e1624 Author: Nick Burch AuthorDate: Tue Sep 29 15:26:48 2020 +0100 Add test certificate and key for TIKA-3205 --- .../src/test/resources/test-documents/testCERT.der | Bin 0 -> 702 bytes .../src/test/resources/test-documents/testCERT.pem | 17 + .../src/test/resources/test-documents/testRSAKEY.der | Bin 0 -> 610 bytes .../src/test/resources/test-documents/testRSAKEY.pem | 16 4 files changed, 33 insertions(+) diff --git a/tika-parsers/src/test/resources/test-documents/testCERT.der b/tika-parsers/src/test/resources/test-documents/testCERT.der new file mode 100644 index 000..935f1f6 Binary files /dev/null and b/tika-parsers/src/test/resources/test-documents/testCERT.der differ diff --git a/tika-parsers/src/test/resources/test-documents/testCERT.pem b/tika-parsers/src/test/resources/test-documents/testCERT.pem new file mode 100644 index 000..dbfd849 --- /dev/null +++ b/tika-parsers/src/test/resources/test-documents/testCERT.pem @@ -0,0 +1,17 @@ +-BEGIN CERTIFICATE- +MIICujCCAiOgAwIBAgIUKOX/l1c68ya6jnfeRJ8uP9kvVx8wDQYJKoZIhvcNAQEL +BQAwbzELMAkGA1UEBhMCWloxFDASBgNVBAgMC0FwYWNoZSBUaWthMQ8wDQYDVQQH +DAZBcGFjaGUxDTALBgNVBAoMBFRpa2ExFDASBgNVBAsMC0FwYWNoZSBUaWthMRQw +EgYDVQQDDAtBcGFjaGUgVGlrYTAeFw0yMDA5MjkxNDIzNDRaFw0zMDA5MjcxNDIz +NDRaMG8xCzAJBgNVBAYTAlpaMRQwEgYDVQQIDAtBcGFjaGUgVGlrYTEPMA0GA1UE +BwwGQXBhY2hlMQ0wCwYDVQQKDARUaWthMRQwEgYDVQQLDAtBcGFjaGUgVGlrYTEU +MBIGA1UEAwwLQXBhY2hlIFRpa2EwgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGB +AMeVjMm2uyhe7HkNFFBU6nnI9niJnn+hv+3TTDw23/GH2/d1T3JSpuSFstHvwS0t +umrSb6Py7igD4fz3+aeZU1gDrby4f9KemnLBlAU63VuPyCDyWj2XqbsZDGdnbIy2 +FnBBy2LBn5p0gDwoDpnmhHPHIJZo9OMGH/5hWQUt6+rtAgMBAAGjUzBRMB0GA1Ud +DgQWBBTM8K2WIAuPiv0VgrRoMn2fAGua1jAfBgNVHSMEGDAWgBTM8K2WIAuPiv0V +grRoMn2fAGua1jAPBgNVHRMBAf8EBTADAQH/MA0GCSqGSIb3DQEBCwUAA4GBALqE ++ja5Hx78Dpym/HxP50TfadwmEes+JXYptykWnuOWgLlqLuGAqJctLOKoR73r7d9d +zJBtdr3A5uTg9vWNMSA2lPdBr/NplNaI8bso+8dRWdkiMut+j7xqTFl8MVMriRSR +a2cA9BsUlpHjJdVjcFweAtdlINZDACoZubCTM7ng +-END CERTIFICATE- diff --git a/tika-parsers/src/test/resources/test-documents/testRSAKEY.der b/tika-parsers/src/test/resources/test-documents/testRSAKEY.der new file mode 100644 index 000..22c4f86 Binary files /dev/null and b/tika-parsers/src/test/resources/test-documents/testRSAKEY.der differ diff --git a/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem b/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem new file mode 100644 index 000..0971b76 --- /dev/null +++ b/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem @@ -0,0 +1,16 @@ +-BEGIN PRIVATE KEY- +MIICeAIBADANBgkqhkiG9w0BAQEFAASCAmIwggJeAgEAAoGBAMeVjMm2uyhe7HkN +FFBU6nnI9niJnn+hv+3TTDw23/GH2/d1T3JSpuSFstHvwS0tumrSb6Py7igD4fz3 ++aeZU1gDrby4f9KemnLBlAU63VuPyCDyWj2XqbsZDGdnbIy2FnBBy2LBn5p0gDwo +DpnmhHPHIJZo9OMGH/5hWQUt6+rtAgMBAAECgYBdb1TGxiYeQzoffZEJ/ob61qsU +SRELnVS16RqigeobL8g5tBCqa6k4CKNrhvt/xA2mnrenID6AOzkb7ZdR8ATEtojF +JjLZ8zmXACU3WetoRUvh2uxlFpxFeK0yQlaEWcvE4Z9MQe3V8pBvMQUNEZxN4bHT +1eMla9O65TR49uxaPQJBAO/Spm9ln02CjnxCHiGmRQ77gUNz39AtrKRLQBv/uEB2 +fhHAvFoSPGXaIgd73GgZEnM/a+faLrMu9NvemMd5aYMCQQDVDAsjaa72+5ZS87zE +xLDrFT1cKM8U1G0ikdGl6rejDnSoiwfZ8DXpSBOOiSkf/PX9zDXDPQl9nHLjmDn9 +wN7PAkEAxsPTF66lGoujZk8yQ/dXczR2DR7Dl/nTBZQsvUfzQNI0aKhSM2C72Dqz +S3qX0Vs+VHBzEYVegTngzT4vZ9wz2wJBAMNXCZdsvUokIA7rALgCCJ1jmiE4Ibdd +lrtNrEZO0hWlmX04DPjc8PF2bsgQJy73R61vYhQjkOIlYoof93wdLa0CQQDTLHSB +8e8f81Jq+zbLReAQ6ch+fEulaMPlPY0OqgExBxdbwXnlPENw09+EiQkKSSo8qhY1 +guri/IWyq3LYm8nE +-END PRIVATE KEY-
[tika] branch main updated: Move new test files to 2.x folder, doh!
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/main by this push: new 0844dce Move new test files to 2.x folder, doh! 0844dce is described below commit 0844dce7d19b90de288c456e78018ca8729895a7 Author: Nick Burch AuthorDate: Wed Sep 30 16:54:44 2020 +0100 Move new test files to 2.x folder, doh! --- .../src/test/resources/test-documents/testCERT.der | Bin .../src/test/resources/test-documents/testCERT.pem | 0 .../src/test/resources/test-documents/testDSAKEY.der| Bin .../src/test/resources/test-documents/testDSAKEY.pem| 0 .../src/test/resources/test-documents/testDSAPARAMS.pem | 0 .../src/test/resources/test-documents/testECKEY.der | Bin .../src/test/resources/test-documents/testECKEY.pem | 0 .../src/test/resources/test-documents/testECPARAMS.pem | 0 .../src/test/resources/test-documents/testRSAKEY.der| Bin .../src/test/resources/test-documents/testRSAKEY.pem| 0 10 files changed, 0 insertions(+), 0 deletions(-) diff --git a/tika-parsers/src/test/resources/test-documents/testCERT.der b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testCERT.der similarity index 100% rename from tika-parsers/src/test/resources/test-documents/testCERT.der rename to tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testCERT.der diff --git a/tika-parsers/src/test/resources/test-documents/testCERT.pem b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testCERT.pem similarity index 100% rename from tika-parsers/src/test/resources/test-documents/testCERT.pem rename to tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testCERT.pem diff --git a/tika-parsers/src/test/resources/test-documents/testDSAKEY.der b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testDSAKEY.der similarity index 100% rename from tika-parsers/src/test/resources/test-documents/testDSAKEY.der rename to tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testDSAKEY.der diff --git a/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testDSAKEY.pem similarity index 100% rename from tika-parsers/src/test/resources/test-documents/testDSAKEY.pem rename to tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testDSAKEY.pem diff --git a/tika-parsers/src/test/resources/test-documents/testDSAPARAMS.pem b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testDSAPARAMS.pem similarity index 100% rename from tika-parsers/src/test/resources/test-documents/testDSAPARAMS.pem rename to tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testDSAPARAMS.pem diff --git a/tika-parsers/src/test/resources/test-documents/testECKEY.der b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testECKEY.der similarity index 100% rename from tika-parsers/src/test/resources/test-documents/testECKEY.der rename to tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testECKEY.der diff --git a/tika-parsers/src/test/resources/test-documents/testECKEY.pem b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testECKEY.pem similarity index 100% rename from tika-parsers/src/test/resources/test-documents/testECKEY.pem rename to tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testECKEY.pem diff --git a/tika-parsers/src/test/resources/test-documents/testECPARAMS.pem b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testECPARAMS.pem similarity index 100% rename from tika-parsers/src/test/resources/test-documents/testECPARAMS.pem rename to tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testECPARAMS.pem diff --git a/tika-parsers/src/test/resources/test-documents/testRSAKEY.der b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testRSAKEY.der similarity index 100% rename from tika-parsers/src/test/resources/test-documents/testRSAKEY.der rename to tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testRSAKEY.der diff --git a/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem b/tika-parser-modules/tika-parser-integration-tests/src/test/resources/test-documents/testRSAKEY.pem similarity index 100% rename from tika-parsers/src/test/resources/test-documents/testRSAKEY.pem rename to tika-parser-modules/tika-parser-integration-tests/src/test/resources
[tika] 02/05: TIKA-3205 Add magic for X509 PEM certificate, and tweak default type
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git commit ecd1d62ad9e4d2ddd53abf204539e5d765e6c624 Author: Nick Burch AuthorDate: Tue Sep 29 15:49:14 2020 +0100 TIKA-3205 Add magic for X509 PEM certificate, and tweak default type --- .../org/apache/tika/mime/tika-mimetypes.xml| 26 +- .../java/org/apache/tika/TikaDetectionTest.java| 5 +++-- 2 files changed, 28 insertions(+), 3 deletions(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 396fb9a..bdbeee5 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -4678,10 +4678,34 @@ - + + + + + + + + + + + + + + + + + + + + + + + diff --git a/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java b/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java index eb3bb19..2364daa 100644 --- a/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java +++ b/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java @@ -624,8 +624,9 @@ public class TikaDetectionTest { assertEquals("application/x-texinfo", tika.detect("x.texi")); assertEquals("application/x-ustar", tika.detect("x.ustar")); assertEquals("application/x-wais-source", tika.detect("x.src")); -assertEquals("application/x-x509-ca-cert", tika.detect("x.der")); -assertEquals("application/x-x509-ca-cert", tika.detect("x.crt")); +// Differ from httpd - use a common parent for CA and User certs +//assertEquals("application/x-x509-ca-cert", tika.detect("x.der")); +//assertEquals("application/x-x509-ca-cert", tika.detect("x.crt")); assertEquals("application/x-xfig", tika.detect("x.fig")); assertEquals("application/x-xpinstall", tika.detect("x.xpi")); assertEquals("application/xenc+xml", tika.detect("x.xenc"));
[tika] 04/05: Split the Certificate and Key mimetypes into DER and PEM subtypes, add test EC files. TIKA-3205
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git commit fa1b2ef87157f51797d0dcaed36ebc990e538910 Author: Nick Burch AuthorDate: Tue Sep 29 16:48:40 2020 +0100 Split the Certificate and Key mimetypes into DER and PEM subtypes, add test EC files. TIKA-3205 --- .../org/apache/tika/mime/tika-mimetypes.xml| 41 + .../java/org/apache/tika/mime/TestMimeTypes.java | 19 +++--- .../test/resources/test-documents/testDSAKEY.der | Bin 0 -> 834 bytes .../test/resources/test-documents/testDSAKEY.pem | 15 .../resources/test-documents/testDSAPARAMS.pem | 14 +++ .../test/resources/test-documents/testECKEY.der| Bin 0 -> 167 bytes .../test/resources/test-documents/testECKEY.pem| 6 +++ .../test/resources/test-documents/testECPARAMS.pem | 3 ++ 8 files changed, 84 insertions(+), 14 deletions(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index a995563..2c4a5e5 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -4682,13 +4682,19 @@ - - + + + + - - + + + + + + @@ -4703,9 +4709,12 @@ + + + + - @@ -4713,16 +4722,32 @@ - - + + + + + + + + + + - + + + + + + + diff --git a/tika-parser-modules/tika-parser-integration-tests/src/test/java/org/apache/tika/mime/TestMimeTypes.java b/tika-parser-modules/tika-parser-integration-tests/src/test/java/org/apache/tika/mime/TestMimeTypes.java index dc3f303..2960b56 100644 --- a/tika-parser-modules/tika-parser-integration-tests/src/test/java/org/apache/tika/mime/TestMimeTypes.java +++ b/tika-parser-modules/tika-parser-integration-tests/src/test/java/org/apache/tika/mime/TestMimeTypes.java @@ -1142,13 +1142,20 @@ public class TestMimeTypes { @Test public void testCertificatesKeys() throws Exception { -assertType("application/x-x509-cert", "testCERT.pem"); -assertType("application/x-x509-cert", "testCERT.der"); -assertTypeByData("application/x-x509-cert", "testCERT.pem"); -assertTypeByData("application/x-x509-cert", "testCERT.der"); +assertType("application/x-x509-cert; format=pem", "testCERT.pem"); +assertType("application/x-x509-cert; format=der", "testCERT.der"); +assertTypeByData("application/x-x509-cert; format=pem", "testCERT.pem"); +assertTypeByData("application/x-x509-cert; format=der", "testCERT.der"); // Keys need the data to identify, name isn't enough -assertTypeByData("application/x-x509-key", "testRSAKEY.pem"); -assertTypeByData("application/x-x509-key", "testRSAKEY.der"); +assertTypeByData("application/x-x509-key; format=pem", "testECKEY.pem"); +assertTypeByData("application/x-x509-key; format=der", "testECKEY.der"); +assertTypeByData("application/x-x509-key; format=pem", "testRSAKEY.pem"); +assertTypeByData("application/x-x509-key; format=der", "testRSAKEY.der"); +assertTypeByData("application/x-x509-key; format=pem", "testDSAKEY.pem"); +assertTypeByData("application/x-x509-key; format=der", "testDSAKEY.der"); +// Parameters only have PEM form, always need data +assertTypeByData("application/x-x509-dsa-parameters", "testDSAPARAMS.pem"); +assertTypeByData("application/x-x509-ec-parameters", "testECPARAMS.pem"); } @Test diff --git a/tika-parsers/src/test/resources/test-documents/testDSAKEY.der b/tika-parsers/src/test/resources/test-documents/testDSAKEY.der new file mode 100644 index 000..9ed2eb9 Binary files /dev/null and b/tika-parsers/src/test/resources/test-documents/testDSAKEY.der differ diff --git a/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem b/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem new file mode 100644 index 000..2b8781a --- /dev/null +++ b/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem @@ -0,0 +1,15 @@ +-BEGIN PRIVATE KEY- +MIICTQIBADCCAi0GByqGSM44BAEwggIgAoIBAQDRXU0Be5k0MI3skB6K0PhyptBh +WSJ1l+NV
[tika] 01/05: Add test certificate and key for TIKA-3205
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git commit ad0d98b9a155e483b815eb01e36ebd02a101695a Author: Nick Burch AuthorDate: Tue Sep 29 15:26:48 2020 +0100 Add test certificate and key for TIKA-3205 --- .../src/test/resources/test-documents/testCERT.der | Bin 0 -> 702 bytes .../src/test/resources/test-documents/testCERT.pem | 17 + .../src/test/resources/test-documents/testRSAKEY.der | Bin 0 -> 610 bytes .../src/test/resources/test-documents/testRSAKEY.pem | 16 4 files changed, 33 insertions(+) diff --git a/tika-parsers/src/test/resources/test-documents/testCERT.der b/tika-parsers/src/test/resources/test-documents/testCERT.der new file mode 100644 index 000..935f1f6 Binary files /dev/null and b/tika-parsers/src/test/resources/test-documents/testCERT.der differ diff --git a/tika-parsers/src/test/resources/test-documents/testCERT.pem b/tika-parsers/src/test/resources/test-documents/testCERT.pem new file mode 100644 index 000..dbfd849 --- /dev/null +++ b/tika-parsers/src/test/resources/test-documents/testCERT.pem @@ -0,0 +1,17 @@ +-BEGIN CERTIFICATE- +MIICujCCAiOgAwIBAgIUKOX/l1c68ya6jnfeRJ8uP9kvVx8wDQYJKoZIhvcNAQEL +BQAwbzELMAkGA1UEBhMCWloxFDASBgNVBAgMC0FwYWNoZSBUaWthMQ8wDQYDVQQH +DAZBcGFjaGUxDTALBgNVBAoMBFRpa2ExFDASBgNVBAsMC0FwYWNoZSBUaWthMRQw +EgYDVQQDDAtBcGFjaGUgVGlrYTAeFw0yMDA5MjkxNDIzNDRaFw0zMDA5MjcxNDIz +NDRaMG8xCzAJBgNVBAYTAlpaMRQwEgYDVQQIDAtBcGFjaGUgVGlrYTEPMA0GA1UE +BwwGQXBhY2hlMQ0wCwYDVQQKDARUaWthMRQwEgYDVQQLDAtBcGFjaGUgVGlrYTEU +MBIGA1UEAwwLQXBhY2hlIFRpa2EwgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGB +AMeVjMm2uyhe7HkNFFBU6nnI9niJnn+hv+3TTDw23/GH2/d1T3JSpuSFstHvwS0t +umrSb6Py7igD4fz3+aeZU1gDrby4f9KemnLBlAU63VuPyCDyWj2XqbsZDGdnbIy2 +FnBBy2LBn5p0gDwoDpnmhHPHIJZo9OMGH/5hWQUt6+rtAgMBAAGjUzBRMB0GA1Ud +DgQWBBTM8K2WIAuPiv0VgrRoMn2fAGua1jAfBgNVHSMEGDAWgBTM8K2WIAuPiv0V +grRoMn2fAGua1jAPBgNVHRMBAf8EBTADAQH/MA0GCSqGSIb3DQEBCwUAA4GBALqE ++ja5Hx78Dpym/HxP50TfadwmEes+JXYptykWnuOWgLlqLuGAqJctLOKoR73r7d9d +zJBtdr3A5uTg9vWNMSA2lPdBr/NplNaI8bso+8dRWdkiMut+j7xqTFl8MVMriRSR +a2cA9BsUlpHjJdVjcFweAtdlINZDACoZubCTM7ng +-END CERTIFICATE- diff --git a/tika-parsers/src/test/resources/test-documents/testRSAKEY.der b/tika-parsers/src/test/resources/test-documents/testRSAKEY.der new file mode 100644 index 000..22c4f86 Binary files /dev/null and b/tika-parsers/src/test/resources/test-documents/testRSAKEY.der differ diff --git a/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem b/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem new file mode 100644 index 000..0971b76 --- /dev/null +++ b/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem @@ -0,0 +1,16 @@ +-BEGIN PRIVATE KEY- +MIICeAIBADANBgkqhkiG9w0BAQEFAASCAmIwggJeAgEAAoGBAMeVjMm2uyhe7HkN +FFBU6nnI9niJnn+hv+3TTDw23/GH2/d1T3JSpuSFstHvwS0tumrSb6Py7igD4fz3 ++aeZU1gDrby4f9KemnLBlAU63VuPyCDyWj2XqbsZDGdnbIy2FnBBy2LBn5p0gDwo +DpnmhHPHIJZo9OMGH/5hWQUt6+rtAgMBAAECgYBdb1TGxiYeQzoffZEJ/ob61qsU +SRELnVS16RqigeobL8g5tBCqa6k4CKNrhvt/xA2mnrenID6AOzkb7ZdR8ATEtojF +JjLZ8zmXACU3WetoRUvh2uxlFpxFeK0yQlaEWcvE4Z9MQe3V8pBvMQUNEZxN4bHT +1eMla9O65TR49uxaPQJBAO/Spm9ln02CjnxCHiGmRQ77gUNz39AtrKRLQBv/uEB2 +fhHAvFoSPGXaIgd73GgZEnM/a+faLrMu9NvemMd5aYMCQQDVDAsjaa72+5ZS87zE +xLDrFT1cKM8U1G0ikdGl6rejDnSoiwfZ8DXpSBOOiSkf/PX9zDXDPQl9nHLjmDn9 +wN7PAkEAxsPTF66lGoujZk8yQ/dXczR2DR7Dl/nTBZQsvUfzQNI0aKhSM2C72Dqz +S3qX0Vs+VHBzEYVegTngzT4vZ9wz2wJBAMNXCZdsvUokIA7rALgCCJ1jmiE4Ibdd +lrtNrEZO0hWlmX04DPjc8PF2bsgQJy73R61vYhQjkOIlYoof93wdLa0CQQDTLHSB +8e8f81Jq+zbLReAQ6ch+fEulaMPlPY0OqgExBxdbwXnlPENw09+EiQkKSSo8qhY1 +guri/IWyq3LYm8nE +-END PRIVATE KEY-
[tika] 05/05: Make the DER private key mostly-match a bit more specific
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git commit 75c2ff5686a70c0fb15c4b52534c1be09669af1b Author: Nick Burch AuthorDate: Tue Sep 29 16:51:19 2020 +0100 Make the DER private key mostly-match a bit more specific --- .../main/resources/org/apache/tika/mime/tika-mimetypes.xml| 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 2c4a5e5..404e462 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -4727,12 +4727,13 @@ - + + - - + +
[tika] 03/05: Add some more DER magic for certificates, and add tests TIKA-3205
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git commit b0ae63a1c59ef60ac6b134cadf2053f2e73152d4 Author: Nick Burch AuthorDate: Tue Sep 29 16:23:08 2020 +0100 Add some more DER magic for certificates, and add tests TIKA-3205 --- .../org/apache/tika/mime/tika-mimetypes.xml| 28 ++ .../java/org/apache/tika/mime/TestMimeTypes.java | 11 + 2 files changed, 34 insertions(+), 5 deletions(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index bdbeee5..a995563 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -4690,10 +4690,16 @@ - - + mask="0xFFF8" offset="0"> + + + + + + @@ -4701,8 +4707,20 @@ + + + + + + - + + + + + + + diff --git a/tika-parser-modules/tika-parser-integration-tests/src/test/java/org/apache/tika/mime/TestMimeTypes.java b/tika-parser-modules/tika-parser-integration-tests/src/test/java/org/apache/tika/mime/TestMimeTypes.java index c507986..dc3f303 100644 --- a/tika-parser-modules/tika-parser-integration-tests/src/test/java/org/apache/tika/mime/TestMimeTypes.java +++ b/tika-parser-modules/tika-parser-integration-tests/src/test/java/org/apache/tika/mime/TestMimeTypes.java @@ -1141,6 +1141,17 @@ public class TestMimeTypes { } @Test +public void testCertificatesKeys() throws Exception { +assertType("application/x-x509-cert", "testCERT.pem"); +assertType("application/x-x509-cert", "testCERT.der"); +assertTypeByData("application/x-x509-cert", "testCERT.pem"); +assertTypeByData("application/x-x509-cert", "testCERT.der"); +// Keys need the data to identify, name isn't enough +assertTypeByData("application/x-x509-key", "testRSAKEY.pem"); +assertTypeByData("application/x-x509-key", "testRSAKEY.der"); +} + +@Test public void testVandICalendars() throws Exception { assertType("text/calendar", "testICalendar.ics"); assertType("text/x-vcalendar", "testVCalendar.vcs");
[tika] branch main updated (6591b32 -> 75c2ff5)
This is an automated email from the ASF dual-hosted git repository. nick pushed a change to branch main in repository https://gitbox.apache.org/repos/asf/tika.git. from 6591b32 TIKA-3196 -- ensure that entryCnt is thread-safe across parses; add integration test; clean up existing unused imports. new ad0d98b Add test certificate and key for TIKA-3205 new ecd1d62 TIKA-3205 Add magic for X509 PEM certificate, and tweak default type new b0ae63a Add some more DER magic for certificates, and add tests TIKA-3205 new fa1b2ef Split the Certificate and Key mimetypes into DER and PEM subtypes, add test EC files. TIKA-3205 new 75c2ff5 Make the DER private key mostly-match a bit more specific The 5 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .../org/apache/tika/mime/tika-mimetypes.xml| 72 - .../java/org/apache/tika/TikaDetectionTest.java| 5 +- .../java/org/apache/tika/mime/TestMimeTypes.java | 18 ++ .../src/test/resources/test-documents/testCERT.der | Bin 0 -> 702 bytes .../src/test/resources/test-documents/testCERT.pem | 17 + .../test/resources/test-documents/testDSAKEY.der | Bin 0 -> 834 bytes .../test/resources/test-documents/testDSAKEY.pem | 15 + .../resources/test-documents/testDSAPARAMS.pem | 14 .../test/resources/test-documents/testECKEY.der| Bin 0 -> 167 bytes .../test/resources/test-documents/testECKEY.pem| 6 ++ .../test/resources/test-documents/testECPARAMS.pem | 3 + .../test/resources/test-documents/testRSAKEY.der | Bin 0 -> 610 bytes .../test/resources/test-documents/testRSAKEY.pem | 16 + 13 files changed, 162 insertions(+), 4 deletions(-) create mode 100644 tika-parsers/src/test/resources/test-documents/testCERT.der create mode 100644 tika-parsers/src/test/resources/test-documents/testCERT.pem create mode 100644 tika-parsers/src/test/resources/test-documents/testDSAKEY.der create mode 100644 tika-parsers/src/test/resources/test-documents/testDSAKEY.pem create mode 100644 tika-parsers/src/test/resources/test-documents/testDSAPARAMS.pem create mode 100644 tika-parsers/src/test/resources/test-documents/testECKEY.der create mode 100644 tika-parsers/src/test/resources/test-documents/testECKEY.pem create mode 100644 tika-parsers/src/test/resources/test-documents/testECPARAMS.pem create mode 100644 tika-parsers/src/test/resources/test-documents/testRSAKEY.der create mode 100644 tika-parsers/src/test/resources/test-documents/testRSAKEY.pem
[tika] 04/05: Split the Certificate and Key mimetypes into DER and PEM subtypes, add test EC files. TIKA-3205
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit c6b30c578e98373496f895cd7caa8317f4212d51 Author: Nick Burch AuthorDate: Tue Sep 29 16:48:40 2020 +0100 Split the Certificate and Key mimetypes into DER and PEM subtypes, add test EC files. TIKA-3205 --- .../org/apache/tika/mime/tika-mimetypes.xml| 41 + .../java/org/apache/tika/mime/TestMimeTypes.java | 19 +++--- .../test/resources/test-documents/testDSAKEY.der | Bin 0 -> 834 bytes .../test/resources/test-documents/testDSAKEY.pem | 15 .../resources/test-documents/testDSAPARAMS.pem | 14 +++ .../test/resources/test-documents/testECKEY.der| Bin 0 -> 167 bytes .../test/resources/test-documents/testECKEY.pem| 6 +++ .../test/resources/test-documents/testECPARAMS.pem | 3 ++ 8 files changed, 84 insertions(+), 14 deletions(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 630b429..abcc5d5 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -4631,13 +4631,19 @@ - - + + + + - - + + + + + + @@ -4652,9 +4658,12 @@ + + + + - @@ -4662,16 +4671,32 @@ - - + + + + + + + + + + - + + + + + + + diff --git a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java index a80dc8e..de45faf 100644 --- a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java +++ b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java @@ -1137,13 +1137,20 @@ public class TestMimeTypes { @Test public void testCertificatesKeys() throws Exception { -assertType("application/x-x509-cert", "testCERT.pem"); -assertType("application/x-x509-cert", "testCERT.der"); -assertTypeByData("application/x-x509-cert", "testCERT.pem"); -assertTypeByData("application/x-x509-cert", "testCERT.der"); +assertType("application/x-x509-cert; format=pem", "testCERT.pem"); +assertType("application/x-x509-cert; format=der", "testCERT.der"); +assertTypeByData("application/x-x509-cert; format=pem", "testCERT.pem"); +assertTypeByData("application/x-x509-cert; format=der", "testCERT.der"); // Keys need the data to identify, name isn't enough -assertTypeByData("application/x-x509-key", "testRSAKEY.pem"); -assertTypeByData("application/x-x509-key", "testRSAKEY.der"); +assertTypeByData("application/x-x509-key; format=pem", "testECKEY.pem"); +assertTypeByData("application/x-x509-key; format=der", "testECKEY.der"); +assertTypeByData("application/x-x509-key; format=pem", "testRSAKEY.pem"); +assertTypeByData("application/x-x509-key; format=der", "testRSAKEY.der"); +assertTypeByData("application/x-x509-key; format=pem", "testDSAKEY.pem"); +assertTypeByData("application/x-x509-key; format=der", "testDSAKEY.der"); +// Parameters only have PEM form, always need data +assertTypeByData("application/x-x509-dsa-parameters", "testDSAPARAMS.pem"); +assertTypeByData("application/x-x509-ec-parameters", "testECPARAMS.pem"); } @Test diff --git a/tika-parsers/src/test/resources/test-documents/testDSAKEY.der b/tika-parsers/src/test/resources/test-documents/testDSAKEY.der new file mode 100644 index 000..9ed2eb9 Binary files /dev/null and b/tika-parsers/src/test/resources/test-documents/testDSAKEY.der differ diff --git a/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem b/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem new file mode 100644 index 000..2b8781a --- /dev/null +++ b/tika-parsers/src/test/resources/test-documents/testDSAKEY.pem @@ -0,0 +1,15 @@ +-BEGIN PRIVATE KEY- +MIICTQIBADCCAi0GByqGSM44BAEwggIgAoIBAQDRXU0Be5k0MI3skB6K0PhyptBh +WSJ1l+NVSOX7wpXC37upcH7a0ZCfU9RyWqcX9dQFw+TWjlH2ANll/FO4osXkkJVY +oylJ+p0599v6WRPBS/yQpKuvfqEm5HA78J8ILhnyCCw8hqdlrADBOMGf7tGF5Agw +hEZJdtHjYRzPWzY0eogptg3wQPd/
[tika] 03/05: Add some more DER magic for certificates, and add tests TIKA-3205
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit eaa712f89d5de9ad06647fa29d10ac1baa47a4c0 Author: Nick Burch AuthorDate: Tue Sep 29 16:23:08 2020 +0100 Add some more DER magic for certificates, and add tests TIKA-3205 --- .../org/apache/tika/mime/tika-mimetypes.xml| 28 ++ .../java/org/apache/tika/mime/TestMimeTypes.java | 11 + 2 files changed, 34 insertions(+), 5 deletions(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 3cdea61..630b429 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -4639,10 +4639,16 @@ - - + mask="0xFFF8" offset="0"> + + + + + + @@ -4650,8 +4656,20 @@ + + + + + + - + + + + + + + diff --git a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java index 83f67eb..a80dc8e 100644 --- a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java +++ b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java @@ -1136,6 +1136,17 @@ public class TestMimeTypes { } @Test +public void testCertificatesKeys() throws Exception { +assertType("application/x-x509-cert", "testCERT.pem"); +assertType("application/x-x509-cert", "testCERT.der"); +assertTypeByData("application/x-x509-cert", "testCERT.pem"); +assertTypeByData("application/x-x509-cert", "testCERT.der"); +// Keys need the data to identify, name isn't enough +assertTypeByData("application/x-x509-key", "testRSAKEY.pem"); +assertTypeByData("application/x-x509-key", "testRSAKEY.der"); +} + +@Test public void testVandICalendars() throws Exception { assertType("text/calendar", "testICalendar.ics"); assertType("text/x-vcalendar", "testVCalendar.vcs");
[tika] branch master updated (62fe4ad -> 6183452)
This is an automated email from the ASF dual-hosted git repository. nick pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/tika.git. from 62fe4ad TIKA-3104 -- add detection and parsing for xml based plist files new 5fdb70a Add test certificate and key for TIKA-3205 new c3fff83 TIKA-3205 Add magic for X509 PEM certificate, and tweak default type new eaa712f Add some more DER magic for certificates, and add tests TIKA-3205 new c6b30c5 Split the Certificate and Key mimetypes into DER and PEM subtypes, add test EC files. TIKA-3205 new 6183452 Make the DER private key mostly-match a bit more specific The 5 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .../org/apache/tika/mime/tika-mimetypes.xml| 72 - .../java/org/apache/tika/TikaDetectionTest.java| 5 +- .../java/org/apache/tika/mime/TestMimeTypes.java | 18 ++ .../src/test/resources/test-documents/testCERT.der | Bin 0 -> 702 bytes .../src/test/resources/test-documents/testCERT.pem | 17 + .../test/resources/test-documents/testDSAKEY.der | Bin 0 -> 834 bytes .../test/resources/test-documents/testDSAKEY.pem | 15 + .../resources/test-documents/testDSAPARAMS.pem | 14 .../test/resources/test-documents/testECKEY.der| Bin 0 -> 167 bytes .../test/resources/test-documents/testECKEY.pem| 6 ++ .../test/resources/test-documents/testECPARAMS.pem | 3 + .../test/resources/test-documents/testRSAKEY.der | Bin 0 -> 610 bytes .../test/resources/test-documents/testRSAKEY.pem | 16 + 13 files changed, 162 insertions(+), 4 deletions(-) create mode 100644 tika-parsers/src/test/resources/test-documents/testCERT.der create mode 100644 tika-parsers/src/test/resources/test-documents/testCERT.pem create mode 100644 tika-parsers/src/test/resources/test-documents/testDSAKEY.der create mode 100644 tika-parsers/src/test/resources/test-documents/testDSAKEY.pem create mode 100644 tika-parsers/src/test/resources/test-documents/testDSAPARAMS.pem create mode 100644 tika-parsers/src/test/resources/test-documents/testECKEY.der create mode 100644 tika-parsers/src/test/resources/test-documents/testECKEY.pem create mode 100644 tika-parsers/src/test/resources/test-documents/testECPARAMS.pem create mode 100644 tika-parsers/src/test/resources/test-documents/testRSAKEY.der create mode 100644 tika-parsers/src/test/resources/test-documents/testRSAKEY.pem
[tika] 01/05: Add test certificate and key for TIKA-3205
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit 5fdb70ae4770301d6b101e9007a1058e15abac94 Author: Nick Burch AuthorDate: Tue Sep 29 15:26:48 2020 +0100 Add test certificate and key for TIKA-3205 --- .../src/test/resources/test-documents/testCERT.der | Bin 0 -> 702 bytes .../src/test/resources/test-documents/testCERT.pem | 17 + .../src/test/resources/test-documents/testRSAKEY.der | Bin 0 -> 610 bytes .../src/test/resources/test-documents/testRSAKEY.pem | 16 4 files changed, 33 insertions(+) diff --git a/tika-parsers/src/test/resources/test-documents/testCERT.der b/tika-parsers/src/test/resources/test-documents/testCERT.der new file mode 100644 index 000..935f1f6 Binary files /dev/null and b/tika-parsers/src/test/resources/test-documents/testCERT.der differ diff --git a/tika-parsers/src/test/resources/test-documents/testCERT.pem b/tika-parsers/src/test/resources/test-documents/testCERT.pem new file mode 100644 index 000..dbfd849 --- /dev/null +++ b/tika-parsers/src/test/resources/test-documents/testCERT.pem @@ -0,0 +1,17 @@ +-BEGIN CERTIFICATE- +MIICujCCAiOgAwIBAgIUKOX/l1c68ya6jnfeRJ8uP9kvVx8wDQYJKoZIhvcNAQEL +BQAwbzELMAkGA1UEBhMCWloxFDASBgNVBAgMC0FwYWNoZSBUaWthMQ8wDQYDVQQH +DAZBcGFjaGUxDTALBgNVBAoMBFRpa2ExFDASBgNVBAsMC0FwYWNoZSBUaWthMRQw +EgYDVQQDDAtBcGFjaGUgVGlrYTAeFw0yMDA5MjkxNDIzNDRaFw0zMDA5MjcxNDIz +NDRaMG8xCzAJBgNVBAYTAlpaMRQwEgYDVQQIDAtBcGFjaGUgVGlrYTEPMA0GA1UE +BwwGQXBhY2hlMQ0wCwYDVQQKDARUaWthMRQwEgYDVQQLDAtBcGFjaGUgVGlrYTEU +MBIGA1UEAwwLQXBhY2hlIFRpa2EwgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGB +AMeVjMm2uyhe7HkNFFBU6nnI9niJnn+hv+3TTDw23/GH2/d1T3JSpuSFstHvwS0t +umrSb6Py7igD4fz3+aeZU1gDrby4f9KemnLBlAU63VuPyCDyWj2XqbsZDGdnbIy2 +FnBBy2LBn5p0gDwoDpnmhHPHIJZo9OMGH/5hWQUt6+rtAgMBAAGjUzBRMB0GA1Ud +DgQWBBTM8K2WIAuPiv0VgrRoMn2fAGua1jAfBgNVHSMEGDAWgBTM8K2WIAuPiv0V +grRoMn2fAGua1jAPBgNVHRMBAf8EBTADAQH/MA0GCSqGSIb3DQEBCwUAA4GBALqE ++ja5Hx78Dpym/HxP50TfadwmEes+JXYptykWnuOWgLlqLuGAqJctLOKoR73r7d9d +zJBtdr3A5uTg9vWNMSA2lPdBr/NplNaI8bso+8dRWdkiMut+j7xqTFl8MVMriRSR +a2cA9BsUlpHjJdVjcFweAtdlINZDACoZubCTM7ng +-END CERTIFICATE- diff --git a/tika-parsers/src/test/resources/test-documents/testRSAKEY.der b/tika-parsers/src/test/resources/test-documents/testRSAKEY.der new file mode 100644 index 000..22c4f86 Binary files /dev/null and b/tika-parsers/src/test/resources/test-documents/testRSAKEY.der differ diff --git a/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem b/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem new file mode 100644 index 000..0971b76 --- /dev/null +++ b/tika-parsers/src/test/resources/test-documents/testRSAKEY.pem @@ -0,0 +1,16 @@ +-BEGIN PRIVATE KEY- +MIICeAIBADANBgkqhkiG9w0BAQEFAASCAmIwggJeAgEAAoGBAMeVjMm2uyhe7HkN +FFBU6nnI9niJnn+hv+3TTDw23/GH2/d1T3JSpuSFstHvwS0tumrSb6Py7igD4fz3 ++aeZU1gDrby4f9KemnLBlAU63VuPyCDyWj2XqbsZDGdnbIy2FnBBy2LBn5p0gDwo +DpnmhHPHIJZo9OMGH/5hWQUt6+rtAgMBAAECgYBdb1TGxiYeQzoffZEJ/ob61qsU +SRELnVS16RqigeobL8g5tBCqa6k4CKNrhvt/xA2mnrenID6AOzkb7ZdR8ATEtojF +JjLZ8zmXACU3WetoRUvh2uxlFpxFeK0yQlaEWcvE4Z9MQe3V8pBvMQUNEZxN4bHT +1eMla9O65TR49uxaPQJBAO/Spm9ln02CjnxCHiGmRQ77gUNz39AtrKRLQBv/uEB2 +fhHAvFoSPGXaIgd73GgZEnM/a+faLrMu9NvemMd5aYMCQQDVDAsjaa72+5ZS87zE +xLDrFT1cKM8U1G0ikdGl6rejDnSoiwfZ8DXpSBOOiSkf/PX9zDXDPQl9nHLjmDn9 +wN7PAkEAxsPTF66lGoujZk8yQ/dXczR2DR7Dl/nTBZQsvUfzQNI0aKhSM2C72Dqz +S3qX0Vs+VHBzEYVegTngzT4vZ9wz2wJBAMNXCZdsvUokIA7rALgCCJ1jmiE4Ibdd +lrtNrEZO0hWlmX04DPjc8PF2bsgQJy73R61vYhQjkOIlYoof93wdLa0CQQDTLHSB +8e8f81Jq+zbLReAQ6ch+fEulaMPlPY0OqgExBxdbwXnlPENw09+EiQkKSSo8qhY1 +guri/IWyq3LYm8nE +-END PRIVATE KEY-
[tika] 05/05: Make the DER private key mostly-match a bit more specific
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit 618345263ee41108e1a225dbcdbb8db16b2aae28 Author: Nick Burch AuthorDate: Tue Sep 29 16:51:19 2020 +0100 Make the DER private key mostly-match a bit more specific --- .../main/resources/org/apache/tika/mime/tika-mimetypes.xml| 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index abcc5d5..92cbb21 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -4676,12 +4676,13 @@ - + + - - + +
[tika] 02/05: TIKA-3205 Add magic for X509 PEM certificate, and tweak default type
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit c3fff83c7e955ff5de0e4cb9098b06a15ee2cf7e Author: Nick Burch AuthorDate: Tue Sep 29 15:49:14 2020 +0100 TIKA-3205 Add magic for X509 PEM certificate, and tweak default type --- .../org/apache/tika/mime/tika-mimetypes.xml| 26 +- .../java/org/apache/tika/TikaDetectionTest.java| 5 +++-- 2 files changed, 28 insertions(+), 3 deletions(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 5dbcf99..3cdea61 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -4627,10 +4627,34 @@ - + + + + + + + + + + + + + + + + + + + + + + + diff --git a/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java b/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java index eb3bb19..2364daa 100644 --- a/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java +++ b/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java @@ -624,8 +624,9 @@ public class TikaDetectionTest { assertEquals("application/x-texinfo", tika.detect("x.texi")); assertEquals("application/x-ustar", tika.detect("x.ustar")); assertEquals("application/x-wais-source", tika.detect("x.src")); -assertEquals("application/x-x509-ca-cert", tika.detect("x.der")); -assertEquals("application/x-x509-ca-cert", tika.detect("x.crt")); +// Differ from httpd - use a common parent for CA and User certs +//assertEquals("application/x-x509-ca-cert", tika.detect("x.der")); +//assertEquals("application/x-x509-ca-cert", tika.detect("x.crt")); assertEquals("application/x-xfig", tika.detect("x.fig")); assertEquals("application/x-xpinstall", tika.detect("x.xpi")); assertEquals("application/xenc+xml", tika.detect("x.xenc"));
[tika] branch master updated: Tweak whitespace to be consistent
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/master by this push: new f233f3b Tweak whitespace to be consistent f233f3b is described below commit f233f3bacb5ec62c948f46d51c2a1ab54744073f Author: Nick Burch AuthorDate: Thu May 28 07:15:16 2020 +0100 Tweak whitespace to be consistent --- tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index feaef21..4ea9252 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -3869,8 +3869,8 @@ <_comment>Apple Xcode Memgraph - - + +
[tika] branch master updated (0bf11ae -> 1140091)
This is an automated email from the ASF dual-hosted git repository. nick pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/tika.git. from 0bf11ae TIKA-2961 Make the CAF mime magic more specific to avoid false positives, by checking for a version number after the "caff" header text new e9d62d2 Make the bplist magic more specific where possible, keep version catch-all as now otherwise new 1140091 Add glob for Xcode Memgraph files, which are bplist-based The 2 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .../resources/org/apache/tika/mime/tika-mimetypes.xml | 18 ++ 1 file changed, 18 insertions(+)
[tika] 01/02: Make the bplist magic more specific where possible, keep version catch-all as now otherwise
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit e9d62d24c19250053aee07a59c9e4de5197f2f42 Author: Nick Burch AuthorDate: Thu May 28 07:05:30 2020 +0100 Make the bplist magic more specific where possible, keep version catch-all as now otherwise --- .../main/resources/org/apache/tika/mime/tika-mimetypes.xml| 11 +++ 1 file changed, 11 insertions(+) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 7210066..aad1c39 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -3295,6 +3295,17 @@ + + + + + + + + + + +
[tika] 02/02: Add glob for Xcode Memgraph files, which are bplist-based
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit 114009165410c91b57b91fc4eaddb089a8559451 Author: Nick Burch AuthorDate: Thu May 28 07:06:14 2020 +0100 Add glob for Xcode Memgraph files, which are bplist-based --- .../src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 7 +++ 1 file changed, 7 insertions(+) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index aad1c39..feaef21 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -3315,6 +3315,7 @@ + <_comment>GNU tar Compressed File Archive (GNU Tape Archive) @@ -3866,6 +3867,12 @@ + +<_comment>Apple Xcode Memgraph + + + + MOBI <_comment>Mobipocket Ebook
[tika] branch master updated: TIKA-2961 Make the CAF mime magic more specific to avoid false positives, by checking for a version number after the "caff" header text
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/master by this push: new 0bf11ae TIKA-2961 Make the CAF mime magic more specific to avoid false positives, by checking for a version number after the "caff" header text 0bf11ae is described below commit 0bf11aec86079b8f1ae2f1ea680910ba79665c4f Author: Nick Burch AuthorDate: Mon May 18 05:06:27 2020 +0100 TIKA-2961 Make the CAF mime magic more specific to avoid false positives, by checking for a version number after the "caff" header text --- .../src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 551e55e..7210066 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -5139,7 +5139,11 @@ <_comment>Core Audio Format <_comment>com.apple.coreaudio-format - + + + + +
[tika] branch master updated: TIKA-3023 Make the SGI Movie mime magic more specific to avoid false positives on text files starting with MOVI
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/master by this push: new 0d259bc TIKA-3023 Make the SGI Movie mime magic more specific to avoid false positives on text files starting with MOVI 0d259bc is described below commit 0d259bc8b6beccaa9bac2e85212b57a48f171e83 Author: Nick Burch AuthorDate: Thu Feb 6 11:42:30 2020 + TIKA-3023 Make the SGI Movie mime magic more specific to avoid false positives on text files starting with MOVI --- .../src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 174dad0..3211cfb 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -7438,7 +7438,11 @@ - + + + + +
[tika] branch master updated: TIKA-3034 Mathematica files don't have a unique magic, but try to detect based on the file starting with a Mathematica-style comment as all we can do. Also add the newer
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/master by this push: new f5571fa TIKA-3034 Mathematica files don't have a unique magic, but try to detect based on the file starting with a Mathematica-style comment as all we can do. Also add the newer Wolfram Language mimetype, which extends mathematica, with a unix detection f5571fa is described below commit f5571fa99ef6f178a16bd1bd3a3cded83c7b0013 Author: Nick Burch AuthorDate: Tue Feb 4 10:31:31 2020 + TIKA-3034 Mathematica files don't have a unique magic, but try to detect based on the file starting with a Mathematica-style comment as all we can do. Also add the newer Wolfram Language mimetype, which extends mathematica, with a unix detection --- .../resources/org/apache/tika/mime/tika-mimetypes.xml | 18 ++ 1 file changed, 18 insertions(+) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 34e8d98..174dad0 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -409,11 +409,29 @@ + +<_comment>Wolfram Mathematica + + + + + + + + +<_comment>Wolfram Language + + + + + + +
[tika] 04/05: HEIF detection unit test. When tooling improves, should ideally create another HEIF test file with another codec too
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit 433a8c1625d302bf1a9d81f2ad1223df7bf83d31 Author: Nick Burch AuthorDate: Mon Nov 18 14:57:09 2019 + HEIF detection unit test. When tooling improves, should ideally create another HEIF test file with another codec too --- .../src/test/java/org/apache/tika/mime/TestMimeTypes.java| 12 1 file changed, 12 insertions(+) diff --git a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java index bdf7da1..d45d116 100644 --- a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java +++ b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java @@ -408,6 +408,18 @@ public class TestMimeTypes { } @Test +public void testHeifDetection() throws Exception { +// HEIF image using the HEVC Codec == HEIC +// created using https://compare.rokka.io/_compare on testJPEG_GEO.jpg +assertType("image/heic", "testHEIF.heic"); +assertTypeByData("image/heic", "testHEIF.heic"); +assertTypeByName("image/heic", "testHEIF.heic"); + +// TODO Create a HEIF using another codec, to test .heif data +assertTypeByName("image/heif", "testHEIF.heif"); +} + +@Test public void testJpegDetection() throws Exception { assertType("image/jpeg", "testJPEG.jpg"); assertTypeByData("image/jpeg", "testJPEG.jpg");
[tika] 02/05: Test file uses the HEVC codec, so switch to the more specific extension
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit efd071aa595d76d094f549f25db856229baace5d Author: Nick Burch AuthorDate: Mon Nov 18 14:54:42 2019 + Test file uses the HEVC codec, so switch to the more specific extension --- .../test-documents/{testHEIF.heif => testHEIF.heic} | Bin 1 file changed, 0 insertions(+), 0 deletions(-) diff --git a/tika-parsers/src/test/resources/test-documents/testHEIF.heif b/tika-parsers/src/test/resources/test-documents/testHEIF.heic similarity index 100% rename from tika-parsers/src/test/resources/test-documents/testHEIF.heif rename to tika-parsers/src/test/resources/test-documents/testHEIF.heic
[tika] branch master updated (f6a5749 -> 1bb1895)
This is an automated email from the ASF dual-hosted git repository. nick pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/tika.git. from f6a5749 TIKA-2982 -- don't require 'DataSpaces' in ooxml-encrypted detection new 8cfacfe Test HEIF file, generated with https://compare.rokka.io/_compare on testJPEG_GEO.jpg new efd071a Test file uses the HEVC codec, so switch to the more specific extension new 0758598 Add mimetypes for the HEIF (High Efficiency Image File) format family - TIKA-2942 new 433a8c1 HEIF detection unit test. When tooling improves, should ideally create another HEIF test file with another codec too new 1bb1895 Changelog update The 5 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: CHANGES.txt| 3 +- .../org/apache/tika/mime/tika-mimetypes.xml| 40 + .../java/org/apache/tika/mime/TestMimeTypes.java | 12 +++ .../test/resources/test-documents/testHEIF.heic| Bin 0 -> 13706 bytes 4 files changed, 54 insertions(+), 1 deletion(-) create mode 100644 tika-parsers/src/test/resources/test-documents/testHEIF.heic
[tika] 05/05: Changelog update
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit 1bb1895a30b722a9780122a6447598dd29e75ca7 Author: Nick Burch AuthorDate: Mon Nov 18 15:00:33 2019 + Changelog update --- CHANGES.txt | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/CHANGES.txt b/CHANGES.txt index 3b66d3b..17b401c 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -16,7 +16,8 @@ Release 1.23 * Add parser for XLIFF v1.2 files (TIKA-2975). - * Add mime type detection support for WebAssembly (TIKA-2894). + * Add mime type detection support for WebAssembly (TIKA-2894) and + HEIF / HEIC images (TIKA-2942). * Add an XLZ Parser (TIKA-2976).
[tika] 03/05: Add mimetypes for the HEIF (High Efficiency Image File) format family - TIKA-2942
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit 0758598bece92f97418f88d0c443e8d9cff7a7ee Author: Nick Burch AuthorDate: Mon Nov 18 14:55:45 2019 + Add mimetypes for the HEIF (High Efficiency Image File) format family - TIKA-2942 --- .../org/apache/tika/mime/tika-mimetypes.xml| 40 ++ 1 file changed, 40 insertions(+) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index c5ad55d..6e967b6 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -5314,6 +5314,46 @@ + +<_comment>HEIF - High Efficiency Image File +HEIF + https://en.wikipedia.org/wiki/High_Efficiency_Image_File_Format + + + + + + + +<_comment>HEIF Sequence - High Efficiency Image Sequence + + + + + + + + +<_comment>HEIF Image using HEVC Codec +HEIC + + + + + + + + + +<_comment>HEIF Sequence using HEVC Codec +HEVC + + + + + + + <_comment>Apple Icon Image Format
svn commit: r1869088 - /tika/site/src/site/resources/doap.rdf
Author: nick Date: Mon Oct 28 21:35:45 2019 New Revision: 1869088 URL: http://svn.apache.org/viewvc?rev=1869088=rev Log: Correct the RDF link for the projects category, and add us to Content too Modified: tika/site/src/site/resources/doap.rdf Modified: tika/site/src/site/resources/doap.rdf URL: http://svn.apache.org/viewvc/tika/site/src/site/resources/doap.rdf?rev=1869088=1869087=1869088=diff == --- tika/site/src/site/resources/doap.rdf (original) +++ tika/site/src/site/resources/doap.rdf Mon Oct 28 21:35:45 2019 @@ -38,7 +38,8 @@ Java -https://projects.apache.org/projects.html?category#library; /> +http://projects.apache.org/category/content; /> +http://projects.apache.org/category/library; /> Apache Tika 1.22
svn commit: r1867123 - in /tika/site: pom.xml publish/.htaccess publish/source-repository.html
Author: nick Date: Wed Sep 18 13:56:36 2019 New Revision: 1867123 URL: http://svn.apache.org/viewvc?rev=1867123=rev Log: TIKA-2947 Update the source control details in the site pom, so the auto-generated source repo file is correct Added: tika/site/publish/source-repository.html Removed: tika/site/publish/.htaccess Modified: tika/site/pom.xml Modified: tika/site/pom.xml URL: http://svn.apache.org/viewvc/tika/site/pom.xml?rev=1867123=1867122=1867123=diff == --- tika/site/pom.xml (original) +++ tika/site/pom.xml Wed Sep 18 13:56:36 2019 @@ -39,12 +39,12 @@ - scm:svn:http://svn.apache.org/repos/asf/tika/trunk + scm:git:https://github.com/apache/tika/ - scm:svn:https://svn.apache.org/repos/asf/tika/trunk + scm:git:https://gitbox.apache.org/repos/asf/tika.git -http://svn.apache.org/repos/asf/tika/trunk +https://github.com/apache/tika/ Added: tika/site/publish/source-repository.html URL: http://svn.apache.org/viewvc/tika/site/publish/source-repository.html?rev=1867123=auto == --- tika/site/publish/source-repository.html (added) +++ tika/site/publish/source-repository.html Wed Sep 18 13:56:36 2019 @@ -0,0 +1,457 @@ +http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd;> + + + + + + + + + +http://www.w3.org/1999/xhtml;> + + +Apache Tika Source Repository + + @import url("./css/site.css"); + + + + function selectProvider(form) { +provider = form.elements['searchProvider'].value; +if (provider == "any") { + if (Math.random() > 0.5) { +provider = "lucid"; + } else { +provider = "sl"; + } +} +if (provider == "lucid") { + form.action = "<a rel="nofollow" href="http://find.searchhub.org/p:tika"">http://find.searchhub.org/p:tika"</a>;; +} else if (provider == "sl") { + form.action = "<a rel="nofollow" href="http://search-lucene.com/tika"">http://search-lucene.com/tika"</a>;; +} +days = 90; +date = new Date(); +date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000)); +expires = "; expires=" + date.toGMTString(); +document.cookie = "searchProvider=" + provider + expires + "; path=/"; + } + function initProvider() { +if (document.cookie.length>0) { + cStart=document.cookie.indexOf("searchProvider="); + if (cStart!=-1) { +cStart=cStart + "searchProvider=".length; +cEnd=document.cookie.indexOf(";", cStart); +if (cEnd==-1) { + cEnd=document.cookie.length; +} +provider = unescape(document.cookie.substring(cStart,cEnd)); +document.forms['searchform'].elements['searchProvider'].value = provider; + } +} +document.forms['searchform'].elements['q'].focus(); + } + + + + + +https://tika.apache.org; id="bannerLeft" title="Apache Tika" + >https://tika.apache.org/tika.png; alt="Apache Tika" +width="292" height="100"/> +https://www.apache.org/; id="bannerRight" + title="The Apache Software Foundation" + >https://tika.apache.org/asf-logo.gif; alt="The Apache Software Foundation" +width="387" height="100"/> + + + +Overview +This project uses http://git-scm.com/;>GIT to manage its source code. Instructions on GIT use can be found at http://git-scm.com/documentation;>http://git-scm.com/documentation. + +Web Access +The following is a link to the online source repository. + +https://github.com/apache/tika/;>https://github.com/apache/tika/ + +Anonymous access +The source can be checked out anonymously from GIT with this command (See http://git-scm.com/docs/git-clone;>http://git-scm.com/docs/git-clone): + +$ git clone https://github.com/apache/tika/ + +Developer access +Only project developers can access the GIT tree via this method (See http://git-scm.com/docs/git-clone;>http://git-scm.com/docs/git-clone). + +$ git clone https://gitbox.apache.org/repos/asf/tika.git + +Access from behind a firewall +Refer to the documentation of the SCM used for more information about access behind a firewall. + + + +Apache Tika + + + +Introduction + + + +Download +
svn commit: r1867122 [2/2] - in /tika/site/publish: 0.10/ 0.5/ 0.6/ 0.7/ 0.8/ 0.9/ 1.0/ 1.1/ 1.10/ 1.11/ 1.12/ 1.13/ 1.14/ 1.15/ 1.16/ 1.17/ 1.18/ 1.19.1/ 1.19/ 1.2/ 1.20/ 1.21/ 1.22/ 1.3/ 1.4/ 1.5/ 1
Modified: tika/site/publish/1.18/gettingstarted.html URL: http://svn.apache.org/viewvc/tika/site/publish/1.18/gettingstarted.html?rev=1867122=1867121=1867122=diff == --- tika/site/publish/1.18/gettingstarted.html (original) +++ tika/site/publish/1.18/gettingstarted.html Wed Sep 18 13:50:40 2019 @@ -89,11 +89,10 @@ This document describes how to build Apache Tika from sources and how to start using Tika in an application. Getting and building the sources -To build Tika from sources you first need to either download a source release or checkout the latest sources from version control. +To build Tika from sources you first need to either download a source release or checkout the latest sources from version control. Once you have the sources, you can build them using the http://maven.apache.org/;>Maven 2 build system. Executing the following command in the base directory will build the sources and install the resulting artifacts in your local Maven repository. -mvn install - +mvn install See the Maven documentation for more information about the available build options. Note that you need Java 7 or higher to build Tika. @@ -120,36 +119,31 @@ groupIdorg.apache.tika/groupId artifactIdtika-core/artifactId version1.18/version - /dependency - + /dependency If you want to use Tika to parse documents (instead of simply detecting document types, etc.), you'll want to depend on tika-parsers instead: dependency groupIdorg.apache.tika/groupId artifactIdtika-parsers/artifactId version1.18/version - /dependency - + /dependency Note that adding this dependency will introduce a number of transitive dependencies to your project, including one on tika-core. You need to make sure that these dependencies won't conflict with your existing project dependencies. You can use the following command in the tika-parsers directory to get a full listing of all the dependencies. -$ mvn dependency:tree | grep :compile - +$ mvn dependency:tree | grep :compile Using Tika in a Gradle-built project To add a dependency on Apache Tika to your Gradle built project, including the full set of parsers, you should depend on the tika-parsers artifact: dependencies { runtime 'org.apache.tika:tika-parsers:1.18' -} - +} Using Tika in an Ant project If you are using http://ant.apache.org/ivy/;>Apache Ivy as your dependency manager tool with Ant, then to include Tika with the full set of parsers, you should depend on the tika-parsers artifact like this: dependencies dependency org=org.apache.tika name=tika-parsers rev=1.18/ -/dependencies - +/dependencies Otherwise, probably the easiest way to use Tika is to include the full tika-app jar on your classpath. For just core functionality, you can add the tika-core jar, but be aware that the full set of parsers have a large number of dependencies which must be included which is very fiddly to do by hand with Ant! To include Tika in your Ant project, you should do something like: classpath @@ -160,8 +154,7 @@ !-- or: Tika with all Parsers-- pathelement location=path/to/tika-app-${tika.version}.jar/ -/classpath - +/classpath Using Tika as a command line utility The Tika application jar (tika-app-*.jar) can be used as a command line utility for extracting text content and metadata from all sorts of files. This runnable jar contains all the dependencies it needs, so you don't need to worry about classpath settings to run it. @@ -277,15 +270,13 @@ Batch Options: To modify child process jvm args, prepend J as in: -JXmx4g or -JDlog4j.configuration=file:log4j.xml. - You can also use the jar as a component in a Unix pipeline or as an external tool in many scripting languages. # Check if an Internet resource contains a specific keyword curl http://.../document.doc \ | java -jar tika-app.jar --text \ - | grep -q keyword - + | grep -q keyword Wrappers Several wrappers are available to use Tika in another programming language, such as https://github.com/aviks/Taro.jl;>Julia or https://github.com/chrismattmann/tika-python;>Python. Modified: tika/site/publish/1.19.1/gettingstarted.html URL: http://svn.apache.org/viewvc/tika/site/publish/1.19.1/gettingstarted.html?rev=1867122=1867121=1867122=diff == --- tika/site/publish/1.19.1/gettingstarted.html (original) +++ tika/site/publish/1.19.1/gettingstarted.html Wed Sep 18 13:50:40 2019 @@ -89,11 +89,10 @@ This document describes how to build Apache Tika from sources and how to start using Tika in an application. Getting and building the sources -To build Tika from sources you first need to either download a source release or checkout the latest sources from version control. +To build Tika from sources you first need to either download a source
svn commit: r1867122 [1/2] - in /tika/site/publish: 0.10/ 0.5/ 0.6/ 0.7/ 0.8/ 0.9/ 1.0/ 1.1/ 1.10/ 1.11/ 1.12/ 1.13/ 1.14/ 1.15/ 1.16/ 1.17/ 1.18/ 1.19.1/ 1.19/ 1.2/ 1.20/ 1.21/ 1.22/ 1.3/ 1.4/ 1.5/ 1
Author: nick Date: Wed Sep 18 13:50:40 2019 New Revision: 1867122 URL: http://svn.apache.org/viewvc?rev=1867122=rev Log: TIKA-2947 Update source code link Modified: tika/site/publish/0.10/gettingstarted.html tika/site/publish/0.5/gettingstarted.html tika/site/publish/0.6/gettingstarted.html tika/site/publish/0.7/gettingstarted.html tika/site/publish/0.8/gettingstarted.html tika/site/publish/0.9/gettingstarted.html tika/site/publish/1.0/gettingstarted.html tika/site/publish/1.1/gettingstarted.html tika/site/publish/1.10/gettingstarted.html tika/site/publish/1.11/gettingstarted.html tika/site/publish/1.12/gettingstarted.html tika/site/publish/1.13/gettingstarted.html tika/site/publish/1.14/gettingstarted.html tika/site/publish/1.15/gettingstarted.html tika/site/publish/1.16/gettingstarted.html tika/site/publish/1.17/gettingstarted.html tika/site/publish/1.18/gettingstarted.html tika/site/publish/1.19.1/gettingstarted.html tika/site/publish/1.19/gettingstarted.html tika/site/publish/1.2/gettingstarted.html tika/site/publish/1.20/gettingstarted.html tika/site/publish/1.21/gettingstarted.html tika/site/publish/1.22/gettingstarted.html tika/site/publish/1.3/gettingstarted.html tika/site/publish/1.4/gettingstarted.html tika/site/publish/1.5/gettingstarted.html tika/site/publish/1.6/gettingstarted.html tika/site/publish/1.7/gettingstarted.html tika/site/publish/1.8/gettingstarted.html tika/site/publish/1.9/gettingstarted.html Modified: tika/site/publish/0.10/gettingstarted.html URL: http://svn.apache.org/viewvc/tika/site/publish/0.10/gettingstarted.html?rev=1867122=1867121=1867122=diff == --- tika/site/publish/0.10/gettingstarted.html (original) +++ tika/site/publish/0.10/gettingstarted.html Wed Sep 18 13:50:40 2019 @@ -89,11 +89,10 @@ This document describes how to build Apache Tika from sources and how to start using Tika in an application. Getting and building the sources -To build Tika from sources you first need to either download a source release or checkout the latest sources from version control. +To build Tika from sources you first need to either download a source release or checkout the latest sources from version control. Once you have the sources, you can build them using the http://maven.apache.org/;>Maven 2 build system. Executing the following command in the base directory will build the sources and install the resulting artifacts in your local Maven repository. -mvn install - +mvn install See the Maven documentation for more information about the available build options. Note that you need Java 5 or higher to build Tika. @@ -116,16 +115,14 @@ groupIdorg.apache.tika/groupId artifactIdtika-core/artifactId version0.10/version - /dependency - + /dependency If you want to use Tika to parse documents (instead of simply detecting document types, etc.), you'll want to depend on tika-parsers instead: dependency groupIdorg.apache.tika/groupId artifactIdtika-parsers/artifactId version0.10/version - /dependency - + /dependency Note that adding this dependency will introduce a number of transitive dependencies to your project, including one on tika-core. You need to make sure that these dependencies won't conflict with your existing project dependencies. The listing below shows all the compile-scope dependencies of tika-parsers in the Tika 0.10 release. org.apache.tika:tika-parsers:bundle:0.10 @@ -154,8 +151,7 @@ +- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile +- rome:rome:jar:0.9:compile -| \- jdom:jdom:jar:1.0:compile - +| \- jdom:jdom:jar:1.0:compile Using Tika in an Ant project Unless you use a dependency manager tool like http://ant.apache.org/ivy/;>Apache Ivy, to use Tika in you application you can include the Tika jar files and the dependencies individually. @@ -187,8 +183,7 @@ pathelement location=path/to/boilerpipe-1.1.0.jar/ pathelement location=path/to/rome-0.9.jar/ pathelement location=path/to/jdom-1.0.jar/ -/classpath - +/classpath An easy way to gather all these libraries is to run mvn dependency:copy-dependencies in the tika-parsers source directory. This will copy all Tika dependencies to the target/dependencies directory. Alternatively you can simply drop the entire tika-app jar to your classpath to get all of the above dependencies in a single archive. @@ -253,15 +248,13 @@ Description: Use the -server (or -s) option to start the Apache Tika server. The server will listen to the -ports you specify as one or more arguments. - +ports you specify as one or more arguments. You can also use the jar as a component in a Unix pipeline or as an external tool in many scripting languages. # Check if an Internet re
svn commit: r1867120 - /tika/site/publish/.htaccess
Author: nick Date: Wed Sep 18 13:43:41 2019 New Revision: 1867120 URL: http://svn.apache.org/viewvc?rev=1867120=rev Log: Remove the old source code page, redirect to the new one Modified: tika/site/publish/.htaccess Modified: tika/site/publish/.htaccess URL: http://svn.apache.org/viewvc/tika/site/publish/.htaccess?rev=1867120=1867119=1867120=diff == --- tika/site/publish/.htaccess (original) +++ tika/site/publish/.htaccess Wed Sep 18 13:43:41 2019 @@ -2,4 +2,4 @@ # See http://httpd.apache.org/docs/current/mod/mod_alias.html#redirect # Redirect old source code page to the new one -Redirect source-repository.html contribute.html +Redirect "/source-repository.html" "/contribute.html"
svn commit: r1867119 - /tika/site/publish/.htaccess
Author: nick Date: Wed Sep 18 13:42:35 2019 New Revision: 1867119 URL: http://svn.apache.org/viewvc?rev=1867119=rev Log: Remove the old source code page, redirect to the new one Modified: tika/site/publish/.htaccess Modified: tika/site/publish/.htaccess URL: http://svn.apache.org/viewvc/tika/site/publish/.htaccess?rev=1867119=1867118=1867119=diff == --- tika/site/publish/.htaccess (original) +++ tika/site/publish/.htaccess Wed Sep 18 13:42:35 2019 @@ -2,4 +2,4 @@ # See http://httpd.apache.org/docs/current/mod/mod_alias.html#redirect # Redirect old source code page to the new one -Redirect source-repository.html contribute.html#Source_Code +Redirect source-repository.html contribute.html
svn commit: r1867118 - in /tika/site/publish: .htaccess source-repository.html
Author: nick Date: Wed Sep 18 13:41:56 2019 New Revision: 1867118 URL: http://svn.apache.org/viewvc?rev=1867118=rev Log: Remove the old source code page, redirect to the new one Added: tika/site/publish/.htaccess Removed: tika/site/publish/source-repository.html Added: tika/site/publish/.htaccess URL: http://svn.apache.org/viewvc/tika/site/publish/.htaccess?rev=1867118=auto == --- tika/site/publish/.htaccess (added) +++ tika/site/publish/.htaccess Wed Sep 18 13:41:56 2019 @@ -0,0 +1,5 @@ +# Apache Tika website redirects +# See http://httpd.apache.org/docs/current/mod/mod_alias.html#redirect + +# Redirect old source code page to the new one +Redirect source-repository.html contribute.html#Source_Code
svn commit: r1867117 - in /tika/site/src/site/apt: 0.10/ 0.5/ 0.6/ 0.7/ 0.8/ 0.9/ 1.0/ 1.1/ 1.10/ 1.11/ 1.12/ 1.13/ 1.14/ 1.15/ 1.16/ 1.17/ 1.18/ 1.19.1/ 1.19/ 1.2/ 1.20/ 1.21/ 1.22/ 1.3/ 1.4/ 1.5/ 1.
Author: nick Date: Wed Sep 18 13:38:59 2019 New Revision: 1867117 URL: http://svn.apache.org/viewvc?rev=1867117=rev Log: TIKA-2947 Fix source code documentation link Modified: tika/site/src/site/apt/0.10/gettingstarted.apt tika/site/src/site/apt/0.5/gettingstarted.apt tika/site/src/site/apt/0.6/gettingstarted.apt tika/site/src/site/apt/0.7/gettingstarted.apt tika/site/src/site/apt/0.8/gettingstarted.apt tika/site/src/site/apt/0.9/gettingstarted.apt tika/site/src/site/apt/1.0/gettingstarted.apt tika/site/src/site/apt/1.1/gettingstarted.apt tika/site/src/site/apt/1.10/gettingstarted.apt tika/site/src/site/apt/1.11/gettingstarted.apt tika/site/src/site/apt/1.12/gettingstarted.apt tika/site/src/site/apt/1.13/gettingstarted.apt tika/site/src/site/apt/1.14/gettingstarted.apt tika/site/src/site/apt/1.15/gettingstarted.apt tika/site/src/site/apt/1.16/gettingstarted.apt tika/site/src/site/apt/1.17/gettingstarted.apt tika/site/src/site/apt/1.18/gettingstarted.apt tika/site/src/site/apt/1.19.1/gettingstarted.apt tika/site/src/site/apt/1.19/gettingstarted.apt tika/site/src/site/apt/1.2/gettingstarted.apt tika/site/src/site/apt/1.20/gettingstarted.apt tika/site/src/site/apt/1.21/gettingstarted.apt tika/site/src/site/apt/1.22/gettingstarted.apt tika/site/src/site/apt/1.3/gettingstarted.apt tika/site/src/site/apt/1.4/gettingstarted.apt tika/site/src/site/apt/1.5/gettingstarted.apt tika/site/src/site/apt/1.6/gettingstarted.apt tika/site/src/site/apt/1.7/gettingstarted.apt tika/site/src/site/apt/1.8/gettingstarted.apt tika/site/src/site/apt/1.9/gettingstarted.apt Modified: tika/site/src/site/apt/0.10/gettingstarted.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/0.10/gettingstarted.apt?rev=1867117=1867116=1867117=diff == --- tika/site/src/site/apt/0.10/gettingstarted.apt (original) +++ tika/site/src/site/apt/0.10/gettingstarted.apt Wed Sep 18 13:38:59 2019 @@ -26,7 +26,7 @@ Getting and building the sources To build Tika from sources you first need to either {{{../download.html}download}} a source release or - {{{../source-repository.html}checkout}} the latest sources from + {{{../contribute.html#Source_Code}checkout}} the latest sources from version control. Once you have the sources, you can build them using the Modified: tika/site/src/site/apt/0.5/gettingstarted.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/0.5/gettingstarted.apt?rev=1867117=1867116=1867117=diff == --- tika/site/src/site/apt/0.5/gettingstarted.apt (original) +++ tika/site/src/site/apt/0.5/gettingstarted.apt Wed Sep 18 13:38:59 2019 @@ -26,7 +26,7 @@ Getting and building the sources To build Tika from sources you first need to either {{{../download.html}download}} a source release or - {{{../source-repository.html}checkout}} the latest sources from + {{{../contribute.html#Source_Code}checkout}} the latest sources from version control. Once you have the sources, you can build them using the Modified: tika/site/src/site/apt/0.6/gettingstarted.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/0.6/gettingstarted.apt?rev=1867117=1867116=1867117=diff == --- tika/site/src/site/apt/0.6/gettingstarted.apt (original) +++ tika/site/src/site/apt/0.6/gettingstarted.apt Wed Sep 18 13:38:59 2019 @@ -26,7 +26,7 @@ Getting and building the sources To build Tika from sources you first need to either {{{../download.html}download}} a source release or - {{{../source-repository.html}checkout}} the latest sources from + {{{../contribute.html#Source_Code}checkout}} the latest sources from version control. Once you have the sources, you can build them using the Modified: tika/site/src/site/apt/0.7/gettingstarted.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/0.7/gettingstarted.apt?rev=1867117=1867116=1867117=diff == --- tika/site/src/site/apt/0.7/gettingstarted.apt (original) +++ tika/site/src/site/apt/0.7/gettingstarted.apt Wed Sep 18 13:38:59 2019 @@ -26,7 +26,7 @@ Getting and building the sources To build Tika from sources you first need to either {{{../download.html}download}} a source release or - {{{../source-repository.html}checkout}} the latest sources from + {{{../contribute.html#Source_Code}checkout}} the latest sources from version control. Once you have the sources, you can build them using the Modified: tika/site/src/site/apt/0.8/gettingstarted.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/0.8/gettingstarted.apt?rev=1867117=1867116=1867117=diff
[tika] 03/03: Use the new RSS 2.0 file in tests too, alongside the current 0.91 one
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit a0546b6cb98c949bb747b2e0e8d5675f651f6a16 Author: Nick Burch AuthorDate: Wed Oct 17 17:43:12 2018 +0100 Use the new RSS 2.0 file in tests too, alongside the current 0.91 one --- .../java/org/apache/tika/mime/TestMimeTypes.java | 3 ++ .../apache/tika/parser/feed/FeedParserTest.java| 38 +- 2 files changed, 25 insertions(+), 16 deletions(-) diff --git a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java index bfb4c62..a527d4e 100644 --- a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java +++ b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java @@ -387,9 +387,12 @@ public class TestMimeTypes { @Test public void testFeedsDetection() throws Exception { assertType("application/rss+xml", "rsstest_091.rss"); +assertType("application/rss+xml", "rsstest_20.rss"); assertType("application/atom+xml", "testATOM.atom"); assertTypeByData("application/rss+xml", "rsstest_091.rss"); assertTypeByName("application/rss+xml", "rsstest_091.rss"); +assertTypeByData("application/rss+xml", "rsstest_20.rss"); +assertTypeByName("application/rss+xml", "rsstest_20.rss"); assertTypeByData("application/atom+xml", "testATOM.atom"); assertTypeByName("application/atom+xml", "testATOM.atom"); } diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java index d7e7c76..1a5c293 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java @@ -31,22 +31,28 @@ import org.xml.sax.ContentHandler; public class FeedParserTest { @Test public void testRSSParser() throws Exception { -try (InputStream input = FeedParserTest.class.getResourceAsStream( -"/test-documents/rsstest_091.rss")) { -Metadata metadata = new Metadata(); -ContentHandler handler = new BodyContentHandler(); -ParseContext context = new ParseContext(); - -new FeedParser().parse(input, handler, metadata, context); - -String content = handler.toString(); -assertFalse(content == null); - -assertEquals("Sample RSS File for Junit test", -metadata.get(TikaCoreProperties.DESCRIPTION)); -assertEquals("TestChannel", metadata.get(TikaCoreProperties.TITLE)); - -// TODO find a way of testing the paragraphs and anchors +// These RSS files should have basically the same contents, +// represented in the various RSS format versions +for (String rssFile : new String[] { +"/test-documents/rsstest_091.rss", +"/test-documents/rsstest_20.rss" +}) { +try (InputStream input = FeedParserTest.class.getResourceAsStream(rssFile)) { +Metadata metadata = new Metadata(); +ContentHandler handler = new BodyContentHandler(); +ParseContext context = new ParseContext(); + +new FeedParser().parse(input, handler, metadata, context); + +String content = handler.toString(); +assertFalse(content == null); + +assertEquals("Sample RSS File for Junit test", +metadata.get(TikaCoreProperties.DESCRIPTION)); +assertEquals("TestChannel", metadata.get(TikaCoreProperties.TITLE)); + +// TODO find a way of testing the paragraphs and anchors +} } }
[tika] 01/03: RSS test file is RSS v0.91, so name appropriately
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit 429b22b2ac9ff96cfca714895d65dce311522616 Author: Nick Burch AuthorDate: Wed Oct 17 17:15:33 2018 +0100 RSS test file is RSS v0.91, so name appropriately --- tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java | 6 +++--- .../src/test/java/org/apache/tika/parser/AutoDetectParserTest.java | 2 +- .../src/test/java/org/apache/tika/parser/feed/FeedParserTest.java | 2 +- .../test/resources/test-documents/{rsstest.rss => rsstest_091.rss} | 0 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java index 9205530..bfb4c62 100644 --- a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java +++ b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java @@ -386,10 +386,10 @@ public class TestMimeTypes { @Test public void testFeedsDetection() throws Exception { -assertType("application/rss+xml", "rsstest.rss"); +assertType("application/rss+xml", "rsstest_091.rss"); assertType("application/atom+xml", "testATOM.atom"); -assertTypeByData("application/rss+xml", "rsstest.rss"); -assertTypeByName("application/rss+xml", "rsstest.rss"); +assertTypeByData("application/rss+xml", "rsstest_091.rss"); +assertTypeByName("application/rss+xml", "rsstest_091.rss"); assertTypeByData("application/atom+xml", "testATOM.atom"); assertTypeByName("application/atom+xml", "testATOM.atom"); } diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java index 10d2a0f..ddbbd75 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java @@ -241,7 +241,7 @@ public class AutoDetectParserTest extends TikaTest { @Test public void testRss() throws Exception { -assertAutoDetect("/test-documents/rsstest.rss", "feed", RSS, "application/rss+xml", "Sample RSS File for Junit test"); +assertAutoDetect("/test-documents/rsstest_091.rss", "feed", RSS, "application/rss+xml", "Sample RSS File for Junit test"); } @Test diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java index cc10dd2..d7e7c76 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java @@ -32,7 +32,7 @@ public class FeedParserTest { @Test public void testRSSParser() throws Exception { try (InputStream input = FeedParserTest.class.getResourceAsStream( -"/test-documents/rsstest.rss")) { +"/test-documents/rsstest_091.rss")) { Metadata metadata = new Metadata(); ContentHandler handler = new BodyContentHandler(); ParseContext context = new ParseContext(); diff --git a/tika-parsers/src/test/resources/test-documents/rsstest.rss b/tika-parsers/src/test/resources/test-documents/rsstest_091.rss similarity index 100% rename from tika-parsers/src/test/resources/test-documents/rsstest.rss rename to tika-parsers/src/test/resources/test-documents/rsstest_091.rss
[tika] branch master updated (5310f17 -> a0546b6)
This is an automated email from the ASF dual-hosted git repository. nick pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/tika.git. from 5310f17 TIKA-2757 -- add versions plugin new 429b22b RSS test file is RSS v0.91, so name appropriately new 1fca098 Add a test RSS 2.0 file new a0546b6 Use the new RSS 2.0 file in tests too, alongside the current 0.91 one The 3 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .../java/org/apache/tika/mime/TestMimeTypes.java | 9 +++-- .../apache/tika/parser/AutoDetectParserTest.java | 2 +- .../apache/tika/parser/feed/FeedParserTest.java| 38 +- .../{rsstest.rss => rsstest_091.rss} | 0 .../test-documents/{rsstest.rss => rsstest_20.rss} | 8 - 5 files changed, 36 insertions(+), 21 deletions(-) copy tika-parsers/src/test/resources/test-documents/{rsstest.rss => rsstest_091.rss} (100%) rename tika-parsers/src/test/resources/test-documents/{rsstest.rss => rsstest_20.rss} (74%)
[tika] branch master updated (3d5d4d8 -> 705b79c)
This is an automated email from the ASF dual-hosted git repository. nick pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/tika.git. from 3d5d4d8 Merge pull request #239 from wowselim/master new b26a0cc Merge branch 'master' of https://github.com/wowselim/tika new 53c8434 Merge branch 'master' of https://github.com/apache/tika new 9a2c7d8 Mime magic for "MIME Encapsulation of Aggregate HTML Documents" (MHTML), pulled out from rfc822 (may not be fully correct long-term...) new 705b79c Changes update The 4 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: CHANGES.txt| 3 +++ .../org/apache/tika/mime/tika-mimetypes.xml| 22 -- 2 files changed, 23 insertions(+), 2 deletions(-)
[tika] 01/04: Merge branch 'master' of https://github.com/wowselim/tika
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit b26a0ccdbd5620b870df0dc434d2f9265b2df082 Merge: e4f0fe5 eb33286 Author: Nick Burch AuthorDate: Wed Sep 5 20:46:56 2018 +0100 Merge branch 'master' of https://github.com/wowselim/tika tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 3 +++ 1 file changed, 3 insertions(+)
[tika] 04/04: Changes update
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit 705b79ccb6c0ad0f92a3a185bf7e66cacf899931 Author: Nick Burch AuthorDate: Thu Sep 6 09:28:24 2018 +0100 Changes update --- CHANGES.txt | 3 +++ 1 file changed, 3 insertions(+) diff --git a/CHANGES.txt b/CHANGES.txt index ce647a9..81782ec 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -5,6 +5,9 @@ Release 2.0.0 - ??? Other changes + * Mime magic improvements for Olympus RAW (TIKA-2658), interpreted + server-side languages via HTTP (TIKA-2648), MHTML (TIKA-2723) + Release 1.19 ??? * Add absolute timeout to ForkParser rather than testing
[tika] 03/04: Mime magic for "MIME Encapsulation of Aggregate HTML Documents" (MHTML), pulled out from rfc822 (may not be fully correct long-term...)
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit 9a2c7d89e03ca7c0e821b69c394165297edfb9d4 Author: Nick Burch AuthorDate: Thu Sep 6 09:28:14 2018 +0100 Mime magic for "MIME Encapsulation of Aggregate HTML Documents" (MHTML), pulled out from rfc822 (may not be fully correct long-term...) --- .../org/apache/tika/mime/tika-mimetypes.xml| 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 007ec53..bd1adfa 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -5980,9 +5980,28 @@ + + + + + + +MHTML +<_comment>MIME Encapsulation of Aggregate HTML Documents +http://tools.ietf.org/html/rfc2557 + + + + + + + + + + - + @@ -6084,7 +6103,6 @@ -
[tika] 02/04: Merge branch 'master' of https://github.com/apache/tika
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit 53c8434f497795885ff129e17440881f059c1624 Merge: b26a0cc 3d5d4d8 Author: Nick Burch AuthorDate: Wed Sep 5 20:58:20 2018 +0100 Merge branch 'master' of https://github.com/apache/tika
[tika] branch master updated (e4f0fe5 -> 3d5d4d8)
This is an automated email from the ASF dual-hosted git repository. nick pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/tika.git. from e4f0fe5 Use DateUtils to format dates to strings, rather than relying on explicit/implicit toString calls add eb33286 TIKA-2658: add olympus raw file magic numbers new 3d5d4d8 Merge pull request #239 from wowselim/master The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 3 +++ 1 file changed, 3 insertions(+)
[tika] 01/01: Merge pull request #239 from wowselim/master
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit 3d5d4d8b9667a31e3cb30a9d02543347feefbcc7 Merge: e4f0fe5 eb33286 Author: Gagravarr AuthorDate: Wed Sep 5 20:58:07 2018 +0100 Merge pull request #239 from wowselim/master TIKA-2658: add olympus raw file magic numbers tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 3 +++ 1 file changed, 3 insertions(+)
[tika] branch master updated: Use DateUtils to format dates to strings, rather than relying on explicit/implicit toString calls
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/master by this push: new e4f0fe5 Use DateUtils to format dates to strings, rather than relying on explicit/implicit toString calls e4f0fe5 is described below commit e4f0fe5184db47724c6bf366a12ea0868972a83f Author: Nick Burch AuthorDate: Wed Sep 5 18:14:28 2018 +0100 Use DateUtils to format dates to strings, rather than relying on explicit/implicit toString calls --- .../geoinfo/GeographicInformationParser.java | 31 ++ 1 file changed, 20 insertions(+), 11 deletions(-) diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/geoinfo/GeographicInformationParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/geoinfo/GeographicInformationParser.java index 27b8040..268dd93 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/geoinfo/GeographicInformationParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/geoinfo/GeographicInformationParser.java @@ -48,6 +48,7 @@ import org.apache.tika.mime.MediaType; import org.apache.tika.parser.AbstractParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.sax.XHTMLContentHandler; +import org.apache.tika.utils.DateUtils; import org.opengis.metadata.Identifier; import org.opengis.metadata.citation.Citation; import org.opengis.metadata.citation.CitationDate; @@ -227,9 +228,11 @@ public class GeographicInformationParser extends AbstractParser{ metadata.add("IdentificationInfoCitationTitle ",i.getCitation().getTitle().toString()); ArrayList dateArrayList= (ArrayList) i.getCitation().getDates(); -for (CitationDate d:dateArrayList){ -if(d.getDateType()!=null) -metadata.add("CitationDate ",d.getDateType().name()+"-->"+d.getDate()); +for (CitationDate d:dateArrayList) { +if (d.getDateType()!=null) { +String date = DateUtils.formatDate(d.getDate()); +metadata.add("CitationDate ",d.getDateType().name()+"-->"+date); +} } ArrayList responsiblePartyArrayList= (ArrayList) i.getCitation().getCitedResponsibleParties(); for(ResponsibleParty r:responsiblePartyArrayList){ @@ -282,9 +285,11 @@ public class GeographicInformationParser extends AbstractParser{ metadata.add("ThesaurusNameAlternativeTitle "+j,k.getThesaurusName().getAlternateTitles().toString()); ArrayListcitationDates= (ArrayList) k.getThesaurusName().getDates(); -for(CitationDate cd:citationDates) { - if(cd.getDateType()!=null) -metadata.add("ThesaurusNameDate ",cd.getDateType().name() +"-->" + cd.getDate()); +for (CitationDate cd:citationDates) { + if (cd.getDateType()!=null) { + String date = DateUtils.formatDate(cd.getDate()); + metadata.add("ThesaurusNameDate ",cd.getDateType().name() +"-->" + date); + } } } ArrayList constraintList= (ArrayList) i.getResourceConstraints(); @@ -315,9 +320,11 @@ public class GeographicInformationParser extends AbstractParser{ for(InternationalString s:((DefaultGeographicDescription) g).getGeographicIdentifier().getAuthority().getAlternateTitles()) { metadata.add("GeographicIdentifierAuthorityAlternativeTitle ",s.toString()); } -for(CitationDate cd:((DefaultGeographicDescription) g).getGeographicIdentifier().getAuthority().getDates()){ -if(cd.getDateType()!=null && cd.getDate()!=null) - metadata.add("GeographicIdentifierAuthorityDate ",cd.getDateType().name()+" "+cd.getDate().toString()); +for (CitationDate cd:((DefaultGeographicDescription) g).getGeographicIdentifier().getAuthority().getDates()){ +if (cd.getDateType()!=null && cd.getDate()!=null) { +String date = DateUtils.formatDate(cd.getDate()); + metadata.add("GeographicIdentifierAuthorityDate ",cd.getDateType().name()+" "+date); +} } } } @@ -363,8 +370,10 @@ public class GeographicInformationParser extends AbstractParser{ private void getMetaDataDateInfo(Metadata metadata, DefaultMetadata defau
[tika] 01/07: TIKA-2479 Option to request missing rows where possible in Excel-like formats
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit a1e42a0659ba33e90cb1bba0a0a10eeb97d4fac7 Author: Nick Burch <n...@gagravarr.org> AuthorDate: Thu May 17 22:15:34 2018 +0100 TIKA-2479 Option to request missing rows where possible in Excel-like formats --- .../apache/tika/parser/microsoft/OfficeParserConfig.java | 16 +++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java index 34b865e..5d34b2e 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java @@ -29,6 +29,7 @@ public class OfficeParserConfig implements Serializable { private boolean includeMoveFromContent = false; private boolean includeShapeBasedContent = true; private boolean includeHeadersAndFooters = true; +private boolean includeMissingRows = false; private boolean concatenatePhoneticRuns = true; private boolean useSAXDocxExtractor = false; @@ -188,10 +189,23 @@ public class OfficeParserConfig implements Serializable { this.extractAllAlternativesFromMSG = extractAllAlternativesFromMSG; } - public boolean getExtractAllAlternativesFromMSG() { return extractAllAlternativesFromMSG; } + +/** + * For table-like formats, and tables within other formats, should + * missing rows in sparse tables be output where detected? + * The default is to only output rows defined within the file, which + * avoid lots of blank lines, but means layout isn't preserved. + */ +public void setIncludeMissingRows(boolean includeMissingRows) { +this.includeMissingRows = includeMissingRows; +} + +public boolean getIncludeMissingRows() { +return includeMissingRows; +} } -- To stop receiving notification emails like this one, please contact n...@apache.org.
[tika] 07/07: Add the other jackcess jar to the bundle
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit 12693ea18f1a05894272aa3a9293d41215f63c06 Author: Nick Burch <n...@gagravarr.org> AuthorDate: Fri May 18 15:35:06 2018 +0100 Add the other jackcess jar to the bundle --- tika-bundle/pom.xml | 1 + 1 file changed, 1 insertion(+) diff --git a/tika-bundle/pom.xml b/tika-bundle/pom.xml index 2b500d7..fa13e21 100644 --- a/tika-bundle/pom.xml +++ b/tika-bundle/pom.xml @@ -170,6 +170,7 @@ curvesapi| xmlbeans| jackcess| + jackcess-encrypt| commons-lang| tagsoup| asm| -- To stop receiving notification emails like this one, please contact n...@apache.org.
[tika] 03/07: Updated Columnar output from SAS with better formats
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit b01b059331f198d3829b111002cf03cbcaf1bab3 Author: Nick Burch <n...@gagravarr.org> AuthorDate: Fri May 18 11:43:47 2018 +0100 Updated Columnar output from SAS with better formats --- .../apache/tika/parser/sas/SAS7BDATParserTest.java | 8 .../test-documents/test-columnar.sas7bdat | Bin 17408 -> 131072 bytes .../resources/test-documents/test-columnar.xls | Bin 6656 -> 66048 bytes .../resources/test-documents/test-columnar.xlsx| Bin 4941 -> 6603 bytes 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java index 610ffc3..00a2aaa 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java @@ -89,15 +89,15 @@ public class SAS7BDATParserTest extends TikaTest { assertEquals("application/x-sas-data", metadata.get(Metadata.CONTENT_TYPE)); assertEquals("TESTING", metadata.get(TikaCoreProperties.TITLE)); -assertEquals("2018-05-09T17:59:33Z", metadata.get(TikaCoreProperties.CREATED)); -assertEquals("2018-05-09T17:59:33Z", metadata.get(TikaCoreProperties.MODIFIED)); +assertEquals("2018-05-18T11:38:30Z", metadata.get(TikaCoreProperties.CREATED)); +assertEquals("2018-05-18T11:38:30Z", metadata.get(TikaCoreProperties.MODIFIED)); assertEquals("1", metadata.get(PagedText.N_PAGES)); assertEquals("8", metadata.get(Database.COLUMN_COUNT)); assertEquals("11", metadata.get(Database.ROW_COUNT)); assertEquals("windows-1252", metadata.get(HttpHeaders.CONTENT_ENCODING)); -assertEquals("W32_7PRO", metadata.get(OfficeOpenXMLExtended.APPLICATION)); -assertEquals("9.0301M2", metadata.get(OfficeOpenXMLExtended.APP_VERSION)); +assertEquals("X64_7PRO", metadata.get(OfficeOpenXMLExtended.APPLICATION)); +assertEquals("9.0401M5", metadata.get(OfficeOpenXMLExtended.APP_VERSION)); assertEquals("32", metadata.get(MachineMetadata.ARCHITECTURE_BITS)); assertEquals("Little", metadata.get(MachineMetadata.ENDIAN)); assertEquals(Arrays.asList("Record Number","Square of the Record Number", diff --git a/tika-parsers/src/test/resources/test-documents/test-columnar.sas7bdat b/tika-parsers/src/test/resources/test-documents/test-columnar.sas7bdat index 33ee412..f6cab63 100644 Binary files a/tika-parsers/src/test/resources/test-documents/test-columnar.sas7bdat and b/tika-parsers/src/test/resources/test-documents/test-columnar.sas7bdat differ diff --git a/tika-parsers/src/test/resources/test-documents/test-columnar.xls b/tika-parsers/src/test/resources/test-documents/test-columnar.xls index 1d7b2cf..cc45372 100644 Binary files a/tika-parsers/src/test/resources/test-documents/test-columnar.xls and b/tika-parsers/src/test/resources/test-documents/test-columnar.xls differ diff --git a/tika-parsers/src/test/resources/test-documents/test-columnar.xlsx b/tika-parsers/src/test/resources/test-documents/test-columnar.xlsx index 58ffd47..22483f1 100644 Binary files a/tika-parsers/src/test/resources/test-documents/test-columnar.xlsx and b/tika-parsers/src/test/resources/test-documents/test-columnar.xlsx differ -- To stop receiving notification emails like this one, please contact n...@apache.org.
[tika] 05/07: TIKA-2479 Update XLS missing cell/row handling to match XLSX and XLSB, add unit test for missing rows, and enable the Columnar tests for the Excel formats
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit 348b87e7f41b79ff115e17d9c91d2dad63a57c15 Author: Nick Burch <n...@gagravarr.org> AuthorDate: Fri May 18 15:15:32 2018 +0100 TIKA-2479 Update XLS missing cell/row handling to match XLSX and XLSB, add unit test for missing rows, and enable the Columnar tests for the Excel formats --- .../tika/parser/microsoft/ExcelExtractor.java | 26 ++-- .../org/apache/tika/parser/TabularFormatsTest.java | 47 ++ .../tika/parser/microsoft/ExcelParserTest.java | 25 +++- 3 files changed, 60 insertions(+), 38 deletions(-) diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java index 0dc33ee..ff5971a 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java @@ -16,7 +16,7 @@ */ package org.apache.tika.parser.microsoft; -import java.awt.*; +import java.awt.Point; import java.io.IOException; import java.text.NumberFormat; import java.util.ArrayList; @@ -42,7 +42,6 @@ import org.apache.poi.hssf.record.CountryRecord; import org.apache.poi.hssf.record.DateWindow1904Record; import org.apache.poi.hssf.record.DrawingGroupRecord; import org.apache.poi.hssf.record.EOFRecord; -import org.apache.poi.hssf.record.ExtSSTRecord; import org.apache.poi.hssf.record.ExtendedFormatRecord; import org.apache.poi.hssf.record.FooterRecord; import org.apache.poi.hssf.record.FormatRecord; @@ -281,7 +280,6 @@ public class ExcelExtractor extends AbstractPOIFSExtractor { public void processFile(DirectoryNode root, boolean listenForAllRecords) throws IOException, SAXException, TikaException { - // Set up listener and register the records we want to process HSSFRequest hssfRequest = new HSSFRequest(); if (listenForAllRecords) { @@ -494,15 +492,14 @@ public class ExcelExtractor extends AbstractPOIFSExtractor { HeaderRecord headerRecord = (HeaderRecord) record; addTextCell(record, headerRecord.getText()); } - break; +break; case FooterRecord.sid: if (extractor.officeParserConfig.getIncludeHeadersAndFooters()) { FooterRecord footerRecord = (FooterRecord) record; addTextCell(record, footerRecord.getText()); } - break; - +break; } previousSid = record.getSid(); @@ -599,12 +596,17 @@ public class ExcelExtractor extends AbstractPOIFSExtractor { handler.startElement("tr"); handler.startElement("td"); for (Map.Entry<Point, Cell> entry : currentSheet.entrySet()) { -while (currentRow < entry.getKey().y) { -handler.endElement("td"); -handler.endElement("tr"); -handler.startElement("tr"); -handler.startElement("td"); -currentRow++; +if (currentRow != entry.getKey().y) { +// We've moved onto a new row, possibly skipping some +do { +handler.endElement("td"); +handler.endElement("tr"); +handler.startElement("tr"); +handler.startElement("td"); +currentRow++; +} while (officeParserConfig.getIncludeMissingRows() && + currentRow < entry.getKey().y); +currentRow = entry.getKey().y; currentColumn = 0; } diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java index 41139e2..4a52118 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java @@ -64,8 +64,8 @@ public class TabularFormatsTest extends TikaTest { "87.5%","88.9%","90.0%" }, new Pattern[] { -Pattern.compile("01-(01|JAN|Jan)-(60|1960)"), -Pattern.compile("02-01-1960"), +Pattern.compile("0?1-01-1960"), +Pattern.compile("
[tika] 06/07: Changelog update
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit 9673fbdbba8feebb72fee569074e94b0868a89df Author: Nick Burch <n...@gagravarr.org> AuthorDate: Fri May 18 15:17:56 2018 +0100 Changelog update --- CHANGES.txt | 5 + 1 file changed, 5 insertions(+) diff --git a/CHANGES.txt b/CHANGES.txt index 38f1973..0ffc5de 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -81,6 +81,11 @@ Release 2.0.0 - ??? * Mime magic for ACES Images (TIKA-2628) and DPX Images (TIKA-2629) + * For sparse XLSX and XLSB files, always output missing cells to + the left of filled ones (matching XLS), and optionally output + missing rows on all 3 formats if requested via the + OfficeParserContext (TIKA-2479) + Release 1.17 - 12/8/2017 ***NOTE: THIS IS THE LAST VERSION OF TIKA THAT WILL RUN -- To stop receiving notification emails like this one, please contact n...@apache.org.
[tika] 04/07: Formatted columns in the columnar test Excel files
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit 6fa1105e0669ffeec5c3cf0d1db247a8c16f3bc5 Author: Nick Burch <n...@gagravarr.org> AuthorDate: Fri May 18 15:13:43 2018 +0100 Formatted columns in the columnar test Excel files --- .../test/resources/test-documents/test-columnar.xls | Bin 66048 -> 32768 bytes .../resources/test-documents/test-columnar.xlsb | Bin 0 -> 9691 bytes .../resources/test-documents/test-columnar.xlsx | Bin 6603 -> 10556 bytes .../src/test/resources/test-documents/testSAS2.sas | 3 +++ 4 files changed, 3 insertions(+) diff --git a/tika-parsers/src/test/resources/test-documents/test-columnar.xls b/tika-parsers/src/test/resources/test-documents/test-columnar.xls index cc45372..3f1009c 100644 Binary files a/tika-parsers/src/test/resources/test-documents/test-columnar.xls and b/tika-parsers/src/test/resources/test-documents/test-columnar.xls differ diff --git a/tika-parsers/src/test/resources/test-documents/test-columnar.xlsb b/tika-parsers/src/test/resources/test-documents/test-columnar.xlsb new file mode 100644 index 000..0ce5139 Binary files /dev/null and b/tika-parsers/src/test/resources/test-documents/test-columnar.xlsb differ diff --git a/tika-parsers/src/test/resources/test-documents/test-columnar.xlsx b/tika-parsers/src/test/resources/test-documents/test-columnar.xlsx index 22483f1..f1f4dc4 100644 Binary files a/tika-parsers/src/test/resources/test-documents/test-columnar.xlsx and b/tika-parsers/src/test/resources/test-documents/test-columnar.xlsx differ diff --git a/tika-parsers/src/test/resources/test-documents/testSAS2.sas b/tika-parsers/src/test/resources/test-documents/testSAS2.sas index 96a9121..df52b1a 100644 --- a/tika-parsers/src/test/resources/test-documents/testSAS2.sas +++ b/tika-parsers/src/test/resources/test-documents/testSAS2.sas @@ -57,6 +57,9 @@ proc export data=testing label putnames=yes; run; +/* Due to SAS Limitations, you will need to manually */ +/* style the % and Date/Datetime columns in Excel */ +/* You will also need to save-as XLSB to generate that */ proc export data=testing label outfile="/testing.xls" dbms=XLS; -- To stop receiving notification emails like this one, please contact n...@apache.org.
[tika] 02/07: TIKA-2479 Output missing left/mid cells in XLSX and XLSB, and optionally also missing rows
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit b1b035e6bbcff0db24e133b682ac79916f92f599 Author: Nick Burch <n...@gagravarr.org> AuthorDate: Thu May 17 23:07:04 2018 +0100 TIKA-2479 Output missing left/mid cells in XLSX and XLSB, and optionally also missing rows --- .../ooxml/XSSFBExcelExtractorDecorator.java| 2 +- .../ooxml/XSSFExcelExtractorDecorator.java | 35 ++ .../org/apache/tika/parser/TabularFormatsTest.java | 11 ++- 3 files changed, 41 insertions(+), 7 deletions(-) diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFBExcelExtractorDecorator.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFBExcelExtractorDecorator.java index dcde62b..33dbb7e 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFBExcelExtractorDecorator.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFBExcelExtractorDecorator.java @@ -117,7 +117,7 @@ public class XSSFBExcelExtractorDecorator extends XSSFExcelExtractorDecorator { addDrawingHyperLinks(sheetPart); sheetParts.add(sheetPart); -SheetTextAsHTML sheetExtractor = new SheetTextAsHTML(config.getIncludeHeadersAndFooters(), xhtml); +SheetTextAsHTML sheetExtractor = new SheetTextAsHTML(config, xhtml); XSSFBCommentsTable comments = iter.getXSSFBSheetComments(); // Start, and output the sheet name diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java index 9a2b017..7e1a7cd 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java @@ -25,7 +25,6 @@ import java.util.List; import java.util.Locale; import java.util.Map; -import org.apache.poi.POIXMLDocument; import org.apache.poi.POIXMLTextExtractor; import org.apache.poi.hssf.extractor.ExcelExtractor; import org.apache.poi.openxml4j.exceptions.InvalidFormatException; @@ -39,6 +38,7 @@ import org.apache.poi.openxml4j.opc.PackagingURIHelper; import org.apache.poi.openxml4j.opc.TargetMode; import org.apache.poi.ss.usermodel.DataFormatter; import org.apache.poi.ss.usermodel.HeaderFooter; +import org.apache.poi.ss.util.CellReference; import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable; import org.apache.poi.xssf.eventusermodel.XSSFReader; import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler; @@ -56,6 +56,7 @@ import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.metadata.TikaCoreProperties; import org.apache.tika.parser.ParseContext; +import org.apache.tika.parser.microsoft.OfficeParserConfig; import org.apache.tika.parser.microsoft.TikaExcelDataFormatter; import org.apache.tika.sax.OfflineContentHandler; import org.apache.tika.sax.XHTMLContentHandler; @@ -144,8 +145,7 @@ public class XSSFExcelExtractorDecorator extends AbstractOOXMLExtractor { } while (iter.hasNext()) { - -SheetTextAsHTML sheetExtractor = new SheetTextAsHTML(config.getIncludeHeadersAndFooters(), xhtml); +SheetTextAsHTML sheetExtractor = new SheetTextAsHTML(config, xhtml); PackagePart sheetPart = null; try (InputStream stream = iter.next()) { sheetPart = iter.getSheetPart(); @@ -397,11 +397,15 @@ public class XSSFExcelExtractorDecorator extends AbstractOOXMLExtractor { protected static class SheetTextAsHTML implements SheetContentsHandler { private XHTMLContentHandler xhtml; private final boolean includeHeadersFooters; +private final boolean includeMissingRows; protected List headers; protected List footers; +private int lastSeenRow = -1; +private int lastSeenCol = -1; -protected SheetTextAsHTML(boolean includeHeaderFooters, XHTMLContentHandler xhtml) { -this.includeHeadersFooters = includeHeaderFooters; +protected SheetTextAsHTML(OfficeParserConfig config, XHTMLContentHandler xhtml) { +this.includeHeadersFooters = config.getIncludeHeadersAndFooters(); +this.includeMissingRows = config.getIncludeMissingRows(); this.xhtml = xhtml; headers = new ArrayList(); footers = new ArrayList(); @@ -409,7 +413,19 @@ public class XSSFExcelExtractorDecorator extends AbstractOOXMLExtractor { public void startRow(int rowNum) { try { +// Missing rows, if desired, with a single emp
[tika] branch master updated (5f05b51 -> 12693ea)
This is an automated email from the ASF dual-hosted git repository. nick pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/tika.git. from 5f05b51 TIKA-2644 - refactor recursiveparserwrapper api new a1e42a0 TIKA-2479 Option to request missing rows where possible in Excel-like formats new b1b035e TIKA-2479 Output missing left/mid cells in XLSX and XLSB, and optionally also missing rows new b01b059 Updated Columnar output from SAS with better formats new 6fa1105 Formatted columns in the columnar test Excel files new 348b87e TIKA-2479 Update XLS missing cell/row handling to match XLSX and XLSB, add unit test for missing rows, and enable the Columnar tests for the Excel formats new 9673fbd Changelog update new 12693ea Add the other jackcess jar to the bundle The 7 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: CHANGES.txt| 5 +++ tika-bundle/pom.xml| 1 + .../tika/parser/microsoft/ExcelExtractor.java | 26 +++--- .../tika/parser/microsoft/OfficeParserConfig.java | 16 - .../ooxml/XSSFBExcelExtractorDecorator.java| 2 +- .../ooxml/XSSFExcelExtractorDecorator.java | 35 +++--- .../org/apache/tika/parser/TabularFormatsTest.java | 40 - .../tika/parser/microsoft/ExcelParserTest.java | 25 - .../apache/tika/parser/sas/SAS7BDATParserTest.java | 8 ++--- .../test-documents/test-columnar.sas7bdat | Bin 17408 -> 131072 bytes .../resources/test-documents/test-columnar.xls | Bin 6656 -> 32768 bytes .../resources/test-documents/test-columnar.xlsb| Bin 0 -> 9691 bytes .../resources/test-documents/test-columnar.xlsx| Bin 4941 -> 10556 bytes .../src/test/resources/test-documents/testSAS2.sas | 3 ++ 14 files changed, 120 insertions(+), 41 deletions(-) create mode 100644 tika-parsers/src/test/resources/test-documents/test-columnar.xlsb -- To stop receiving notification emails like this one, please contact n...@apache.org.
[tika] branch master updated: Mime magic for DPX and ACES, thanks to Andreas Meier (TIKA-2628 and TIKA-2629)
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/master by this push: new ca3207c Mime magic for DPX and ACES, thanks to Andreas Meier (TIKA-2628 and TIKA-2629) ca3207c is described below commit ca3207c3b0dd408b32a07b70dcfef42aa4d0a9bd Author: Nick Burch <n...@gagravarr.org> AuthorDate: Thu May 10 22:18:36 2018 +0100 Mime magic for DPX and ACES, thanks to Andreas Meier (TIKA-2628 and TIKA-2629) --- CHANGES.txt | 2 ++ .../resources/org/apache/tika/mime/tika-mimetypes.xml | 19 +++ 2 files changed, 21 insertions(+) diff --git a/CHANGES.txt b/CHANGES.txt index c66e883..b24df29 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -76,6 +76,8 @@ Release 2.0.0 - ??? * Handle .epub files using .htm rather than .html extensions for the embedded contents (TIKA-1288) + * Mime magic for ACES Images (TIKA-2628) and DPX Images (TIKA-2629) + Release 1.17 - 12/8/2017 ***NOTE: THIS IS THE LAST VERSION OF TIKA THAT WILL RUN diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 7c0cd91..104cd2c 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -5074,6 +5074,15 @@ + +<_comment>ACES Image Container File + + + + + + + @@ -5123,6 +5132,16 @@ + +DPX +<_comment>Digital Picture Exchange from SMPTE + + + + + + + -- To stop receiving notification emails like this one, please contact n...@apache.org.
[tika] 04/04: Add disabled, currently failing ODS test
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit 49833d88cb323928c3de7bd7a86ab38444530418 Author: Nick Burch <n...@gagravarr.org> AuthorDate: Thu May 10 17:13:24 2018 +0100 Add disabled, currently failing ODS test --- .../java/org/apache/tika/parser/TabularFormatsTest.java | 14 +++--- .../src/test/resources/test-documents/test-columnar.ods | Bin 0 -> 12854 bytes 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java index 119c9cd..ea326bd 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java @@ -226,7 +226,7 @@ public class TabularFormatsTest extends TikaTest { XMLResult result = getXML("test-columnar.xls"); String xml = result.xml; assertHeaders(xml, false, true, false); -// TODO Correctly handle empty cells then test +// TODO Correctly handle empty cells then enable this test //assertContents(xml, true, false); } @Test @@ -234,10 +234,18 @@ public class TabularFormatsTest extends TikaTest { XMLResult result = getXML("test-columnar.xlsx"); String xml = result.xml; assertHeaders(xml, false, true, false); -// TODO Correctly handle empty cells then test +// TODO Correctly handle empty cells then enable this test //assertContents(xml, true, false); } -// TODO Test OpenDocument ODS test +// TODO Fix the ODS test - currently failing with +// org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared +//@Test +//public void testODS() throws Exception { +//XMLResult result = getXML("test-columnar.ods"); +//String xml = result.xml; +//assertHeaders(xml, false, true, false); +//assertContents(xml, true, true); +//} // TODO Test other formats, eg Database formats diff --git a/tika-parsers/src/test/resources/test-documents/test-columnar.ods b/tika-parsers/src/test/resources/test-documents/test-columnar.ods new file mode 100644 index 000..067ca18 Binary files /dev/null and b/tika-parsers/src/test/resources/test-documents/test-columnar.ods differ -- To stop receiving notification emails like this one, please contact n...@apache.org.
[tika] branch master updated (cfd6256 -> 49833d8)
This is an automated email from the ASF dual-hosted git repository. nick pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/tika.git. from cfd6256 Remaining values to check new 6cff602 Ensure that empty cells are still output new d0fb697 Not all formats know about %s, dates not completely consistent either... new 72994c8 Use patterns to handle the date format variations new 49833d8 Add disabled, currently failing ODS test The 4 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .../org/apache/tika/parser/sas/SAS7BDATParser.java | 6 +- .../org/apache/tika/parser/TabularFormatsTest.java | 126 ++--- .../resources/test-documents/test-columnar.ods | Bin 0 -> 12854 bytes 3 files changed, 88 insertions(+), 44 deletions(-) create mode 100644 tika-parsers/src/test/resources/test-documents/test-columnar.ods -- To stop receiving notification emails like this one, please contact n...@apache.org.
[tika] 03/04: Use patterns to handle the date format variations
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit 72994c8ac8f0c749f26f4f19b7992b8224fc2a12 Author: Nick Burch <n...@gagravarr.org> AuthorDate: Thu May 10 16:59:09 2018 +0100 Use patterns to handle the date format variations --- .../org/apache/tika/parser/TabularFormatsTest.java | 101 - 1 file changed, 56 insertions(+), 45 deletions(-) diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java index 80a7f56..119c9cd 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java @@ -18,10 +18,11 @@ package org.apache.tika.parser; import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; import java.util.Arrays; import java.util.List; -import java.util.Locale; +import java.util.regex.Pattern; import org.apache.tika.TikaTest; import org.junit.Test; @@ -45,14 +46,14 @@ public class TabularFormatsTest extends TikaTest { /** * Expected values, by column */ -protected static final String[][] table = new String[][] { +protected static final Object[][] table = new Object[][] { new String[] { "0","1","2","3","4","5","6","7","8","9","10" }, new String[] { "0","1","4","9","16","25","36","49","64","81","100" }, -new String[] {}, // Done later +new String[] {}, // Generated later new String[] { "0%","10%","20%","30%","40%","50%", "60%","70%","80%","90%","100%" @@ -62,37 +63,44 @@ public class TabularFormatsTest extends TikaTest { "75.0%","80.0%","83.3%","85.7%", "87.5%","88.9%","90.0%" }, -new String[] { - "01-01-1960", "02-01-1960", "17-01-1960", - "22-03-1960", "13-09-1960", "17-09-1961", - "20-07-1963", "29-07-1966", "20-03-1971", - "18-12-1977", "19-05-1987" +new Pattern[] { +Pattern.compile("01-(01|JAN|Jan)-(60|1960)"), +Pattern.compile("02-01-1960"), +Pattern.compile("17-01-1960"), +Pattern.compile("22-03-1960"), +Pattern.compile("13-09-1960"), +Pattern.compile("17-09-1961"), +Pattern.compile("20-07-1963"), +Pattern.compile("29-07-1966"), +Pattern.compile("20-03-1971"), +Pattern.compile("18-12-1977"), +Pattern.compile("19-05-1987"), }, -new String[] { - "01JAN60:00:00:01", - "01JAN60:00:00:10", - "01JAN60:00:01:40", - "01JAN60:00:16:40", - "01JAN60:02:46:40", - "02JAN60:03:46:40", - "12JAN60:13:46:40", - "25APR60:17:46:40", - "03MAR63:09:46:40", - "09SEP91:01:46:40", - "19NOV76:17:46:40" +new Pattern[] { + Pattern.compile("01(JAN|Jan)(60|1960):00:00:01(.00)?"), + Pattern.compile("01(JAN|Jan)(60|1960):00:00:10(.00)?"), + Pattern.compile("01(JAN|Jan)(60|1960):00:01:40(.00)?"), + Pattern.compile("01(JAN|Jan)(60|1960):00:16:40(.00)?"), + Pattern.compile("01(JAN|Jan)(60|1960):02:46:40(.00)?"), + Pattern.compile("02(JAN|Jan)(60|1960):03:46:40(.00)?"), + Pattern.compile("12(JAN|Jan)(60|1960):13:46:40(.00)?"), + Pattern.compile("25(APR|Apr)(60|1960):17:46:40(.00)?"), + Pattern.compile("03(MAR|Mar)(63|1963):09:46:40(.00)?"), + Pattern.compile("09(SEP|Sep)(91|1991):01:46:40(.00)?"), + Pattern.compile("19(NOV|Nov)(76|2276):17:46:40(.00)?") }, -new String[] { - "0:00:01", - "0:00:03", - "0:00:09", -
[tika] 02/04: Not all formats know about %s, dates not completely consistent either...
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit d0fb69715e83a42db2ee5c2750eaa9d3b4f4d86c Author: Nick Burch <n...@gagravarr.org> AuthorDate: Thu May 10 16:33:45 2018 +0100 Not all formats know about %s, dates not completely consistent either... --- .../org/apache/tika/parser/TabularFormatsTest.java | 33 ++ 1 file changed, 27 insertions(+), 6 deletions(-) diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java index 7330f6a..80a7f56 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java @@ -20,6 +20,8 @@ package org.apache.tika.parser; import static org.junit.Assert.assertEquals; import java.util.Arrays; +import java.util.List; +import java.util.Locale; import org.apache.tika.TikaTest; import org.junit.Test; @@ -56,7 +58,7 @@ public class TabularFormatsTest extends TikaTest { "60%","70%","80%","90%","100%" }, new String[] { -"M","0.0%","50.0%","66.7%", +"","0.0%","50.0%","66.7%", "75.0%","80.0%","83.3%","85.7%", "87.5%","88.9%","90.0%" }, @@ -100,6 +102,15 @@ public class TabularFormatsTest extends TikaTest { table[2][i] = "This is row " + i + " of 10"; } } +// Which columns hold percentages? Not all parsers +// correctly format these... +protected static final List percentageColumns = +Arrays.asList(new Integer[] { 3, 4 }); +// Which columns hold dates? Some parsers output +// bits of the month in lower case, some all upper, eg JAN vs Jan +protected static final List dateColumns = +Arrays.asList(new Integer[] { 5, 6 }); +// TODO Handle 60 vs 1960 protected static String[] toCells(String row, boolean isTH) { // Split into cells, ignoring stuff before first cell @@ -152,7 +163,7 @@ public class TabularFormatsTest extends TikaTest { } } } -protected void assertContents(String xml, boolean hasHeader) { +protected void assertContents(String xml, boolean hasHeader, boolean doesPercents) { // Ignore anything before the first // Ignore the header row if there is one int ignores = 1; @@ -178,8 +189,14 @@ public class TabularFormatsTest extends TikaTest { table.length, cells.length); for (int cn=0; cn<table.length; cn++) { +String val = cells[cn]; + +// If the parser doesn't know about % formats, +// skip the cell if the column in a % one +if (!doesPercents && percentageColumns.contains(cn)) continue; +if (dateColumns.contains(cn)) val = val.toUpperCase(Locale.ROOT); + // Ignore cell attributes -String val = cells.length > (cn-1) ? cells[cn] : ""; if (! val.isEmpty()) val = val.split(">")[1]; // Check assertEquals("Wrong text in row " + (rn+1) + " and column " + (cn+1), @@ -193,21 +210,25 @@ public class TabularFormatsTest extends TikaTest { XMLResult result = getXML("test-columnar.sas7bdat"); String xml = result.xml; assertHeaders(xml, true, true, true); -//assertContents(xml, true); +// TODO Wait for https://github.com/epam/parso/issues/28 to be fixed +// then check the % formats again +//assertContents(xml, true, false); } @Test public void testXLS() throws Exception { XMLResult result = getXML("test-columnar.xls"); String xml = result.xml; assertHeaders(xml, false, true, false); -//assertContents(xml, true); +// TODO Correctly handle empty cells then test +//assertContents(xml, true, false); } @Test public void testXLSX() throws Exception { XMLResult result = getXML("test-columnar.xlsx"); String xml = result.xml; assertHeaders(xml, false, true, false); -//assertContents(xml, true); +// TODO Correctly handle empty cells then test +//assertContents(xml, true, false); } // TODO Test ODS -- To stop receiving notification emails like this one, please contact n...@apache.org.
[tika] 01/04: Ensure that empty cells are still output
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit 6cff6029beb4316e541169d788fe1884b338 Author: Nick Burch <n...@gagravarr.org> AuthorDate: Thu May 10 16:26:22 2018 +0100 Ensure that empty cells are still output --- .../src/main/java/org/apache/tika/parser/sas/SAS7BDATParser.java| 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/sas/SAS7BDATParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/sas/SAS7BDATParser.java index 121d958..8b28644 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/sas/SAS7BDATParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/sas/SAS7BDATParser.java @@ -134,7 +134,11 @@ public class SAS7BDATParser extends AbstractParser { while ((row = sas.readNext()) != null) { xhtml.startElement("tr"); for (String val : DataWriterUtil.getRowValues(sas.getColumns(), row)) { -xhtml.element("td", val); +// Use explicit start/end, rather than element, to +// ensure that empty cells still get output +xhtml.startElement("td"); +xhtml.characters(val); +xhtml.endElement("td"); } xhtml.endElement("tr"); xhtml.newline(); -- To stop receiving notification emails like this one, please contact n...@apache.org.
[tika] 02/05: Add a time column to the test columnar files
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit ca2f5bc63b7595730e53e95758dc9aaf6b567daa Author: Nick Burch <n...@gagravarr.org> AuthorDate: Thu May 10 11:35:04 2018 +0100 Add a time column to the test columnar files --- .../org/apache/tika/parser/TabularFormatsTest.java | 22 +++- .../apache/tika/parser/sas/SAS7BDATParserTest.java | 8 ++--- .../resources/test-documents/test-columnar.csv | 37 +++-- .../resources/test-documents/test-columnar.sas.xml | 11 ++ .../test-documents/test-columnar.sas7bdat | Bin 17408 -> 17408 bytes .../resources/test-documents/test-columnar.xls | Bin 0 -> 6656 bytes .../resources/test-documents/test-columnar.xlsx| Bin 0 -> 4941 bytes .../resources/test-documents/test-columnar.xpt | Bin 4560 -> 4720 bytes .../src/test/resources/test-documents/testSAS2.sas | 27 --- 9 files changed, 64 insertions(+), 41 deletions(-) diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java index 61fcca2..4dc7336 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java @@ -26,25 +26,31 @@ import org.junit.Test; * This is mostly focused on the XHTML output */ public class TabularFormatsTest extends TikaTest { -protected static final String[] headers = new String[] { -"String (Num=)","Number","Date","Datetime","Number" +protected static final String[] columnNames = new String[] { + "recnum","square","desc","pctdone","pctinc", + "date","datetime","time" }; +protected static final String[] columnLabels = new String[] { +"Record Number","Square of the Record Number", +"Description of the Row","Percent Done", +"Percent Increment","date","datetime","time" +}; + /** * Expected values, by column */ protected static final String[][] table = new String[][] { // TODO All values new String[] { -"Num=0" + "0","1","2","3","4","5","6","7","8","9","10" }, new String[] { -"0.0" + "0","1","4" // etc }, -new String[] { -"1899-12-30" +new String[] { // etc +"01-01-1960" }, -new String[] { -"1900-01-01 11:00:00" +new String[] { // etc }, new String[] { "" diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java index 3bb3e01..610ffc3 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java @@ -89,11 +89,11 @@ public class SAS7BDATParserTest extends TikaTest { assertEquals("application/x-sas-data", metadata.get(Metadata.CONTENT_TYPE)); assertEquals("TESTING", metadata.get(TikaCoreProperties.TITLE)); -assertEquals("2018-05-09T16:42:04Z", metadata.get(TikaCoreProperties.CREATED)); -assertEquals("2018-05-09T16:42:04Z", metadata.get(TikaCoreProperties.MODIFIED)); +assertEquals("2018-05-09T17:59:33Z", metadata.get(TikaCoreProperties.CREATED)); +assertEquals("2018-05-09T17:59:33Z", metadata.get(TikaCoreProperties.MODIFIED)); assertEquals("1", metadata.get(PagedText.N_PAGES)); -assertEquals("7", metadata.get(Database.COLUMN_COUNT)); +assertEquals("8", metadata.get(Database.COLUMN_COUNT)); assertEquals("11", metadata.get(Database.ROW_COUNT)); assertEquals("windows-1252", metadata.get(HttpHeaders.CONTENT_ENCODING)); assertEquals("W32_7PRO", metadata.get(OfficeOpenXMLExtended.APPLICATION)); @@ -102,7 +102,7 @@ public class SAS7BDATParserTest extends TikaTest { assertEquals("Little", metadata.get(MachineMetadata.ENDIAN)); assertEquals(Arrays.asList("Record Number","Square of the Record Number", "Description of the
[tika] branch master updated (a0ffec1 -> cfd6256)
This is an automated email from the ASF dual-hosted git repository. nick pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/tika.git. from a0ffec1 Handle .epub files using .htm rather than .html extensions for the embedded contents (TIKA-1288) new d0324f8 Add a test .sas7bdat file with labels, and generate the columnar/tabular test file in a few more formats new ca2f5bc Add a time column to the test columnar files new 1d7a113 CSV assert as best we can (no dedicated parser), start on XLS and SAS7BDAT consistency tests new 7f89db3 Check header contents, check data rows count, add XLSX test new cfd6256 Remaining values to check The 5 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .../org/apache/tika/parser/TabularFormatsTest.java | 196 +++-- .../apache/tika/parser/sas/SAS7BDATParserTest.java | 51 -- .../resources/test-documents/test-columnar.csv | 37 ++-- .../resources/test-documents/test-columnar.sas.xml | 113 .../test-documents/test-columnar.sas7bdat | Bin 9216 -> 17408 bytes .../resources/test-documents/test-columnar.xls | Bin 0 -> 6656 bytes .../resources/test-documents/test-columnar.xlsx| Bin 0 -> 4941 bytes .../resources/test-documents/test-columnar.xpt | Bin 0 -> 4720 bytes .../src/test/resources/test-documents/testSAS2.sas | 67 +++ 9 files changed, 405 insertions(+), 59 deletions(-) create mode 100644 tika-parsers/src/test/resources/test-documents/test-columnar.sas.xml create mode 100644 tika-parsers/src/test/resources/test-documents/test-columnar.xls create mode 100644 tika-parsers/src/test/resources/test-documents/test-columnar.xlsx create mode 100644 tika-parsers/src/test/resources/test-documents/test-columnar.xpt create mode 100644 tika-parsers/src/test/resources/test-documents/testSAS2.sas -- To stop receiving notification emails like this one, please contact n...@apache.org.
[tika] 01/05: Add a test .sas7bdat file with labels, and generate the columnar/tabular test file in a few more formats
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit d0324f8e4fa70fce67d56dc70f611f5535fe229b Author: Nick Burch <n...@gagravarr.org> AuthorDate: Wed May 9 18:19:34 2018 +0100 Add a test .sas7bdat file with labels, and generate the columnar/tabular test file in a few more formats --- .../apache/tika/parser/sas/SAS7BDATParserTest.java | 51 +++ .../resources/test-documents/test-columnar.sas.xml | 102 + .../test-documents/test-columnar.sas7bdat | Bin 9216 -> 17408 bytes .../resources/test-documents/test-columnar.xpt | Bin 0 -> 4560 bytes .../src/test/resources/test-documents/testSAS2.sas | 48 ++ 5 files changed, 182 insertions(+), 19 deletions(-) diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java index 2657ac2..3bb3e01 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/sas/SAS7BDATParserTest.java @@ -82,36 +82,36 @@ public class SAS7BDATParserTest extends TikaTest { Metadata metadata = new Metadata(); try (InputStream stream = SAS7BDATParserTest.class.getResourceAsStream( -"/test-documents/test-columnar.sas7bdat")) { +"/test-documents/test-columnar.sas7bdat")) { parser.parse(stream, handler, metadata, new ParseContext()); } assertEquals("application/x-sas-data", metadata.get(Metadata.CONTENT_TYPE)); -assertEquals("SHEET1", metadata.get(TikaCoreProperties.TITLE)); +assertEquals("TESTING", metadata.get(TikaCoreProperties.TITLE)); -// Fri Mar 06 19:10:19 GMT 2015 -assertEquals("2015-03-06T19:10:19Z", metadata.get(TikaCoreProperties.CREATED)); -assertEquals("2015-03-06T19:10:19Z", metadata.get(TikaCoreProperties.MODIFIED)); +assertEquals("2018-05-09T16:42:04Z", metadata.get(TikaCoreProperties.CREATED)); +assertEquals("2018-05-09T16:42:04Z", metadata.get(TikaCoreProperties.MODIFIED)); assertEquals("1", metadata.get(PagedText.N_PAGES)); -assertEquals("5", metadata.get(Database.COLUMN_COUNT)); -assertEquals("31", metadata.get(Database.ROW_COUNT)); +assertEquals("7", metadata.get(Database.COLUMN_COUNT)); +assertEquals("11", metadata.get(Database.ROW_COUNT)); assertEquals("windows-1252", metadata.get(HttpHeaders.CONTENT_ENCODING)); -assertEquals("XP_PRO", metadata.get(OfficeOpenXMLExtended.APPLICATION)); -assertEquals("9.0101M3", metadata.get(OfficeOpenXMLExtended.APP_VERSION)); +assertEquals("W32_7PRO", metadata.get(OfficeOpenXMLExtended.APPLICATION)); +assertEquals("9.0301M2", metadata.get(OfficeOpenXMLExtended.APP_VERSION)); assertEquals("32", metadata.get(MachineMetadata.ARCHITECTURE_BITS)); assertEquals("Little", metadata.get(MachineMetadata.ENDIAN)); -assertEquals(Arrays.asList("A","B","C","D","E"), +assertEquals(Arrays.asList("Record Number","Square of the Record Number", + "Description of the Row","Percent Done", + "Percent Increment","date","datetime"), Arrays.asList(metadata.getValues(Database.COLUMN_NAME))); String content = handler.toString(); -assertContains("SHEET1", content); -assertContains("A\tB\tC", content); -assertContains("Num=0\t", content); -assertContains("Num=404242\t", content); -assertContains("\t0\t", content); -assertContains("\t404242\t", content); -assertContains("\t08Feb1904\t", content); +assertContains("TESTING", content); +assertContains("0\t0\tThis", content); +assertContains("2\t4\tThis", content); +assertContains("4\t16\tThis", content); +assertContains("\t01-01-1960\t", content); +assertContains("\t01Jan1960:00:00", content); } @Test @@ -129,7 +129,20 @@ public class SAS7BDATParserTest extends TikaTest { assertContains("This is row", xml); assertContains("10", xml); } + +@Test +public void testHTML2() throws Exception { +XMLResult result = getXML(&q
[tika] 05/05: Remaining values to check
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit cfd62569a8f6bf79ba5d15bb3f4063d49347c7fd Author: Nick Burch <n...@gagravarr.org> AuthorDate: Thu May 10 15:41:16 2018 +0100 Remaining values to check --- .../org/apache/tika/parser/TabularFormatsTest.java | 84 +++--- 1 file changed, 73 insertions(+), 11 deletions(-) diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java index 023f49d..7330f6a 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java @@ -44,24 +44,62 @@ public class TabularFormatsTest extends TikaTest { * Expected values, by column */ protected static final String[][] table = new String[][] { -// TODO All values new String[] { "0","1","2","3","4","5","6","7","8","9","10" }, new String[] { "0","1","4","9","16","25","36","49","64","81","100" }, -/* -new String[] { // etc -"01-01-1960" +new String[] {}, // Done later +new String[] { +"0%","10%","20%","30%","40%","50%", +"60%","70%","80%","90%","100%" +}, +new String[] { +"M","0.0%","50.0%","66.7%", +"75.0%","80.0%","83.3%","85.7%", +"87.5%","88.9%","90.0%" }, -new String[] { // etc +new String[] { + "01-01-1960", "02-01-1960", "17-01-1960", + "22-03-1960", "13-09-1960", "17-09-1961", + "20-07-1963", "29-07-1966", "20-03-1971", + "18-12-1977", "19-05-1987" }, new String[] { -"" + "01JAN60:00:00:01", + "01JAN60:00:00:10", + "01JAN60:00:01:40", + "01JAN60:00:16:40", + "01JAN60:02:46:40", + "02JAN60:03:46:40", + "12JAN60:13:46:40", + "25APR60:17:46:40", + "03MAR63:09:46:40", + "09SEP91:01:46:40", + "19NOV76:17:46:40" +}, +new String[] { + "0:00:01", + "0:00:03", + "0:00:09", + "0:00:27", + "0:01:21", + "0:04:03", + "0:12:09", + "0:36:27", + "1:49:21", + "5:28:03", + "16:24:09" } -*/ }; +static { +// Row text in 3rd column +table[2] = new String[table[0].length]; +for (int i=0; i<table[0].length; i++) { +table[2][i] = "This is row " + i + " of 10"; +} +} protected static String[] toCells(String row, boolean isTH) { // Split into cells, ignoring stuff before first cell @@ -72,9 +110,18 @@ public class TabularFormatsTest extends TikaTest { cells = row.split("<td"); } cells = Arrays.copyOfRange(cells, 1, cells.length); + +// Ignore the closing tag onwards, and normalise whitespace for (int i=0; i<cells.length; i++) { +cells[i] = cells[i].trim(); +if (cells[i].equals("/>")) { +cells[i] = ""; +continue; +} + int splitAt = cells[i].lastIndexOf(" (cn-1) ? cells[cn] : ""; +if (! val.isEmpty()) val = val.split(">")[1]; +// Check +assertEquals("Wrong text in row " + (rn+1) + " and column " + (cn+1), + table[cn][rn], val); +} +} } @Test @@ -133,21 +193,21 @@ public class TabularFormatsTest extends TikaTest { XMLResult result = getXML("test-columnar.sas7bdat"); String xml = result.xml; assertHeaders(xml, true, true, true); -assertCont
[tika] 04/05: Check header contents, check data rows count, add XLSX test
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git commit 7f89db35d066e6c4ae35490c5bad67d376e5365e Author: Nick Burch <n...@gagravarr.org> AuthorDate: Thu May 10 15:13:43 2018 +0100 Check header contents, check data rows count, add XLSX test --- .../org/apache/tika/parser/TabularFormatsTest.java | 77 +- 1 file changed, 61 insertions(+), 16 deletions(-) diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java index 8574d37..023f49d 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java @@ -31,7 +31,7 @@ import org.junit.Test; */ public class TabularFormatsTest extends TikaTest { protected static final String[] columnNames = new String[] { - "recnum","square","desc","pctdone","pctinc", + "recnum","square","desc","pctdone","pctincr", "date","datetime","time" }; protected static final String[] columnLabels = new String[] { @@ -49,8 +49,9 @@ public class TabularFormatsTest extends TikaTest { "0","1","2","3","4","5","6","7","8","9","10" }, new String[] { - "0","1","4" // etc + "0","1","4","9","16","25","36","49","64","81","100" }, +/* new String[] { // etc "01-01-1960" }, @@ -59,37 +60,72 @@ public class TabularFormatsTest extends TikaTest { new String[] { "" } +*/ }; - -protected void assertHeaders(String xml, boolean isTH, boolean hasLabel, boolean hasName) { -// Find the first row -int splitAt = xml.indexOf(""); -String hRow = xml.substring(0, splitAt); -splitAt = xml.indexOf(""); -hRow = hRow.substring(splitAt+4); - + +protected static String[] toCells(String row, boolean isTH) { // Split into cells, ignoring stuff before first cell String[] cells; if (isTH) { -cells = hRow.split("<th"); +cells = row.split("<th"); } else { -cells = hRow.split("<td"); +cells = row.split("<td"); } cells = Arrays.copyOfRange(cells, 1, cells.length); for (int i=0; i<cells.length; i++) { -splitAt = cells[i].lastIndexOf(""); +String hRow = xml.substring(0, splitAt); +splitAt = xml.indexOf(""); +hRow = hRow.substring(splitAt+4); + +// Split into cells, ignoring stuff before first cell +String[] cells = toCells(hRow, isTH); // Check we got the right number assertEquals("Wrong number of cells in header row " + hRow, columnLabels.length, cells.length); // Check we got the right stuff -// TODO +for (int i=0; i<cells.length; i++) { +if (hasLabel && hasName) { +assertContains("title=\"" + columnNames[i] + "\"", cells[i]); +assertContains(">" + columnLabels[i], cells[i]); +} else if (hasName) { +assertContains(">" + columnNames[i], cells[i]); +} else { +assertContains(">" + columnLabels[i], cells[i]); +} +} } protected void assertContents(String xml, boolean hasHeader) { -// TODO Check the rows +// Ignore anything before the first +// Ignore the header row if there is one +int ignores = 1; +if (hasHeader) ignores++; + +// Split into rows, and discard the row closing (and anything after) +String[] rows = xml.split(""); +rows = Arrays.copyOfRange(rows, ignores, rows.length); +for (int i=0; i<rows.length; i++) { +rows[i] = rows[i].split("")[0].trim(); +} + +// Check we got the right number of rows +for (int cn=0; cn<table.length; cn++) { +assertEquals("Wrong number of rows found compared to column " + (cn+1), + table[cn].length, rows.length); +} + +// Check each row's values +// TODO } @Te
[tika] branch master updated: Handle .epub files using .htm rather than .html extensions for the embedded contents (TIKA-1288)
This is an automated email from the ASF dual-hosted git repository. nick pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/master by this push: new a0ffec1 Handle .epub files using .htm rather than .html extensions for the embedded contents (TIKA-1288) a0ffec1 is described below commit a0ffec146e84fdcf4c747b4375f92ae283944f4c Author: Nick Burch <n...@gagravarr.org> AuthorDate: Wed May 9 10:23:09 2018 +0100 Handle .epub files using .htm rather than .html extensions for the embedded contents (TIKA-1288) --- CHANGES.txt| 3 +++ tika-parsers/src/main/java/org/apache/tika/parser/epub/EpubParser.java | 3 ++- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/CHANGES.txt b/CHANGES.txt index 194fef8..c66e883 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -73,6 +73,9 @@ Release 2.0.0 - ??? * Support for SAS7BDAT data files (TIKA-2462) + * Handle .epub files using .htm rather than .html extensions for the + embedded contents (TIKA-1288) + Release 1.17 - 12/8/2017 ***NOTE: THIS IS THE LAST VERSION OF TIKA THAT WILL RUN diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/epub/EpubParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/epub/EpubParser.java index c4f72de..775b319 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/epub/EpubParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/epub/EpubParser.java @@ -105,7 +105,8 @@ public class EpubParser extends AbstractParser { meta.parse(zip, new DefaultHandler(), metadata, context); } else if (entry.getName().endsWith(".opf")) { meta.parse(zip, new DefaultHandler(), metadata, context); -} else if (entry.getName().endsWith(".html") || +} else if (entry.getName().endsWith(".htm") || + entry.getName().endsWith(".html") || entry.getName().endsWith(".xhtml")) { content.parse(zip, childHandler, metadata, context); } -- To stop receiving notification emails like this one, please contact n...@apache.org.