[jira] [Commented] (TIKA-1241) Tika does not recognise empty nor spanning ZIP files magic
[ https://issues.apache.org/jira/browse/TIKA-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910315#comment-13910315 ] ASF GitHub Bot commented on TIKA-1241: -- Github user cstamas closed the pull request at: https://github.com/apache/tika/pull/4 Tika does not recognise empty nor spanning ZIP files magic -- Key: TIKA-1241 URL: https://issues.apache.org/jira/browse/TIKA-1241 Project: Tika Issue Type: Improvement Reporter: Cservenak, Tamas Priority: Minor Fix For: 1.6 As it turns out, magic differs for non-empty, empty and spanning ZIP files. Tika recognizes only the non-empty ZIP files. Magic for empty ZIP file is validated with hexdump: https://gist.github.com/cstamas/6e90ae73f83c8e4a3f42 Also described on Wikipedia http://en.wikipedia.org/wiki/Zip_(file_format) (see sidebar with Magic Numbers) Proposed change: add two more match entries to ZIP MIME definition: https://github.com/apache/tika/pull/4 -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1169) Fails to parse jnilib file
[ https://issues.apache.org/jira/browse/TIKA-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999825#comment-13999825 ] ASF GitHub Bot commented on TIKA-1169: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/8 Fails to parse jnilib file -- Key: TIKA-1169 URL: https://issues.apache.org/jira/browse/TIKA-1169 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: Windows 7 x64, Java 1.6.45 Reporter: brat Priority: Critical Fix For: 1.5 Attachments: libwrapper.jnilib Hi, I'm trying to parse a folder with jnilib file inside, but Tika 1.4 throws exception : java.io.IOException: at org.apache.tika.parser.ParsingReader.read(ParsingReader.java:260) at java.io.Reader.read(Unknown Source) at ca.cloudscraper.core.impl.Engine.process(Engine.java:63) at ca.cloudscraper.core.impl.Engine.process(Engine.java:34) at ca.cloudscraper.core.impl.Engine.process(Engine.java:34) at ca.cloudscraper.core.impl.Engine.process(Engine.java:34) at ca.cloudscraper.core.impl.Engine.execute(Engine.java:117) at ca.cloudscraper.core.tests.LuceneServiceImplTest.test5(LuceneServiceImplTest.java:140) at ca.cloudscraper.core.tests.LuceneServiceImplTest.main(LuceneServiceImplTest.java:176) Caused by: org.apache.tika.exception.TikaException: Failed to parse a Java class at org.apache.tika.parser.asm.XHTMLClassVisitor.parse(XHTMLClassVisitor.java:66) at org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:221) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 at org.objectweb.asm.ClassReader.readClass(ClassReader.java:2157) at org.objectweb.asm.ClassReader.accept(ClassReader.java:542) at org.objectweb.asm.ClassReader.accept(ClassReader.java:506) at org.apache.tika.parser.asm.XHTMLClassVisitor.parse(XHTMLClassVisitor.java:61) ... 6 more Seems like Tika tries to parse this file as Java class file, but that obviously doesn't work. I've tried to create custom-mimetypes.xml file like this : ?xml version=1.0 encoding=UTF-8? mime-info mime-type type=application/octet-stream _commentMac OSX jnilib/_comment glob pattern=*.jnilib/ /mime-type /mime-info and after I repack tika-app-1.4.jar with this file in org.apache.tika.mime folder, the problem still exists. Jnilib file is actually from the ActiveMQ 5.8.0 binary found in bin/macosx/libwrapper.jnilib -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1169) Fails to parse jnilib file
[ https://issues.apache.org/jira/browse/TIKA-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999705#comment-13999705 ] ASF GitHub Bot commented on TIKA-1169: -- GitHub user mkr opened a pull request: https://github.com/apache/tika/pull/8 TIKA-1169: Adding other Mach-O magic bytes for jnilib files. Adding remaining Mach-o binary signatures to fix TIKA-1169 You can merge this pull request into a Git repository by running: $ git pull https://github.com/mkr/tika tika-1169-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/8.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8 commit 8b033abfa19a3cb8939ac67bbabcd4d53e89ec42 Author: Matthias Krueger m...@mkr.io Date: 2014-05-16T08:58:40Z TIKA-1169: Adding other Mach-O magic bytes for jnilib files. MH_MAGIC = 0xfeedface MH_CIGAM = 0xcefaedfe MH_MAGIC_64 = 0xfeedfacf MH_CIGAM_64 = 0xcffaedfe See https://developer.apple.com/library/mac/documentation/DeveloperTools/Conceptual/MachORuntime/Reference/reference.html Fails to parse jnilib file -- Key: TIKA-1169 URL: https://issues.apache.org/jira/browse/TIKA-1169 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: Windows 7 x64, Java 1.6.45 Reporter: brat Priority: Critical Fix For: 1.5 Attachments: libwrapper.jnilib Hi, I'm trying to parse a folder with jnilib file inside, but Tika 1.4 throws exception : java.io.IOException: at org.apache.tika.parser.ParsingReader.read(ParsingReader.java:260) at java.io.Reader.read(Unknown Source) at ca.cloudscraper.core.impl.Engine.process(Engine.java:63) at ca.cloudscraper.core.impl.Engine.process(Engine.java:34) at ca.cloudscraper.core.impl.Engine.process(Engine.java:34) at ca.cloudscraper.core.impl.Engine.process(Engine.java:34) at ca.cloudscraper.core.impl.Engine.execute(Engine.java:117) at ca.cloudscraper.core.tests.LuceneServiceImplTest.test5(LuceneServiceImplTest.java:140) at ca.cloudscraper.core.tests.LuceneServiceImplTest.main(LuceneServiceImplTest.java:176) Caused by: org.apache.tika.exception.TikaException: Failed to parse a Java class at org.apache.tika.parser.asm.XHTMLClassVisitor.parse(XHTMLClassVisitor.java:66) at org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:221) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 at org.objectweb.asm.ClassReader.readClass(ClassReader.java:2157) at org.objectweb.asm.ClassReader.accept(ClassReader.java:542) at org.objectweb.asm.ClassReader.accept(ClassReader.java:506) at org.apache.tika.parser.asm.XHTMLClassVisitor.parse(XHTMLClassVisitor.java:61) ... 6 more Seems like Tika tries to parse this file as Java class file, but that obviously doesn't work. I've tried to create custom-mimetypes.xml file like this : ?xml version=1.0 encoding=UTF-8? mime-info mime-type type=application/octet-stream _commentMac OSX jnilib/_comment glob pattern=*.jnilib/ /mime-type /mime-info and after I repack tika-app-1.4.jar with this file in org.apache.tika.mime folder, the problem still exists. Jnilib file is actually from the ActiveMQ 5.8.0 binary found in bin/macosx/libwrapper.jnilib -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1292) Inconsistent priorities in bundled tika-mimetypes.xml
[ https://issues.apache.org/jira/browse/TIKA-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004730#comment-14004730 ] ASF GitHub Bot commented on TIKA-1292: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/6 Inconsistent priorities in bundled tika-mimetypes.xml - Key: TIKA-1292 URL: https://issues.apache.org/jira/browse/TIKA-1292 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.5 Reporter: Cservenak, Tamas It seems that mime-type priorities are a bit inconsistent in the tika-core bundled tika-mimetypes.xml Few examples: * [application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497] vs [application/x-7z-compressed|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3510]: both are similar containers archive formats (structured, having entries), having distinct file extensions (zip vs 7z globs), still priorities are 40 and 50 respectively. * [application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497] vs [text/html|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L4713]: not quite related MIME types, having same priority of 40. But ZIP files can be uncompressed (meaning entries are mostly concatenated, and their content, if plaintext, is readable). Hence, having an uncompressed ZIP (or any subclass like JAR) file that contains HTML files zipped up might/will be detected as HTML, which is wrong. And this is what happens in Nexus that uses Tika under the hud for content validation, basically using MIME magic detection provided by Tika Detector: the Java JAR {{com.intellij:annotations:7.0.3}} ([link|http://repo1.maven.org/maven2/com/intellij/annotations/7.0.3/]) is being detected as {{text/html}} instead of (expected) {{application/java-archive}}. Reason is following: the JAR file is zipped up in uncompressed zip format, and among few annotations it also contains one HTML file entry (the license I guess). Since both MIME types have same priority (40), I guess tika randomly chooses the {{text/html}}. Original Nexus issue https://issues.sonatype.org/browse/NEXUS-6560 At Nexus issue there is a GH Pull Request that solves the problem for us (by raising {{application/zip}} priority to 41. But by inspecting the bundled tike-mimetypes.xml we spotted other -- probably -- priority inconsistencies, like that of zip vs 7z mentioned above. Note: this happens when using tika-core solely on classpath and using it for MIME magic detection. Interestingly, when the tika-parsers (with it's all dependencies) are added to classpath, Tika will properly figure out that the artifact is {{application/java-archive}}. Still, our use case in Nexus requires the MIME magic detection only, so we do not use tika-parsers, nor we would like to do so. Sample project to reproduce https://github.com/cstamas/tika-1292 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1292) Inconsistent priorities in bundled tika-mimetypes.xml
[ https://issues.apache.org/jira/browse/TIKA-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005933#comment-14005933 ] ASF GitHub Bot commented on TIKA-1292: -- Github user cstamas closed the pull request at: https://github.com/apache/tika/pull/7 Inconsistent priorities in bundled tika-mimetypes.xml - Key: TIKA-1292 URL: https://issues.apache.org/jira/browse/TIKA-1292 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.5 Reporter: Cservenak, Tamas It seems that mime-type priorities are a bit inconsistent in the tika-core bundled tika-mimetypes.xml Few examples: * [application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497] vs [application/x-7z-compressed|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3510]: both are similar containers archive formats (structured, having entries), having distinct file extensions (zip vs 7z globs), still priorities are 40 and 50 respectively. * [application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497] vs [text/html|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L4713]: not quite related MIME types, having same priority of 40. But ZIP files can be uncompressed (meaning entries are mostly concatenated, and their content, if plaintext, is readable). Hence, having an uncompressed ZIP (or any subclass like JAR) file that contains HTML files zipped up might/will be detected as HTML, which is wrong. And this is what happens in Nexus that uses Tika under the hud for content validation, basically using MIME magic detection provided by Tika Detector: the Java JAR {{com.intellij:annotations:7.0.3}} ([link|http://repo1.maven.org/maven2/com/intellij/annotations/7.0.3/]) is being detected as {{text/html}} instead of (expected) {{application/java-archive}}. Reason is following: the JAR file is zipped up in uncompressed zip format, and among few annotations it also contains one HTML file entry (the license I guess). Since both MIME types have same priority (40), I guess tika randomly chooses the {{text/html}}. Original Nexus issue https://issues.sonatype.org/browse/NEXUS-6560 At Nexus issue there is a GH Pull Request that solves the problem for us (by raising {{application/zip}} priority to 41. But by inspecting the bundled tike-mimetypes.xml we spotted other -- probably -- priority inconsistencies, like that of zip vs 7z mentioned above. Note: this happens when using tika-core solely on classpath and using it for MIME magic detection. Interestingly, when the tika-parsers (with it's all dependencies) are added to classpath, Tika will properly figure out that the artifact is {{application/java-archive}}. Still, our use case in Nexus requires the MIME magic detection only, so we do not use tika-parsers, nor we would like to do so. Sample project to reproduce https://github.com/cstamas/tika-1292 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1322) XML file parse errors within archives trigger Zip bomb detection
[ https://issues.apache.org/jira/browse/TIKA-1322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018247#comment-14018247 ] ASF GitHub Bot commented on TIKA-1322: -- GitHub user mkr opened a pull request: https://github.com/apache/tika/pull/9 TIKA-1322: Properly close XMLParser's output in case of SAXException. Fix and test for https://issues.apache.org/jira/browse/TIKA-1322. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mkr/tika TIKA-1322 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/9.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9 commit 63d979538a72e5c044b2074219268da57fcf48cd Author: Matthias Krueger m...@mkr.io Date: 2014-06-04T21:45:15Z TIKA-1322: Properly close XMLParser's output in case of SAXException. XML file parse errors within archives trigger Zip bomb detection Key: TIKA-1322 URL: https://issues.apache.org/jira/browse/TIKA-1322 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Matthias Krueger Priority: Minor Tika parses XML input using org.apache.tika.parser.xml.XMLParser. XMLParser opens a p tag before a SAXParser's output of the input XML is appended. A possible SAXException during parsing is rethrown but the opened p tag not closed. The Zip bomb detection in SecureContentHandler relies on consistent starting and closing of elements. With the current behaviour of XMLParser it will be triggered, for example, if an archive contains 10 (SecureContentHandler#maxPackageEntryDepth) invalid XML files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1322) XML file parse errors within archives trigger Zip bomb detection
[ https://issues.apache.org/jira/browse/TIKA-1322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019737#comment-14019737 ] ASF GitHub Bot commented on TIKA-1322: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/9 XML file parse errors within archives trigger Zip bomb detection Key: TIKA-1322 URL: https://issues.apache.org/jira/browse/TIKA-1322 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Matthias Krueger Priority: Minor Fix For: 1.6 Tika parses XML input using org.apache.tika.parser.xml.XMLParser. XMLParser opens a p tag before a SAXParser's output of the input XML is appended. A possible SAXException during parsing is rethrown but the opened p tag not closed. The Zip bomb detection in SecureContentHandler relies on consistent starting and closing of elements. With the current behaviour of XMLParser it will be triggered, for example, if an archive contains 10 (SecureContentHandler#maxPackageEntryDepth) invalid XML files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1336) Provide a Detector JAXRS endpoint
[ https://issues.apache.org/jira/browse/TIKA-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031661#comment-14031661 ] ASF GitHub Bot commented on TIKA-1336: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/10 Provide a Detector JAXRS endpoint - Key: TIKA-1336 URL: https://issues.apache.org/jira/browse/TIKA-1336 Project: Tika Issue Type: Improvement Components: detector, server Affects Versions: 1.5 Reporter: Nick Burch Assignee: Chris A. Mattmann Fix For: 1.6 As identified in TIKA-1335, the Tika Server now has an endpoint which will tell you what Detectors are available to it, but not one that will trigger detection. That means your only way to do detection is to request the metadata, and check the content type, but that isn't always as accurate as an explicit detection call (eg if a general parser picks up the file) We should therefore add in a new endpoint that just does the detection -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1350) OutlookPSTParser: Unknown message type: IPM.Note
[ https://issues.apache.org/jira/browse/TIKA-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038862#comment-14038862 ] ASF GitHub Bot commented on TIKA-1350: -- GitHub user jrhe opened a pull request: https://github.com/apache/tika/pull/12 Bumps libpst version to fix TIKA-1350 When parsing some emails in a PST file I get the error Unknown message type: IPM.Note preventing them from being parsed. This is because of an extra null byte at the end of the message class string. This has been fixed in version 0.8.1 of java-libpst so a version bump is all that is required. https://github.com/rjohnsondev/java-libpst/issues/14 You can merge this pull request into a Git repository by running: $ git pull https://github.com/jrhe/tika TIKA-1350 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/12.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12 commit 29272a0f4abb01407b795426e563cba8ed134107 Author: Jonathan Richard Henry Evans (JRHE) cont...@jrhe.co.uk Date: 2014-06-20T14:53:21Z Bumps libpst version to fix TIKA-1350 OutlookPSTParser: Unknown message type: IPM.Note Key: TIKA-1350 URL: https://issues.apache.org/jira/browse/TIKA-1350 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Reporter: Jonathan Evans Labels: libpst, parser, pst Fix For: 1.7 Original Estimate: 0.2h Remaining Estimate: 0.2h When parsing some emails in a PST file I get the error Unknown message type: IPM.Note preventing them from being parsed. This is because of an extra null byte at the end of the message class string. This has been fixed in version 0.8.1 of java-libpst so a version bump is all that is required. https://github.com/rjohnsondev/java-libpst/issues/14 I would attempt to do this myself but I am unsure how to open a pull request with SVN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container
[ https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042129#comment-14042129 ] ASF GitHub Bot commented on TIKA-1354: -- GitHub user hlavki opened a pull request: https://github.com/apache/tika/pull/13 [TIKA-1354] Register ForkParser service in Activator and add simple test There is maybe another way but I didn't find it. It'll will be good if somebody with higher OSGI knowledge also look on it. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hlavki/tika trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/13.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13 commit cc2f328365e649e1290ce23e48f43376c5d42687 Author: Michal Hlavac hla...@hlavki.eu Date: 2014-06-24T13:26:42Z [TIKA-1354] Register ForkParser service in Activator and add simple test ForkParser doesn't work in OSGI container - Key: TIKA-1354 URL: https://issues.apache.org/jira/browse/TIKA-1354 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Michal Hlavac I can't find way to run ForkParser in OSGI container. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1350) OutlookPSTParser: Unknown message type: IPM.Note
[ https://issues.apache.org/jira/browse/TIKA-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042339#comment-14042339 ] ASF GitHub Bot commented on TIKA-1350: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/12 OutlookPSTParser: Unknown message type: IPM.Note Key: TIKA-1350 URL: https://issues.apache.org/jira/browse/TIKA-1350 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Reporter: Jonathan Evans Labels: libpst, parser, pst Fix For: 1.7 Original Estimate: 0.2h Remaining Estimate: 0.2h When parsing some emails in a PST file I get the error Unknown message type: IPM.Note preventing them from being parsed. This is because of an extra null byte at the end of the message class string. This has been fixed in version 0.8.1 of java-libpst so a version bump is all that is required. https://github.com/rjohnsondev/java-libpst/issues/14 I would attempt to do this myself but I am unsure how to open a pull request with SVN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1361) Update MP4Parser to 1.0.2
[ https://issues.apache.org/jira/browse/TIKA-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073543#comment-14073543 ] ASF GitHub Bot commented on TIKA-1361: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/14 Update MP4Parser to 1.0.2 - Key: TIKA-1361 URL: https://issues.apache.org/jira/browse/TIKA-1361 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Matthias Krueger The currently used com.googlecode.mp4parser:isoparser version is 1.0-RC-1. According to https://code.google.com/p/mp4parser/#Changes/Releases and https://code.google.com/p/mp4parser/source/list there have been quite some improvements since then. Before tackling more metadata (such as in TIKA-852) we should update to 1.0.2. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor
[ https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077584#comment-14077584 ] ASF GitHub Bot commented on TIKA-1369: -- GitHub user vilmospapp opened a pull request: https://github.com/apache/tika/pull/15 TIKA-1369 Resolve thread safety issue in ImageMetadataExtractor Hi, This fix tries to resolve TIKA-1369 with handle thread safety by ThreadLocal and avoid other library dependencies. I have run the test cases, so it seems correct to me, though I haven't found any other occurrence of ThreadLocal in Tika's source, so perhaps it's against your general patterns. Regards, Vilmos You can merge this pull request into a Git repository by running: $ git pull https://github.com/vilmospapp/tika TIKA-1369 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/15.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15 commit 3a9575fc56a6463b4378b14820e9079352bb1848 Author: Vilmos Papp papp.gyorgy.vil...@gmail.com Date: 2014-07-23T09:18:50Z TIKA-1369 Make SimpleDateFormat usage thread safe Date parsing and thread safety in ImageMetadataExtractor Key: TIKA-1369 URL: https://issues.apache.org/jira/browse/TIKA-1369 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Environment: OS X 10.9.4 Java 7_60 Reporter: John Gibson Priority: Critical The {{ImageMetadataExtractor}} uses a static instance of {{SimpleDateFormat}}. This is not thread safe. {code:title=ImageMetadataExtractor.java} static class ExifHandler implements DirectoryHandler { private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new SimpleDateFormat(-MM-dd'T'HH:mm:ss); ... public void handleDateTags(Directory directory, Metadata metadata) throws MetadataException { // Date/Time Original overrides value from ExifDirectory.TAG_DATETIME Date original = null; if (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) { original = directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL); // Unless we have GPS time we don't know the time zone so date must be set // as ISO 8601 datetime without timezone suffix (no Z or +/-) if (original != null) { String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor uses metadata.set(TikaCoreProperties.CREATED, datetimeNoTimeZone); metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone); } } ... {code} This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. In the discussion there the idea of using alternative thread-safe (and faster) formatters from either Joda time or Commons Lang were dismissed because they would add too many dependencies. Given that Tika already has a fairly large laundry list of dependencies to parse content, adding one more JAR to make sure things don't break is probably a good idea. In addition, because no timezone or locale are specified by either Tika's formatter or the call to com.drew.metadata.Directory it can wreak havok during randomized testing. Given that the timezone is unknown, why not just default it to UTC and let the caller guess the timezone? As it stands I have to reparse all of the dates into UTC to get stable behavior across timezones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1435) Update rome dependency to 1.5
[ https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157837#comment-14157837 ] ASF GitHub Bot commented on TIKA-1435: -- GitHub user jotomo opened a pull request: https://github.com/apache/tika/pull/16 TIKA-1435: Upgrade Rome to 1.5 Adopt new namespace and enjoy generics. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jotomo/tika rome-1.5 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/16.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16 commit b6e3a51be79efc04fdd643378f67b2f7d3bc5af4 Author: Johannes Mockenhaupt g...@jotomo.de Date: 2014-10-02T22:17:55Z TIKA-1435: Upgrade Rome to 1.5 Adopt new namespace and enjoy generics. Update rome dependency to 1.5 - Key: TIKA-1435 URL: https://issues.apache.org/jira/browse/TIKA-1435 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Johannes Mockenhaupt Priority: Minor Fix For: 1.7 Rome 1.5 has been released to Sonatype (https://github.com/rometools/rome/issues/183). Though the website (http://rometools.github.io/rome/) is blissfully ignorant of that. The update is mostly maintenance, adopting slf4j and generics as well as moving the namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1435) Update rome dependency to 1.5
[ https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158724#comment-14158724 ] ASF GitHub Bot commented on TIKA-1435: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/16 Update rome dependency to 1.5 - Key: TIKA-1435 URL: https://issues.apache.org/jira/browse/TIKA-1435 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Johannes Mockenhaupt Priority: Minor Fix For: 1.7 Rome 1.5 has been released to Sonatype (https://github.com/rometools/rome/issues/183). Though the website (http://rometools.github.io/rome/) is blissfully ignorant of that. The update is mostly maintenance, adopting slf4j and generics as well as moving the namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container
[ https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158734#comment-14158734 ] ASF GitHub Bot commented on TIKA-1354: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/13 ForkParser doesn't work in OSGI container - Key: TIKA-1354 URL: https://issues.apache.org/jira/browse/TIKA-1354 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Michal Hlavac I can't find way to run ForkParser in OSGI container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1126) text/html procuder for tika-server
[ https://issues.apache.org/jira/browse/TIKA-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158743#comment-14158743 ] ASF GitHub Bot commented on TIKA-1126: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/3 text/html procuder for tika-server -- Key: TIKA-1126 URL: https://issues.apache.org/jira/browse/TIKA-1126 Project: Tika Issue Type: Improvement Components: server Affects Versions: 1.4 Reporter: Ali Mosavian Priority: Trivial Fix For: 1.4 Attachments: tika_server_html_output.patch the /tika resource handler of tika-server can only produce text/plain. This patch adds support for producing text/html. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor
[ https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158893#comment-14158893 ] ASF GitHub Bot commented on TIKA-1369: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/15 Date parsing and thread safety in ImageMetadataExtractor Key: TIKA-1369 URL: https://issues.apache.org/jira/browse/TIKA-1369 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Environment: OS X 10.9.4 Java 7_60 Reporter: John Gibson Priority: Critical The {{ImageMetadataExtractor}} uses a static instance of {{SimpleDateFormat}}. This is not thread safe. {code:title=ImageMetadataExtractor.java} static class ExifHandler implements DirectoryHandler { private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new SimpleDateFormat(-MM-dd'T'HH:mm:ss); ... public void handleDateTags(Directory directory, Metadata metadata) throws MetadataException { // Date/Time Original overrides value from ExifDirectory.TAG_DATETIME Date original = null; if (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) { original = directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL); // Unless we have GPS time we don't know the time zone so date must be set // as ISO 8601 datetime without timezone suffix (no Z or +/-) if (original != null) { String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor uses metadata.set(TikaCoreProperties.CREATED, datetimeNoTimeZone); metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone); } } ... {code} This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. In the discussion there the idea of using alternative thread-safe (and faster) formatters from either Joda time or Commons Lang were dismissed because they would add too many dependencies. Given that Tika already has a fairly large laundry list of dependencies to parse content, adding one more JAR to make sure things don't break is probably a good idea. In addition, because no timezone or locale are specified by either Tika's formatter or the call to com.drew.metadata.Directory it can wreak havok during randomized testing. Given that the timezone is unknown, why not just default it to UTC and let the caller guess the timezone? As it stands I have to reparse all of the dates into UTC to get stable behavior across timezones. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor
[ https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160234#comment-14160234 ] ASF GitHub Bot commented on TIKA-1369: -- GitHub user vilmospapp opened a pull request: https://github.com/apache/tika/pull/17 TIKA-1369 Avoid ThreadLocal usage from Memory Leak Hi @chrismattmann , Based on our discussion from https://github.com/apache/tika/pull/15 I've added the ThreadLocal clean up part, so theoretically it won't suffer from the scenario that @grossws mentioned. Cheers, Vilmos You can merge this pull request into a Git repository by running: $ git pull https://github.com/vilmospapp/tika TIKA-1369-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/17.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17 commit f95fad94619946ef1d4fe7cf407deab6317ad2fd Author: Vilmos Papp papp.gyorgy.vil...@gmail.com Date: 2014-10-06T12:10:14Z TIKA-1369 Avoid ThreadLocal usage from Memory Leak Date parsing and thread safety in ImageMetadataExtractor Key: TIKA-1369 URL: https://issues.apache.org/jira/browse/TIKA-1369 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Environment: OS X 10.9.4 Java 7_60 Reporter: John Gibson Assignee: Chris A. Mattmann Priority: Critical Fix For: 1.7 The {{ImageMetadataExtractor}} uses a static instance of {{SimpleDateFormat}}. This is not thread safe. {code:title=ImageMetadataExtractor.java} static class ExifHandler implements DirectoryHandler { private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new SimpleDateFormat(-MM-dd'T'HH:mm:ss); ... public void handleDateTags(Directory directory, Metadata metadata) throws MetadataException { // Date/Time Original overrides value from ExifDirectory.TAG_DATETIME Date original = null; if (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) { original = directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL); // Unless we have GPS time we don't know the time zone so date must be set // as ISO 8601 datetime without timezone suffix (no Z or +/-) if (original != null) { String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor uses metadata.set(TikaCoreProperties.CREATED, datetimeNoTimeZone); metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone); } } ... {code} This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. In the discussion there the idea of using alternative thread-safe (and faster) formatters from either Joda time or Commons Lang were dismissed because they would add too many dependencies. Given that Tika already has a fairly large laundry list of dependencies to parse content, adding one more JAR to make sure things don't break is probably a good idea. In addition, because no timezone or locale are specified by either Tika's formatter or the call to com.drew.metadata.Directory it can wreak havok during randomized testing. Given that the timezone is unknown, why not just default it to UTC and let the caller guess the timezone? As it stands I have to reparse all of the dates into UTC to get stable behavior across timezones. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container
[ https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160340#comment-14160340 ] ASF GitHub Bot commented on TIKA-1354: -- GitHub user hlavki opened a pull request: https://github.com/apache/tika/pull/18 TIKA-1354 Add test method with nonfunctional fork parser There is something wrong with pax commons logging so ForkParser doesn't work in general. Test method: testForkParserPdf() I suppose that this pull request will never be merged to trunk. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hlavki/tika trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/18.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18 commit ab9d4d432b4ace9877cd1bbc178594ac230d4edd Author: Michal Hlavac hla...@hlavki.eu Date: 2014-10-06T14:21:48Z TIKA-1354 Add test method with nonfunctional fork parser (There is something wrong with pax commons logging) ForkParser doesn't work in OSGI container - Key: TIKA-1354 URL: https://issues.apache.org/jira/browse/TIKA-1354 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Michal Hlavac I can't find way to run ForkParser in OSGI container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor
[ https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161801#comment-14161801 ] ASF GitHub Bot commented on TIKA-1369: -- Github user vilmospapp closed the pull request at: https://github.com/apache/tika/pull/17 Date parsing and thread safety in ImageMetadataExtractor Key: TIKA-1369 URL: https://issues.apache.org/jira/browse/TIKA-1369 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Environment: OS X 10.9.4 Java 7_60 Reporter: John Gibson Assignee: Chris A. Mattmann Priority: Critical Fix For: 1.7 The {{ImageMetadataExtractor}} uses a static instance of {{SimpleDateFormat}}. This is not thread safe. {code:title=ImageMetadataExtractor.java} static class ExifHandler implements DirectoryHandler { private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new SimpleDateFormat(-MM-dd'T'HH:mm:ss); ... public void handleDateTags(Directory directory, Metadata metadata) throws MetadataException { // Date/Time Original overrides value from ExifDirectory.TAG_DATETIME Date original = null; if (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) { original = directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL); // Unless we have GPS time we don't know the time zone so date must be set // as ISO 8601 datetime without timezone suffix (no Z or +/-) if (original != null) { String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor uses metadata.set(TikaCoreProperties.CREATED, datetimeNoTimeZone); metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone); } } ... {code} This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. In the discussion there the idea of using alternative thread-safe (and faster) formatters from either Joda time or Commons Lang were dismissed because they would add too many dependencies. Given that Tika already has a fairly large laundry list of dependencies to parse content, adding one more JAR to make sure things don't break is probably a good idea. In addition, because no timezone or locale are specified by either Tika's formatter or the call to com.drew.metadata.Directory it can wreak havok during randomized testing. Given that the timezone is unknown, why not just default it to UTC and let the caller guess the timezone? As it stands I have to reparse all of the dates into UTC to get stable behavior across timezones. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181483#comment-14181483 ] ASF GitHub Bot commented on TIKA-1446: -- GitHub user thaichat04 opened a pull request: https://github.com/apache/tika/pull/20 TIKA-1446 TIKA- 1430, TIKA-1446, TIKA-1447, TIKA-1448: CHM Parser improvement You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/tika 1.6 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/20.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20 commit 58a465391d128c2aa9b11c9f5a986f6bcd28abca Author: Chris Mattmann mattm...@apache.org Date: 2014-07-28T00:45:03Z [maven-release-plugin] copy for tag 1.6 git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1613865 13f79535-47bb-0310-9956-ffa450edef68 commit c98da37a4b83bdad6aa86ccc6aaec6b0d647c59a Author: David Meikle dmei...@apache.org Date: 2014-07-31T18:29:32Z TIKA-1381 - Added Lingo24Translator implementation git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1614950 13f79535-47bb-0310-9956-ffa450edef68 commit d831ac12be2fc3303f5dab45b00b53b53b6a67e9 Author: Nick Burch n...@apache.org Date: 2014-08-04T15:41:54Z Create a branch for 1.6, to backport the POI upgrade to git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615619 13f79535-47bb-0310-9956-ffa450edef68 commit e2d10e633d38c52b0f490a09043fb43176d26fbe Author: Nick Burch n...@apache.org Date: 2014-08-04T15:54:55Z Merge the POI 3.11 beta 1 upgrade from Trunk to the 1.6 branch (TIKA-1380), ready for inclusion in rc2 git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615636 13f79535-47bb-0310-9956-ffa450edef68 commit a5942c11cd6a3e75304ce0267c1fc4b5e979c66c Author: Tim Allison talli...@apache.org Date: 2014-08-04T16:51:40Z TIKA-1317 extract contents from SDTs within cells in tables in XWPF (docx) files git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615675 13f79535-47bb-0310-9956-ffa450edef68 commit 68f9a11926946bdea29ab757a8275149d8d057e9 Author: Nick Burch n...@apache.org Date: 2014-08-04T21:27:41Z Merge r1615631 from Trunk to 1.6 - Upgrade the Commons Codec version to match that in Apache POI, upgraded in TIKA-1380 git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615800 13f79535-47bb-0310-9956-ffa450edef68 commit ee988d4daa5b451a51b799b0ec790b88ca7fc111 Author: Tim Allison talli...@apache.org Date: 2014-08-05T13:03:05Z TIKA-1275 upgrade Commons Compress to 1.8.1; updated CHANGES.txt, too git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615923 13f79535-47bb-0310-9956-ffa450edef68 commit 9d27e1379fba530def45b470a92ce5052078021c Author: Tim Allison talli...@apache.org Date: 2014-08-05T18:17:39Z TIKA-1380; fix for null ole.getLabel() git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615970 13f79535-47bb-0310-9956-ffa450edef68 commit 2ee02d85aa703e65607a707ee171c166017916ab Author: Nick Burch n...@apache.org Date: 2014-08-20T14:16:06Z Merge r1619108 from Trunk to the 1.6 branch ready for release - Bump the POI dependency to 3.11-beta2, and remove the Geronimo stax one which is no longer required by anything now we are on Java 1.6 TIKA-1380 git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1619109 13f79535-47bb-0310-9956-ffa450edef68 commit a3eac367cd560c20da4231f45eb18d638d4f91a1 Author: Chris Mattmann mattm...@apache.org Date: 2014-08-31T19:36:36Z Bring 1.6 branch up to date with trunk in prep for 1.6 RC #2. git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621623 13f79535-47bb-0310-9956-ffa450edef68 commit dd2a2b5bad7e363c5ab74db69b89b6083f6fc8ff Author: Chris Mattmann mattm...@apache.org Date: 2014-08-31T19:44:11Z [maven-release-plugin] prepare release 1.6-rc2 git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621627 13f79535-47bb-0310-9956-ffa450edef68 commit 5f9845759fb7839298ac5ee3abb11667035faac3 Author: Chris Mattmann mattm...@apache.org Date: 2014-08-31T19:44:17Z [maven-release-plugin] prepare for next development iteration git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621629 13f79535-47bb-0310-9956-ffa450edef68 CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority:
[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181518#comment-14181518 ] ASF GitHub Bot commented on TIKA-1446: -- Github user thaichat04 closed the pull request at: https://github.com/apache/tika/pull/20 CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Attachments: chm.zip If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1472) Warning on Tika Server startup - Failed to load class org.slf4j.impl.StaticLoggerBinder
[ https://issues.apache.org/jira/browse/TIKA-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206488#comment-14206488 ] ASF GitHub Bot commented on TIKA-1472: -- GitHub user grossws opened a pull request: https://github.com/apache/tika/pull/22 Added slf4j-jcl impl to tika-server deps. Fixes TIKA-1472. You can merge this pull request into a Git repository by running: $ git pull https://github.com/grossws/tika fix-tika-1472 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/22.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22 commit 892d28051fbd745c33566e1c86d054aa2c11c4cc Author: Konstantin Gribov gros...@gmail.com Date: 2014-11-11T14:33:29Z Added slf4j-jcl impl to tika-server deps. Fixes TIKA-1472. Warning on Tika Server startup - Failed to load class org.slf4j.impl.StaticLoggerBinder - Key: TIKA-1472 URL: https://issues.apache.org/jira/browse/TIKA-1472 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.6 Environment: Windows 8, JDK 1.8, Maven 3.2.3 Reporter: Darya Arbuzova Priority: Minor Attachments: 0001-Added-slf4j-jcl-impl-to-tika-server-deps.patch Hello! I want to use Apache Tika in server mode. I downloaded {{tika-server-1.6.jar}} from http://mirror.vorboss.net/apache/tika/ When I try to start the server, I get {{SLF4J: Failed to load class org.slf4j.impl.StaticLoggerBinder.}} So I go to the link you direct me to (http://www.slf4j.org/codes.html#StaticLoggerBinder), download other slfj4 {{jar}}-files, but what next? I can't put them to the class path, since I don't have a project. I can't change dependencies in {{pom.xml}} for the same reason. Whant should I do? I tried downloading the whole source code, but couldn't build it using Maven, still haven't figured out why. Previous discussion see here: https://issues.apache.org/jira/browse/TIKA-1470 Thank you! Best regards, Darya Arbuzova -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1472) Warning on Tika Server startup - Failed to load class org.slf4j.impl.StaticLoggerBinder
[ https://issues.apache.org/jira/browse/TIKA-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207940#comment-14207940 ] ASF GitHub Bot commented on TIKA-1472: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/22 Warning on Tika Server startup - Failed to load class org.slf4j.impl.StaticLoggerBinder - Key: TIKA-1472 URL: https://issues.apache.org/jira/browse/TIKA-1472 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.6 Environment: Windows 8, JDK 1.8, Maven 3.2.3 Reporter: Darya Arbuzova Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.7 Attachments: 0001-Added-slf4j-jcl-impl-to-tika-server-deps.patch Hello! I want to use Apache Tika in server mode. I downloaded {{tika-server-1.6.jar}} from http://mirror.vorboss.net/apache/tika/ When I try to start the server, I get {{SLF4J: Failed to load class org.slf4j.impl.StaticLoggerBinder.}} So I go to the link you direct me to (http://www.slf4j.org/codes.html#StaticLoggerBinder), download other slfj4 {{jar}}-files, but what next? I can't put them to the class path, since I don't have a project. I can't change dependencies in {{pom.xml}} for the same reason. Whant should I do? I tried downloading the whole source code, but couldn't build it using Maven, still haven't figured out why. Previous discussion see here: https://issues.apache.org/jira/browse/TIKA-1470 Thank you! Best regards, Darya Arbuzova -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-936) encoding of ZipArchiveInputStream
[ https://issues.apache.org/jira/browse/TIKA-936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306931#comment-14306931 ] ASF GitHub Bot commented on TIKA-936: - GitHub user kongxianghe1234 opened a pull request: https://github.com/apache/tika/pull/27 Update RarParser.java if you want to detect a file which has chinese-like fileName ? this way you did will be error. details TIKA-936. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kongxianghe1234/tika patch-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/27.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #27 commit 6d3913047980521df07ae10922a6c4233dadbb52 Author: 孔祥和 JavaMiner kong1011437...@gmail.com Date: 2015-02-05T09:52:41Z Update RarParser.java if you want to detect a file which has chinese-like fileName ? this way you did will be error. details TIKA-936. encoding of ZipArchiveInputStream - Key: TIKA-936 URL: https://issues.apache.org/jira/browse/TIKA-936 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.1 Reporter: Shinichiro Abe Assignee: Jukka Zitting Attachments: x-日本語メモ.zip When extracting from the zip files which are zipped at Windows OS(Japanese), the file name extracted from zip is garbled. ZipArchiveInputStream has three constructors. Modifying like the below, the file name was not garbled. I specified the encoding - SJIS. {code:title=PackageExtractor|borderStyle=solid} public void parse(InputStream stream) : //unpack(new ZipArchiveInputStream(stream), xhtml); unpack(new ZipArchiveInputStream(stream,SJIS,true), xhtml); : {code} In first constructor -the platform's default encoding- UTF-8 is used. In my case the encoding of my computer is UTF-8, the encoding of zip file is SJIS, so the file name was garbled. We will get garbled file name if there is a difference of encoding between -platform- this constructor and zip file. I want Tika to parse zip by giving some kind of encoding parameter per file, Where should I give the encoding, somewhere in Metadata or ParseContext? Please support this. I am using Tika via Solr(SolrCell), so when posting zip file to Solr I want to add encoding parameter to the request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container
[ https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326497#comment-14326497 ] ASF GitHub Bot commented on TIKA-1354: -- Github user hlavki closed the pull request at: https://github.com/apache/tika/pull/18 ForkParser doesn't work in OSGI container - Key: TIKA-1354 URL: https://issues.apache.org/jira/browse/TIKA-1354 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Michal Hlavac I can't find way to run ForkParser in OSGI container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container
[ https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324823#comment-14324823 ] ASF GitHub Bot commented on TIKA-1354: -- GitHub user hlavki opened a pull request: https://github.com/apache/tika/pull/30 Rollback (TIKA-1354) and update pax-exam to version 4.4.0 Test all bundle tests cases passes You can merge this pull request into a Git repository by running: $ git pull https://github.com/hlavki/tika trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/30.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #30 commit fe7ee28aa344744b5534cb707db0baeaf843f79e Author: Michal Hlavac hla...@hlavki.eu Date: 2015-02-17T20:20:08Z The ForkParser service removed from Activator commit 01dcbc913638c641434001cb60f3bb3035f996c5 Author: Michal Hlavac hla...@hlavki.eu Date: 2015-02-17T20:20:51Z update pax-exam to 4.4.0 and fix osgi bundle tests ForkParser doesn't work in OSGI container - Key: TIKA-1354 URL: https://issues.apache.org/jira/browse/TIKA-1354 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Michal Hlavac I can't find way to run ForkParser in OSGI container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1530) MP4Parser parses duration but does not set it
[ https://issues.apache.org/jira/browse/TIKA-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293120#comment-14293120 ] ASF GitHub Bot commented on TIKA-1530: -- GitHub user owickstrom opened a pull request: https://github.com/apache/tika/pull/25 TIKA-1530: Include parsed mp4 duration in metadata Note that I couldn't get all tests working in the project (https://issues.apache.org/jira/browse/TIKA-1521?focusedCommentId=14288719page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14288719) so I have only run `org.apache.tika.parser.mp4.MP4ParserTest`. If someone else with a working build could try this perhaps? You can merge this pull request into a Git repository by running: $ git pull https://github.com/owickstrom/tika trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/25.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #25 commit 5300d22e6c7f71c353ad84a7cf4534a8efff85da Author: Oskar Wickström oskar.wickst...@live.com Date: 2015-01-27T07:28:00Z TIKA-1530: Include parsed mp4 duration in metadata MP4Parser parses duration but does not set it -- Key: TIKA-1530 URL: https://issues.apache.org/jira/browse/TIKA-1530 Project: Tika Issue Type: Improvement Components: parser Reporter: Oskar Wickström Priority: Minor See the TODO comment at https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L167 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1530) MP4Parser parses duration but does not set it
[ https://issues.apache.org/jira/browse/TIKA-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293664#comment-14293664 ] ASF GitHub Bot commented on TIKA-1530: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/25 MP4Parser parses duration but does not set it -- Key: TIKA-1530 URL: https://issues.apache.org/jira/browse/TIKA-1530 Project: Tika Issue Type: Improvement Components: parser Reporter: Oskar Wickström Priority: Minor See the TODO comment at https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L167 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1537) Installation on OSX 10.10.2 generates OutOfMemory Error during parser tests
[ https://issues.apache.org/jira/browse/TIKA-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14300953#comment-14300953 ] ASF GitHub Bot commented on TIKA-1537: -- GitHub user archerrbgh opened a pull request: https://github.com/apache/tika/pull/26 Added argLine value for maven-surefire-plugin Setting a higher maximum amount of memory prevents TestChmExtraction from generating an OutOfMemory error from running out of heap space when running parser tests and trying to install Tika 1.7 on OS X 10.10.2. This is to address issue TIKA-1537. You can merge this pull request into a Git repository by running: $ git pull https://github.com/archerrbgh/tika trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/26.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #26 commit 05340a8ef401a29206b7dff30b97d193347cd8c2 Author: Andrew Hwang archerr...@gmail.com Date: 2015-02-02T07:08:19Z Added argLine value for maven-surefire-plugin Setting a higher maximum amount of memory prevents TestChmExtraction from generating an OutOfMemory error from running out of heap space when running parser tests and trying to install Tika 1.7 on OS X 10.10.2. Installation on OSX 10.10.2 generates OutOfMemory Error during parser tests --- Key: TIKA-1537 URL: https://issues.apache.org/jira/browse/TIKA-1537 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: Mac OSX 10.10.2 Reporter: Andrew Hwang Priority: Minor Labels: easyfix Original Estimate: 0h Remaining Estimate: 0h I was having issues during installation of Tika 1.7 where the build failed when running parser tests (specifically on TestChmExtraction). I had set the MAVEN_OPTS variable to have enough memory (-Xmx2048m), but the build still failed. I turned out that the maven-surefire-plugin was creating a new JVM that did not have enough specified memory, causing TestChmExtraction to fail. A fix I found online led me to change the POM for tika-parent (adding an argLine to maven-surefire-plugin, where -Xmx2048m worked). After adding this, the installation was able to finish. I will submit a pull request with the addition. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1365) Incorrectly MimeType detection for Apache Lucene web site
[ https://issues.apache.org/jira/browse/TIKA-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366176#comment-14366176 ] ASF GitHub Bot commented on TIKA-1365: -- GitHub user mkr opened a pull request: https://github.com/apache/tika/pull/35 TIKA-1365: Lower priority for XML starting with comment TIKA-1365: Lower priority for XML starting with comment, allow HTML starting with comment to be detected as text/html You can merge this pull request into a Git repository by running: $ git pull https://github.com/mkr/tika TIKA-1365 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/35.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #35 commit f9655d44978af188018bee81b2d554770ddcd7f9 Author: Matthias Krueger m...@mkr.io Date: 2015-03-17T21:45:36Z TIKA-1365: Lower priority for XML starting with comment, allow HTML starting with comment to be detected as text/html Incorrectly MimeType detection for Apache Lucene web site - Key: TIKA-1365 URL: https://issues.apache.org/jira/browse/TIKA-1365 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.5 Reporter: Tien Nguyen Manh Attachments: discussion.html Tika 1.5 detect many page from apache lucene web site as xml, for example this page http://lucene.apache.org/core/discussion.html Here are error log:, it failed to parse becuase it use xml parser Apache Tika was unable to parse the document at http://lucene.apache.org/core/discussion.html. The full exception stack trace is included below: org.apache.tika.exception.TikaException: XML parse error at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320) at org.apache.tika.gui.TikaGUI.openURL(TikaGUI.java:293) at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:247) at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2018) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1554) Improve EMF file detection
[ https://issues.apache.org/jira/browse/TIKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366038#comment-14366038 ] ASF GitHub Bot commented on TIKA-1554: -- GitHub user mkr opened a pull request: https://github.com/apache/tika/pull/34 TIKA-1554: Adding EMF magic as per Microsoft's EMF specification, thanks to Luis Filipe Nassif TIKA-1554: Adding EMF magic as per Microsoft's EMF specification, thanks to Luis Filipe Nassif You can merge this pull request into a Git repository by running: $ git pull https://github.com/mkr/tika TIKA-1554 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/34.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #34 commit 4608ff50c28b9ba8c2d1caf6fe4530eeb2a088be Author: Matthias Krueger m...@mkr.io Date: 2015-03-17T12:45:06Z TIKA-1554: Adding EMF magic as per Microsoft's EMF specification, thanks to Luis Filipe Nassif Improve EMF file detection -- Key: TIKA-1554 URL: https://issues.apache.org/jira/browse/TIKA-1554 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.7 Reporter: Luis Filipe Nassif Attachments: nonEmf.dat I am getting many files being incorrectly detected as application/x-emf. I think the current magic is too common. According to MS documentation (https://msdn.microsoft.com/en-us/library/cc230635.aspx and https://msdn.microsoft.com/en-us/library/dd240211.aspx), it can be improved to: {code} mime-type type=application/x-emf acronymEMF/acronym _commentExtended Metafile/_comment glob pattern=*.emf/ magic priority=50 match value=0x0100 type=string offset=0 match value= EMF type=string offset=40/ /match /magic /mime-type {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1554) Improve EMF file detection
[ https://issues.apache.org/jira/browse/TIKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368568#comment-14368568 ] ASF GitHub Bot commented on TIKA-1554: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/34 Improve EMF file detection -- Key: TIKA-1554 URL: https://issues.apache.org/jira/browse/TIKA-1554 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.7 Reporter: Luis Filipe Nassif Assignee: Chris A. Mattmann Attachments: nonEmf.dat I am getting many files being incorrectly detected as application/x-emf. I think the current magic is too common. According to MS documentation (https://msdn.microsoft.com/en-us/library/cc230635.aspx and https://msdn.microsoft.com/en-us/library/dd240211.aspx), it can be improved to: {code} mime-type type=application/x-emf acronymEMF/acronym _commentExtended Metafile/_comment glob pattern=*.emf/ magic priority=50 match value=0x0100 type=string offset=0 match value= EMF type=string offset=40/ /match /magic /mime-type {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1365) Incorrectly MimeType detection for Apache Lucene web site
[ https://issues.apache.org/jira/browse/TIKA-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368559#comment-14368559 ] ASF GitHub Bot commented on TIKA-1365: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/35 Incorrectly MimeType detection for Apache Lucene web site - Key: TIKA-1365 URL: https://issues.apache.org/jira/browse/TIKA-1365 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.5 Reporter: Tien Nguyen Manh Assignee: Chris A. Mattmann Attachments: discussion.html Tika 1.5 detect many page from apache lucene web site as xml, for example this page http://lucene.apache.org/core/discussion.html Here are error log:, it failed to parse becuase it use xml parser Apache Tika was unable to parse the document at http://lucene.apache.org/core/discussion.html. The full exception stack trace is included below: org.apache.tika.exception.TikaException: XML parse error at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320) at org.apache.tika.gui.TikaGUI.openURL(TikaGUI.java:293) at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:247) at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2018) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1567) WelcomeResource in TikaServer doesn't print PathParam prefix
[ https://issues.apache.org/jira/browse/TIKA-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351930#comment-14351930 ] ASF GitHub Bot commented on TIKA-1567: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/33 WelcomeResource in TikaServer doesn't print PathParam prefix Key: TIKA-1567 URL: https://issues.apache.org/jira/browse/TIKA-1567 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Ben Zaiten found out that while looking at the WelcomeResource for Tika server, things like meta/form are shown as metaform not properly delineating the PathParams. I tracked this down to WelcomeResource easy fix coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1567) WelcomeResource in TikaServer doesn't print PathParam prefix
[ https://issues.apache.org/jira/browse/TIKA-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351928#comment-14351928 ] ASF GitHub Bot commented on TIKA-1567: -- GitHub user chrismattmann opened a pull request: https://github.com/apache/tika/pull/33 Fix for TIKA-1567 WelcomeResource in TikaServer doesn't print PathParam prefix You can merge this pull request into a Git repository by running: $ git pull https://github.com/chrismattmann/tika TIKA-1567 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/33.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #33 commit c3c97ce987ad1da55a0b4d1d016d67d06341467a Author: Chris Mattmann mattm...@apache.org Date: 2015-03-08T06:08:48Z fix for TIKA-1567 WelcomeResource in TikaServer doesn't print PathParam prefix. WelcomeResource in TikaServer doesn't print PathParam prefix Key: TIKA-1567 URL: https://issues.apache.org/jira/browse/TIKA-1567 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Ben Zaiten found out that while looking at the WelcomeResource for Tika server, things like meta/form are shown as metaform not properly delineating the PathParams. I tracked this down to WelcomeResource easy fix coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags
[ https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388478#comment-14388478 ] ASF GitHub Bot commented on TIKA-1589: -- GitHub user mdaniline opened a pull request: https://github.com/apache/tika/pull/38 fix for TIKA-1589 contributed by mdaniline https://issues.apache.org/jira/browse/TIKA-1589 You can merge this pull request into a Git repository by running: $ git pull https://github.com/mdaniline/tika TIKA-1589 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/38.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #38 commit fb29412710ea058f89d3c6df5078587768dcac74 Author: Max Daniline maxim.danil...@softwire.com Date: 2015-03-31T12:49:43Z fix for TIKA-1589 contributed by mdaniline Mp3 parser does not add duration to metadata if there are no ID3 tags - Key: TIKA-1589 URL: https://issues.apache.org/jira/browse/TIKA-1589 Project: Tika Issue Type: Bug Reporter: Max Daniline Steps to reproduce: * Have a file without any ID3 tags (v1 or v2) * Parse the file * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'. Expected result: The duration should be set even for a file without ID3 tags, since it is independent information. Actual result: The duration is null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags
[ https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388504#comment-14388504 ] ASF GitHub Bot commented on TIKA-1589: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/38 Mp3 parser does not add duration to metadata if there are no ID3 tags - Key: TIKA-1589 URL: https://issues.apache.org/jira/browse/TIKA-1589 Project: Tika Issue Type: Bug Reporter: Max Daniline Steps to reproduce: * Have a file without any ID3 tags (v1 or v2) * Parse the file * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'. Expected result: The duration should be set even for a file without ID3 tags, since it is independent information. Actual result: The duration is null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1558) Create a Parser Blacklist
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389464#comment-14389464 ] ASF GitHub Bot commented on TIKA-1558: -- Github user tpalsulich closed the pull request at: https://github.com/apache/tika/pull/39 Create a Parser Blacklist - Key: TIKA-1558 URL: https://issues.apache.org/jira/browse/TIKA-1558 Project: Tika Issue Type: New Feature Reporter: Tyler Palsulich Assignee: Tyler Palsulich Fix For: 1.8 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385398#comment-14385398 ] ASF GitHub Bot commented on TIKA-1586: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/37 Enable CORS on Tika Server -- Key: TIKA-1586 URL: https://issues.apache.org/jira/browse/TIKA-1586 Project: Tika Issue Type: New Feature Components: server Reporter: Tyler Palsulich Assignee: Tyler Palsulich Tika Server should allow configuration of CORS requests (for uses like TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] from CXF for how to add it. The only change from that site is that we will need to add a {{CrossOriginResourceSharingFilter}} as a provider. Ideally, this is configurable (limit which resources have CORS, and which origins are allowed). But, I'm not thinking of any general methods of how to do that... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385369#comment-14385369 ] ASF GitHub Bot commented on TIKA-1586: -- GitHub user tpalsulich opened a pull request: https://github.com/apache/tika/pull/37 TIKA-1586. Enable CORS requests on Tika server You can merge this pull request into a Git repository by running: $ git pull https://github.com/tpalsulich/tika TIKA-1586 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/37.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #37 commit c296a0459477f1ba088d96fa6ba3895e3a6b3ac5 Author: Tyler Palsulich tpalsul...@gmail.com Date: 2015-03-28T15:45:45Z TIKA-1586. Enable CORS requests on Tika server. Enable CORS on Tika Server -- Key: TIKA-1586 URL: https://issues.apache.org/jira/browse/TIKA-1586 Project: Tika Issue Type: New Feature Components: server Reporter: Tyler Palsulich Assignee: Tyler Palsulich Tika Server should allow configuration of CORS requests (for uses like TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] from CXF for how to add it. The only change from that site is that we will need to add a {{CrossOriginResourceSharingFilter}} as a provider. Ideally, this is configurable (limit which resources have CORS, and which origins are allowed). But, I'm not thinking of any general methods of how to do that... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification
[ https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341673#comment-14341673 ] ASF GitHub Bot commented on TIKA-1561: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/32 GCMD Directory Interchange Format (.dif) identification --- Key: TIKA-1561 URL: https://issues.apache.org/jira/browse/TIKA-1561 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.7 Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Fix For: 1.8 Attachments: carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the type of xml file is needed. Although dif file in this case seems to be an proper xml file which can be parsed by xmlparser, still it might need a specific process on some of the fields to be extracted and injected into the Solr System for analysis. Then it is proposed that the following type 'text/dif+xml' is appended and used in the tika-mimetypes.xml to be able to support the specific xml type detection which extends the application/xml, so that some special process can be applied to this particular xml file. mime-type type=text/dif+xml root-XML localName=DIF/ root-XML localName=DIF namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif// glob pattern=*.dif/ sub-class-of type=application/xml/ /mime-type Expected MIME type: text/dif+xml The following is the link to the dif format guide http://gcmd.nasa.gov/add/difguide/ example dif files: 1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif 2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif 3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif an example dif file has also been attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container
[ https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332464#comment-14332464 ] ASF GitHub Bot commented on TIKA-1354: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/30 ForkParser doesn't work in OSGI container - Key: TIKA-1354 URL: https://issues.apache.org/jira/browse/TIKA-1354 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Michal Hlavac I can't find way to run ForkParser in OSGI container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification
[ https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338105#comment-14338105 ] ASF GitHub Bot commented on TIKA-1561: -- GitHub user LukeLiush opened a pull request: https://github.com/apache/tika/pull/32 add mime detection with dif(TIKA-1561) support You can merge this pull request into a Git repository by running: $ git pull https://github.com/LukeLiush/tika difmime Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/32.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #32 commit eb920b55f4debe0f1152dcb17bd17ff4f09893f9 Author: LukeLiush hanson311...@gmail.com Date: 2015-02-26T08:53:50Z add mime detection with dif(TIKA-1561) support GCMD Directory Interchange Format (.dif) identification --- Key: TIKA-1561 URL: https://issues.apache.org/jira/browse/TIKA-1561 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.7 Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Attachments: carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the type of xml file is needed. Although dif file in this case seems to be an proper xml file which can be parsed by xmlparser, still it might need a specific process on some of the fields to be extracted and injected into the Solr System for analysis. Then it is proposed that the following type 'text/dif+xml' is appended and used in the tika-mimetypes.xml to be able to support the specific xml type detection which extends the application/xml, so that some special process can be applied to this particular xml file. mime-type type=text/dif+xml root-XML localName=DIF/ root-XML localName=DIF namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif// glob pattern=*.dif/ sub-class-of type=application/xml/ /mime-type Expected MIME type: text/dif+xml The following is the link to the dif format guide http://gcmd.nasa.gov/add/difguide/ example dif files: 1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif 2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif 3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif an example dif file has also been attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram
[ https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385177#comment-14385177 ] ASF GitHub Bot commented on TIKA-1582: -- GitHub user LukeLiush opened a pull request: https://github.com/apache/tika/pull/36 Nn branch https://issues.apache.org/jira/browse/TIKA-1582 You can merge this pull request into a Git repository by running: $ git pull https://github.com/LukeLiush/tika nnBranch Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/36.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #36 commit eb04f13260bfb5e4f4b0bf7fd54ecd085995cb92 Author: LukeLiush hanson311...@gmail.com Date: 2015-03-28T07:12:06Z https://issues.apache.org/jira/browse/TIKA-1582 commit acaf27bb666fdef05bdb18d7edcaafe7ccfd9bf5 Author: LukeLiush hanson311...@gmail.com Date: 2015-03-28T07:16:07Z move the comments of apache licence to the top commit 701fcc394ed2110e4c771fbb84999dca77932392 Author: LukeLiush hanson311...@gmail.com Date: 2015-03-28T07:19:43Z add some comments commit 12f290826a88cd99bbf2e1a0385b315e73e3 Author: LukeLiush hanson311...@gmail.com Date: 2015-03-28T07:25:55Z move the example model file to the test resource directory commit 6c8d2e523c427380438f24d90985e28bfdbce050 Author: LukeLiush hanson311...@gmail.com Date: 2015-03-28T07:28:25Z remove empty comment block Mime Detection based on neural networks with Byte-frequency-histogram -- Key: TIKA-1582 URL: https://issues.apache.org/jira/browse/TIKA-1582 Project: Tika Issue Type: Improvement Components: detector, mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Content-based mime type detection is one of the popular approaches to detect mime type, there are others based on file extension and magic numbers ; And currently Tika has implemented 3 approaches in detecting mime types; They are : 1) file extensions 2) magic numbers (the most trustworthy in tika) 3) content-type(the header in the http response if present and available) Content-based mime type detection however analyses the distribution of the entire stream of bytes and find a similar pattern for the same type and build a function that is able to group them into one or several classes so as to classify and predict; It is believed this feature might broaden the usage of Tika with a bit more security enforcement for mime type detection. Because we want to build a model that is etched with the patterns it has seen, in some situations we may not trust those types which have not been trained/learned by the model. In some situations, magic numbers imbedded in the files can be copied but the actual content could be a potentially detrimental Troy program. By enforcing the trust on byte frequency patterns, we are able to enhance the security of the detection. The proposed content-based mime detection to be integrated into Tika is based on the machine learning algorithm i.e. neural network with back-propagation. The input: 0-255 bins each of which represent a byte, and and each of which stores the count of occurrences for each byte, and the byte frequency histograms are normalized to fall in the range between 0 and 1, they then are passed to a companding function to enhancement the infrequent bytes. The output of the neural network is a binary decision 1 or 0; Notice BTW, the proposed feature will be implemented with GRB file type as one example. In this example, we build a model that is able to classify GRB file type from non-GRB file types, notice the size of non-GRB files is huge and cannot be easily defined, so there need to be as many negative training example as possible to form this non-GRB types decision boundary. The Neural networks is considered as two stage of processes. Training and classification. The training can be done in any programming language, in this feature /research, the training of neural network is implemented in R and the source can be found in my github repository i.e. https://github.com/LukeLiush/filetypeDetection; i am also going to post a document that describe the use of the program, the syntax/ format of the input and output. After training, we need to export the model and import it to Tika; in Tika, we create a TrainedModelDetector that reads this model file with one or more model parameters or several model files,so it can detect the mime types with the model of those mime types. Details of the research and usage with this proposed feature will be posted on my github shortly.
[jira] [Commented] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14490879#comment-14490879 ] ASF GitHub Bot commented on TIKA-1517: -- GitHub user LukeLiush opened a pull request: https://github.com/apache/tika/pull/41 add probabilistic mime selection The probabilistic mime selection detector has been re-implemented. the Bayesian probabilistic selection has been improved by adding more freedom and flexibilities that allow users to specify their own prior and conditional probabilities; the concrete details and basic idea are illustrated in the Tika-1571 and the implementation still follow the main description posted in the https://issues.apache.org/jira/browse/TIKA-1517 but with a minor change on the prior probability which was not modifiable in the original design. The https://issues.apache.org/jira/browse/TIKA-1535 is resolved according to Prof Mattmann's suggestion. You can merge this pull request into a Git repository by running: $ git pull https://github.com/LukeLiush/tika mimeDetection Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/41.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #41 commit fd41a855f794876b43f97327bd77190099c1e554 Author: LukeLiush hanson311...@gmail.com Date: 2015-04-11T08:57:12Z add probabilistic mime selection MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 Reporter: Luke sh Priority: Trivial Attachments: BaysianTest.java Improvement and intuition The original implementation for MIME type selection/detection is a bit less flexible by initial design, as it heavily relies on the outcome produced by magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable in a file, Tika will follow the file type detected by magic-bytes. It may be better to provide more control over the method of choice. This proposed approach slightly incorporate the Bayesian probability theorem, where users are able to assign weights to each approach in terms of probability, so they have the control over preference of which file type or mime type identification methods implemented/available in Tika, and currently there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File extension and Metadata content-type hint). By introducing some weights on the approach in the proposed approach, users are able to choose which method they trust most, the magic-bytes method is often trust-worthy though. But the virtue is that in some situations, file type identification must be sensitive, some might want all of the MIME type identification methods to agree on the same file type before they start processing those files, incorrect file type identification is less intolerable. The current implementation seems to be less flexible for this purpose and heavily rely on the Magic-bytes file identification method (although magic-bytes is most reliable compared to the other 2 ); Proposed design: The idea of selection is to incorporate probability as weights on each MIME type identification method currently being implemented in Tika (they are Magic bytes approach, file extension match and metadata content-type hint). for example, as an user, i would probably like to assign the the preference to the method based on the degree of the trust, and order the results if they don't coincide. Bayesian rule may be a bit appropriate here to meet the intuition. The following is what are needed for Bayesian rule implementation. Prior probability P(file_type) e.g. P(pdf), theoretically this is computed based on the samples, and this depends on the domain or use cases, intuitively we more care the orders of the weights or probability of the results rather than the actual numbers, and also the context of Prior depends on samples for a particular use case or domain, e.g. if we happen to crawl a website that contains mostly the pdf files, we probably can collect some samples and compute the prior, based on the samples we can say 90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here we propose to define the prior as configurable param for users, and by default we leave the prior to be unapplicable. Alternatively, we can define prior for each file type to be 1/[number of supported file types in Tika] I think the
[jira] [Commented] (TIKA-1614) Geo Topic Parser
[ https://issues.apache.org/jira/browse/TIKA-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508325#comment-14508325 ] ASF GitHub Bot commented on TIKA-1614: -- GitHub user AranyaLi opened a pull request: https://github.com/apache/tika/pull/43 TIKA-1614 Geo Topic Parser You can merge this pull request into a Git repository by running: $ git pull https://github.com/AranyaLi/tika trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/43.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #43 commit c6c402451c7133769aa94cdb6bbe075687e7c519 Author: aranyali aranyaelen...@gmail.com Date: 2015-04-23T02:23:18Z add geo topic parser commit 51ef0cde472f3b7bd1e38318a1036af596135689 Author: aranyali aranyaelen...@gmail.com Date: 2015-04-23T02:23:52Z delete ~ Geo Topic Parser Key: TIKA-1614 URL: https://issues.apache.org/jira/browse/TIKA-1614 Project: Tika Issue Type: New Feature Components: parser Reporter: Anya Yun Li ##Description This program aims to provide the support to identify geonames for any unstructured text data in the project NSF polar research. https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/1 This project is a content-based geotagging solution, made of a variaty of NLP tools and could be used for any geotagging purposes. ##Workingflow 1. Plain text input is passed to geoparser 2. Location names are extracted from the text using OpenNLP NER 3. Provide two roles: * The most frequent location name choosed as the best match for the input text * Other extracted locations are treated as alternatives (equal) 4. location extracted above, search the best GeoName object and return the resloved objects with fields (name in gazetteer, longitude, latitude) ##How to Use *Cautions*: This program requires at least 1.2 GB disk space for building Lucene Index ```Java function A(stream){ Metadata metadata = new Metadata(); ParseContext context=new ParseContext(); GeoParserConfig config= new GeoParserConfig(); config.setGazetterPath(gazetteerPath); config.setNERModelPath(nerPath); context.set(GeoParserConfig.class, config); geoparser.parse( stream, new BodyContentHandler(), metadata, context); for(String name: metadata.names()){ String value=metadata.get(name); System.out.println(name + + value); } } ``` This parser generates useful geographical information to Tika's Metadata Object. Fields for best matched location: ``` Geographic_NAME Geographic_LONGTITUDE Geographic_LATITUDE ``` Fields for alternatives: ``` Geographic_NAME1 Geographic_LONGTITUDE1 Geographic_LATITUDE1 Geographic_NAME2 Geographic_LONGTITUDE2 Geographic_LATITUDE2 ... ``` If you have any questions, contact me: anyayu...@gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1532) DIF Parser
[ https://issues.apache.org/jira/browse/TIKA-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518805#comment-14518805 ] ASF GitHub Bot commented on TIKA-1532: -- GitHub user HyperDunk opened a pull request: https://github.com/apache/tika/pull/46 TIKA-1532: DIF Parser and change mime-type from text/dif+xml to application/dif+xml Implementation of DIFParser (GCDM Directory Interchange Format) with unit test and change to mime-type as per discussion in TIKA-1532 You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyperDunk/tika trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/46.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #46 commit 5f751a67266ea3c54f9e1a3e18c594e4c215cadc Author: HyperDunk aakarsh@gmail.com Date: 2015-04-29T06:10:41Z TIKA-1532: DIF Parser and change mime-type from text/dif+xml to application/dif+xml DIF Parser -- Key: TIKA-1532 URL: https://issues.apache.org/jira/browse/TIKA-1532 Project: Tika Issue Type: New Feature Components: parser Reporter: Aakarsh Medleri Hire Math Labels: memex MIME Type detection content parser for .dif format -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1532) DIF Parser
[ https://issues.apache.org/jira/browse/TIKA-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520969#comment-14520969 ] ASF GitHub Bot commented on TIKA-1532: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/46 DIF Parser -- Key: TIKA-1532 URL: https://issues.apache.org/jira/browse/TIKA-1532 Project: Tika Issue Type: New Feature Components: parser Reporter: Aakarsh Medleri Hire Math Assignee: Chris A. Mattmann Labels: memex Fix For: 1.9 MIME Type detection content parser for .dif format -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes
[ https://issues.apache.org/jira/browse/TIKA-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514946#comment-14514946 ] ASF GitHub Bot commented on TIKA-1535: -- GitHub user LukeLiush opened a pull request: https://github.com/apache/tika/pull/45 https://issues.apache.org/jira/browse/TIKA-1535 Inheritance modification... ... for the class MIMETypes You can merge this pull request into a Git repository by running: $ git pull https://github.com/LukeLiush/tika TIKA-1535 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/45.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #45 commit 66cdb6e30dd1654702e08723bb0f607fb4fc9def Author: LukeLiush hanson311...@gmail.com Date: 2015-04-27T20:41:43Z https://issues.apache.org/jira/browse/TIKA-1535 Inheritance modification for the class MIMETypes Inheritance modification for the class MIMETypes Key: TIKA-1535 URL: https://issues.apache.org/jira/browse/TIKA-1535 Project: Tika Issue Type: Improvement Components: mime Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial The Class MIMETypes does not currently allow for inheritance. There are a couple of methods in this class which looks independent, and some of which needs to be exposed or overwritten for special needs or use cases, this will enable tika users with more flexibility for new mime detection algorithm. Perhaps it may be a good idea to extract out the detector logic from the MimeTypes class, and create an independent detector for Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes
[ https://issues.apache.org/jira/browse/TIKA-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522082#comment-14522082 ] ASF GitHub Bot commented on TIKA-1535: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/45 Inheritance modification for the class MIMETypes Key: TIKA-1535 URL: https://issues.apache.org/jira/browse/TIKA-1535 Project: Tika Issue Type: Improvement Components: mime Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Labels: memex Fix For: 1.9 The Class MIMETypes does not currently allow for inheritance. There are a couple of methods in this class which looks independent, and some of which needs to be exposed or overwritten for special needs or use cases, this will enable tika users with more flexibility for new mime detection algorithm. Perhaps it may be a good idea to extract out the detector logic from the MimeTypes class, and create an independent detector for Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram
[ https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525083#comment-14525083 ] ASF GitHub Bot commented on TIKA-1582: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/36 Mime Detection based on neural networks with Byte-frequency-histogram -- Key: TIKA-1582 URL: https://issues.apache.org/jira/browse/TIKA-1582 Project: Tika Issue Type: Improvement Components: detector, mime Affects Versions: 1.7 Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Labels: memex Fix For: 1.9 Attachments: nnmodel.docx, week2-report-histogram comparison.docx, week6 report.docx Content-based mime type detection is one of the popular approaches to detect mime type, there are others based on file extension and magic numbers ; And currently Tika has implemented 3 approaches in detecting mime types; They are : 1) file extensions 2) magic numbers (the most trustworthy in tika) 3) content-type(the header in the http response if present and available) Content-based mime type detection however analyses the distribution of the entire stream of bytes and find a similar pattern for the same type and build a function that is able to group them into one or several classes so as to classify and predict; It is believed this feature might broaden the usage of Tika with a bit more security enforcement for mime type detection. Because we want to build a model that is etched with the patterns it has seen, in some situations we may not trust those types which have not been trained/learned by the model. In some situations, magic numbers imbedded in the files can be copied but the actual content could be a potentially detrimental Troy program. By enforcing the trust on byte frequency patterns, we are able to enhance the security of the detection. The proposed content-based mime detection to be integrated into Tika is based on the machine learning algorithm i.e. neural network with back-propagation. The input: 0-255 bins each of which represent a byte, and and each of which stores the count of occurrences for each byte, and the byte frequency histograms are normalized to fall in the range between 0 and 1, they then are passed to a companding function to enhancement the infrequent bytes. The output of the neural network is a binary decision 1 or 0; Notice BTW, the proposed feature will be implemented with GRB file type as one example. In this example, we build a model that is able to classify GRB file type from non-GRB file types, notice the size of non-GRB files is huge and cannot be easily defined, so there need to be as many negative training example as possible to form this non-GRB types decision boundary. The Neural networks is considered as two stage of processes. Training and classification. The training can be done in any programming language, in this feature /research, the training of neural network is implemented in R and the source can be found in my github repository i.e. https://github.com/LukeLiush/filetypeDetection; i am also going to post a document that describe the use of the program, the syntax/ format of the input and output. After training, we need to export the model and import it to Tika; in Tika, we create a TrainedModelDetector that reads this model file with one or more model parameters or several model files,so it can detect the mime types with the model of those mime types. Details of the research and usage with this proposed feature will be posted on my github shortly. It is worth noting again that in this research we only worked out one model - GRB as one example to demonstrate the use of this content-based mime detection. One of the challenges again is that the non-GRB file types cannot be clearly defined unless we feed our model with some example data for all of the existing file types in the world, but this seems to be too utopian and a bit less likely, so it is better that the set of class/types is given and defined in advance to minimize the problem domain. Another challenge is the size of the training data; even if we know the types we want to classify, getting enough training data to form a model can be also one of the main factors of success. In our example model, grb data are collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the grb data from that source all exhibit a similar pattern, a simple neural network structure is able to predict well, even a linear logistic regression is able to do a good job; However, if we pass the GRB files collected from other source to the model for
[jira] [Commented] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524934#comment-14524934 ] ASF GitHub Bot commented on TIKA-1517: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/41 MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 Reporter: Luke sh Priority: Trivial Labels: memex Fix For: 1.9 Attachments: BaysianTest.java Improvement and intuition The original implementation for MIME type selection/detection is a bit less flexible by initial design, as it heavily relies on the outcome produced by magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable in a file, Tika will follow the file type detected by magic-bytes. It may be better to provide more control over the method of choice. This proposed approach slightly incorporate the Bayesian probability theorem, where users are able to assign weights to each approach in terms of probability, so they have the control over preference of which file type or mime type identification methods implemented/available in Tika, and currently there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File extension and Metadata content-type hint). By introducing some weights on the approach in the proposed approach, users are able to choose which method they trust most, the magic-bytes method is often trust-worthy though. But the virtue is that in some situations, file type identification must be sensitive, some might want all of the MIME type identification methods to agree on the same file type before they start processing those files, incorrect file type identification is less intolerable. The current implementation seems to be less flexible for this purpose and heavily rely on the Magic-bytes file identification method (although magic-bytes is most reliable compared to the other 2 ); Proposed design: The idea of selection is to incorporate probability as weights on each MIME type identification method currently being implemented in Tika (they are Magic bytes approach, file extension match and metadata content-type hint). for example, as an user, i would probably like to assign the the preference to the method based on the degree of the trust, and order the results if they don't coincide. Bayesian rule may be a bit appropriate here to meet the intuition. The following is what are needed for Bayesian rule implementation. Prior probability P(file_type) e.g. P(pdf), theoretically this is computed based on the samples, and this depends on the domain or use cases, intuitively we more care the orders of the weights or probability of the results rather than the actual numbers, and also the context of Prior depends on samples for a particular use case or domain, e.g. if we happen to crawl a website that contains mostly the pdf files, we probably can collect some samples and compute the prior, based on the samples we can say 90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here we propose to define the prior as configurable param for users, and by default we leave the prior to be unapplicable. Alternatively, we can define prior for each file type to be 1/[number of supported file types in Tika] I think the number would be approximately 1/1157 and using this number seems to be more fair, but the point of avoiding it is that this prior is fixed for every type, and eventually we care more the orders of the result and if the number is fixed, so will the order be, bringing this number of 1/1157 into the Bayesian equation will not only be unable to affect the order but also it will lumber our implementation with extra computation, thus we will leave it as unapplicable which means we assign 1 to it as it never exists! but note we care more the order rather the actual number, and this param is configurable, and we believe it provides much flexibilities in some use cases. Conditional probability of positive tests given a file type P(test| file_type) e.g. P(test1 = pdf | pdf), this probability is also based on collection of samples and domain or use cases, we leave it configurable, but based on our intuition we think test1(i.e. Magic-bytes method) is most trustworthy, thus the default value is 0.75 for P(test1 = a_file_type | a_file_type), this is to say given the file whose type is a file type, the probability of the test1 predicting the file
[jira] [Commented] (TIKA-443) Geographic Information Parser
[ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522848#comment-14522848 ] ASF GitHub Bot commented on TIKA-443: - GitHub user gautham4 opened a pull request: https://github.com/apache/tika/pull/47 PULL REQUEST for TIKA-443 You can merge this pull request into a Git repository by running: $ git pull https://github.com/gautham4/tika TIKA-443 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/47.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #47 commit 6bfdbcd869455bbae7a4547b738e5a0b249053e8 Author: unknown gautham@gmail.com Date: 2015-04-22T04:35:03Z fix for TIKA-443 contributed by gautham4 commit 66ba03ee85946d7babf9815b9734f0ee83b4767f Author: unknown gautham@gmail.com Date: 2015-05-01T05:34:38Z fix for TIKA-443 contributed by gautham@gmail.com Geographic Information Parser - Key: TIKA-443 URL: https://issues.apache.org/jira/browse/TIKA-443 Project: Tika Issue Type: New Feature Components: parser Reporter: Arturo Beltran Assignee: Chris A. Mattmann Labels: new-parser Attachments: getFDOMetadata.xml I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases. If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-443) Geographic Information Parser
[ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522907#comment-14522907 ] ASF GitHub Bot commented on TIKA-443: - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/47 Geographic Information Parser - Key: TIKA-443 URL: https://issues.apache.org/jira/browse/TIKA-443 Project: Tika Issue Type: New Feature Components: parser Reporter: Arturo Beltran Assignee: Chris A. Mattmann Labels: new-parser Attachments: getFDOMetadata.xml I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases. If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1634) Detecting problem with Matlab source code
[ https://issues.apache.org/jira/browse/TIKA-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14571681#comment-14571681 ] ASF GitHub Bot commented on TIKA-1634: -- GitHub user jihyunoh opened a pull request: https://github.com/apache/tika/pull/49 fix for TIKA-1634 contributed by Ji-Hyun Oh You can merge this pull request into a Git repository by running: $ git pull https://github.com/jihyunoh/tika TIKA-1634 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/49.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #49 commit f0cc53e79afa3feb8330a53eed6c73bc80a4f3c0 Author: jihyunoh mail2j...@gmail.com Date: 2015-06-03T21:26:56Z fix for TIKA-1634 contributed by Ji-Hyun Oh Detecting problem with Matlab source code - Key: TIKA-1634 URL: https://issues.apache.org/jira/browse/TIKA-1634 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.8 Reporter: Ji-Hyun Oh Priority: Trivial Labels: earthcube Attachments: BARCAST_MainCode.m, Initial_Vals_Maker.m, custom-mimetypes.xml, tika-mimetypes.xml, wtsgaus.m Both Matlab source code and Objective-C source code have the same suffix, which is .m. Therefore, Matlab has additional match value in mime types.xml. In tika-mimetypes.xml Matlab is defined as: mime-type type=text/x-matlab _commentMatlab source code/_comment magic priority=50 match value=function [ type=string offset=0/ /magic !-- glob pattern=*.m/ - conflicts with text/x-objcsrc -- sub-class-of type=text/plain/ /mime-type However, Matlab codes does not always start with function [“. Therefore, some Matlab codes are detected as text/x-bojcsrc. Based on the source codes collected from NOAA Paleoclimatology Software Resources, many Matlab codes have match value like these (problematic files are attached as an example): mime-type type=text/x-matlab _commentMatlab source code/_comment magic priority=50 match value=function type=string offset=0/ match value=% type=string offset=0/ /magic !-- glob pattern=*.m/ - conflicts with text/x-objcsrc -- sub-class-of type=text/plain/ /mime-type Conducted several detecting tests by using different Matlab packages obtained from NOAA Paleoclimatology Software Resources, with/without Custom-mimtypes.xml. Results are attached. As a results, total 103 Matlab files are detected correctly with custom-mimetypes.xml, while 42 Matlab files are detected as Matlab files without custom-mimetypes.xml (= only with current match value). However, this match value for Matlab source code could be only common in Paleoclimatology community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1634) Detecting problem with Matlab source code
[ https://issues.apache.org/jira/browse/TIKA-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572192#comment-14572192 ] ASF GitHub Bot commented on TIKA-1634: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/49 Detecting problem with Matlab source code - Key: TIKA-1634 URL: https://issues.apache.org/jira/browse/TIKA-1634 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.8 Reporter: Ji-Hyun Oh Assignee: Chris A. Mattmann Priority: Trivial Labels: earthcube Attachments: BARCAST_MainCode.m, Initial_Vals_Maker.m, custom-mimetypes.xml, tika-mimetypes.xml, wtsgaus.m Both Matlab source code and Objective-C source code have the same suffix, which is .m. Therefore, Matlab has additional match value in mime types.xml. In tika-mimetypes.xml Matlab is defined as: mime-type type=text/x-matlab _commentMatlab source code/_comment magic priority=50 match value=function [ type=string offset=0/ /magic !-- glob pattern=*.m/ - conflicts with text/x-objcsrc -- sub-class-of type=text/plain/ /mime-type However, Matlab codes does not always start with function [“. Therefore, some Matlab codes are detected as text/x-bojcsrc. Based on the source codes collected from NOAA Paleoclimatology Software Resources, many Matlab codes have match value like these (problematic files are attached as an example): mime-type type=text/x-matlab _commentMatlab source code/_comment magic priority=50 match value=function type=string offset=0/ match value=% type=string offset=0/ /magic !-- glob pattern=*.m/ - conflicts with text/x-objcsrc -- sub-class-of type=text/plain/ /mime-type Conducted several detecting tests by using different Matlab packages obtained from NOAA Paleoclimatology Software Resources, with/without Custom-mimtypes.xml. Results are attached. As a results, total 103 Matlab files are detected correctly with custom-mimetypes.xml, while 42 Matlab files are detected as Matlab files without custom-mimetypes.xml (= only with current match value). However, this match value for Matlab source code could be only common in Paleoclimatology community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1659) ZipContainerDetector does not detect all IPA files
[ https://issues.apache.org/jira/browse/TIKA-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591754#comment-14591754 ] ASF GitHub Bot commented on TIKA-1659: -- GitHub user Rshomali opened a pull request: https://github.com/apache/tika/pull/51 fix for TIKA-1659 contributed by rami You can merge this pull request into a Git repository by running: $ git pull https://github.com/Rshomali/tika TIKA-1659 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/51.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #51 commit b333284c8d77ecc26431c765d1fba495ea168aef Author: Rami Shomali rami.shom...@lookout.com Date: 2015-06-18T13:02:16Z fix for TIKA-1659 contributed by rami ZipContainerDetector does not detect all IPA files --- Key: TIKA-1659 URL: https://issues.apache.org/jira/browse/TIKA-1659 Project: Tika Issue Type: Bug Components: mime Reporter: Rami Shomali ZipContainerDetector expects two files to identify the IPA file as application/x-itunes-ipa: 1) app/CodeResources 2) app/ResourceRules.plist https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java#L386 Recent IPA files downloaded from the App Store do not include those files. Need to update ZipContainerDetector and remove the patterns for those two files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1664) GDALParser does not correctly set nitf as a supported MediaType
[ https://issues.apache.org/jira/browse/TIKA-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601810#comment-14601810 ] ASF GitHub Bot commented on TIKA-1664: -- GitHub user jrnorth opened a pull request: https://github.com/apache/tika/pull/53 Fix for TIKA-1664 You can merge this pull request into a Git repository by running: $ git pull https://github.com/jrnorth/tika trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/53.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #53 commit 5bfddec49cde9f6ebe59e76fd3de44f8f49e07c0 Author: Joseph North joerno...@gmail.com Date: 2015-06-25T19:43:12Z Fix for TIKA-1664 GDALParser does not correctly set nitf as a supported MediaType - Key: TIKA-1664 URL: https://issues.apache.org/jira/browse/TIKA-1664 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7, 1.8, 1.9 Reporter: Joseph North Labels: easyfix, patch GDALParser incorrectly adds ntif as a supported MediaType. It should be nitf. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1659) ZipContainerDetector does not detect all IPA files
[ https://issues.apache.org/jira/browse/TIKA-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601632#comment-14601632 ] ASF GitHub Bot commented on TIKA-1659: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/51 ZipContainerDetector does not detect all IPA files --- Key: TIKA-1659 URL: https://issues.apache.org/jira/browse/TIKA-1659 Project: Tika Issue Type: Bug Components: mime Reporter: Rami Shomali Assignee: Chris A. Mattmann Fix For: 1.10 Attachments: CamLingual.ipa, hacker_news.ipa ZipContainerDetector expects two files to identify the IPA file as application/x-itunes-ipa: 1) app/CodeResources 2) app/ResourceRules.plist https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java#L386 Recent IPA files downloaded from the App Store do not include those files. Need to update ZipContainerDetector and remove the patterns for those two files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1655) Inconsistent formatting in parsers pom.xml file
[ https://issues.apache.org/jira/browse/TIKA-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582017#comment-14582017 ] ASF GitHub Bot commented on TIKA-1655: -- GitHub user Purg opened a pull request: https://github.com/apache/tika/pull/50 Adjusted indentation in pom.xml file to match rest of file Fix for TIKA-1655 contributed by paul.tunison You can merge this pull request into a Git repository by running: $ git pull https://github.com/Purg/tika TIKA-1655 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/50.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #50 commit 0684258016fbebbc4efbbf33a542f23e444aff02 Author: Paul Tunison paul.tuni...@kitware.com Date: 2015-06-11T14:50:32Z Adjusted indentation in pom.xml file to match rest of file Fix for TIKA-1655 contributed by paul.tunison Inconsistent formatting in parsers pom.xml file --- Key: TIKA-1655 URL: https://issues.apache.org/jira/browse/TIKA-1655 Project: Tika Issue Type: Bug Environment: None Reporter: Paul Tunison Priority: Trivial Labels: easyfix, maven Fix For: 1.10 Indentation inconsistency in tika-parsers/pom.xml under Provided Dependencies comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1655) Inconsistent formatting in parsers pom.xml file
[ https://issues.apache.org/jira/browse/TIKA-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582113#comment-14582113 ] ASF GitHub Bot commented on TIKA-1655: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/50 Inconsistent formatting in parsers pom.xml file --- Key: TIKA-1655 URL: https://issues.apache.org/jira/browse/TIKA-1655 Project: Tika Issue Type: Bug Environment: None Reporter: Paul Tunison Priority: Trivial Labels: easyfix, maven Fix For: 1.10 Indentation inconsistency in tika-parsers/pom.xml under Provided Dependencies comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1614) Geo Topic Parser
[ https://issues.apache.org/jira/browse/TIKA-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557903#comment-14557903 ] ASF GitHub Bot commented on TIKA-1614: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/43 Geo Topic Parser Key: TIKA-1614 URL: https://issues.apache.org/jira/browse/TIKA-1614 Project: Tika Issue Type: New Feature Components: parser Reporter: Anya Yun Li Assignee: Chris A. Mattmann Labels: memex Attachments: TIKA-1614.Mattmann.Li.052405.patch.txt ##Description This program aims to provide the support to identify geonames for any unstructured text data in the project NSF polar research. https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/1 This project is a content-based geotagging solution, made of a variaty of NLP tools and could be used for any geotagging purposes. ##Workingflow 1. Plain text input is passed to geoparser 2. Location names are extracted from the text using OpenNLP NER 3. Provide two roles: * The most frequent location name choosed as the best match for the input text * Other extracted locations are treated as alternatives (equal) 4. location extracted above, search the best GeoName object and return the resloved objects with fields (name in gazetteer, longitude, latitude) ##How to Use *Cautions*: This program requires at least 1.2 GB disk space for building Lucene Index ```Java function A(stream){ Metadata metadata = new Metadata(); ParseContext context=new ParseContext(); GeoParserConfig config= new GeoParserConfig(); config.setGazetterPath(gazetteerPath); config.setNERModelPath(nerPath); context.set(GeoParserConfig.class, config); geoparser.parse( stream, new BodyContentHandler(), metadata, context); for(String name: metadata.names()){ String value=metadata.get(name); System.out.println(name + + value); } } ``` This parser generates useful geographical information to Tika's Metadata Object. Fields for best matched location: ``` Geographic_NAME Geographic_LONGTITUDE Geographic_LATITUDE ``` Fields for alternatives: ``` Geographic_NAME1 Geographic_LONGTITUDE1 Geographic_LATITUDE1 Geographic_NAME2 Geographic_LONGTITUDE2 Geographic_LATITUDE2 ... ``` If you have any questions, contact me: anyayu...@gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646321#comment-14646321 ] ASF GitHub Bot commented on TIKA-1699: -- GitHub user sujen1412 opened a pull request: https://github.com/apache/tika/pull/55 Fix for TIKA-1699 contributed by Sujen Shah Waiting for GROBID to get published to maven central. Sonatype issue - https://issues.sonatype.org/browse/OSSRH-16837 You can merge this pull request into a Git repository by running: $ git pull https://github.com/sujen1412/tika TIKA-1699 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/55.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #55 commit 4f067107d01e99bd81a66c78163f2a4baf3f817f Author: Sujen Shah sujen1...@gmail.com Date: 2015-07-29T13:49:00Z Added grobid dependencies commit 323ba33816a9beabe22d351c8eac4350fa010be0 Author: Sujen Shah sujen1...@gmail.com Date: 2015-07-29T13:49:36Z Registering journal parser commit 71cdd0970fb17aeec85469d07dc1ee6460d2f4da Author: Sujen Shah sujen1...@gmail.com Date: 2015-07-29T13:54:07Z Code for integrating GROBID Parser in to Tika commit b6e9f8724b308e0c830f73702994cbe1c5932cd2 Author: Sujen Shah sujen1...@gmail.com Date: 2015-07-29T13:58:08Z Grobid properties files commit 57b70ce38a77cc349588d2f513938bc4f18d4ad4 Author: Sujen Shah sujen1...@gmail.com Date: 2015-07-29T13:58:58Z Added unit test for journal parser Corrected formatting Corrected formatting Corrected formatting Integrate the GROBID PDF extractor in Tika -- Key: TIKA-1699 URL: https://issues.apache.org/jira/browse/TIKA-1699 Project: Tika Issue Type: New Feature Components: parser Reporter: Sujen Shah Labels: memex GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications. It has a java api which can be used to augment PDF parsing for journals and help extract extra metadata about the paper like authors, publication, citations, etc. It would be nice to have this integrated into Tika, I have tried it on my local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1703) Can't Specify Tesseract Data Folder Distinct from Tesseract Executable Path
[ https://issues.apache.org/jira/browse/TIKA-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652940#comment-14652940 ] ASF GitHub Bot commented on TIKA-1703: -- GitHub user taidan19 opened a pull request: https://github.com/apache/tika/pull/56 TIKA-1703 Add ability to specify Tesseract config path. Link to Jira ticket - https://issues.apache.org/jira/browse/TIKA-1703 You can merge this pull request into a Git repository by running: $ git pull https://github.com/taidan19/tika TIKA-1703 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/56.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #56 commit 86e8fdf187af5051812e1164c4cc3fef737a0644 Author: Christian Wolfe taida...@gmail.com Date: 2015-08-04T00:54:23Z TIKA-1703 Add ability to specify Tesseract config path. Can't Specify Tesseract Data Folder Distinct from Tesseract Executable Path --- Key: TIKA-1703 URL: https://issues.apache.org/jira/browse/TIKA-1703 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.9 Reporter: Christian Wolfe Priority: Minor Fix For: 1.9 If a user specifies the path to the Tesseract executable using {{TesseractOCRConfig.setTesseractPath}}, then Tika will assume that the Tesseract config folder (usually referred to as the 'tessdata' folder) is in the same location. This is usually true in a Windows environment, where everything is installed into a central location. However, this is not necessarily the case in a Linux environment. If one were to build Tesseract from source, for example, the config folder will be installed in a different location than the Tesseract executable. One way to fix this would be to add a way to specify the location of the Tesseract config folder separate from the path to the executable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696515#comment-14696515 ] ASF GitHub Bot commented on TIKA-1699: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/55 Integrate the GROBID PDF extractor in Tika -- Key: TIKA-1699 URL: https://issues.apache.org/jira/browse/TIKA-1699 Project: Tika Issue Type: New Feature Components: parser Reporter: Sujen Shah Assignee: Chris A. Mattmann Labels: memex Fix For: 1.11 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications. It has a java api which can be used to augment PDF parsing for journals and help extract extra metadata about the paper like authors, publication, citations, etc. It would be nice to have this integrated into Tika, I have tried it on my local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1703) Can't Specify Tesseract Data Folder Distinct from Tesseract Executable Path
[ https://issues.apache.org/jira/browse/TIKA-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654652#comment-14654652 ] ASF GitHub Bot commented on TIKA-1703: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/56 Can't Specify Tesseract Data Folder Distinct from Tesseract Executable Path --- Key: TIKA-1703 URL: https://issues.apache.org/jira/browse/TIKA-1703 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.9 Reporter: Christian Wolfe Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.11 If a user specifies the path to the Tesseract executable using {{TesseractOCRConfig.setTesseractPath}}, then Tika will assume that the Tesseract config folder (usually referred to as the 'tessdata' folder) is in the same location. This is usually true in a Windows environment, where everything is installed into a central location. However, this is not necessarily the case in a Linux environment. If one were to build Tesseract from source, for example, the config folder will be installed in a different location than the Tesseract executable. One way to fix this would be to add a way to specify the location of the Tesseract config folder separate from the path to the executable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1694) Missing scopetest/scope for junit dependency
[ https://issues.apache.org/jira/browse/TIKA-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638903#comment-14638903 ] ASF GitHub Bot commented on TIKA-1694: -- GitHub user tmortagne opened a pull request: https://github.com/apache/tika/pull/54 TIKA-1694: Missing scopetest/scope for junit dependency You can merge this pull request into a Git repository by running: $ git pull https://github.com/tmortagne/tika trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/54.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #54 commit 70d4c9d4ade2e4df5fd7ddef90592ea4da0ce9b1 Author: Thomas Mortagne thomas.morta...@gmail.com Date: 2015-07-23T14:23:10Z TIKA-1694: Missing scopetest/scope for junit dependency Missing scopetest/scope for junit dependency Key: TIKA-1694 URL: https://issues.apache.org/jira/browse/TIKA-1694 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.9 Reporter: Thomas Mortagne Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1694) Missing scopetest/scope for junit dependency
[ https://issues.apache.org/jira/browse/TIKA-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638969#comment-14638969 ] ASF GitHub Bot commented on TIKA-1694: -- Github user tmortagne closed the pull request at: https://github.com/apache/tika/pull/54 Missing scopetest/scope for junit dependency Key: TIKA-1694 URL: https://issues.apache.org/jira/browse/TIKA-1694 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.9 Reporter: Thomas Mortagne Assignee: Konstantin Gribov Priority: Minor Pull request on https://github.com/apache/tika/pull/54. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1791) URI is not hierarchical exception when location model resource is inside a jar in classpath
[ https://issues.apache.org/jira/browse/TIKA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997841#comment-14997841 ] ASF GitHub Bot commented on TIKA-1791: -- GitHub user thammegowda opened a pull request: https://github.com/apache/tika/pull/63 TIKA-1791 fix : non hierarchical URI exception when NER model is inside jar file Improvement : Model is loaded once and NameFinder is reused You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/tika fix-1791 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/63.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #63 commit 4c02c9e94adfde2163a2e2b4fac9425e5485a583 Author: Thamme GowdaDate: 2015-11-10T01:39:44Z TIKA-1791 fix : non hierarchical URI for NER model > URI is not hierarchical exception when location model resource is inside a > jar in classpath > --- > > Key: TIKA-1791 > URL: https://issues.apache.org/jira/browse/TIKA-1791 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: location model file is placed inside a fat Jar (with > all the dependencies) >Reporter: Thamme Gowda N > > {code:title=Stacktrace|borderStyle=solid} > The following error happens when location NER model resource is packaged > inside a jar and GeoTopicParser is enabled. > Caused by: java.lang.IllegalArgumentException: URI is not hierarchical > at java.io.File.(File.java:418) > at > org.apache.tika.parser.geo.topic.GeoParserConfig.(GeoParserConfig.java:33) > at org.apache.tika.parser.geo.topic.GeoParser.(GeoParser.java:54) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at java.lang.Class.newInstance(Class.java:442) > at > org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:559) > at > org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:492) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:166) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:149) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:142) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:138) > at edu.usc.cs.ir.cwork.tika.Parser.(Parser.java:45) > {code} > Refernces : > http://stackoverflow.com/questions/18055189/why-my-uri-is-not-hierarchical -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1791) URI is not hierarchical exception when location model resource is inside a jar in classpath
[ https://issues.apache.org/jira/browse/TIKA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006009#comment-15006009 ] ASF GitHub Bot commented on TIKA-1791: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/63 > URI is not hierarchical exception when location model resource is inside a > jar in classpath > --- > > Key: TIKA-1791 > URL: https://issues.apache.org/jira/browse/TIKA-1791 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: location model file is placed inside a fat Jar (with > all the dependencies) >Reporter: Thamme Gowda N > > {code:title=Stacktrace|borderStyle=solid} > The following error happens when location NER model resource is packaged > inside a jar and GeoTopicParser is enabled. > Caused by: java.lang.IllegalArgumentException: URI is not hierarchical > at java.io.File.(File.java:418) > at > org.apache.tika.parser.geo.topic.GeoParserConfig.(GeoParserConfig.java:33) > at org.apache.tika.parser.geo.topic.GeoParser.(GeoParser.java:54) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at java.lang.Class.newInstance(Class.java:442) > at > org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:559) > at > org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:492) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:166) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:149) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:142) > at org.apache.tika.config.TikaConfig.(TikaConfig.java:138) > at edu.usc.cs.ir.cwork.tika.Parser.(Parser.java:45) > {code} > Refernces : > http://stackoverflow.com/questions/18055189/why-my-uri-is-not-hierarchical -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1787) Include Stanford Name Entity Recognition in Tika
[ https://issues.apache.org/jira/browse/TIKA-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009110#comment-15009110 ] ASF GitHub Bot commented on TIKA-1787: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/61 > Include Stanford Name Entity Recognition in Tika > > > Key: TIKA-1787 > URL: https://issues.apache.org/jira/browse/TIKA-1787 > Project: Tika > Issue Type: Improvement > Components: mime, parser >Affects Versions: 1.12 > Environment: Java 1.8, Mac OSX 10.11 >Reporter: Yueheng He >Assignee: Chris A. Mattmann > Labels: features, newbie, test > Fix For: 1.12 > > Original Estimate: 168h > Remaining Estimate: 168h > > Using the Stanford Name Entity Recognition, Tika will be able to extract name > entities like PERSON, ORGANIZATION, LOCATION, etc from the given text. The > extracted name entities will be added to the metadata -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1763) StringIndexOutOfBoundsException in ImageMetadataExtractor
[ https://issues.apache.org/jira/browse/TIKA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948070#comment-14948070 ] ASF GitHub Bot commented on TIKA-1763: -- Github user jrnorth closed the pull request at: https://github.com/apache/tika/pull/57 > StringIndexOutOfBoundsException in ImageMetadataExtractor > - > > Key: TIKA-1763 > URL: https://issues.apache.org/jira/browse/TIKA-1763 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.10 >Reporter: Joseph North >Priority: Minor > > The {{trimPixels(String s)}} method in {{ImageMetadataExtractor}} will throw > a {{StringIndexOutOfBoundsException}} if the string passed to it doesn't > contain the substring " pixels". The method does a {{lastIndexOf(" pixels")}} > call on the string passed to it and uses the result directly in a > {{substring()}} call without checking whether the result is greater than -1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1772) Mimetype of VTT files
[ https://issues.apache.org/jira/browse/TIKA-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960290#comment-14960290 ] ASF GitHub Bot commented on TIKA-1772: -- GitHub user wiedsche opened a pull request: https://github.com/apache/tika/pull/59 fix for TIKA-1772 contributed by wiedsche You can merge this pull request into a Git repository by running: $ git pull https://github.com/wiedsche/tika TIKA-1772 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/59.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #59 commit 08a4df4e2b6a0d2cd14dc411906ed4a4a45814a3 Author: Alexander WideraDate: 2015-10-16T07:15:56Z fix for TIKA-1772 contributed by wiedsche > Mimetype of VTT files > - > > Key: TIKA-1772 > URL: https://issues.apache.org/jira/browse/TIKA-1772 > Project: Tika > Issue Type: Improvement >Reporter: Alexander Widera >Priority: Minor > > Files with extension "vtt" are "WebVTT: The Web Video Text Tracks Format" > files. > The mimetype resolved by tika is currently text/plain. > The correct mimetype should be text/vtt. > see: https://w3c.github.io/webvtt/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1772) Mimetype of VTT files
[ https://issues.apache.org/jira/browse/TIKA-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962569#comment-14962569 ] ASF GitHub Bot commented on TIKA-1772: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/59 > Mimetype of VTT files > - > > Key: TIKA-1772 > URL: https://issues.apache.org/jira/browse/TIKA-1772 > Project: Tika > Issue Type: Improvement >Reporter: Alexander Widera >Priority: Minor > Fix For: 1.11 > > Attachments: upc-video-subtitles-en.vtt > > > Files with extension "vtt" are "WebVTT: The Web Video Text Tracks Format" > files. > The mimetype resolved by tika is currently text/plain. > The correct mimetype should be text/vtt. > see: https://w3c.github.io/webvtt/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1763) StringIndexOutOfBoundsException in ImageMetadataExtractor
[ https://issues.apache.org/jira/browse/TIKA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943692#comment-14943692 ] ASF GitHub Bot commented on TIKA-1763: -- GitHub user jrnorth opened a pull request: https://github.com/apache/tika/pull/57 Fix for TIKA-1763 -Prevents a StringIndexOutOfBoundsException by checking the result of a call to lastIndexOf() before using it in a call to substring(). -Adds unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jrnorth/tika trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/57.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #57 commit 807302f2775d9ee6a6a9f3820665330edc163d88 Author: Joseph NorthDate: 2015-10-05T17:12:54Z Fix for TIKA-1763 -Prevents a StringIndexOutOfBoundsException by checking the result of a call to lastIndexOf() before using it in a call to substring(). -Adds unit tests. > StringIndexOutOfBoundsException in ImageMetadataExtractor > - > > Key: TIKA-1763 > URL: https://issues.apache.org/jira/browse/TIKA-1763 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.10 >Reporter: Joseph North >Priority: Minor > > The {{trimPixels(String s)}} method in {{ImageMetadataExtractor}} will throw > a {{StringIndexOutOfBoundsException}} if the string passed to it doesn't > contain the substring " pixels". The method does a {{lastIndexOf(" pixels")}} > call on the string passed to it and uses the result directly in a > {{substring()}} call without checking whether the result is greater than -1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1803) Use lucene-geo-gazetteer REST API in GeoTopicParser
[ https://issues.apache.org/jira/browse/TIKA-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060182#comment-15060182 ] ASF GitHub Bot commented on TIKA-1803: -- GitHub user smadha opened a pull request: https://github.com/apache/tika/pull/65 fix for TIKA-1803 contributed by msha...@usc.edu You can merge this pull request into a Git repository by running: $ git pull https://github.com/smadha/tika TIKA-1803 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/65.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #65 commit a55990aa5d6a0c521358123f8d7bbd6947255174 Author: smadhaDate: 2015-12-16T15:26:23Z fix for TIKA-1803 contributed by msha...@usc.edu > Use lucene-geo-gazetteer REST API in GeoTopicParser > --- > > Key: TIKA-1803 > URL: https://issues.apache.org/jira/browse/TIKA-1803 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Madhav Sharan > > As of now tika uses lucene-geo-gazetteer CLI to extract co-ordinates of a > location. CLI requires jvm and lucene to instantiate for every request. With > all new REST api it will be possible to gain improvement in this space. > Idea is to create a client of lucene-geo-gazetteer in tika and use it in > GeoTopicParser -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1816) Lenient testing for NamedEntityParser
[ https://issues.apache.org/jira/browse/TIKA-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065966#comment-15065966 ] ASF GitHub Bot commented on TIKA-1816: -- GitHub user thammegowda opened a pull request: https://github.com/apache/tika/pull/68 FIX for TIKA-1816 by Thamme Gowda Lenient testing for `NamedEntityParser` You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/tika TIKA-1816 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/68.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #68 commit 865de584be7cda0ed34c677f5bff5bb87b7a6996 Author: Thamme GowdaDate: 2015-12-21T01:14:56Z Lenient testing for NamedEntityParser > Lenient testing for NamedEntityParser > - > > Key: TIKA-1816 > URL: https://issues.apache.org/jira/browse/TIKA-1816 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Thamme Gowda N > > NamedEntityParser has a hard setup requirement like downloading of NER models > from remote servers and adding them to classpath. > These model files are huge and hence are not added to source control. > So, the tests are most likely to fail in various environments. > Make the best effort to set up the tests, but in the worst case skip tests > instead of failing the whole build process. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled
[ https://issues.apache.org/jira/browse/TIKA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065535#comment-15065535 ] ASF GitHub Bot commented on TIKA-1815: -- Github user thammegowda closed the pull request at: https://github.com/apache/tika/pull/66 > Text content from parser is empty when NamedEntityParser is enabled > --- > > Key: TIKA-1815 > URL: https://issues.apache.org/jira/browse/TIKA-1815 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Thamme Gowda N > Fix For: 1.12 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > When the NamedEntityParser is enabled, the Tika#parseToString() and other > parse() methods produces an empty string. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled
[ https://issues.apache.org/jira/browse/TIKA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065485#comment-15065485 ] ASF GitHub Bot commented on TIKA-1815: -- GitHub user thammegowda opened a pull request: https://github.com/apache/tika/pull/66 Fix for TIKA-1815 contributed by Thamme Gowda + Outputting the text content to XMLDocumentHandler You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/tika fix-TIKA-1815 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/66.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #66 commit e96da2bc28d5eef81d034e39eb05099ed5d38ac1 Author: Thamme GowdaDate: 2015-10-30T21:47:45Z Add NamedEntityParser Add OpenNLPNERecogniser as default commit a720507a1c1906a501470a7d5c5cec335412fcd3 Author: Thamme Gowda Date: 2015-10-30T22:16:11Z Set charset for converting text to stream commit 6b1a20e681a5d319886464ec147967c876b7e60d Author: Thamme Gowda Date: 2015-10-31T04:23:43Z Automated OpenNLP NER model downloader commit e381ea88ebd2bb8f5adfe36d710acfce673e30aa Author: Thamme Gowda Date: 2015-11-04T00:31:40Z using a secondary parser to convert non-text streams commit ea7871bd4afae7d18e500ffc285e58afd08f5e86 Author: Thamme Gowda Date: 2015-11-08T07:36:48Z Add regex based NER commit 084985b3612438e9ca7107fecdffd67757d04d10 Author: Thamme Gowda Date: 2015-11-08T07:38:17Z Add CoreNLP NER with runtime binding commit e4d74218ece77143d1e5245a3ef64ddf5578c310 Author: Thamme Gowda Date: 2015-11-08T23:41:15Z Added support for chaining NER implementations commit 7e6b43c83ec6cdd35ea258f52c0110ba986c82b3 Author: Thamme Gowda Date: 2015-11-09T05:58:58Z charset specified commit caba68773a287752dea43f3366e6d4309fde861c Author: Thamme Gowda Date: 2015-11-10T01:34:04Z Merge branch 'trunk' of github.com:apache/tika into trunk commit 08b916790b279cda0201f2529ca58646dea4b2f9 Author: Thamme Gowda Date: 2015-11-10T19:06:29Z Resolved Code formatting issues + Removed star imports + Removed dead code / commented code + Added License header to missing files commit e07ac630d54cc79d9a7bfc9ac82332474d07434b Author: Thamme Gowda Date: 2015-11-16T09:05:07Z Add missing doc strings, fix code formatting issues commit 96d4d7cc29d4bcd8ac0cf7a595c39b6ed64d4d19 Author: Thamme Gowda Date: 2015-11-18T03:03:41Z Fix: build phase for model downloader commit 6d0b121b8b321e8a31257fc608bb001d3fe7afb5 Author: Thamme Gowda Date: 2015-12-11T14:33:36Z Merge branch 'trunk' of github.com:apache/tika into trunk commit 66d3a10ffabf1f54cff384ce1c7325c2a3c16279 Author: Thamme Gowda Date: 2015-12-19T18:59:26Z Fix : TIKA-1815 by Thamme Gowda N. 1. Writing text content to XMLContentHandler 2. Added RegexNERParser to Default parser chain > Text content from parser is empty when NamedEntityParser is enabled > --- > > Key: TIKA-1815 > URL: https://issues.apache.org/jira/browse/TIKA-1815 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Thamme Gowda N > Fix For: 1.12 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > When the NamedEntityParser is enabled, the Tika#parseToString() and other > parse() methods produces an empty string. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled
[ https://issues.apache.org/jira/browse/TIKA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065533#comment-15065533 ] ASF GitHub Bot commented on TIKA-1815: -- GitHub user thammegowda opened a pull request: https://github.com/apache/tika/pull/67 FIX for TIKA-1815 contributed by Thamme Gowda + Writing the text content to XML Document + Added Regex recogniser to default NER chain Closes #66 (this is a simpler version of the same). Fixes #TIKA-1815 You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/tika TIKA-1815 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/67.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #67 commit a40a18e2f61f2152fa065bda193ceb74e7e60c97 Author: Thamme GowdaDate: 2015-12-19T20:56:21Z FIX for TIKA-1815 contributed by Thamme Gowda + Writing the text content to XML Document + Added Regex recogniser to default NER chain > Text content from parser is empty when NamedEntityParser is enabled > --- > > Key: TIKA-1815 > URL: https://issues.apache.org/jira/browse/TIKA-1815 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Thamme Gowda N > Fix For: 1.12 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > When the NamedEntityParser is enabled, the Tika#parseToString() and other > parse() methods produces an empty string. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1816) Lenient testing for NamedEntityParser
[ https://issues.apache.org/jira/browse/TIKA-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066111#comment-15066111 ] ASF GitHub Bot commented on TIKA-1816: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/68 > Lenient testing for NamedEntityParser > - > > Key: TIKA-1816 > URL: https://issues.apache.org/jira/browse/TIKA-1816 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.12 > > > NamedEntityParser has a hard setup requirement like downloading of NER models > from remote servers and adding them to classpath. > These model files are huge and hence are not added to source control. > So, the tests are most likely to fail in various environments. > Make the best effort to set up the tests, but in the worst case skip tests > instead of failing the whole build process. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1803) Use lucene-geo-gazetteer REST API in GeoTopicParser
[ https://issues.apache.org/jira/browse/TIKA-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065848#comment-15065848 ] ASF GitHub Bot commented on TIKA-1803: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/65 > Use lucene-geo-gazetteer REST API in GeoTopicParser > --- > > Key: TIKA-1803 > URL: https://issues.apache.org/jira/browse/TIKA-1803 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Madhav Sharan >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.12 > > > As of now tika uses lucene-geo-gazetteer CLI to extract co-ordinates of a > location. CLI requires jvm and lucene to instantiate for every request. With > all new REST api it will be possible to gain improvement in this space. > Idea is to create a client of lucene-geo-gazetteer in tika and use it in > GeoTopicParser -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled
[ https://issues.apache.org/jira/browse/TIKA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065887#comment-15065887 ] ASF GitHub Bot commented on TIKA-1815: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/67 > Text content from parser is empty when NamedEntityParser is enabled > --- > > Key: TIKA-1815 > URL: https://issues.apache.org/jira/browse/TIKA-1815 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.12 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > When the NamedEntityParser is enabled, the Tika#parseToString() and other > parse() methods produces an empty string. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1798) Parser for Video Similarity using PooledTimeSeries metric
[ https://issues.apache.org/jira/browse/TIKA-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15030293#comment-15030293 ] ASF GitHub Bot commented on TIKA-1798: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/64 > Parser for Video Similarity using PooledTimeSeries metric > - > > Key: TIKA-1798 > URL: https://issues.apache.org/jira/browse/TIKA-1798 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.12 > > > My student [~1ceb00da] and I are working on a parser that's an implementation > of the PooledTimeSeries metric for video similarity: > http://github.com/chrismattmann/pooled_time_series > The original author of the algorithm approach was Michael Ryoo in his paper > here: > https://github.com/chrismattmann/pooled_time_series#research-background-and-detail > The code is Apache License, version 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1798) Parser for Video Similarity using PooledTimeSeries metric
[ https://issues.apache.org/jira/browse/TIKA-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021220#comment-15021220 ] ASF GitHub Bot commented on TIKA-1798: -- GitHub user chrismattmann opened a pull request: https://github.com/apache/tika/pull/64 Pull request for TIKA-1798 Parser for Video Similarity using PooledTimeSeries metric contributed by Aditya Dhulipala and Chris Mattmann. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chrismattmann/tika trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/64.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #64 commit d79f57c8fe9a13f8bb549777d82ce5be0034acb1 Author: Aditya DhulipalaDate: 2015-10-02T12:45:22Z PoT definitions commit 78daf1a0cd30b4810f64582582a873cd1c5f81ce Author: Aditya Dhulipala Date: 2015-10-12T02:09:56Z PoT Config defns commit 9923a7c0a8dc7338b845e314c9caa20f0b8142af Author: Aditya Dhulipala Date: 2015-10-14T14:43:04Z Registered PoTParser with ParserServices commit a76b6c643a55a84c310d5bfea0ab799d13a1d8a9 Author: Aditya Dhulipala Date: 2015-10-14T15:30:40Z defined PoT properties config file commit ca8c1811fea9f45ead9c1dc60ba566befafd374d Author: Aditya Dhulipala Date: 2015-10-14T15:32:31Z defined PoT config values commit 1012ba1e68939d4ab3784043694ef4a26b27e3af Author: Aditya Dhulipala Date: 2015-10-14T15:33:33Z Modified config file to better reflect PoT config values commit d3f78b567d22e1626dcd7ceea624e789fe3057ef Author: Aditya Dhulipala Date: 2015-10-14T15:34:13Z Fixed build errors commit 43969b154a83cfb3dad29eb49f597426c88bcfb4 Author: Aditya Dhulipala Date: 2015-10-29T16:17:25Z Built basic version of tika-pot-parser commit 30f240abfe5b0be9fd572e0d55be764d43827412 Author: Aditya Dhulipala Date: 2015-11-14T02:06:33Z Added a timeout value commit 3c9719370a62c12e31fa744e4ec9547c83ac2ca6 Author: Aditya Dhulipala Date: 2015-11-14T02:07:54Z Parser now runs PoT and extracts OF metadata Uses JavaProcessBuilder to invoke the PoT jar. Runs PoT to generate of.txt and hog.txt. Extracts of.txt metadata as a list of lists, i.e. a 2D matrix in HTML commit 90e42d145d806cc2b122baf8f7b12a4b465c5125 Author: Aditya Dhulipala Date: 2015-11-14T02:22:46Z Added Apache License header commit 235156a122e164276d164ef78229a8f01784022a Author: Chris Mattmann Date: 2015-11-22T22:20:42Z Merge branch 'pooled-time-series' of https://github.com/cafed00d4j/tika-pooled-time-series into trunk > Parser for Video Similarity using PooledTimeSeries metric > - > > Key: TIKA-1798 > URL: https://issues.apache.org/jira/browse/TIKA-1798 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Fix For: 1.12 > > > My student [~1ceb00da] and I are working on a parser that's an implementation > of the PooledTimeSeries metric for video similarity: > http://github.com/chrismattmann/pooled_time_series > The original author of the algorithm approach was Michael Ryoo in his paper > here: > https://github.com/chrismattmann/pooled_time_series#research-background-and-detail > The code is Apache License, version 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1978) Invocation of java.net.URL.equals(Object), which blocks to do domain name resolution, in org.apache.tika.parser.geo.topic.GeoParser.initialize(URL)
[ https://issues.apache.org/jira/browse/TIKA-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300248#comment-15300248 ] ASF GitHub Bot commented on TIKA-1978: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/122 > Invocation of java.net.URL.equals(Object), which blocks to do domain name > resolution, in org.apache.tika.parser.geo.topic.GeoParser.initialize(URL) > --- > > Key: TIKA-1978 > URL: https://issues.apache.org/jira/browse/TIKA-1978 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.14 > > > Performance - The equals and hashCode methods of URL are blocking > The equals and hashCode method of URL perform domain name resolution, this > can result in a big performance hit. See > http://michaelscharf.blogspot.com/2006/11/javaneturlequals-and-hashcode-make.html > for more information. Consider using java.net.URI instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1993) Image Recognition with Tika
[ https://issues.apache.org/jira/browse/TIKA-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326163#comment-15326163 ] ASF GitHub Bot commented on TIKA-1993: -- GitHub user thammegowda opened a pull request: https://github.com/apache/tika/pull/125 TIKA-1993: ObjectRecognitionParser + Tensorflow image recognition with Inception-V3 model as default implementation Summary of changes: - Fixed TIKA-2002 : ExternalParser.check() empties stdout and stderr buffers so no more hanging is expected - Added ObjectRecognitionParser, ObjectRecogniser, RecognisedObject - A parser, interface and a model class respectively - implemented TensorFlowImageRecParser - an `ExternalParser` which (if missing) downloads and calls tensorflow `image_classify.py` script (the script then downloads Inception-v3 model) --- ## Quick Setup and Test - Install tensor flow using pip - https://www.tensorflow.org/versions/r0.9/get_started/os_setup.html#pip-installation - Checkout the test case `tika-parsers/src/test/java/org/apache/tika/parser/recognition/ObjectRecognitionParserTest.java` ## Demos Compile package : `mvn clean install` # `-DskipTests` if you dont like to wait for tests Lets check - (for animal lovers,) on a cat's image at https://raw.githubusercontent.com/apache/tika/master/tika-parsers/src/test/resources/test-documents/testJPEG.jpg ``` java -jar tika-app/target/tika-app-1.14-SNAPSHOT.jar \ --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml \ tika-parsers/src/test/resources/test-documents/testJPEG.jpg ``` ```xml ``` - (For law-keepers) On a rifle at https://upload.wikimedia.org/wikipedia/commons/thumb/3/38/US_Navy_100714-N-4965F-174_Chief_Mass_Communication_Specialist_Paula_Ludwick%2C_assigned_to_Fleet_Combat_Camera_Group_Pacific%2C_shoots_at_a_target_during_a_Navy_Rifle_Qualification_Course.jpg/220px-thumbnail.jpg ```xml ``` - (for law-keepers) On a revolver at https://upload.wikimedia.org/wikipedia/commons/8/8d/Glock17.jpg ```xml ``` - (for car enthusiasts) On a car at http://www.trbimg.com/img-57226a08/turbine/ct-tesla-model-3-unveiling-20160404/650/650x366 ```xml ``` / /NOTE: 1. The most efficient way to make use of tensorflow would be to use C++ api via JNI. I didn't have a chance to learn that stuff so far so help needed to make this efficient. Or else we may wait for tensorflow folks to offer Java bindings! Right now, the image recognition model is loaded and unloaded every time by the script (200MB of disk-read per parse call, very inefficient!). 2. The very first call takes plenty of time as the model is downloaded lazily 3. Only `image/jpeg` is supported. PNG coming later You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/tika TIKA-1993 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/125.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #125 commit 9b5dc7fae4456b12b75ec21d050b9439e6527c47 Author: Thamme GowdaDate: 2016-06-12T02:04:08Z External Parser now have consumer for ignored lines, Fix TIKA-2002 commit eccc15387f0d4a5c62d8d12e6579878dba2f52a8 Author: Thamme Gowda Date: 2016-06-12T02:04:28Z Added an utility to load and insatiate classes commit 2184e2c2c2a0e507be6be4f9692e0fab5b38a476 Author: Thamme Gowda Date: 2016-06-12T02:04:49Z Object recognition parser, tensorflow based implementation, and test cases for these commit 0305cfb402f5d5e289533411d5737e1e832888ac Author: Thamme Gowda Date: 2016-06-12T02:43:07Z Explicit Locale > Image Recognition with Tika > > > Key: TIKA-1993 > URL: https://issues.apache.org/jira/browse/TIKA-1993 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Thamme Gowda > > Create "ImageRecognitionParser" which can have pluggable implementation for > core recognition logic. > As the name suggests, this parser should detect objects in the images, and > support many implementations + models (similar to what NamedEntityParser did > for text). > Supply a default implementation based on Tensorflow with the current > state-of-the-art model \[1\]. > Links: > \[1\] > https://www.tensorflow.org/versions/r0.8/tutorials/image_recognition/index.html#usage-with-python-api -- This
[jira] [Commented] (TIKA-1986) support parser parameters with type (int, double, etc) in configuration XML file
[ https://issues.apache.org/jira/browse/TIKA-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15301236#comment-15301236 ] ASF GitHub Bot commented on TIKA-1986: -- GitHub user thammegowda opened a pull request: https://github.com/apache/tika/pull/123 TIKA-1986 : support for typed parameters from XML configuration This is a sub-task of TIKA-1508 (please merge #91 first). This extends configuration file with a place for specifying parameters and their types. + Added a class called `Param` + Relying on JAXB annotations to convert between XML and Java objects + Test case updated to check for types You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/tika TIKA-1986 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/123.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #123 commit b2cf23178ede925b0ef23f88ebf1aff95c8c157c Author: Thamme GowdaDate: 2016-03-09T02:23:19Z Add uniformity to parser parameter configuration. 1. Added Configurable interface. This can be used for all services like Parser, Detector which can take configurable parameters. 2. Added ConfigurableParser interface which extends Parser interface. I didn't add new method to existing Parser because that will break the compatibility. 3. AbstractParser extends ConfigurableParser and has default implementation for configure() contract. I think it is safe to do so and it doesnt break anything. In addition all parsers which extend AbstractParser will can easily access config from TikaConfig if they want to 3. Added a TODO to TikaConfig, after this should allow multiple instances of same parser with different runtime configurations. 4. TikaConfig is modified to detect if instance can be configured, if so, then checks if params are available in XML file, parses the params and invokes configure(ctx) method with these params 5. Added DummyConfigurableParser that simply copies parameters to metadata for the sake of testing 6. Added a sample XML config file for testing. Added ConfigurableParserTest that performs an end to end test of all the above. commit ae51417d8881dd90b921f02c2677a7d5bfd69a30 Author: Thamme Gowda Date: 2016-03-09T03:23:47Z remove unwanted TODO: commit 64db9614cfaa3e873a9dc9efc6d201d887f6a4c5 Author: Thamme Gowda Date: 2016-03-12T14:43:44Z Added a TikaConfigException, params getter commit 0d69ca7540b4350e043c5b9ed34d14a46bd70cf7 Author: Thamme Gowda Date: 2016-03-12T14:51:14Z Test Case updated with newer exception and getter commit e780d56652d48dd0f50b4e62a58153e95f055022 Author: Thamme Gowda Date: 2016-05-23T18:30:13Z merged upstream changes and resolved conflicts commit b64612dcdb021fbb8b3fbf31d70a02f1bb7736cb Author: Thamme Gowda Date: 2016-05-23T18:52:55Z Update javadoc with @since commit 01869923533b330ec7728995e3ee5feceee1b90e Author: Thamme Gowda Date: 2016-05-26T00:18:25Z Added support for type for runtime parameters commit 9e08a6bc0a2b2ffad12e4b6f90725b2201d0a69b Author: Thamme Gowda Date: 2016-05-26T00:50:49Z Updated test case with type checking > support parser parameters with type (int, double, etc) in configuration XML > file > > > Key: TIKA-1986 > URL: https://issues.apache.org/jira/browse/TIKA-1986 > Project: Tika > Issue Type: Sub-task > Components: config >Reporter: Thamme Gowda > Fix For: 1.14 > > > Tika Configuration should be enhanced to support for basic types like int, > double, boolean, url, file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1978) Invocation of java.net.URL.equals(Object), which blocks to do domain name resolution, in org.apache.tika.parser.geo.topic.GeoParser.initialize(URL)
[ https://issues.apache.org/jira/browse/TIKA-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302610#comment-15302610 ] ASF GitHub Bot commented on TIKA-1978: -- GitHub user lewismc opened a pull request: https://github.com/apache/tika/pull/124 TIKA-1978 Invocation of java.net.URL.equals(Object), which blocks to do domain name resolution, in org.apache.tika.parser.geo.topic.GeoParser.initialize(URL) 2.x branch This issue addresses https://issues.apache.org/jira/browse/TIKA-1978 for the 2.x branch. It also makes a number of additional improvements such as using the diamond operator where possible, throwing and logging the correct exceptions and using the correct syntax for constants. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lewismc/tika TIKA-1978v2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/124.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #124 commit bd3ecfcddeaf13262e477ba29c5256ebd44e32db Author: Lewis John McGibbneyDate: 2016-05-26T18:15:02Z TIKA-1978 Invocation of java.net.URL.equals(Object), which blocks to do domain name resolution, in org.apache.tika.parser.geo.topic.GeoParser.initialize(URL) 2.x branch > Invocation of java.net.URL.equals(Object), which blocks to do domain name > resolution, in org.apache.tika.parser.geo.topic.GeoParser.initialize(URL) > --- > > Key: TIKA-1978 > URL: https://issues.apache.org/jira/browse/TIKA-1978 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 2.0, 1.14 > > > Performance - The equals and hashCode methods of URL are blocking > The equals and hashCode method of URL perform domain name resolution, this > can result in a big performance hit. See > http://michaelscharf.blogspot.com/2006/11/javaneturlequals-and-hashcode-make.html > for more information. Consider using java.net.URI instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1834) Fix for GeoTopic parser holding state while running Tika server
[ https://issues.apache.org/jira/browse/TIKA-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105609#comment-15105609 ] ASF GitHub Bot commented on TIKA-1834: -- GitHub user smadha opened a pull request: https://github.com/apache/tika/pull/71 fix for TIKA-1834 contributed by msha...@usc.edu You can merge this pull request into a Git repository by running: $ git pull https://github.com/smadha/tika TIKA-1834 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/71.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #71 commit 0154e3067c8a63ab176e4e2161515d2d7d45b8e7 Author: smadhaDate: 2016-01-18T18:08:18Z fix for TIKA-1834 contributed by msha...@usc.edu > Fix for GeoTopic parser holding state while running Tika server > --- > > Key: TIKA-1834 > URL: https://issues.apache.org/jira/browse/TIKA-1834 > Project: Tika > Issue Type: Sub-task > Components: parser >Affects Versions: 1.12 > Environment: All >Reporter: Madhav Sharan > Fix For: 1.12 > > > While using TIKA-server we observed that GeoTopic parser started holding > state and returned all the location retrieved for any previous request. > This was happening as mutable object > org.apache.tika.parser.geo.topic.NameEntityExtractor was initialised once and > then was reused by all request. > As part of this fix org.apache.tika.parser.geo.topic.NameEntityExtractor is > recreated for every request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1834) Fix for GeoTopic parser holding state while running Tika server
[ https://issues.apache.org/jira/browse/TIKA-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105626#comment-15105626 ] ASF GitHub Bot commented on TIKA-1834: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/71 > Fix for GeoTopic parser holding state while running Tika server > --- > > Key: TIKA-1834 > URL: https://issues.apache.org/jira/browse/TIKA-1834 > Project: Tika > Issue Type: Sub-task > Components: parser >Affects Versions: 1.12 > Environment: All >Reporter: Madhav Sharan >Assignee: Chris A. Mattmann > Fix For: 1.12 > > > While using TIKA-server we observed that GeoTopic parser started holding > state and returned all the location retrieved for any previous request. > This was happening as mutable object > org.apache.tika.parser.geo.topic.NameEntityExtractor was initialised once and > then was reused by all request. > As part of this fix org.apache.tika.parser.geo.topic.NameEntityExtractor is > recreated for every request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2016) A parser that combines Apache OpenNLP and Apache Tika and provides facilities for automatically deriving sentiment from text.
[ https://issues.apache.org/jira/browse/TIKA-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351373#comment-15351373 ] ASF GitHub Bot commented on TIKA-2016: -- GitHub user amensiko opened a pull request: https://github.com/apache/tika/pull/127 creation of TIKA-2016 contributed by amensiko Sentiment Analysis Parser You can merge this pull request into a Git repository by running: $ git pull https://github.com/amensiko/tika TIKA-2016 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/127.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #127 commit be8b433d2994af4e0d95de3bc5ab4fea99ee6bcd Author: amensikoDate: 2016-06-25T19:29:20Z creation of TIKA-2016 contributed by amensiko > A parser that combines Apache OpenNLP and Apache Tika and provides facilities > for automatically deriving sentiment from text. > - > > Key: TIKA-2016 > URL: https://issues.apache.org/jira/browse/TIKA-2016 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Anastasija Mensikova > Labels: analysis, parser, sentiment > > A new project that implements a parser that uses Apache OpenNLP and Apache > Tika to perform Sentiment Analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2021) Improving accuracy of Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349029#comment-15349029 ] ASF GitHub Bot commented on TIKA-2021: -- GitHub user Zarana-Parekh opened a pull request: https://github.com/apache/tika/pull/126 fix for TIKA-2021 contributed by Zarana Parekh Improving accuracy of Tesseract for better extraction of numeric and alphanumeric text from images. You can merge this pull request into a Git repository by running: $ git pull https://github.com/Zarana-Parekh/tika TIKA-2021 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/126.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #126 commit 48b27d219f791ee14f1e0ffa18e4e80583f3df54 Author: Zarana ParekhDate: 2016-06-25T01:53:00Z fix for TIKA-2021 contributed by Zarana Parekh > Improving accuracy of Tesseract parser > -- > > Key: TIKA-2021 > URL: https://issues.apache.org/jira/browse/TIKA-2021 > Project: Tika > Issue Type: Improvement >Reporter: Zarana Parekh > > Tesseract OCR parser works well with images containing English text. However, > there is possibility of improvement in case of alphanumeric and numeric > content which require training Tesseract with the relevant cases in order to > better extract content from images. Such a customization can be helpful in > extraction of serial numbers from images of counterfeit electronics and other > applications focussing on atypical textual content. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1869) Jackson update to latest version
[ https://issues.apache.org/jira/browse/TIKA-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15162969#comment-15162969 ] ASF GitHub Bot commented on TIKA-1869: -- GitHub user nhojpatrick opened a pull request: https://github.com/apache/tika/pull/75 TIKA-1869 update Jackson to latest version 2.7.1 You can merge this pull request into a Git repository by running: $ git pull https://github.com/nhojpatrick/tika bugfix/TIKA-1869 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/75.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #75 commit 13d772ab6317c6151b4b4e6111b80af2bd30cd7b Author: John PatrickDate: 2016-02-24T13:19:53Z TIKA-1869 update Jackson to latest version 2.7.1 > Jackson update to latest version > > > Key: TIKA-1869 > URL: https://issues.apache.org/jira/browse/TIKA-1869 > Project: Tika > Issue Type: Bug > Components: translation >Affects Versions: 1.11, 1.12 >Reporter: John Patrick > Labels: github-import, newbie, patch > Fix For: 1.13 > > > Linked to TIKA-1868 this is to update the version of Jackson used from 2.4.0 > to 2.7.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1868) create clean tika-server jar and shaded classifier jar
[ https://issues.apache.org/jira/browse/TIKA-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15162977#comment-15162977 ] ASF GitHub Bot commented on TIKA-1868: -- GitHub user nhojpatrick opened a pull request: https://github.com/apache/tika/pull/76 TIKA-1868 tika-server split into clean and standalone jar Understand based upon mailing email and jira defect this might be rejected. But this the change I was intending to do, my original email was to understand if tika-server meant to be a shaded jar, which it appears to was intended to be. But if you need to use classes that only live within tika-server it does make it harder to write custom code. If the guts of tika-server where put into another module maybe tika-server-internals then those that really need to used classes that just live in tika-server can use tika-server-internals and tika-server can be a simply shaded jar. Just a thought. You can merge this pull request into a Git repository by running: $ git pull https://github.com/nhojpatrick/tika bugfix/TIKA-1868 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/76.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #76 commit 9b106210fa8be284b47ab5b904dcf83b0f175308 Author: John PatrickDate: 2016-02-24T12:33:56Z TIKA-1868 tika-server split into clean and standalone jar > create clean tika-server jar and shaded classifier jar > -- > > Key: TIKA-1868 > URL: https://issues.apache.org/jira/browse/TIKA-1868 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.11, 1.12 > Environment: n/a >Reporter: John Patrick > Labels: github-import, maven, newbie, patch > Fix For: 1.13 > > > If using tika-server-VERSION.jar as a standalone component it works. But if > you use it as a dependency so is included with other jars then it causes > classpath issues specifically around jackson. > The project I'm working on is using Jackson 2.6.1, we have just added tika > but when adding tika-server-VERSION.jar we have discovered it contains > Jackson 2.4.0 classes. > I've update the maven build so two jar's are now created. > 1) tika-server-VERSION.jar correct clean jar > 2) tika-server-VERSION-standalone.jar what was previously created > This in my view is more inline with how maven should be being used to create > jars as the previous way restricted the consumers ability to override maven > dependencies. > I've also updated the documentation in source control that refs to > tika-server to include the new tika-server standalone jar. I realize other > documentation might also need to change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1868) create clean tika-server jar and shaded classifier jar
[ https://issues.apache.org/jira/browse/TIKA-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163113#comment-15163113 ] ASF GitHub Bot commented on TIKA-1868: -- Github user nhojpatrick closed the pull request at: https://github.com/apache/tika/pull/76 > create clean tika-server jar and shaded classifier jar > -- > > Key: TIKA-1868 > URL: https://issues.apache.org/jira/browse/TIKA-1868 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.11, 1.12 > Environment: n/a >Reporter: John Patrick > Labels: github-import, maven, newbie, patch > Fix For: 1.13 > > > If using tika-server-VERSION.jar as a standalone component it works. But if > you use it as a dependency so is included with other jars then it causes > classpath issues specifically around jackson. > The project I'm working on is using Jackson 2.6.1, we have just added tika > but when adding tika-server-VERSION.jar we have discovered it contains > Jackson 2.4.0 classes. > I've update the maven build so two jar's are now created. > 1) tika-server-VERSION.jar correct clean jar > 2) tika-server-VERSION-standalone.jar what was previously created > This in my view is more inline with how maven should be being used to create > jars as the previous way restricted the consumers ability to override maven > dependencies. > I've also updated the documentation in source control that refs to > tika-server to include the new tika-server standalone jar. I realize other > documentation might also need to change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1869) Jackson update to latest version
[ https://issues.apache.org/jira/browse/TIKA-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163133#comment-15163133 ] ASF GitHub Bot commented on TIKA-1869: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/75 > Jackson update to latest version > > > Key: TIKA-1869 > URL: https://issues.apache.org/jira/browse/TIKA-1869 > Project: Tika > Issue Type: Bug > Components: translation >Affects Versions: 1.11, 1.12 >Reporter: John Patrick > Labels: github-import, newbie, patch > Fix For: 1.13 > > > Linked to TIKA-1868 this is to update the version of Jackson used from 2.4.0 > to 2.7.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1870) Relocating RichTextContentHandler into tika-core from tika-server
[ https://issues.apache.org/jira/browse/TIKA-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163127#comment-15163127 ] ASF GitHub Bot commented on TIKA-1870: -- GitHub user nhojpatrick opened a pull request: https://github.com/apache/tika/pull/77 TIKA-1870 refactor RichTextContentHandler into tika-core from tika-se… …rver so users if needing it don't need to depend upon tika-server You can merge this pull request into a Git repository by running: $ git pull https://github.com/nhojpatrick/tika bugfix/TIKA-1870 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/77.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #77 commit 0bd05cec54c581c971d90380304aaa23c9543296 Author: John PatrickDate: 2016-02-24T14:50:38Z TIKA-1870 refactor RichTextContentHandler into tika-core from tika-server so users if needing it don't need to depend upon tika-server > Relocating RichTextContentHandler into tika-core from tika-server > - > > Key: TIKA-1870 > URL: https://issues.apache.org/jira/browse/TIKA-1870 > Project: Tika > Issue Type: Bug > Components: core, server >Reporter: John Patrick > Labels: newbie, patch > Fix For: 1.13 > > > linked to TIKA-1868, different solution by refactoring class into tika-core > so don't need to depend upon tika-server and changing other classes used to > custom ones or other alternatives. -- This message was sent by Atlassian JIRA (v6.3.4#6332)