[jira] [Commented] (TIKA-1241) Tika does not recognise empty nor spanning ZIP files magic

2014-02-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910315#comment-13910315
 ] 

ASF GitHub Bot commented on TIKA-1241:
--

Github user cstamas closed the pull request at:

https://github.com/apache/tika/pull/4


 Tika does not recognise empty nor spanning ZIP files magic
 --

 Key: TIKA-1241
 URL: https://issues.apache.org/jira/browse/TIKA-1241
 Project: Tika
  Issue Type: Improvement
Reporter: Cservenak, Tamas
Priority: Minor
 Fix For: 1.6


 As it turns out, magic differs for non-empty, empty and
 spanning ZIP files. Tika recognizes only the non-empty ZIP files.
 Magic for empty ZIP file is validated with hexdump:
 https://gist.github.com/cstamas/6e90ae73f83c8e4a3f42
 Also described on Wikipedia
 http://en.wikipedia.org/wiki/Zip_(file_format)
 (see sidebar with Magic Numbers)
 Proposed change:
 add two more match entries to ZIP MIME definition:
 https://github.com/apache/tika/pull/4



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1169) Fails to parse jnilib file

2014-05-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999825#comment-13999825
 ] 

ASF GitHub Bot commented on TIKA-1169:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/8


 Fails to parse jnilib file
 --

 Key: TIKA-1169
 URL: https://issues.apache.org/jira/browse/TIKA-1169
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: Windows 7 x64, Java 1.6.45
Reporter: brat
Priority: Critical
 Fix For: 1.5

 Attachments: libwrapper.jnilib


 Hi,
 I'm trying to parse a folder with jnilib file inside, but Tika 1.4 throws 
 exception :
 java.io.IOException: 
   at org.apache.tika.parser.ParsingReader.read(ParsingReader.java:260)
   at java.io.Reader.read(Unknown Source)
   at ca.cloudscraper.core.impl.Engine.process(Engine.java:63)
   at ca.cloudscraper.core.impl.Engine.process(Engine.java:34)
   at ca.cloudscraper.core.impl.Engine.process(Engine.java:34)
   at ca.cloudscraper.core.impl.Engine.process(Engine.java:34)
   at ca.cloudscraper.core.impl.Engine.execute(Engine.java:117)
   at 
 ca.cloudscraper.core.tests.LuceneServiceImplTest.test5(LuceneServiceImplTest.java:140)
   at 
 ca.cloudscraper.core.tests.LuceneServiceImplTest.main(LuceneServiceImplTest.java:176)
 Caused by: org.apache.tika.exception.TikaException: Failed to parse a Java 
 class
   at 
 org.apache.tika.parser.asm.XHTMLClassVisitor.parse(XHTMLClassVisitor.java:66)
   at org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at 
 org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:221)
   at java.lang.Thread.run(Unknown Source)
 Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
   at org.objectweb.asm.ClassReader.readClass(ClassReader.java:2157)
   at org.objectweb.asm.ClassReader.accept(ClassReader.java:542)
   at org.objectweb.asm.ClassReader.accept(ClassReader.java:506)
   at 
 org.apache.tika.parser.asm.XHTMLClassVisitor.parse(XHTMLClassVisitor.java:61)
   ... 6 more
 Seems like Tika tries to parse this file as Java class file, but that 
 obviously doesn't work.
 I've tried to create custom-mimetypes.xml file like this :
 ?xml version=1.0 encoding=UTF-8?
 mime-info
   mime-type type=application/octet-stream
 _commentMac OSX jnilib/_comment
 glob pattern=*.jnilib/
   /mime-type
 /mime-info
 and after I repack tika-app-1.4.jar with this file in org.apache.tika.mime 
 folder, the problem still
 exists.
 Jnilib file is actually from the ActiveMQ 5.8.0 binary found in 
 bin/macosx/libwrapper.jnilib



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1169) Fails to parse jnilib file

2014-05-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999705#comment-13999705
 ] 

ASF GitHub Bot commented on TIKA-1169:
--

GitHub user mkr opened a pull request:

https://github.com/apache/tika/pull/8

TIKA-1169: Adding other Mach-O magic bytes for jnilib files.

Adding remaining Mach-o binary signatures to fix TIKA-1169

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mkr/tika tika-1169-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/8.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8


commit 8b033abfa19a3cb8939ac67bbabcd4d53e89ec42
Author: Matthias Krueger m...@mkr.io
Date:   2014-05-16T08:58:40Z

TIKA-1169: Adding other Mach-O magic bytes for jnilib files.
MH_MAGIC = 0xfeedface
MH_CIGAM = 0xcefaedfe
MH_MAGIC_64 = 0xfeedfacf
MH_CIGAM_64 = 0xcffaedfe
 See 
https://developer.apple.com/library/mac/documentation/DeveloperTools/Conceptual/MachORuntime/Reference/reference.html




 Fails to parse jnilib file
 --

 Key: TIKA-1169
 URL: https://issues.apache.org/jira/browse/TIKA-1169
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: Windows 7 x64, Java 1.6.45
Reporter: brat
Priority: Critical
 Fix For: 1.5

 Attachments: libwrapper.jnilib


 Hi,
 I'm trying to parse a folder with jnilib file inside, but Tika 1.4 throws 
 exception :
 java.io.IOException: 
   at org.apache.tika.parser.ParsingReader.read(ParsingReader.java:260)
   at java.io.Reader.read(Unknown Source)
   at ca.cloudscraper.core.impl.Engine.process(Engine.java:63)
   at ca.cloudscraper.core.impl.Engine.process(Engine.java:34)
   at ca.cloudscraper.core.impl.Engine.process(Engine.java:34)
   at ca.cloudscraper.core.impl.Engine.process(Engine.java:34)
   at ca.cloudscraper.core.impl.Engine.execute(Engine.java:117)
   at 
 ca.cloudscraper.core.tests.LuceneServiceImplTest.test5(LuceneServiceImplTest.java:140)
   at 
 ca.cloudscraper.core.tests.LuceneServiceImplTest.main(LuceneServiceImplTest.java:176)
 Caused by: org.apache.tika.exception.TikaException: Failed to parse a Java 
 class
   at 
 org.apache.tika.parser.asm.XHTMLClassVisitor.parse(XHTMLClassVisitor.java:66)
   at org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at 
 org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:221)
   at java.lang.Thread.run(Unknown Source)
 Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
   at org.objectweb.asm.ClassReader.readClass(ClassReader.java:2157)
   at org.objectweb.asm.ClassReader.accept(ClassReader.java:542)
   at org.objectweb.asm.ClassReader.accept(ClassReader.java:506)
   at 
 org.apache.tika.parser.asm.XHTMLClassVisitor.parse(XHTMLClassVisitor.java:61)
   ... 6 more
 Seems like Tika tries to parse this file as Java class file, but that 
 obviously doesn't work.
 I've tried to create custom-mimetypes.xml file like this :
 ?xml version=1.0 encoding=UTF-8?
 mime-info
   mime-type type=application/octet-stream
 _commentMac OSX jnilib/_comment
 glob pattern=*.jnilib/
   /mime-type
 /mime-info
 and after I repack tika-app-1.4.jar with this file in org.apache.tika.mime 
 folder, the problem still
 exists.
 Jnilib file is actually from the ActiveMQ 5.8.0 binary found in 
 bin/macosx/libwrapper.jnilib



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1292) Inconsistent priorities in bundled tika-mimetypes.xml

2014-05-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004730#comment-14004730
 ] 

ASF GitHub Bot commented on TIKA-1292:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/6


 Inconsistent priorities in bundled tika-mimetypes.xml
 -

 Key: TIKA-1292
 URL: https://issues.apache.org/jira/browse/TIKA-1292
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.5
Reporter: Cservenak, Tamas

 It seems that mime-type priorities are a bit inconsistent in the tika-core 
 bundled tika-mimetypes.xml
 Few examples:
 * 
 [application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497]
  vs 
 [application/x-7z-compressed|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3510]:
  both are similar containers archive formats (structured, having entries), 
 having distinct file extensions (zip vs 7z globs), still priorities are 
 40 and 50 respectively.
 * 
 [application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497]
  vs 
 [text/html|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L4713]:
  not quite related MIME types, having same priority of 40. But ZIP files can 
 be uncompressed (meaning entries are mostly concatenated, and their 
 content, if plaintext, is readable). Hence, having an uncompressed ZIP (or 
 any subclass like JAR) file that contains HTML files zipped up might/will be 
 detected as HTML, which is wrong. 
 And this is what happens in Nexus that uses Tika under the hud for content 
 validation, basically using MIME magic detection provided by Tika Detector: 
 the Java JAR {{com.intellij:annotations:7.0.3}} 
 ([link|http://repo1.maven.org/maven2/com/intellij/annotations/7.0.3/]) is 
 being detected as {{text/html}} instead of (expected) 
 {{application/java-archive}}.
 Reason is following: the JAR file is zipped up in uncompressed zip format, 
 and among few annotations it also contains one HTML file entry (the license I 
 guess). Since both MIME types have same priority (40), I guess tika 
 randomly chooses the {{text/html}}.
 Original Nexus issue
 https://issues.sonatype.org/browse/NEXUS-6560
 At Nexus issue there is a GH Pull Request that solves the problem for us (by 
 raising {{application/zip}} priority to 41.
 But by inspecting the bundled tike-mimetypes.xml we spotted other -- probably 
 -- priority inconsistencies, like that of zip vs 7z mentioned above.
 Note: this happens when using tika-core solely on classpath and using it for 
 MIME magic detection. Interestingly, when the tika-parsers (with it's all 
 dependencies) are added to classpath, Tika will properly figure out that the 
 artifact is {{application/java-archive}}. Still, our use case in Nexus 
 requires the MIME magic detection only, so we do not use tika-parsers, nor we 
 would like to do so.
 Sample project to reproduce
 https://github.com/cstamas/tika-1292



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1292) Inconsistent priorities in bundled tika-mimetypes.xml

2014-05-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005933#comment-14005933
 ] 

ASF GitHub Bot commented on TIKA-1292:
--

Github user cstamas closed the pull request at:

https://github.com/apache/tika/pull/7


 Inconsistent priorities in bundled tika-mimetypes.xml
 -

 Key: TIKA-1292
 URL: https://issues.apache.org/jira/browse/TIKA-1292
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.5
Reporter: Cservenak, Tamas

 It seems that mime-type priorities are a bit inconsistent in the tika-core 
 bundled tika-mimetypes.xml
 Few examples:
 * 
 [application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497]
  vs 
 [application/x-7z-compressed|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3510]:
  both are similar containers archive formats (structured, having entries), 
 having distinct file extensions (zip vs 7z globs), still priorities are 
 40 and 50 respectively.
 * 
 [application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497]
  vs 
 [text/html|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L4713]:
  not quite related MIME types, having same priority of 40. But ZIP files can 
 be uncompressed (meaning entries are mostly concatenated, and their 
 content, if plaintext, is readable). Hence, having an uncompressed ZIP (or 
 any subclass like JAR) file that contains HTML files zipped up might/will be 
 detected as HTML, which is wrong. 
 And this is what happens in Nexus that uses Tika under the hud for content 
 validation, basically using MIME magic detection provided by Tika Detector: 
 the Java JAR {{com.intellij:annotations:7.0.3}} 
 ([link|http://repo1.maven.org/maven2/com/intellij/annotations/7.0.3/]) is 
 being detected as {{text/html}} instead of (expected) 
 {{application/java-archive}}.
 Reason is following: the JAR file is zipped up in uncompressed zip format, 
 and among few annotations it also contains one HTML file entry (the license I 
 guess). Since both MIME types have same priority (40), I guess tika 
 randomly chooses the {{text/html}}.
 Original Nexus issue
 https://issues.sonatype.org/browse/NEXUS-6560
 At Nexus issue there is a GH Pull Request that solves the problem for us (by 
 raising {{application/zip}} priority to 41.
 But by inspecting the bundled tike-mimetypes.xml we spotted other -- probably 
 -- priority inconsistencies, like that of zip vs 7z mentioned above.
 Note: this happens when using tika-core solely on classpath and using it for 
 MIME magic detection. Interestingly, when the tika-parsers (with it's all 
 dependencies) are added to classpath, Tika will properly figure out that the 
 artifact is {{application/java-archive}}. Still, our use case in Nexus 
 requires the MIME magic detection only, so we do not use tika-parsers, nor we 
 would like to do so.
 Sample project to reproduce
 https://github.com/cstamas/tika-1292



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1322) XML file parse errors within archives trigger Zip bomb detection

2014-06-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018247#comment-14018247
 ] 

ASF GitHub Bot commented on TIKA-1322:
--

GitHub user mkr opened a pull request:

https://github.com/apache/tika/pull/9

TIKA-1322: Properly close XMLParser's output in case of SAXException.

Fix and test for https://issues.apache.org/jira/browse/TIKA-1322.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mkr/tika TIKA-1322

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/9.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9


commit 63d979538a72e5c044b2074219268da57fcf48cd
Author: Matthias Krueger m...@mkr.io
Date:   2014-06-04T21:45:15Z

TIKA-1322: Properly close XMLParser's output in case of SAXException.




 XML file parse errors within archives trigger Zip bomb detection
 

 Key: TIKA-1322
 URL: https://issues.apache.org/jira/browse/TIKA-1322
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Matthias Krueger
Priority: Minor

 Tika parses XML input using org.apache.tika.parser.xml.XMLParser. XMLParser 
 opens a p tag before a SAXParser's output of the input XML is appended. A 
 possible SAXException during parsing is rethrown but the opened p tag not 
 closed. The Zip bomb detection in SecureContentHandler relies on consistent 
 starting and closing of elements. With the current behaviour of XMLParser it 
 will be triggered, for example, if an archive contains 10 
 (SecureContentHandler#maxPackageEntryDepth) invalid XML files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1322) XML file parse errors within archives trigger Zip bomb detection

2014-06-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019737#comment-14019737
 ] 

ASF GitHub Bot commented on TIKA-1322:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/9


 XML file parse errors within archives trigger Zip bomb detection
 

 Key: TIKA-1322
 URL: https://issues.apache.org/jira/browse/TIKA-1322
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Matthias Krueger
Priority: Minor
 Fix For: 1.6


 Tika parses XML input using org.apache.tika.parser.xml.XMLParser. XMLParser 
 opens a p tag before a SAXParser's output of the input XML is appended. A 
 possible SAXException during parsing is rethrown but the opened p tag not 
 closed. The Zip bomb detection in SecureContentHandler relies on consistent 
 starting and closing of elements. With the current behaviour of XMLParser it 
 will be triggered, for example, if an archive contains 10 
 (SecureContentHandler#maxPackageEntryDepth) invalid XML files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1336) Provide a Detector JAXRS endpoint

2014-06-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031661#comment-14031661
 ] 

ASF GitHub Bot commented on TIKA-1336:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/10


 Provide a Detector JAXRS endpoint
 -

 Key: TIKA-1336
 URL: https://issues.apache.org/jira/browse/TIKA-1336
 Project: Tika
  Issue Type: Improvement
  Components: detector, server
Affects Versions: 1.5
Reporter: Nick Burch
Assignee: Chris A. Mattmann
 Fix For: 1.6


 As identified in TIKA-1335, the Tika Server now has an endpoint which will 
 tell you what Detectors are available to it, but not one that will trigger 
 detection. That means your only way to do detection is to request the 
 metadata, and check the content type, but that isn't always as accurate as an 
 explicit detection call (eg if a general parser picks up the file)
 We should therefore add in a new endpoint that just does the detection



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1350) OutlookPSTParser: Unknown message type: IPM.Note

2014-06-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038862#comment-14038862
 ] 

ASF GitHub Bot commented on TIKA-1350:
--

GitHub user jrhe opened a pull request:

https://github.com/apache/tika/pull/12

Bumps libpst version to fix TIKA-1350

When parsing some emails in a PST file I get the error Unknown message 
type: IPM.Note preventing them from being parsed. This is because of an extra 
null byte at the end of the message class string.
This has been fixed in version 0.8.1 of java-libpst so a version bump is 
all that is required. 
https://github.com/rjohnsondev/java-libpst/issues/14

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jrhe/tika TIKA-1350

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/12.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12


commit 29272a0f4abb01407b795426e563cba8ed134107
Author: Jonathan Richard Henry Evans (JRHE) cont...@jrhe.co.uk
Date:   2014-06-20T14:53:21Z

Bumps libpst version to fix TIKA-1350




 OutlookPSTParser: Unknown message type: IPM.Note
 

 Key: TIKA-1350
 URL: https://issues.apache.org/jira/browse/TIKA-1350
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
Reporter: Jonathan Evans
  Labels: libpst, parser, pst
 Fix For: 1.7

   Original Estimate: 0.2h
  Remaining Estimate: 0.2h

 When parsing some emails in a PST file I get the error Unknown message type: 
 IPM.Note preventing them from being parsed. This is because of an extra null 
 byte at the end of the message class string.
 This has been fixed in version 0.8.1 of java-libpst so a version bump is all 
 that is required. 
 https://github.com/rjohnsondev/java-libpst/issues/14
 I would attempt to do this myself but I am unsure how to open a pull request 
 with SVN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container

2014-06-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042129#comment-14042129
 ] 

ASF GitHub Bot commented on TIKA-1354:
--

GitHub user hlavki opened a pull request:

https://github.com/apache/tika/pull/13

[TIKA-1354] Register ForkParser service in Activator and add simple test

There is maybe another way but I didn't find it. It'll will be good if 
somebody with higher OSGI knowledge also look on it.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hlavki/tika trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/13.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13


commit cc2f328365e649e1290ce23e48f43376c5d42687
Author: Michal Hlavac hla...@hlavki.eu
Date:   2014-06-24T13:26:42Z

[TIKA-1354] Register ForkParser service in Activator and add simple test




 ForkParser doesn't work in OSGI container
 -

 Key: TIKA-1354
 URL: https://issues.apache.org/jira/browse/TIKA-1354
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Michal Hlavac

 I can't find way to run ForkParser in OSGI container.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1350) OutlookPSTParser: Unknown message type: IPM.Note

2014-06-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042339#comment-14042339
 ] 

ASF GitHub Bot commented on TIKA-1350:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/12


 OutlookPSTParser: Unknown message type: IPM.Note
 

 Key: TIKA-1350
 URL: https://issues.apache.org/jira/browse/TIKA-1350
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
Reporter: Jonathan Evans
  Labels: libpst, parser, pst
 Fix For: 1.7

   Original Estimate: 0.2h
  Remaining Estimate: 0.2h

 When parsing some emails in a PST file I get the error Unknown message type: 
 IPM.Note preventing them from being parsed. This is because of an extra null 
 byte at the end of the message class string.
 This has been fixed in version 0.8.1 of java-libpst so a version bump is all 
 that is required. 
 https://github.com/rjohnsondev/java-libpst/issues/14
 I would attempt to do this myself but I am unsure how to open a pull request 
 with SVN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1361) Update MP4Parser to 1.0.2

2014-07-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073543#comment-14073543
 ] 

ASF GitHub Bot commented on TIKA-1361:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/14


 Update MP4Parser to 1.0.2
 -

 Key: TIKA-1361
 URL: https://issues.apache.org/jira/browse/TIKA-1361
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Matthias Krueger

 The currently used com.googlecode.mp4parser:isoparser version is 1.0-RC-1. 
 According to https://code.google.com/p/mp4parser/#Changes/Releases and 
 https://code.google.com/p/mp4parser/source/list there have been quite some 
 improvements since then. Before tackling more metadata (such as in TIKA-852) 
 we should update to 1.0.2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor

2014-07-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077584#comment-14077584
 ] 

ASF GitHub Bot commented on TIKA-1369:
--

GitHub user vilmospapp opened a pull request:

https://github.com/apache/tika/pull/15

TIKA-1369 Resolve thread safety issue in ImageMetadataExtractor 

Hi,

This fix tries to resolve TIKA-1369 with handle thread safety by 
ThreadLocal and avoid other library dependencies.

I have run the test cases, so it seems correct to me, though I haven't 
found any other occurrence of ThreadLocal in Tika's source, so perhaps it's 
against your general patterns.

Regards,
Vilmos

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vilmospapp/tika TIKA-1369

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/15.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15


commit 3a9575fc56a6463b4378b14820e9079352bb1848
Author: Vilmos Papp papp.gyorgy.vil...@gmail.com
Date:   2014-07-23T09:18:50Z

TIKA-1369 Make SimpleDateFormat usage thread safe




 Date parsing and thread safety in ImageMetadataExtractor
 

 Key: TIKA-1369
 URL: https://issues.apache.org/jira/browse/TIKA-1369
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
 Environment: OS X 10.9.4 Java 7_60
Reporter: John Gibson
Priority: Critical

 The {{ImageMetadataExtractor}} uses a static instance of 
 {{SimpleDateFormat}}.  This is not thread safe.
 {code:title=ImageMetadataExtractor.java}
 static class ExifHandler implements DirectoryHandler {
 private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new 
 SimpleDateFormat(-MM-dd'T'HH:mm:ss);
 ...
 public void handleDateTags(Directory directory, Metadata metadata)
 throws MetadataException {
 // Date/Time Original overrides value from 
 ExifDirectory.TAG_DATETIME
 Date original = null;
 if 
 (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
 original = 
 directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
 // Unless we have GPS time we don't know the time zone so 
 date must be set
 // as ISO 8601 datetime without timezone suffix (no Z or +/-)
 if (original != null) {
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
 uses
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
 }
 }
...
 {code}
 This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. 
 In the discussion there the idea of using alternative thread-safe (and 
 faster) formatters from either Joda time or Commons Lang were dismissed 
 because they would add too many dependencies. Given that Tika already has a 
 fairly large laundry list of dependencies to parse content, adding one more 
 JAR to make sure things don't break is probably a good idea.
 In addition, because no timezone or locale are specified by either Tika's 
 formatter or the call to com.drew.metadata.Directory it can wreak havok 
 during randomized testing. Given that the timezone is unknown, why not just 
 default it to UTC and let the caller guess the timezone? As it stands I have 
 to reparse all of the dates into UTC to get stable behavior across timezones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1435) Update rome dependency to 1.5

2014-10-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157837#comment-14157837
 ] 

ASF GitHub Bot commented on TIKA-1435:
--

GitHub user jotomo opened a pull request:

https://github.com/apache/tika/pull/16

TIKA-1435: Upgrade Rome to 1.5

Adopt new namespace and enjoy generics.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jotomo/tika rome-1.5

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/16.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16


commit b6e3a51be79efc04fdd643378f67b2f7d3bc5af4
Author: Johannes Mockenhaupt g...@jotomo.de
Date:   2014-10-02T22:17:55Z

TIKA-1435: Upgrade Rome to 1.5

Adopt new namespace and enjoy generics.




 Update rome dependency to 1.5
 -

 Key: TIKA-1435
 URL: https://issues.apache.org/jira/browse/TIKA-1435
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Johannes Mockenhaupt
Priority: Minor
 Fix For: 1.7


 Rome 1.5 has been released to Sonatype 
 (https://github.com/rometools/rome/issues/183). Though the website 
 (http://rometools.github.io/rome/) is blissfully ignorant of that. The update 
 is mostly maintenance, adopting slf4j and generics as well as moving the 
 namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1435) Update rome dependency to 1.5

2014-10-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158724#comment-14158724
 ] 

ASF GitHub Bot commented on TIKA-1435:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/16


 Update rome dependency to 1.5
 -

 Key: TIKA-1435
 URL: https://issues.apache.org/jira/browse/TIKA-1435
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Johannes Mockenhaupt
Priority: Minor
 Fix For: 1.7


 Rome 1.5 has been released to Sonatype 
 (https://github.com/rometools/rome/issues/183). Though the website 
 (http://rometools.github.io/rome/) is blissfully ignorant of that. The update 
 is mostly maintenance, adopting slf4j and generics as well as moving the 
 namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container

2014-10-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158734#comment-14158734
 ] 

ASF GitHub Bot commented on TIKA-1354:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/13


 ForkParser doesn't work in OSGI container
 -

 Key: TIKA-1354
 URL: https://issues.apache.org/jira/browse/TIKA-1354
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Michal Hlavac

 I can't find way to run ForkParser in OSGI container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1126) text/html procuder for tika-server

2014-10-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158743#comment-14158743
 ] 

ASF GitHub Bot commented on TIKA-1126:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/3


 text/html procuder for tika-server
 --

 Key: TIKA-1126
 URL: https://issues.apache.org/jira/browse/TIKA-1126
 Project: Tika
  Issue Type: Improvement
  Components: server
Affects Versions: 1.4
Reporter: Ali Mosavian
Priority: Trivial
 Fix For: 1.4

 Attachments: tika_server_html_output.patch


 the /tika resource handler of tika-server can only produce text/plain. This 
 patch adds support for producing text/html.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor

2014-10-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158893#comment-14158893
 ] 

ASF GitHub Bot commented on TIKA-1369:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/15


 Date parsing and thread safety in ImageMetadataExtractor
 

 Key: TIKA-1369
 URL: https://issues.apache.org/jira/browse/TIKA-1369
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
 Environment: OS X 10.9.4 Java 7_60
Reporter: John Gibson
Priority: Critical

 The {{ImageMetadataExtractor}} uses a static instance of 
 {{SimpleDateFormat}}.  This is not thread safe.
 {code:title=ImageMetadataExtractor.java}
 static class ExifHandler implements DirectoryHandler {
 private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new 
 SimpleDateFormat(-MM-dd'T'HH:mm:ss);
 ...
 public void handleDateTags(Directory directory, Metadata metadata)
 throws MetadataException {
 // Date/Time Original overrides value from 
 ExifDirectory.TAG_DATETIME
 Date original = null;
 if 
 (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
 original = 
 directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
 // Unless we have GPS time we don't know the time zone so 
 date must be set
 // as ISO 8601 datetime without timezone suffix (no Z or +/-)
 if (original != null) {
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
 uses
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
 }
 }
...
 {code}
 This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. 
 In the discussion there the idea of using alternative thread-safe (and 
 faster) formatters from either Joda time or Commons Lang were dismissed 
 because they would add too many dependencies. Given that Tika already has a 
 fairly large laundry list of dependencies to parse content, adding one more 
 JAR to make sure things don't break is probably a good idea.
 In addition, because no timezone or locale are specified by either Tika's 
 formatter or the call to com.drew.metadata.Directory it can wreak havok 
 during randomized testing. Given that the timezone is unknown, why not just 
 default it to UTC and let the caller guess the timezone? As it stands I have 
 to reparse all of the dates into UTC to get stable behavior across timezones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor

2014-10-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160234#comment-14160234
 ] 

ASF GitHub Bot commented on TIKA-1369:
--

GitHub user vilmospapp opened a pull request:

https://github.com/apache/tika/pull/17

TIKA-1369 Avoid ThreadLocal usage from Memory Leak

Hi @chrismattmann ,

Based on our discussion from https://github.com/apache/tika/pull/15 I've 
added the ThreadLocal clean up part, so theoretically it won't suffer from the 
scenario that @grossws mentioned.

Cheers,
Vilmos

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vilmospapp/tika TIKA-1369-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/17.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17


commit f95fad94619946ef1d4fe7cf407deab6317ad2fd
Author: Vilmos Papp papp.gyorgy.vil...@gmail.com
Date:   2014-10-06T12:10:14Z

TIKA-1369 Avoid ThreadLocal usage from Memory Leak




 Date parsing and thread safety in ImageMetadataExtractor
 

 Key: TIKA-1369
 URL: https://issues.apache.org/jira/browse/TIKA-1369
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
 Environment: OS X 10.9.4 Java 7_60
Reporter: John Gibson
Assignee: Chris A. Mattmann
Priority: Critical
 Fix For: 1.7


 The {{ImageMetadataExtractor}} uses a static instance of 
 {{SimpleDateFormat}}.  This is not thread safe.
 {code:title=ImageMetadataExtractor.java}
 static class ExifHandler implements DirectoryHandler {
 private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new 
 SimpleDateFormat(-MM-dd'T'HH:mm:ss);
 ...
 public void handleDateTags(Directory directory, Metadata metadata)
 throws MetadataException {
 // Date/Time Original overrides value from 
 ExifDirectory.TAG_DATETIME
 Date original = null;
 if 
 (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
 original = 
 directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
 // Unless we have GPS time we don't know the time zone so 
 date must be set
 // as ISO 8601 datetime without timezone suffix (no Z or +/-)
 if (original != null) {
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
 uses
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
 }
 }
...
 {code}
 This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. 
 In the discussion there the idea of using alternative thread-safe (and 
 faster) formatters from either Joda time or Commons Lang were dismissed 
 because they would add too many dependencies. Given that Tika already has a 
 fairly large laundry list of dependencies to parse content, adding one more 
 JAR to make sure things don't break is probably a good idea.
 In addition, because no timezone or locale are specified by either Tika's 
 formatter or the call to com.drew.metadata.Directory it can wreak havok 
 during randomized testing. Given that the timezone is unknown, why not just 
 default it to UTC and let the caller guess the timezone? As it stands I have 
 to reparse all of the dates into UTC to get stable behavior across timezones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container

2014-10-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160340#comment-14160340
 ] 

ASF GitHub Bot commented on TIKA-1354:
--

GitHub user hlavki opened a pull request:

https://github.com/apache/tika/pull/18

TIKA-1354 Add test method with nonfunctional fork parser

There is something wrong with pax commons logging so ForkParser doesn't 
work in general.
Test method: testForkParserPdf()

I suppose that this pull request will never be merged to trunk.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hlavki/tika trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/18.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18


commit ab9d4d432b4ace9877cd1bbc178594ac230d4edd
Author: Michal Hlavac hla...@hlavki.eu
Date:   2014-10-06T14:21:48Z

TIKA-1354 Add test method with nonfunctional fork parser (There is 
something wrong with pax commons logging)




 ForkParser doesn't work in OSGI container
 -

 Key: TIKA-1354
 URL: https://issues.apache.org/jira/browse/TIKA-1354
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Michal Hlavac

 I can't find way to run ForkParser in OSGI container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor

2014-10-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161801#comment-14161801
 ] 

ASF GitHub Bot commented on TIKA-1369:
--

Github user vilmospapp closed the pull request at:

https://github.com/apache/tika/pull/17


 Date parsing and thread safety in ImageMetadataExtractor
 

 Key: TIKA-1369
 URL: https://issues.apache.org/jira/browse/TIKA-1369
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
 Environment: OS X 10.9.4 Java 7_60
Reporter: John Gibson
Assignee: Chris A. Mattmann
Priority: Critical
 Fix For: 1.7


 The {{ImageMetadataExtractor}} uses a static instance of 
 {{SimpleDateFormat}}.  This is not thread safe.
 {code:title=ImageMetadataExtractor.java}
 static class ExifHandler implements DirectoryHandler {
 private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new 
 SimpleDateFormat(-MM-dd'T'HH:mm:ss);
 ...
 public void handleDateTags(Directory directory, Metadata metadata)
 throws MetadataException {
 // Date/Time Original overrides value from 
 ExifDirectory.TAG_DATETIME
 Date original = null;
 if 
 (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
 original = 
 directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
 // Unless we have GPS time we don't know the time zone so 
 date must be set
 // as ISO 8601 datetime without timezone suffix (no Z or +/-)
 if (original != null) {
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
 uses
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
 }
 }
...
 {code}
 This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. 
 In the discussion there the idea of using alternative thread-safe (and 
 faster) formatters from either Joda time or Commons Lang were dismissed 
 because they would add too many dependencies. Given that Tika already has a 
 fairly large laundry list of dependencies to parse content, adding one more 
 JAR to make sure things don't break is probably a good idea.
 In addition, because no timezone or locale are specified by either Tika's 
 formatter or the call to com.drew.metadata.Directory it can wreak havok 
 during randomized testing. Given that the timezone is unknown, why not just 
 default it to UTC and let the caller guess the timezone? As it stands I have 
 to reparse all of the dates into UTC to get stable behavior across timezones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181483#comment-14181483
 ] 

ASF GitHub Bot commented on TIKA-1446:
--

GitHub user thaichat04 opened a pull request:

https://github.com/apache/tika/pull/20

TIKA-1446

TIKA- 1430, TIKA-1446, TIKA-1447, TIKA-1448: CHM Parser improvement

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/tika 1.6

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/20.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20


commit 58a465391d128c2aa9b11c9f5a986f6bcd28abca
Author: Chris Mattmann mattm...@apache.org
Date:   2014-07-28T00:45:03Z

[maven-release-plugin]  copy for tag 1.6

git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1613865 
13f79535-47bb-0310-9956-ffa450edef68

commit c98da37a4b83bdad6aa86ccc6aaec6b0d647c59a
Author: David Meikle dmei...@apache.org
Date:   2014-07-31T18:29:32Z

TIKA-1381 - Added Lingo24Translator implementation

git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1614950 
13f79535-47bb-0310-9956-ffa450edef68

commit d831ac12be2fc3303f5dab45b00b53b53b6a67e9
Author: Nick Burch n...@apache.org
Date:   2014-08-04T15:41:54Z

Create a branch for 1.6, to backport the POI upgrade to

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615619 
13f79535-47bb-0310-9956-ffa450edef68

commit e2d10e633d38c52b0f490a09043fb43176d26fbe
Author: Nick Burch n...@apache.org
Date:   2014-08-04T15:54:55Z

Merge the POI 3.11 beta 1 upgrade from Trunk to the 1.6 branch (TIKA-1380), 
ready for inclusion in rc2

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615636 
13f79535-47bb-0310-9956-ffa450edef68

commit a5942c11cd6a3e75304ce0267c1fc4b5e979c66c
Author: Tim Allison talli...@apache.org
Date:   2014-08-04T16:51:40Z

TIKA-1317 extract contents from SDTs within cells in tables in XWPF (docx) 
files

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615675 
13f79535-47bb-0310-9956-ffa450edef68

commit 68f9a11926946bdea29ab757a8275149d8d057e9
Author: Nick Burch n...@apache.org
Date:   2014-08-04T21:27:41Z

Merge r1615631 from Trunk to 1.6 - Upgrade the Commons Codec version to 
match that in Apache POI, upgraded in TIKA-1380

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615800 
13f79535-47bb-0310-9956-ffa450edef68

commit ee988d4daa5b451a51b799b0ec790b88ca7fc111
Author: Tim Allison talli...@apache.org
Date:   2014-08-05T13:03:05Z

TIKA-1275 upgrade Commons Compress to 1.8.1; updated CHANGES.txt, too

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615923 
13f79535-47bb-0310-9956-ffa450edef68

commit 9d27e1379fba530def45b470a92ce5052078021c
Author: Tim Allison talli...@apache.org
Date:   2014-08-05T18:17:39Z

TIKA-1380; fix for null ole.getLabel()

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615970 
13f79535-47bb-0310-9956-ffa450edef68

commit 2ee02d85aa703e65607a707ee171c166017916ab
Author: Nick Burch n...@apache.org
Date:   2014-08-20T14:16:06Z

Merge r1619108 from Trunk to the 1.6 branch ready for release - Bump the 
POI dependency to 3.11-beta2, and remove the Geronimo stax one which is no 
longer required by anything now we are on Java 1.6 TIKA-1380

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1619109 
13f79535-47bb-0310-9956-ffa450edef68

commit a3eac367cd560c20da4231f45eb18d638d4f91a1
Author: Chris Mattmann mattm...@apache.org
Date:   2014-08-31T19:36:36Z

Bring 1.6 branch up to date with trunk in prep for 1.6 RC #2.

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621623 
13f79535-47bb-0310-9956-ffa450edef68

commit dd2a2b5bad7e363c5ab74db69b89b6083f6fc8ff
Author: Chris Mattmann mattm...@apache.org
Date:   2014-08-31T19:44:11Z

[maven-release-plugin] prepare release 1.6-rc2

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621627 
13f79535-47bb-0310-9956-ffa450edef68

commit 5f9845759fb7839298ac5ee3abb11667035faac3
Author: Chris Mattmann mattm...@apache.org
Date:   2014-08-31T19:44:17Z

[maven-release-plugin] prepare for next development iteration

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621629 
13f79535-47bb-0310-9956-ffa450edef68




 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: 

[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181518#comment-14181518
 ] 

ASF GitHub Bot commented on TIKA-1446:
--

Github user thaichat04 closed the pull request at:

https://github.com/apache/tika/pull/20


 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1472) Warning on Tika Server startup - Failed to load class org.slf4j.impl.StaticLoggerBinder

2014-11-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206488#comment-14206488
 ] 

ASF GitHub Bot commented on TIKA-1472:
--

GitHub user grossws opened a pull request:

https://github.com/apache/tika/pull/22

Added slf4j-jcl impl to tika-server deps.

Fixes TIKA-1472.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/grossws/tika fix-tika-1472

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/22.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22


commit 892d28051fbd745c33566e1c86d054aa2c11c4cc
Author: Konstantin Gribov gros...@gmail.com
Date:   2014-11-11T14:33:29Z

Added slf4j-jcl impl to tika-server deps.

Fixes TIKA-1472.




 Warning on Tika Server startup - Failed to load class 
 org.slf4j.impl.StaticLoggerBinder
 -

 Key: TIKA-1472
 URL: https://issues.apache.org/jira/browse/TIKA-1472
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.6
 Environment: Windows 8, JDK 1.8, Maven 3.2.3
Reporter: Darya Arbuzova
Priority: Minor
 Attachments: 0001-Added-slf4j-jcl-impl-to-tika-server-deps.patch


 Hello!
 I want to use Apache Tika in server mode.
 I downloaded {{tika-server-1.6.jar}} from 
 http://mirror.vorboss.net/apache/tika/
 When I try to start the server, I get
 {{SLF4J: Failed to load class org.slf4j.impl.StaticLoggerBinder.}}
 So I go to the link you direct me to 
 (http://www.slf4j.org/codes.html#StaticLoggerBinder), download other slfj4 
 {{jar}}-files, but what next? I can't put them to the class path, since I 
 don't have a project. I can't change dependencies in {{pom.xml}} for the same 
 reason. Whant should I do?
 I tried downloading the whole source code, but couldn't build it using Maven, 
 still haven't figured out why. Previous discussion see here:
 https://issues.apache.org/jira/browse/TIKA-1470
 Thank you!
 Best regards,
 Darya Arbuzova



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1472) Warning on Tika Server startup - Failed to load class org.slf4j.impl.StaticLoggerBinder

2014-11-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207940#comment-14207940
 ] 

ASF GitHub Bot commented on TIKA-1472:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/22


 Warning on Tika Server startup - Failed to load class 
 org.slf4j.impl.StaticLoggerBinder
 -

 Key: TIKA-1472
 URL: https://issues.apache.org/jira/browse/TIKA-1472
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.6
 Environment: Windows 8, JDK 1.8, Maven 3.2.3
Reporter: Darya Arbuzova
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.7

 Attachments: 0001-Added-slf4j-jcl-impl-to-tika-server-deps.patch


 Hello!
 I want to use Apache Tika in server mode.
 I downloaded {{tika-server-1.6.jar}} from 
 http://mirror.vorboss.net/apache/tika/
 When I try to start the server, I get
 {{SLF4J: Failed to load class org.slf4j.impl.StaticLoggerBinder.}}
 So I go to the link you direct me to 
 (http://www.slf4j.org/codes.html#StaticLoggerBinder), download other slfj4 
 {{jar}}-files, but what next? I can't put them to the class path, since I 
 don't have a project. I can't change dependencies in {{pom.xml}} for the same 
 reason. Whant should I do?
 I tried downloading the whole source code, but couldn't build it using Maven, 
 still haven't figured out why. Previous discussion see here:
 https://issues.apache.org/jira/browse/TIKA-1470
 Thank you!
 Best regards,
 Darya Arbuzova



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-936) encoding of ZipArchiveInputStream

2015-02-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306931#comment-14306931
 ] 

ASF GitHub Bot commented on TIKA-936:
-

GitHub user kongxianghe1234 opened a pull request:

https://github.com/apache/tika/pull/27

Update RarParser.java

if you want to detect a file which has chinese-like fileName ? this way you 
did will be error.
details TIKA-936.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kongxianghe1234/tika patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/27.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #27


commit 6d3913047980521df07ae10922a6c4233dadbb52
Author: 孔祥和  JavaMiner kong1011437...@gmail.com
Date:   2015-02-05T09:52:41Z

Update RarParser.java

if you want to detect a file which has chinese-like fileName ? this way you 
did will be error.
details TIKA-936.




 encoding of ZipArchiveInputStream
 -

 Key: TIKA-936
 URL: https://issues.apache.org/jira/browse/TIKA-936
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.1
Reporter: Shinichiro Abe
Assignee: Jukka Zitting
 Attachments: x-日本語メモ.zip


 When extracting from the zip files which are zipped at Windows OS(Japanese), 
 the file name extracted from zip is garbled.
 ZipArchiveInputStream has three constructors. Modifying like the below, the 
 file name was not garbled. I specified the encoding - SJIS.
 {code:title=PackageExtractor|borderStyle=solid}
 public void parse(InputStream stream)
  :
  //unpack(new ZipArchiveInputStream(stream), xhtml);  
  unpack(new ZipArchiveInputStream(stream,SJIS,true), xhtml); 
  :
 {code}
 In first constructor -the platform's default encoding- UTF-8 is used.  In my 
 case the encoding of my computer is UTF-8, the encoding of zip file is SJIS, 
 so the file name was garbled. We will get garbled file name if there is a 
 difference of  encoding between -platform- this constructor and zip file.
 I want Tika to parse zip by giving some kind of encoding parameter per file, 
 Where should I give the encoding, somewhere in Metadata or ParseContext? 
 Please support this. I am using Tika via Solr(SolrCell), so when posting zip 
 file to Solr I want to add encoding parameter to the request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container

2015-02-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326497#comment-14326497
 ] 

ASF GitHub Bot commented on TIKA-1354:
--

Github user hlavki closed the pull request at:

https://github.com/apache/tika/pull/18


 ForkParser doesn't work in OSGI container
 -

 Key: TIKA-1354
 URL: https://issues.apache.org/jira/browse/TIKA-1354
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Michal Hlavac

 I can't find way to run ForkParser in OSGI container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container

2015-02-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324823#comment-14324823
 ] 

ASF GitHub Bot commented on TIKA-1354:
--

GitHub user hlavki opened a pull request:

https://github.com/apache/tika/pull/30

Rollback (TIKA-1354) and update pax-exam to version 4.4.0

Test all bundle tests cases passes

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hlavki/tika trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/30.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #30


commit fe7ee28aa344744b5534cb707db0baeaf843f79e
Author: Michal Hlavac hla...@hlavki.eu
Date:   2015-02-17T20:20:08Z

The ForkParser service removed from Activator

commit 01dcbc913638c641434001cb60f3bb3035f996c5
Author: Michal Hlavac hla...@hlavki.eu
Date:   2015-02-17T20:20:51Z

update pax-exam to 4.4.0 and fix osgi bundle tests




 ForkParser doesn't work in OSGI container
 -

 Key: TIKA-1354
 URL: https://issues.apache.org/jira/browse/TIKA-1354
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Michal Hlavac

 I can't find way to run ForkParser in OSGI container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1530) MP4Parser parses duration but does not set it

2015-01-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293120#comment-14293120
 ] 

ASF GitHub Bot commented on TIKA-1530:
--

GitHub user owickstrom opened a pull request:

https://github.com/apache/tika/pull/25

TIKA-1530: Include parsed mp4 duration in metadata

Note that I couldn't get all tests working in the project 
(https://issues.apache.org/jira/browse/TIKA-1521?focusedCommentId=14288719page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14288719)
 so I have only run `org.apache.tika.parser.mp4.MP4ParserTest`. If someone else 
with a working build could try this perhaps?

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/owickstrom/tika trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/25.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #25


commit 5300d22e6c7f71c353ad84a7cf4534a8efff85da
Author: Oskar Wickström oskar.wickst...@live.com
Date:   2015-01-27T07:28:00Z

TIKA-1530: Include parsed mp4 duration in metadata




 MP4Parser parses duration but does not set it 
 --

 Key: TIKA-1530
 URL: https://issues.apache.org/jira/browse/TIKA-1530
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Oskar Wickström
Priority: Minor

 See the TODO comment at 
 https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L167



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1530) MP4Parser parses duration but does not set it

2015-01-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293664#comment-14293664
 ] 

ASF GitHub Bot commented on TIKA-1530:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/25


 MP4Parser parses duration but does not set it 
 --

 Key: TIKA-1530
 URL: https://issues.apache.org/jira/browse/TIKA-1530
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Oskar Wickström
Priority: Minor

 See the TODO comment at 
 https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L167



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1537) Installation on OSX 10.10.2 generates OutOfMemory Error during parser tests

2015-02-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14300953#comment-14300953
 ] 

ASF GitHub Bot commented on TIKA-1537:
--

GitHub user archerrbgh opened a pull request:

https://github.com/apache/tika/pull/26

Added argLine value for maven-surefire-plugin

Setting a higher maximum amount of memory prevents TestChmExtraction
from generating an OutOfMemory error from running out of heap space
when running parser tests and trying to install Tika 1.7 on OS X
10.10.2. This is to address issue TIKA-1537.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/archerrbgh/tika trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/26.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #26


commit 05340a8ef401a29206b7dff30b97d193347cd8c2
Author: Andrew Hwang archerr...@gmail.com
Date:   2015-02-02T07:08:19Z

Added argLine value for maven-surefire-plugin

Setting a higher maximum amount of memory prevents TestChmExtraction
from generating an OutOfMemory error from running out of heap space
when running parser tests and trying to install Tika 1.7 on OS X
10.10.2.




 Installation on OSX 10.10.2 generates OutOfMemory Error during parser tests
 ---

 Key: TIKA-1537
 URL: https://issues.apache.org/jira/browse/TIKA-1537
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: Mac OSX 10.10.2
Reporter: Andrew Hwang
Priority: Minor
  Labels: easyfix
   Original Estimate: 0h
  Remaining Estimate: 0h

 I was having issues during installation of Tika 1.7 where the build failed 
 when running parser tests (specifically on TestChmExtraction). I had set the 
 MAVEN_OPTS variable to have enough memory (-Xmx2048m), but the build still 
 failed. I turned out that the maven-surefire-plugin was creating a new JVM 
 that did not have enough specified memory, causing TestChmExtraction to fail. 
 A fix I found online led me to change the POM for tika-parent (adding an 
 argLine to maven-surefire-plugin, where -Xmx2048m worked). After adding this, 
 the installation was able to finish. I will submit a pull request with the 
 addition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1365) Incorrectly MimeType detection for Apache Lucene web site

2015-03-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366176#comment-14366176
 ] 

ASF GitHub Bot commented on TIKA-1365:
--

GitHub user mkr opened a pull request:

https://github.com/apache/tika/pull/35

TIKA-1365: Lower priority for XML starting with comment

TIKA-1365: Lower priority for XML starting with comment, allow HTML 
starting with comment to be detected as text/html

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mkr/tika TIKA-1365

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/35.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #35


commit f9655d44978af188018bee81b2d554770ddcd7f9
Author: Matthias Krueger m...@mkr.io
Date:   2015-03-17T21:45:36Z

TIKA-1365: Lower priority for XML starting with comment, allow HTML 
starting with comment to be detected as text/html




 Incorrectly MimeType detection for Apache Lucene web site
 -

 Key: TIKA-1365
 URL: https://issues.apache.org/jira/browse/TIKA-1365
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 1.5
Reporter: Tien Nguyen Manh
 Attachments: discussion.html


 Tika 1.5 detect many page from apache lucene web site as xml, for example 
 this page 
 http://lucene.apache.org/core/discussion.html
 Here are error log:, it failed to parse becuase it use xml parser
 Apache Tika was unable to parse the document
 at http://lucene.apache.org/core/discussion.html.
 The full exception stack trace is included below:
 org.apache.tika.exception.TikaException: XML parse error
   at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
   at org.apache.tika.gui.TikaGUI.openURL(TikaGUI.java:293)
   at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:247)
   at 
 javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2018)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1554) Improve EMF file detection

2015-03-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366038#comment-14366038
 ] 

ASF GitHub Bot commented on TIKA-1554:
--

GitHub user mkr opened a pull request:

https://github.com/apache/tika/pull/34

TIKA-1554: Adding EMF magic as per Microsoft's EMF specification, thanks to 
Luis Filipe Nassif

TIKA-1554: Adding EMF magic as per Microsoft's EMF specification, thanks to 
Luis Filipe Nassif

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mkr/tika TIKA-1554

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/34.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #34


commit 4608ff50c28b9ba8c2d1caf6fe4530eeb2a088be
Author: Matthias Krueger m...@mkr.io
Date:   2015-03-17T12:45:06Z

TIKA-1554: Adding EMF magic as per Microsoft's EMF specification, thanks to 
Luis Filipe Nassif




 Improve EMF file detection
 --

 Key: TIKA-1554
 URL: https://issues.apache.org/jira/browse/TIKA-1554
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 1.7
Reporter: Luis Filipe Nassif
 Attachments: nonEmf.dat


 I am getting many files being incorrectly detected as application/x-emf. I 
 think the current magic is too common. According to MS documentation 
 (https://msdn.microsoft.com/en-us/library/cc230635.aspx and 
 https://msdn.microsoft.com/en-us/library/dd240211.aspx), it can be improved 
 to:
 {code}
 mime-type type=application/x-emf
 acronymEMF/acronym
 _commentExtended Metafile/_comment
 glob pattern=*.emf/
 magic priority=50
   match value=0x0100 type=string offset=0
   match value= EMF type=string offset=40/
   /match
 /magic
   /mime-type
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1554) Improve EMF file detection

2015-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368568#comment-14368568
 ] 

ASF GitHub Bot commented on TIKA-1554:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/34


 Improve EMF file detection
 --

 Key: TIKA-1554
 URL: https://issues.apache.org/jira/browse/TIKA-1554
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 1.7
Reporter: Luis Filipe Nassif
Assignee: Chris A. Mattmann
 Attachments: nonEmf.dat


 I am getting many files being incorrectly detected as application/x-emf. I 
 think the current magic is too common. According to MS documentation 
 (https://msdn.microsoft.com/en-us/library/cc230635.aspx and 
 https://msdn.microsoft.com/en-us/library/dd240211.aspx), it can be improved 
 to:
 {code}
 mime-type type=application/x-emf
 acronymEMF/acronym
 _commentExtended Metafile/_comment
 glob pattern=*.emf/
 magic priority=50
   match value=0x0100 type=string offset=0
   match value= EMF type=string offset=40/
   /match
 /magic
   /mime-type
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1365) Incorrectly MimeType detection for Apache Lucene web site

2015-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368559#comment-14368559
 ] 

ASF GitHub Bot commented on TIKA-1365:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/35


 Incorrectly MimeType detection for Apache Lucene web site
 -

 Key: TIKA-1365
 URL: https://issues.apache.org/jira/browse/TIKA-1365
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 1.5
Reporter: Tien Nguyen Manh
Assignee: Chris A. Mattmann
 Attachments: discussion.html


 Tika 1.5 detect many page from apache lucene web site as xml, for example 
 this page 
 http://lucene.apache.org/core/discussion.html
 Here are error log:, it failed to parse becuase it use xml parser
 Apache Tika was unable to parse the document
 at http://lucene.apache.org/core/discussion.html.
 The full exception stack trace is included below:
 org.apache.tika.exception.TikaException: XML parse error
   at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
   at org.apache.tika.gui.TikaGUI.openURL(TikaGUI.java:293)
   at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:247)
   at 
 javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2018)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1567) WelcomeResource in TikaServer doesn't print PathParam prefix

2015-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351930#comment-14351930
 ] 

ASF GitHub Bot commented on TIKA-1567:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/33


 WelcomeResource in TikaServer doesn't print PathParam prefix
 

 Key: TIKA-1567
 URL: https://issues.apache.org/jira/browse/TIKA-1567
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8


 Ben Zaiten found out that while looking at the WelcomeResource for Tika 
 server, things like meta/form are shown as metaform not properly 
 delineating the PathParams. I tracked this down to WelcomeResource easy fix 
 coming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1567) WelcomeResource in TikaServer doesn't print PathParam prefix

2015-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351928#comment-14351928
 ] 

ASF GitHub Bot commented on TIKA-1567:
--

GitHub user chrismattmann opened a pull request:

https://github.com/apache/tika/pull/33

Fix for TIKA-1567 WelcomeResource in TikaServer doesn't print PathParam 
prefix



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chrismattmann/tika TIKA-1567

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/33.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #33


commit c3c97ce987ad1da55a0b4d1d016d67d06341467a
Author: Chris Mattmann mattm...@apache.org
Date:   2015-03-08T06:08:48Z

fix for TIKA-1567 WelcomeResource in TikaServer doesn't print PathParam 
prefix.




 WelcomeResource in TikaServer doesn't print PathParam prefix
 

 Key: TIKA-1567
 URL: https://issues.apache.org/jira/browse/TIKA-1567
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8


 Ben Zaiten found out that while looking at the WelcomeResource for Tika 
 server, things like meta/form are shown as metaform not properly 
 delineating the PathParams. I tracked this down to WelcomeResource easy fix 
 coming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags

2015-03-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388478#comment-14388478
 ] 

ASF GitHub Bot commented on TIKA-1589:
--

GitHub user mdaniline opened a pull request:

https://github.com/apache/tika/pull/38

fix for TIKA-1589 contributed by mdaniline

https://issues.apache.org/jira/browse/TIKA-1589

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mdaniline/tika TIKA-1589

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/38.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #38


commit fb29412710ea058f89d3c6df5078587768dcac74
Author: Max Daniline maxim.danil...@softwire.com
Date:   2015-03-31T12:49:43Z

fix for TIKA-1589 contributed by mdaniline




 Mp3 parser does not add duration to metadata if there are no ID3 tags
 -

 Key: TIKA-1589
 URL: https://issues.apache.org/jira/browse/TIKA-1589
 Project: Tika
  Issue Type: Bug
Reporter: Max Daniline

 Steps to reproduce:
 * Have a file without any ID3 tags (v1 or v2)
 * Parse the file
 * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'.
 Expected result:
 The duration should be set even for a file without ID3 tags, since it is 
 independent information.
 Actual result:
 The duration is null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1589) Mp3 parser does not add duration to metadata if there are no ID3 tags

2015-03-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388504#comment-14388504
 ] 

ASF GitHub Bot commented on TIKA-1589:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/38


 Mp3 parser does not add duration to metadata if there are no ID3 tags
 -

 Key: TIKA-1589
 URL: https://issues.apache.org/jira/browse/TIKA-1589
 Project: Tika
  Issue Type: Bug
Reporter: Max Daniline

 Steps to reproduce:
 * Have a file without any ID3 tags (v1 or v2)
 * Parse the file
 * Attempt to retrieve the duration by calling 'metadata.get(XMPDM.DURATION)'.
 Expected result:
 The duration should be set even for a file without ID3 tags, since it is 
 independent information.
 Actual result:
 The duration is null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1558) Create a Parser Blacklist

2015-03-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389464#comment-14389464
 ] 

ASF GitHub Bot commented on TIKA-1558:
--

Github user tpalsulich closed the pull request at:

https://github.com/apache/tika/pull/39


 Create a Parser Blacklist
 -

 Key: TIKA-1558
 URL: https://issues.apache.org/jira/browse/TIKA-1558
 Project: Tika
  Issue Type: New Feature
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
 Fix For: 1.8


 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
 disable Parsers without pulling their dependencies out. In some cases (e.g. 
 disable all ExternalParsers), there may not be an easy way to exclude the 
 dependencies via Maven.
 So, an initial design would be to include another file like 
 {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
 new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
 {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
 that are assignable to an element in 
 {{ServiceLoader#loadServiceProviderBlacklist}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server

2015-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385398#comment-14385398
 ] 

ASF GitHub Bot commented on TIKA-1586:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/37


 Enable CORS on Tika Server
 --

 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 Tika Server should allow configuration of CORS requests (for uses like 
 TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] 
 from CXF for how to add it.
 The only change from that site is that we will need to add a 
 {{CrossOriginResourceSharingFilter}} as a provider.
 Ideally, this is configurable (limit which resources have CORS, and which 
 origins are allowed). But, I'm not thinking of any general methods of how to 
 do that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server

2015-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385369#comment-14385369
 ] 

ASF GitHub Bot commented on TIKA-1586:
--

GitHub user tpalsulich opened a pull request:

https://github.com/apache/tika/pull/37

TIKA-1586. Enable CORS requests on Tika server



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tpalsulich/tika TIKA-1586

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/37.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #37


commit c296a0459477f1ba088d96fa6ba3895e3a6b3ac5
Author: Tyler Palsulich tpalsul...@gmail.com
Date:   2015-03-28T15:45:45Z

TIKA-1586. Enable CORS requests on Tika server.




 Enable CORS on Tika Server
 --

 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 Tika Server should allow configuration of CORS requests (for uses like 
 TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] 
 from CXF for how to add it.
 The only change from that site is that we will need to add a 
 {{CrossOriginResourceSharingFilter}} as a provider.
 Ideally, this is configurable (limit which resources have CORS, and which 
 origins are allowed). But, I'm not thinking of any general methods of how to 
 do that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification

2015-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341673#comment-14341673
 ] 

ASF GitHub Bot commented on TIKA-1561:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/32


 GCMD Directory Interchange Format (.dif) identification
 ---

 Key: TIKA-1561
 URL: https://issues.apache.org/jira/browse/TIKA-1561
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
 Fix For: 1.8

 Attachments: 
 carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif


 cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
 The Directory Interchange Format (DIF) is metadata format used to create 
 directory entries that describe scientific data
 sets. A DIF holds a collection of fields, which detail specific information 
 about the data.
  The .dif file respect proper xml format that describe the scientific data 
 set, the schema xsd files can be found inside the .dif xml file.
 i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd
 The reason opening this ticket is tika parser for this dif file is being 
 under consideration with development, the support to identify the type of xml 
 file is needed.
 Although dif file in this case seems to be an proper xml file which can be 
 parsed by xmlparser, still it might need a specific process on some of the 
 fields to be extracted and injected into the Solr System for analysis.
 Then it is proposed that the following type 'text/dif+xml' is appended and 
 used in the tika-mimetypes.xml to be able to support the specific xml type 
 detection which extends the application/xml, so that some special process can 
 be applied to this particular xml file.
 mime-type type=text/dif+xml
root-XML localName=DIF/
root-XML localName=DIF 
 namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif//
glob pattern=*.dif/
sub-class-of type=application/xml/
 /mime-type
 Expected MIME type: text/dif+xml
 The following is the link to the dif format guide
 http://gcmd.nasa.gov/add/difguide/
 example dif files:
 1) 
 https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
 2) 
 https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
 3) 
 https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif
 an example dif file has also been attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container

2015-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332464#comment-14332464
 ] 

ASF GitHub Bot commented on TIKA-1354:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/30


 ForkParser doesn't work in OSGI container
 -

 Key: TIKA-1354
 URL: https://issues.apache.org/jira/browse/TIKA-1354
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Michal Hlavac

 I can't find way to run ForkParser in OSGI container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification

2015-02-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338105#comment-14338105
 ] 

ASF GitHub Bot commented on TIKA-1561:
--

GitHub user LukeLiush opened a pull request:

https://github.com/apache/tika/pull/32

add mime detection with dif(TIKA-1561) support



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/LukeLiush/tika difmime

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/32.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #32


commit eb920b55f4debe0f1152dcb17bd17ff4f09893f9
Author: LukeLiush hanson311...@gmail.com
Date:   2015-02-26T08:53:50Z

add mime detection with dif(TIKA-1561) support




 GCMD Directory Interchange Format (.dif) identification
 ---

 Key: TIKA-1561
 URL: https://issues.apache.org/jira/browse/TIKA-1561
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
 Attachments: 
 carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif


 cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
 The Directory Interchange Format (DIF) is metadata format used to create 
 directory entries that describe scientific data
 sets. A DIF holds a collection of fields, which detail specific information 
 about the data.
  The .dif file respect proper xml format that describe the scientific data 
 set, the schema xsd files can be found inside the .dif xml file.
 i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd
 The reason opening this ticket is tika parser for this dif file is being 
 under consideration with development, the support to identify the type of xml 
 file is needed.
 Although dif file in this case seems to be an proper xml file which can be 
 parsed by xmlparser, still it might need a specific process on some of the 
 fields to be extracted and injected into the Solr System for analysis.
 Then it is proposed that the following type 'text/dif+xml' is appended and 
 used in the tika-mimetypes.xml to be able to support the specific xml type 
 detection which extends the application/xml, so that some special process can 
 be applied to this particular xml file.
 mime-type type=text/dif+xml
root-XML localName=DIF/
root-XML localName=DIF 
 namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif//
glob pattern=*.dif/
sub-class-of type=application/xml/
 /mime-type
 Expected MIME type: text/dif+xml
 The following is the link to the dif format guide
 http://gcmd.nasa.gov/add/difguide/
 example dif files:
 1) 
 https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
 2) 
 https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
 3) 
 https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif
 an example dif file has also been attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram

2015-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385177#comment-14385177
 ] 

ASF GitHub Bot commented on TIKA-1582:
--

GitHub user LukeLiush opened a pull request:

https://github.com/apache/tika/pull/36

Nn branch

https://issues.apache.org/jira/browse/TIKA-1582


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/LukeLiush/tika nnBranch

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/36.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #36


commit eb04f13260bfb5e4f4b0bf7fd54ecd085995cb92
Author: LukeLiush hanson311...@gmail.com
Date:   2015-03-28T07:12:06Z

https://issues.apache.org/jira/browse/TIKA-1582

commit acaf27bb666fdef05bdb18d7edcaafe7ccfd9bf5
Author: LukeLiush hanson311...@gmail.com
Date:   2015-03-28T07:16:07Z

move the comments of apache licence to the top

commit 701fcc394ed2110e4c771fbb84999dca77932392
Author: LukeLiush hanson311...@gmail.com
Date:   2015-03-28T07:19:43Z

add some comments

commit 12f290826a88cd99bbf2e1a0385b315e73e3
Author: LukeLiush hanson311...@gmail.com
Date:   2015-03-28T07:25:55Z

move the example model file to the test resource directory

commit 6c8d2e523c427380438f24d90985e28bfdbce050
Author: LukeLiush hanson311...@gmail.com
Date:   2015-03-28T07:28:25Z

remove empty comment block




 Mime Detection based on neural networks with Byte-frequency-histogram 
 --

 Key: TIKA-1582
 URL: https://issues.apache.org/jira/browse/TIKA-1582
 Project: Tika
  Issue Type: Improvement
  Components: detector, mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial

 Content-based mime type detection is one of the popular approaches to detect 
 mime type, there are others based on file extension and magic numbers ; And 
 currently Tika has implemented 3 approaches in detecting mime types; 
 They are :
 1) file extensions
 2) magic numbers (the most trustworthy in tika)
 3) content-type(the header in the http response if present and available) 
 Content-based mime type detection however analyses the distribution of the 
 entire stream of bytes and find a similar pattern for the same type and build 
 a function that is able to group them into one or several classes so as to 
 classify and predict; It is believed this feature might broaden the usage of 
 Tika with a bit more security enforcement for mime type detection. Because we 
 want to build a model that is etched with the patterns it has seen, in some 
 situations we may not trust those types which have not been trained/learned 
 by the model. In some situations, magic numbers imbedded in the files can be 
 copied but the actual content could be a potentially detrimental Troy 
 program. By enforcing the trust on byte frequency patterns, we are able to 
 enhance the security of the detection.
 The proposed content-based mime detection to be integrated into Tika is based 
 on the machine learning algorithm i.e. neural network with back-propagation. 
 The input: 0-255 bins each of which represent a byte, and and each of which 
 stores the count of occurrences for each byte, and the byte frequency 
 histograms are normalized to fall in the range between 0 and 1, they then are 
 passed to a companding function to enhancement the infrequent bytes.
 The output of the neural network is a binary decision 1 or 0;
 Notice BTW, the proposed feature will be implemented with GRB file type as 
 one example.
 In this example, we build a model that is able to classify GRB file type from 
 non-GRB file types, notice the size of non-GRB files is huge and cannot be 
 easily defined, so there need to be as many negative training example as 
 possible to form this non-GRB types decision boundary.
 The Neural networks is considered as two stage of processes.
 Training and classification.
 The training can be done in any programming language, in this feature 
 /research, the training of neural network is implemented in R and the source 
 can be found in my github repository i.e. 
 https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
 document that describe the use of the program, the syntax/ format of the 
 input and output.
 After training, we need to export the model and import it to Tika; in Tika, 
 we create a TrainedModelDetector that reads this model file with one or more 
 model parameters or several model files,so it can detect the mime types with 
 the model of those mime types. Details of the research and usage with this 
 proposed feature will be posted on my github shortly.
 

[jira] [Commented] (TIKA-1517) MIME type selection with probability

2015-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14490879#comment-14490879
 ] 

ASF GitHub Bot commented on TIKA-1517:
--

GitHub user LukeLiush opened a pull request:

https://github.com/apache/tika/pull/41

add probabilistic mime selection

The probabilistic mime selection detector has been re-implemented.
the Bayesian probabilistic selection has been improved by adding more 
freedom and flexibilities that allow users to specify their own prior and 
conditional probabilities; the concrete details and basic idea are illustrated 
in the Tika-1571 and the implementation still follow the main description 
posted in the https://issues.apache.org/jira/browse/TIKA-1517 but with a minor 
change on the prior probability which was not modifiable in the original design.
The https://issues.apache.org/jira/browse/TIKA-1535 is resolved according 
to Prof Mattmann's suggestion.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/LukeLiush/tika mimeDetection

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/41.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #41


commit fd41a855f794876b43f97327bd77190099c1e554
Author: LukeLiush hanson311...@gmail.com
Date:   2015-04-11T08:57:12Z

add probabilistic mime selection




 MIME type selection with probability
 

 Key: TIKA-1517
 URL: https://issues.apache.org/jira/browse/TIKA-1517
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 
 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
Reporter: Luke sh
Priority: Trivial
 Attachments: BaysianTest.java


 Improvement and intuition
 The original implementation for MIME type selection/detection is a bit less 
 flexible by initial design, as it heavily relies on the outcome produced by 
 magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable 
 in a file, Tika will follow the file type detected by magic-bytes. It may be 
 better to provide more control over the method of choice.
 This proposed approach slightly incorporate the Bayesian probability theorem, 
 where users are able to assign weights to each approach in terms of 
 probability, so they have the control over preference of which file type or 
 mime type identification methods implemented/available in Tika, and currently 
 there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File 
 extension and Metadata content-type hint). By introducing some weights on the 
 approach in the proposed approach, users are able to choose which method they 
 trust most, the magic-bytes method is often trust-worthy though. But the 
 virtue is that in some situations, file type identification must be 
 sensitive, some might want all of the MIME type identification methods to 
 agree on the same file type before they start processing those files, 
 incorrect file type identification is less intolerable. The current 
 implementation seems to be less flexible for this purpose and heavily rely on 
 the Magic-bytes file identification method (although magic-bytes is most 
 reliable compared to the other 2 ); 
 Proposed design:
 The idea of selection is to incorporate probability as weights on each MIME 
 type identification method currently being implemented in Tika (they are 
 Magic bytes approach, file extension match and metadata content-type hint).
 for example,
 as an user, i would probably like to assign the the preference to the method 
 based on the degree of the trust, and order the results if they don't 
 coincide.
 Bayesian rule may be a bit appropriate here to meet the intuition.
 The following is what are needed for Bayesian rule implementation.
  Prior probability P(file_type) e.g. P(pdf), theoretically this is computed 
  based on the samples, and this depends on the domain or use cases, 
  intuitively we more care the orders of the weights or probability of the 
  results rather than the actual numbers, and also the context of Prior 
  depends on samples for a particular use case or domain, e.g. if we happen 
  to crawl a website that contains mostly the pdf files, we probably can 
  collect some samples and compute the prior, based on the samples we can say 
  90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here 
  we propose to define the prior as configurable param for users, and by 
  default we leave the prior to be unapplicable. Alternatively, we can 
  define prior for each file type to be  1/[number of supported file types in 
  Tika] I think the 

[jira] [Commented] (TIKA-1614) Geo Topic Parser

2015-04-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508325#comment-14508325
 ] 

ASF GitHub Bot commented on TIKA-1614:
--

GitHub user AranyaLi opened a pull request:

https://github.com/apache/tika/pull/43

TIKA-1614  Geo Topic Parser



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AranyaLi/tika trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/43.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #43


commit c6c402451c7133769aa94cdb6bbe075687e7c519
Author: aranyali aranyaelen...@gmail.com
Date:   2015-04-23T02:23:18Z

add geo topic parser

commit 51ef0cde472f3b7bd1e38318a1036af596135689
Author: aranyali aranyaelen...@gmail.com
Date:   2015-04-23T02:23:52Z

delete ~




 Geo Topic Parser
 

 Key: TIKA-1614
 URL: https://issues.apache.org/jira/browse/TIKA-1614
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Anya Yun Li

 ##Description
 This program aims to provide the support to identify geonames for any 
 unstructured text data in the project NSF polar research. 
 https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/1
 This project is a content-based geotagging solution, made of a variaty of NLP 
 tools and could be used for any geotagging purposes. 
 ##Workingflow
 1. Plain text input is passed to geoparser
 2. Location names are extracted from the text using OpenNLP NER
 3. Provide two roles: 
   * The most frequent location name choosed as the best match for the 
 input text
   * Other extracted locations are treated as alternatives (equal)
 4. location extracted above, search the best GeoName object and return the 
 resloved objects with fields (name in gazetteer, longitude, latitude)
 ##How to Use
 *Cautions*: This program requires at least 1.2 GB disk space for building 
 Lucene Index
 ```Java
   function A(stream){
   Metadata metadata = new Metadata();
 ParseContext context=new ParseContext();
 GeoParserConfig config= new GeoParserConfig();
 config.setGazetterPath(gazetteerPath);
 config.setNERModelPath(nerPath);
 context.set(GeoParserConfig.class, config);

 geoparser.parse(
 stream,
 new BodyContentHandler(),
 metadata,
 context);

for(String name: metadata.names()){
  String value=metadata.get(name);
  System.out.println(name +  + value);  
}
 }
 ```
 This parser generates useful geographical information to Tika's Metadata 
 Object. 
 Fields for best matched location:
 ```
 Geographic_NAME
 Geographic_LONGTITUDE
 Geographic_LATITUDE
 ```
 Fields for alternatives:
 ```
 Geographic_NAME1
 Geographic_LONGTITUDE1
 Geographic_LATITUDE1
 Geographic_NAME2
 Geographic_LONGTITUDE2
 Geographic_LATITUDE2
 ...
 ```
 If you have any questions, contact me: anyayu...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1532) DIF Parser

2015-04-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518805#comment-14518805
 ] 

ASF GitHub Bot commented on TIKA-1532:
--

GitHub user HyperDunk opened a pull request:

https://github.com/apache/tika/pull/46

TIKA-1532: DIF Parser and change mime-type from text/dif+xml to 
application/dif+xml

Implementation of DIFParser (GCDM Directory Interchange Format) with unit 
test and change to mime-type as per discussion in TIKA-1532

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyperDunk/tika trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/46.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #46


commit 5f751a67266ea3c54f9e1a3e18c594e4c215cadc
Author: HyperDunk aakarsh@gmail.com
Date:   2015-04-29T06:10:41Z

TIKA-1532: DIF Parser and change mime-type from text/dif+xml to 
application/dif+xml




 DIF Parser
 --

 Key: TIKA-1532
 URL: https://issues.apache.org/jira/browse/TIKA-1532
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Aakarsh Medleri Hire Math
  Labels: memex

 MIME Type detection  content parser for .dif format



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1532) DIF Parser

2015-04-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520969#comment-14520969
 ] 

ASF GitHub Bot commented on TIKA-1532:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/46


 DIF Parser
 --

 Key: TIKA-1532
 URL: https://issues.apache.org/jira/browse/TIKA-1532
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Aakarsh Medleri Hire Math
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.9


 MIME Type detection  content parser for .dif format



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes

2015-04-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514946#comment-14514946
 ] 

ASF GitHub Bot commented on TIKA-1535:
--

GitHub user LukeLiush opened a pull request:

https://github.com/apache/tika/pull/45

https://issues.apache.org/jira/browse/TIKA-1535 Inheritance modification...

... for the class MIMETypes

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/LukeLiush/tika TIKA-1535

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/45.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #45


commit 66cdb6e30dd1654702e08723bb0f607fb4fc9def
Author: LukeLiush hanson311...@gmail.com
Date:   2015-04-27T20:41:43Z

https://issues.apache.org/jira/browse/TIKA-1535 Inheritance modification 
for the class MIMETypes




 Inheritance modification for the class MIMETypes
 

 Key: TIKA-1535
 URL: https://issues.apache.org/jira/browse/TIKA-1535
 Project: Tika
  Issue Type: Improvement
  Components: mime
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial

 The Class MIMETypes does not currently allow for inheritance.
 There are a couple of methods in this class which looks independent, and some 
 of which needs to be exposed or overwritten for special needs or use cases, 
 this will enable tika users with more flexibility for new mime detection 
 algorithm.
 
 Perhaps it may be a good idea to extract out the detector logic from the 
 MimeTypes class, and create an independent detector for Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes

2015-04-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522082#comment-14522082
 ] 

ASF GitHub Bot commented on TIKA-1535:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/45


 Inheritance modification for the class MIMETypes
 

 Key: TIKA-1535
 URL: https://issues.apache.org/jira/browse/TIKA-1535
 Project: Tika
  Issue Type: Improvement
  Components: mime
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: memex
 Fix For: 1.9


 The Class MIMETypes does not currently allow for inheritance.
 There are a couple of methods in this class which looks independent, and some 
 of which needs to be exposed or overwritten for special needs or use cases, 
 this will enable tika users with more flexibility for new mime detection 
 algorithm.
 
 Perhaps it may be a good idea to extract out the detector logic from the 
 MimeTypes class, and create an independent detector for Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram

2015-05-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525083#comment-14525083
 ] 

ASF GitHub Bot commented on TIKA-1582:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/36


 Mime Detection based on neural networks with Byte-frequency-histogram 
 --

 Key: TIKA-1582
 URL: https://issues.apache.org/jira/browse/TIKA-1582
 Project: Tika
  Issue Type: Improvement
  Components: detector, mime
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: memex
 Fix For: 1.9

 Attachments: nnmodel.docx, week2-report-histogram comparison.docx, 
 week6 report.docx


 Content-based mime type detection is one of the popular approaches to detect 
 mime type, there are others based on file extension and magic numbers ; And 
 currently Tika has implemented 3 approaches in detecting mime types; 
 They are :
 1) file extensions
 2) magic numbers (the most trustworthy in tika)
 3) content-type(the header in the http response if present and available) 
 Content-based mime type detection however analyses the distribution of the 
 entire stream of bytes and find a similar pattern for the same type and build 
 a function that is able to group them into one or several classes so as to 
 classify and predict; It is believed this feature might broaden the usage of 
 Tika with a bit more security enforcement for mime type detection. Because we 
 want to build a model that is etched with the patterns it has seen, in some 
 situations we may not trust those types which have not been trained/learned 
 by the model. In some situations, magic numbers imbedded in the files can be 
 copied but the actual content could be a potentially detrimental Troy 
 program. By enforcing the trust on byte frequency patterns, we are able to 
 enhance the security of the detection.
 The proposed content-based mime detection to be integrated into Tika is based 
 on the machine learning algorithm i.e. neural network with back-propagation. 
 The input: 0-255 bins each of which represent a byte, and and each of which 
 stores the count of occurrences for each byte, and the byte frequency 
 histograms are normalized to fall in the range between 0 and 1, they then are 
 passed to a companding function to enhancement the infrequent bytes.
 The output of the neural network is a binary decision 1 or 0;
 Notice BTW, the proposed feature will be implemented with GRB file type as 
 one example.
 In this example, we build a model that is able to classify GRB file type from 
 non-GRB file types, notice the size of non-GRB files is huge and cannot be 
 easily defined, so there need to be as many negative training example as 
 possible to form this non-GRB types decision boundary.
 The Neural networks is considered as two stage of processes.
 Training and classification.
 The training can be done in any programming language, in this feature 
 /research, the training of neural network is implemented in R and the source 
 can be found in my github repository i.e. 
 https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
 document that describe the use of the program, the syntax/ format of the 
 input and output.
 After training, we need to export the model and import it to Tika; in Tika, 
 we create a TrainedModelDetector that reads this model file with one or more 
 model parameters or several model files,so it can detect the mime types with 
 the model of those mime types. Details of the research and usage with this 
 proposed feature will be posted on my github shortly.
 It is worth noting again that in this research we only worked out one model - 
 GRB as one example to demonstrate the use of this content-based mime 
 detection. One of the challenges again is that the non-GRB file types cannot 
 be clearly defined unless we feed our model with some example data for all of 
 the existing file types in the world, but this seems to be too utopian and a 
 bit less likely, so it is better that the set of class/types is given and 
 defined in advance to minimize the problem domain. 
 Another challenge is the size of the training data; even if we know the types 
 we want to classify, getting enough training data to form a model can be also 
 one of the main factors of success. In our example model, grb data are 
 collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the 
 grb data from that source all exhibit a similar pattern, a simple neural 
 network structure is able to predict well, even a linear logistic regression 
 is able to do a good job; However, if we pass the GRB files collected from 
 other source to the model for 

[jira] [Commented] (TIKA-1517) MIME type selection with probability

2015-05-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524934#comment-14524934
 ] 

ASF GitHub Bot commented on TIKA-1517:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/41


 MIME type selection with probability
 

 Key: TIKA-1517
 URL: https://issues.apache.org/jira/browse/TIKA-1517
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 
 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
Reporter: Luke sh
Priority: Trivial
  Labels: memex
 Fix For: 1.9

 Attachments: BaysianTest.java


 Improvement and intuition
 The original implementation for MIME type selection/detection is a bit less 
 flexible by initial design, as it heavily relies on the outcome produced by 
 magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable 
 in a file, Tika will follow the file type detected by magic-bytes. It may be 
 better to provide more control over the method of choice.
 This proposed approach slightly incorporate the Bayesian probability theorem, 
 where users are able to assign weights to each approach in terms of 
 probability, so they have the control over preference of which file type or 
 mime type identification methods implemented/available in Tika, and currently 
 there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File 
 extension and Metadata content-type hint). By introducing some weights on the 
 approach in the proposed approach, users are able to choose which method they 
 trust most, the magic-bytes method is often trust-worthy though. But the 
 virtue is that in some situations, file type identification must be 
 sensitive, some might want all of the MIME type identification methods to 
 agree on the same file type before they start processing those files, 
 incorrect file type identification is less intolerable. The current 
 implementation seems to be less flexible for this purpose and heavily rely on 
 the Magic-bytes file identification method (although magic-bytes is most 
 reliable compared to the other 2 ); 
 Proposed design:
 The idea of selection is to incorporate probability as weights on each MIME 
 type identification method currently being implemented in Tika (they are 
 Magic bytes approach, file extension match and metadata content-type hint).
 for example,
 as an user, i would probably like to assign the the preference to the method 
 based on the degree of the trust, and order the results if they don't 
 coincide.
 Bayesian rule may be a bit appropriate here to meet the intuition.
 The following is what are needed for Bayesian rule implementation.
  Prior probability P(file_type) e.g. P(pdf), theoretically this is computed 
  based on the samples, and this depends on the domain or use cases, 
  intuitively we more care the orders of the weights or probability of the 
  results rather than the actual numbers, and also the context of Prior 
  depends on samples for a particular use case or domain, e.g. if we happen 
  to crawl a website that contains mostly the pdf files, we probably can 
  collect some samples and compute the prior, based on the samples we can say 
  90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here 
  we propose to define the prior as configurable param for users, and by 
  default we leave the prior to be unapplicable. Alternatively, we can 
  define prior for each file type to be  1/[number of supported file types in 
  Tika] I think the number would be approximately 1/1157 and using this 
  number seems to be more fair, but the point of avoiding it is that this 
  prior is fixed for every type, and eventually we care more the orders of 
  the result and if the number is fixed, so will the order be, bringing this 
  number of 1/1157 into the Bayesian equation will not only be unable to 
  affect the order but also it will lumber our implementation with extra 
  computation, thus we will leave it as unapplicable which means we assign 
  1 to it as it never exists! but note we care more the order rather the 
  actual number, and this param is configurable, and we believe it provides 
  much flexibilities in some use cases.
  Conditional probability of positive tests given a file type P(test| 
  file_type) e.g. P(test1 = pdf | pdf), this probability is also based on 
  collection of samples and domain or use cases, we leave it configurable, 
  but based on our intuition we think test1(i.e. Magic-bytes method) is most 
  trustworthy, thus the default value is 0.75 for P(test1 = a_file_type | 
  a_file_type), this is to say given the file whose type is a file type, 
  the probability of the test1 predicting the file 

[jira] [Commented] (TIKA-443) Geographic Information Parser

2015-05-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522848#comment-14522848
 ] 

ASF GitHub Bot commented on TIKA-443:
-

GitHub user gautham4 opened a pull request:

https://github.com/apache/tika/pull/47

PULL REQUEST for TIKA-443



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gautham4/tika TIKA-443

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/47.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #47


commit 6bfdbcd869455bbae7a4547b738e5a0b249053e8
Author: unknown gautham@gmail.com
Date:   2015-04-22T04:35:03Z

fix for TIKA-443 contributed by gautham4

commit 66ba03ee85946d7babf9815b9734f0ee83b4767f
Author: unknown gautham@gmail.com
Date:   2015-05-01T05:34:38Z

fix for TIKA-443 contributed by gautham@gmail.com




 Geographic Information Parser
 -

 Key: TIKA-443
 URL: https://issues.apache.org/jira/browse/TIKA-443
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Arturo Beltran
Assignee: Chris A. Mattmann
  Labels: new-parser
 Attachments: getFDOMetadata.xml


 I'm working in the automatic description of geospatial resources, and I think 
 that might be interesting to incorporate new parser/s to Tika in order to 
 manage and describe some geo-formats. These geo-formats include files, 
 services and databases.
 If anyone is interested in this issue or want to collaborate do not hesitate 
 to contact me. Any help is welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-443) Geographic Information Parser

2015-05-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522907#comment-14522907
 ] 

ASF GitHub Bot commented on TIKA-443:
-

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/47


 Geographic Information Parser
 -

 Key: TIKA-443
 URL: https://issues.apache.org/jira/browse/TIKA-443
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Arturo Beltran
Assignee: Chris A. Mattmann
  Labels: new-parser
 Attachments: getFDOMetadata.xml


 I'm working in the automatic description of geospatial resources, and I think 
 that might be interesting to incorporate new parser/s to Tika in order to 
 manage and describe some geo-formats. These geo-formats include files, 
 services and databases.
 If anyone is interested in this issue or want to collaborate do not hesitate 
 to contact me. Any help is welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1634) Detecting problem with Matlab source code

2015-06-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14571681#comment-14571681
 ] 

ASF GitHub Bot commented on TIKA-1634:
--

GitHub user jihyunoh opened a pull request:

https://github.com/apache/tika/pull/49

fix for TIKA-1634 contributed by Ji-Hyun Oh



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jihyunoh/tika TIKA-1634

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/49.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #49


commit f0cc53e79afa3feb8330a53eed6c73bc80a4f3c0
Author: jihyunoh mail2j...@gmail.com
Date:   2015-06-03T21:26:56Z

fix for TIKA-1634 contributed by Ji-Hyun Oh




 Detecting problem with Matlab source code
 -

 Key: TIKA-1634
 URL: https://issues.apache.org/jira/browse/TIKA-1634
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.8
Reporter: Ji-Hyun Oh
Priority: Trivial
  Labels: earthcube
 Attachments: BARCAST_MainCode.m, Initial_Vals_Maker.m, 
 custom-mimetypes.xml, tika-mimetypes.xml, wtsgaus.m


 Both Matlab source code and Objective-C source code have the same suffix, 
 which is .m. Therefore, Matlab has additional match value in mime types.xml. 
 In tika-mimetypes.xml Matlab is defined as:
   mime-type type=text/x-matlab
 _commentMatlab source code/_comment
 magic priority=50
   match value=function [ type=string offset=0/
 /magic
 !-- glob pattern=*.m/ - conflicts with text/x-objcsrc --
 sub-class-of type=text/plain/
   /mime-type
 However, Matlab codes does not always start with function [“. Therefore, 
 some Matlab codes are detected as text/x-bojcsrc. Based on the source codes 
 collected from NOAA Paleoclimatology Software Resources, many Matlab codes 
 have match value like these (problematic files are attached as an example):
 mime-type type=text/x-matlab
 _commentMatlab source code/_comment
 magic priority=50
   match value=function type=string offset=0/
   match value=% type=string offset=0/
 /magic
 !-- glob pattern=*.m/ - conflicts with text/x-objcsrc --
 sub-class-of type=text/plain/
   /mime-type
 Conducted several detecting tests by using different Matlab packages obtained 
 from NOAA Paleoclimatology Software Resources, with/without 
 Custom-mimtypes.xml. Results are attached. As a results, total 103 Matlab 
 files are detected correctly with custom-mimetypes.xml, while  42 Matlab 
 files are detected as Matlab files without custom-mimetypes.xml (= only with 
 current match value). However, this match value for Matlab source code could 
 be only common in Paleoclimatology community. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1634) Detecting problem with Matlab source code

2015-06-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572192#comment-14572192
 ] 

ASF GitHub Bot commented on TIKA-1634:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/49


 Detecting problem with Matlab source code
 -

 Key: TIKA-1634
 URL: https://issues.apache.org/jira/browse/TIKA-1634
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.8
Reporter: Ji-Hyun Oh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: earthcube
 Attachments: BARCAST_MainCode.m, Initial_Vals_Maker.m, 
 custom-mimetypes.xml, tika-mimetypes.xml, wtsgaus.m


 Both Matlab source code and Objective-C source code have the same suffix, 
 which is .m. Therefore, Matlab has additional match value in mime types.xml. 
 In tika-mimetypes.xml Matlab is defined as:
   mime-type type=text/x-matlab
 _commentMatlab source code/_comment
 magic priority=50
   match value=function [ type=string offset=0/
 /magic
 !-- glob pattern=*.m/ - conflicts with text/x-objcsrc --
 sub-class-of type=text/plain/
   /mime-type
 However, Matlab codes does not always start with function [“. Therefore, 
 some Matlab codes are detected as text/x-bojcsrc. Based on the source codes 
 collected from NOAA Paleoclimatology Software Resources, many Matlab codes 
 have match value like these (problematic files are attached as an example):
 mime-type type=text/x-matlab
 _commentMatlab source code/_comment
 magic priority=50
   match value=function type=string offset=0/
   match value=% type=string offset=0/
 /magic
 !-- glob pattern=*.m/ - conflicts with text/x-objcsrc --
 sub-class-of type=text/plain/
   /mime-type
 Conducted several detecting tests by using different Matlab packages obtained 
 from NOAA Paleoclimatology Software Resources, with/without 
 Custom-mimtypes.xml. Results are attached. As a results, total 103 Matlab 
 files are detected correctly with custom-mimetypes.xml, while  42 Matlab 
 files are detected as Matlab files without custom-mimetypes.xml (= only with 
 current match value). However, this match value for Matlab source code could 
 be only common in Paleoclimatology community. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1659) ZipContainerDetector does not detect all IPA files

2015-06-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591754#comment-14591754
 ] 

ASF GitHub Bot commented on TIKA-1659:
--

GitHub user Rshomali opened a pull request:

https://github.com/apache/tika/pull/51

fix for TIKA-1659 contributed by rami



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Rshomali/tika TIKA-1659

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/51.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #51


commit b333284c8d77ecc26431c765d1fba495ea168aef
Author: Rami Shomali rami.shom...@lookout.com
Date:   2015-06-18T13:02:16Z

fix for TIKA-1659 contributed by rami




 ZipContainerDetector does not detect all IPA files 
 ---

 Key: TIKA-1659
 URL: https://issues.apache.org/jira/browse/TIKA-1659
 Project: Tika
  Issue Type: Bug
  Components: mime
Reporter: Rami Shomali

 ZipContainerDetector expects two files to identify the IPA file as 
 application/x-itunes-ipa:
 1) app/CodeResources
 2) app/ResourceRules.plist
 https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java#L386
 Recent IPA files downloaded from the App Store do not include those files. 
 Need to update ZipContainerDetector and remove the patterns for those two 
 files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1664) GDALParser does not correctly set nitf as a supported MediaType

2015-06-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601810#comment-14601810
 ] 

ASF GitHub Bot commented on TIKA-1664:
--

GitHub user jrnorth opened a pull request:

https://github.com/apache/tika/pull/53

Fix for TIKA-1664



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jrnorth/tika trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/53.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #53


commit 5bfddec49cde9f6ebe59e76fd3de44f8f49e07c0
Author: Joseph North joerno...@gmail.com
Date:   2015-06-25T19:43:12Z

Fix for TIKA-1664




 GDALParser does not correctly set nitf as a supported MediaType
 -

 Key: TIKA-1664
 URL: https://issues.apache.org/jira/browse/TIKA-1664
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7, 1.8, 1.9
Reporter: Joseph North
  Labels: easyfix, patch

 GDALParser incorrectly adds ntif as a supported MediaType. It should be 
 nitf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1659) ZipContainerDetector does not detect all IPA files

2015-06-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601632#comment-14601632
 ] 

ASF GitHub Bot commented on TIKA-1659:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/51


 ZipContainerDetector does not detect all IPA files 
 ---

 Key: TIKA-1659
 URL: https://issues.apache.org/jira/browse/TIKA-1659
 Project: Tika
  Issue Type: Bug
  Components: mime
Reporter: Rami Shomali
Assignee: Chris A. Mattmann
 Fix For: 1.10

 Attachments: CamLingual.ipa, hacker_news.ipa


 ZipContainerDetector expects two files to identify the IPA file as 
 application/x-itunes-ipa:
 1) app/CodeResources
 2) app/ResourceRules.plist
 https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java#L386
 Recent IPA files downloaded from the App Store do not include those files. 
 Need to update ZipContainerDetector and remove the patterns for those two 
 files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1655) Inconsistent formatting in parsers pom.xml file

2015-06-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582017#comment-14582017
 ] 

ASF GitHub Bot commented on TIKA-1655:
--

GitHub user Purg opened a pull request:

https://github.com/apache/tika/pull/50

Adjusted indentation in pom.xml file to match rest of file

Fix for TIKA-1655 contributed by paul.tunison

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Purg/tika TIKA-1655

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/50.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #50


commit 0684258016fbebbc4efbbf33a542f23e444aff02
Author: Paul Tunison paul.tuni...@kitware.com
Date:   2015-06-11T14:50:32Z

Adjusted indentation in pom.xml file to match rest of file

Fix for TIKA-1655 contributed by paul.tunison




 Inconsistent formatting in parsers pom.xml file
 ---

 Key: TIKA-1655
 URL: https://issues.apache.org/jira/browse/TIKA-1655
 Project: Tika
  Issue Type: Bug
 Environment: None
Reporter: Paul Tunison
Priority: Trivial
  Labels: easyfix, maven
 Fix For: 1.10


 Indentation inconsistency in tika-parsers/pom.xml under Provided 
 Dependencies comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1655) Inconsistent formatting in parsers pom.xml file

2015-06-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582113#comment-14582113
 ] 

ASF GitHub Bot commented on TIKA-1655:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/50


 Inconsistent formatting in parsers pom.xml file
 ---

 Key: TIKA-1655
 URL: https://issues.apache.org/jira/browse/TIKA-1655
 Project: Tika
  Issue Type: Bug
 Environment: None
Reporter: Paul Tunison
Priority: Trivial
  Labels: easyfix, maven
 Fix For: 1.10


 Indentation inconsistency in tika-parsers/pom.xml under Provided 
 Dependencies comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1614) Geo Topic Parser

2015-05-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557903#comment-14557903
 ] 

ASF GitHub Bot commented on TIKA-1614:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/43


 Geo Topic Parser
 

 Key: TIKA-1614
 URL: https://issues.apache.org/jira/browse/TIKA-1614
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Anya Yun Li
Assignee: Chris A. Mattmann
  Labels: memex
 Attachments: TIKA-1614.Mattmann.Li.052405.patch.txt


 ##Description
 This program aims to provide the support to identify geonames for any 
 unstructured text data in the project NSF polar research. 
 https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/1
 This project is a content-based geotagging solution, made of a variaty of NLP 
 tools and could be used for any geotagging purposes. 
 ##Workingflow
 1. Plain text input is passed to geoparser
 2. Location names are extracted from the text using OpenNLP NER
 3. Provide two roles: 
   * The most frequent location name choosed as the best match for the 
 input text
   * Other extracted locations are treated as alternatives (equal)
 4. location extracted above, search the best GeoName object and return the 
 resloved objects with fields (name in gazetteer, longitude, latitude)
 ##How to Use
 *Cautions*: This program requires at least 1.2 GB disk space for building 
 Lucene Index
 ```Java
   function A(stream){
   Metadata metadata = new Metadata();
 ParseContext context=new ParseContext();
 GeoParserConfig config= new GeoParserConfig();
 config.setGazetterPath(gazetteerPath);
 config.setNERModelPath(nerPath);
 context.set(GeoParserConfig.class, config);

 geoparser.parse(
 stream,
 new BodyContentHandler(),
 metadata,
 context);

for(String name: metadata.names()){
  String value=metadata.get(name);
  System.out.println(name +  + value);  
}
 }
 ```
 This parser generates useful geographical information to Tika's Metadata 
 Object. 
 Fields for best matched location:
 ```
 Geographic_NAME
 Geographic_LONGTITUDE
 Geographic_LATITUDE
 ```
 Fields for alternatives:
 ```
 Geographic_NAME1
 Geographic_LONGTITUDE1
 Geographic_LATITUDE1
 Geographic_NAME2
 Geographic_LONGTITUDE2
 Geographic_LATITUDE2
 ...
 ```
 If you have any questions, contact me: anyayu...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-07-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646321#comment-14646321
 ] 

ASF GitHub Bot commented on TIKA-1699:
--

GitHub user sujen1412 opened a pull request:

https://github.com/apache/tika/pull/55

Fix for TIKA-1699 contributed by Sujen Shah

Waiting for GROBID to get published to maven central. 
Sonatype issue - https://issues.sonatype.org/browse/OSSRH-16837

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sujen1412/tika TIKA-1699

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/55.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #55


commit 4f067107d01e99bd81a66c78163f2a4baf3f817f
Author: Sujen Shah sujen1...@gmail.com
Date:   2015-07-29T13:49:00Z

Added grobid dependencies

commit 323ba33816a9beabe22d351c8eac4350fa010be0
Author: Sujen Shah sujen1...@gmail.com
Date:   2015-07-29T13:49:36Z

Registering journal parser

commit 71cdd0970fb17aeec85469d07dc1ee6460d2f4da
Author: Sujen Shah sujen1...@gmail.com
Date:   2015-07-29T13:54:07Z

Code for integrating GROBID Parser in to Tika

commit b6e9f8724b308e0c830f73702994cbe1c5932cd2
Author: Sujen Shah sujen1...@gmail.com
Date:   2015-07-29T13:58:08Z

Grobid properties files

commit 57b70ce38a77cc349588d2f513938bc4f18d4ad4
Author: Sujen Shah sujen1...@gmail.com
Date:   2015-07-29T13:58:58Z

Added unit test for journal parser

Corrected formatting

Corrected formatting

Corrected formatting




 Integrate the GROBID PDF extractor in Tika
 --

 Key: TIKA-1699
 URL: https://issues.apache.org/jira/browse/TIKA-1699
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Sujen Shah
  Labels: memex

 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
 library for extracting, parsing and re-structuring raw documents such as PDF 
 into structured TEI-encoded documents with a particular focus on technical 
 and scientific publications.
 It has a java api which can be used to augment PDF parsing for journals and 
 help extract extra metadata about the paper like authors, publication, 
 citations, etc. 
 It would be nice to have this integrated into Tika, I have tried it on my 
 local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1703) Can't Specify Tesseract Data Folder Distinct from Tesseract Executable Path

2015-08-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652940#comment-14652940
 ] 

ASF GitHub Bot commented on TIKA-1703:
--

GitHub user taidan19 opened a pull request:

https://github.com/apache/tika/pull/56

TIKA-1703 Add ability to specify Tesseract config path.

Link to Jira ticket - https://issues.apache.org/jira/browse/TIKA-1703

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/taidan19/tika TIKA-1703

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/56.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #56


commit 86e8fdf187af5051812e1164c4cc3fef737a0644
Author: Christian Wolfe taida...@gmail.com
Date:   2015-08-04T00:54:23Z

TIKA-1703 Add ability to specify Tesseract config path.




 Can't Specify Tesseract Data Folder Distinct from Tesseract Executable Path
 ---

 Key: TIKA-1703
 URL: https://issues.apache.org/jira/browse/TIKA-1703
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.9
Reporter: Christian Wolfe
Priority: Minor
 Fix For: 1.9


 If a user specifies the path to the Tesseract executable using 
 {{TesseractOCRConfig.setTesseractPath}}, then Tika will assume that the 
 Tesseract config folder (usually referred to as the 'tessdata' folder) is in 
 the same location. This is usually true in a Windows environment, where 
 everything is installed into a central location.
 However, this is not necessarily the case in a Linux environment. If one were 
 to build Tesseract from source, for example, the config folder will be 
 installed in a different location than the Tesseract executable.
 One way to fix this would be to add a way to specify the location of the 
 Tesseract config folder separate from the path to the executable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696515#comment-14696515
 ] 

ASF GitHub Bot commented on TIKA-1699:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/55


 Integrate the GROBID PDF extractor in Tika
 --

 Key: TIKA-1699
 URL: https://issues.apache.org/jira/browse/TIKA-1699
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Sujen Shah
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.11


 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
 library for extracting, parsing and re-structuring raw documents such as PDF 
 into structured TEI-encoded documents with a particular focus on technical 
 and scientific publications.
 It has a java api which can be used to augment PDF parsing for journals and 
 help extract extra metadata about the paper like authors, publication, 
 citations, etc. 
 It would be nice to have this integrated into Tika, I have tried it on my 
 local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1703) Can't Specify Tesseract Data Folder Distinct from Tesseract Executable Path

2015-08-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654652#comment-14654652
 ] 

ASF GitHub Bot commented on TIKA-1703:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/56


 Can't Specify Tesseract Data Folder Distinct from Tesseract Executable Path
 ---

 Key: TIKA-1703
 URL: https://issues.apache.org/jira/browse/TIKA-1703
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.9
Reporter: Christian Wolfe
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.11


 If a user specifies the path to the Tesseract executable using 
 {{TesseractOCRConfig.setTesseractPath}}, then Tika will assume that the 
 Tesseract config folder (usually referred to as the 'tessdata' folder) is in 
 the same location. This is usually true in a Windows environment, where 
 everything is installed into a central location.
 However, this is not necessarily the case in a Linux environment. If one were 
 to build Tesseract from source, for example, the config folder will be 
 installed in a different location than the Tesseract executable.
 One way to fix this would be to add a way to specify the location of the 
 Tesseract config folder separate from the path to the executable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1694) Missing scopetest/scope for junit dependency

2015-07-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638903#comment-14638903
 ] 

ASF GitHub Bot commented on TIKA-1694:
--

GitHub user tmortagne opened a pull request:

https://github.com/apache/tika/pull/54

TIKA-1694: Missing scopetest/scope for junit dependency



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tmortagne/tika trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/54.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #54


commit 70d4c9d4ade2e4df5fd7ddef90592ea4da0ce9b1
Author: Thomas Mortagne thomas.morta...@gmail.com
Date:   2015-07-23T14:23:10Z

TIKA-1694: Missing scopetest/scope for junit dependency




 Missing scopetest/scope for junit dependency
 

 Key: TIKA-1694
 URL: https://issues.apache.org/jira/browse/TIKA-1694
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.9
Reporter: Thomas Mortagne
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1694) Missing scopetest/scope for junit dependency

2015-07-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638969#comment-14638969
 ] 

ASF GitHub Bot commented on TIKA-1694:
--

Github user tmortagne closed the pull request at:

https://github.com/apache/tika/pull/54


 Missing scopetest/scope for junit dependency
 

 Key: TIKA-1694
 URL: https://issues.apache.org/jira/browse/TIKA-1694
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.9
Reporter: Thomas Mortagne
Assignee: Konstantin Gribov
Priority: Minor

 Pull request on https://github.com/apache/tika/pull/54.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1791) URI is not hierarchical exception when location model resource is inside a jar in classpath

2015-11-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997841#comment-14997841
 ] 

ASF GitHub Bot commented on TIKA-1791:
--

GitHub user thammegowda opened a pull request:

https://github.com/apache/tika/pull/63

TIKA-1791 fix : non hierarchical URI exception when NER model is inside jar 
file

Improvement : Model is loaded once and NameFinder is reused 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thammegowda/tika fix-1791

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/63.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #63


commit 4c02c9e94adfde2163a2e2b4fac9425e5485a583
Author: Thamme Gowda 
Date:   2015-11-10T01:39:44Z

TIKA-1791 fix : non hierarchical URI for NER model




> URI is not hierarchical exception when location model resource is inside a 
> jar in classpath
> ---
>
> Key: TIKA-1791
> URL: https://issues.apache.org/jira/browse/TIKA-1791
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: location model  file is placed inside a fat Jar (with 
> all the dependencies)
>Reporter: Thamme Gowda N
>
> {code:title=Stacktrace|borderStyle=solid}
> The following error happens when location NER model resource is packaged 
> inside a jar and GeoTopicParser is enabled.
> Caused by: java.lang.IllegalArgumentException: URI is not hierarchical
>   at java.io.File.(File.java:418)
>   at 
> org.apache.tika.parser.geo.topic.GeoParserConfig.(GeoParserConfig.java:33)
>   at org.apache.tika.parser.geo.topic.GeoParser.(GeoParser.java:54)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at java.lang.Class.newInstance(Class.java:442)
>   at 
> org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:559)
>   at 
> org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:492)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:166)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:149)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:142)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:138)
>   at edu.usc.cs.ir.cwork.tika.Parser.(Parser.java:45)
> {code}
> Refernces :
> http://stackoverflow.com/questions/18055189/why-my-uri-is-not-hierarchical



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1791) URI is not hierarchical exception when location model resource is inside a jar in classpath

2015-11-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006009#comment-15006009
 ] 

ASF GitHub Bot commented on TIKA-1791:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/63


> URI is not hierarchical exception when location model resource is inside a 
> jar in classpath
> ---
>
> Key: TIKA-1791
> URL: https://issues.apache.org/jira/browse/TIKA-1791
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: location model  file is placed inside a fat Jar (with 
> all the dependencies)
>Reporter: Thamme Gowda N
>
> {code:title=Stacktrace|borderStyle=solid}
> The following error happens when location NER model resource is packaged 
> inside a jar and GeoTopicParser is enabled.
> Caused by: java.lang.IllegalArgumentException: URI is not hierarchical
>   at java.io.File.(File.java:418)
>   at 
> org.apache.tika.parser.geo.topic.GeoParserConfig.(GeoParserConfig.java:33)
>   at org.apache.tika.parser.geo.topic.GeoParser.(GeoParser.java:54)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at java.lang.Class.newInstance(Class.java:442)
>   at 
> org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:559)
>   at 
> org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:492)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:166)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:149)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:142)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:138)
>   at edu.usc.cs.ir.cwork.tika.Parser.(Parser.java:45)
> {code}
> Refernces :
> http://stackoverflow.com/questions/18055189/why-my-uri-is-not-hierarchical



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1787) Include Stanford Name Entity Recognition in Tika

2015-11-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009110#comment-15009110
 ] 

ASF GitHub Bot commented on TIKA-1787:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/61


> Include Stanford Name Entity Recognition in Tika
> 
>
> Key: TIKA-1787
> URL: https://issues.apache.org/jira/browse/TIKA-1787
> Project: Tika
>  Issue Type: Improvement
>  Components: mime, parser
>Affects Versions: 1.12
> Environment: Java 1.8, Mac OSX 10.11
>Reporter: Yueheng He
>Assignee: Chris A. Mattmann
>  Labels: features, newbie, test
> Fix For: 1.12
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Using the Stanford Name Entity Recognition, Tika will be able to extract name 
> entities like PERSON, ORGANIZATION, LOCATION, etc from the given text. The 
> extracted name entities will be added to the metadata



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1763) StringIndexOutOfBoundsException in ImageMetadataExtractor

2015-10-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948070#comment-14948070
 ] 

ASF GitHub Bot commented on TIKA-1763:
--

Github user jrnorth closed the pull request at:

https://github.com/apache/tika/pull/57


> StringIndexOutOfBoundsException in ImageMetadataExtractor
> -
>
> Key: TIKA-1763
> URL: https://issues.apache.org/jira/browse/TIKA-1763
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.10
>Reporter: Joseph North
>Priority: Minor
>
> The {{trimPixels(String s)}} method in {{ImageMetadataExtractor}} will throw 
> a {{StringIndexOutOfBoundsException}} if the string passed to it doesn't 
> contain the substring " pixels". The method does a {{lastIndexOf(" pixels")}} 
> call on the string passed to it and uses the result directly in a 
> {{substring()}} call without checking whether the result is greater than -1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1772) Mimetype of VTT files

2015-10-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960290#comment-14960290
 ] 

ASF GitHub Bot commented on TIKA-1772:
--

GitHub user wiedsche opened a pull request:

https://github.com/apache/tika/pull/59

fix for TIKA-1772 contributed by wiedsche



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wiedsche/tika TIKA-1772

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/59.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #59


commit 08a4df4e2b6a0d2cd14dc411906ed4a4a45814a3
Author: Alexander Widera 
Date:   2015-10-16T07:15:56Z

fix for TIKA-1772 contributed by wiedsche




> Mimetype of VTT files
> -
>
> Key: TIKA-1772
> URL: https://issues.apache.org/jira/browse/TIKA-1772
> Project: Tika
>  Issue Type: Improvement
>Reporter: Alexander Widera
>Priority: Minor
>
> Files with extension "vtt" are "WebVTT: The Web Video Text Tracks Format" 
> files.
> The mimetype resolved by tika is currently text/plain.
> The correct mimetype should be text/vtt.
> see: https://w3c.github.io/webvtt/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1772) Mimetype of VTT files

2015-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962569#comment-14962569
 ] 

ASF GitHub Bot commented on TIKA-1772:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/59


> Mimetype of VTT files
> -
>
> Key: TIKA-1772
> URL: https://issues.apache.org/jira/browse/TIKA-1772
> Project: Tika
>  Issue Type: Improvement
>Reporter: Alexander Widera
>Priority: Minor
> Fix For: 1.11
>
> Attachments: upc-video-subtitles-en.vtt
>
>
> Files with extension "vtt" are "WebVTT: The Web Video Text Tracks Format" 
> files.
> The mimetype resolved by tika is currently text/plain.
> The correct mimetype should be text/vtt.
> see: https://w3c.github.io/webvtt/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1763) StringIndexOutOfBoundsException in ImageMetadataExtractor

2015-10-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943692#comment-14943692
 ] 

ASF GitHub Bot commented on TIKA-1763:
--

GitHub user jrnorth opened a pull request:

https://github.com/apache/tika/pull/57

Fix for TIKA-1763

-Prevents a StringIndexOutOfBoundsException by checking the result of a
call to lastIndexOf() before using it in a call to substring().
-Adds unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jrnorth/tika trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/57.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #57


commit 807302f2775d9ee6a6a9f3820665330edc163d88
Author: Joseph North 
Date:   2015-10-05T17:12:54Z

Fix for TIKA-1763

-Prevents a StringIndexOutOfBoundsException by checking the result of a
call to lastIndexOf() before using it in a call to substring().
-Adds unit tests.




> StringIndexOutOfBoundsException in ImageMetadataExtractor
> -
>
> Key: TIKA-1763
> URL: https://issues.apache.org/jira/browse/TIKA-1763
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.10
>Reporter: Joseph North
>Priority: Minor
>
> The {{trimPixels(String s)}} method in {{ImageMetadataExtractor}} will throw 
> a {{StringIndexOutOfBoundsException}} if the string passed to it doesn't 
> contain the substring " pixels". The method does a {{lastIndexOf(" pixels")}} 
> call on the string passed to it and uses the result directly in a 
> {{substring()}} call without checking whether the result is greater than -1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1803) Use lucene-geo-gazetteer REST API in GeoTopicParser

2015-12-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060182#comment-15060182
 ] 

ASF GitHub Bot commented on TIKA-1803:
--

GitHub user smadha opened a pull request:

https://github.com/apache/tika/pull/65

fix for TIKA-1803 contributed by msha...@usc.edu



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/smadha/tika TIKA-1803

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/65.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #65


commit a55990aa5d6a0c521358123f8d7bbd6947255174
Author: smadha 
Date:   2015-12-16T15:26:23Z

fix for TIKA-1803 contributed by msha...@usc.edu




> Use lucene-geo-gazetteer REST API in GeoTopicParser
> ---
>
> Key: TIKA-1803
> URL: https://issues.apache.org/jira/browse/TIKA-1803
> Project: Tika
>  Issue Type: Sub-task
>  Components: parser
>Reporter: Madhav Sharan
>
> As of now tika uses lucene-geo-gazetteer CLI to extract co-ordinates of a 
> location. CLI requires jvm and lucene to instantiate for every request. With 
> all new REST api it will be possible to gain improvement in this space.
> Idea is to create a client of lucene-geo-gazetteer in tika and use it in 
> GeoTopicParser



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1816) Lenient testing for NamedEntityParser

2015-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065966#comment-15065966
 ] 

ASF GitHub Bot commented on TIKA-1816:
--

GitHub user thammegowda opened a pull request:

https://github.com/apache/tika/pull/68

FIX for TIKA-1816 by Thamme Gowda

Lenient testing for `NamedEntityParser`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thammegowda/tika TIKA-1816

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/68.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #68


commit 865de584be7cda0ed34c677f5bff5bb87b7a6996
Author: Thamme Gowda 
Date:   2015-12-21T01:14:56Z

Lenient testing for NamedEntityParser




> Lenient testing for NamedEntityParser
> -
>
> Key: TIKA-1816
> URL: https://issues.apache.org/jira/browse/TIKA-1816
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda N
>
> NamedEntityParser has a hard setup requirement like downloading of NER models 
> from remote servers and adding them to classpath.
> These model files are huge and hence are not added to source control.
> So, the tests are most likely to fail in various environments.
> Make the best effort to set up the tests, but in the worst case skip tests 
> instead of failing the whole build process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled

2015-12-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065535#comment-15065535
 ] 

ASF GitHub Bot commented on TIKA-1815:
--

Github user thammegowda closed the pull request at:

https://github.com/apache/tika/pull/66


> Text content from parser is empty when NamedEntityParser is enabled
> ---
>
> Key: TIKA-1815
> URL: https://issues.apache.org/jira/browse/TIKA-1815
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda N
> Fix For: 1.12
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> When the NamedEntityParser is enabled, the Tika#parseToString() and other 
> parse() methods produces an empty string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled

2015-12-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065485#comment-15065485
 ] 

ASF GitHub Bot commented on TIKA-1815:
--

GitHub user thammegowda opened a pull request:

https://github.com/apache/tika/pull/66

Fix for TIKA-1815 contributed by Thamme Gowda

+ Outputting the text content to XMLDocumentHandler

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thammegowda/tika fix-TIKA-1815

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/66.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #66


commit e96da2bc28d5eef81d034e39eb05099ed5d38ac1
Author: Thamme Gowda 
Date:   2015-10-30T21:47:45Z

Add NamedEntityParser

Add OpenNLPNERecogniser as default

commit a720507a1c1906a501470a7d5c5cec335412fcd3
Author: Thamme Gowda 
Date:   2015-10-30T22:16:11Z

Set charset for converting text to stream

commit 6b1a20e681a5d319886464ec147967c876b7e60d
Author: Thamme Gowda 
Date:   2015-10-31T04:23:43Z

Automated OpenNLP NER model downloader

commit e381ea88ebd2bb8f5adfe36d710acfce673e30aa
Author: Thamme Gowda 
Date:   2015-11-04T00:31:40Z

using a secondary parser to convert non-text streams

commit ea7871bd4afae7d18e500ffc285e58afd08f5e86
Author: Thamme Gowda 
Date:   2015-11-08T07:36:48Z

Add regex based NER

commit 084985b3612438e9ca7107fecdffd67757d04d10
Author: Thamme Gowda 
Date:   2015-11-08T07:38:17Z

Add CoreNLP NER with runtime binding

commit e4d74218ece77143d1e5245a3ef64ddf5578c310
Author: Thamme Gowda 
Date:   2015-11-08T23:41:15Z

Added support for chaining NER implementations

commit 7e6b43c83ec6cdd35ea258f52c0110ba986c82b3
Author: Thamme Gowda 
Date:   2015-11-09T05:58:58Z

charset specified

commit caba68773a287752dea43f3366e6d4309fde861c
Author: Thamme Gowda 
Date:   2015-11-10T01:34:04Z

Merge branch 'trunk' of github.com:apache/tika into trunk

commit 08b916790b279cda0201f2529ca58646dea4b2f9
Author: Thamme Gowda 
Date:   2015-11-10T19:06:29Z

Resolved Code formatting issues

+ Removed star imports
+ Removed dead code / commented code
+ Added License header to missing files

commit e07ac630d54cc79d9a7bfc9ac82332474d07434b
Author: Thamme Gowda 
Date:   2015-11-16T09:05:07Z

Add missing doc strings, fix code formatting issues

commit 96d4d7cc29d4bcd8ac0cf7a595c39b6ed64d4d19
Author: Thamme Gowda 
Date:   2015-11-18T03:03:41Z

Fix: build phase for model downloader

commit 6d0b121b8b321e8a31257fc608bb001d3fe7afb5
Author: Thamme Gowda 
Date:   2015-12-11T14:33:36Z

Merge branch 'trunk' of github.com:apache/tika into trunk

commit 66d3a10ffabf1f54cff384ce1c7325c2a3c16279
Author: Thamme Gowda 
Date:   2015-12-19T18:59:26Z

Fix : TIKA-1815 by Thamme Gowda N.

1. Writing text content to XMLContentHandler
2. Added RegexNERParser to Default parser chain




> Text content from parser is empty when NamedEntityParser is enabled
> ---
>
> Key: TIKA-1815
> URL: https://issues.apache.org/jira/browse/TIKA-1815
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda N
> Fix For: 1.12
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> When the NamedEntityParser is enabled, the Tika#parseToString() and other 
> parse() methods produces an empty string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled

2015-12-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065533#comment-15065533
 ] 

ASF GitHub Bot commented on TIKA-1815:
--

GitHub user thammegowda opened a pull request:

https://github.com/apache/tika/pull/67

FIX for TIKA-1815 contributed by Thamme Gowda

+ Writing the text content to XML Document
+ Added Regex recogniser to default NER chain

Closes #66  (this is a simpler version of the same). Fixes #TIKA-1815

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thammegowda/tika TIKA-1815

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/67.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #67


commit a40a18e2f61f2152fa065bda193ceb74e7e60c97
Author: Thamme Gowda 
Date:   2015-12-19T20:56:21Z

FIX for TIKA-1815 contributed by Thamme Gowda

+ Writing the text content to XML Document
+ Added Regex recogniser to default NER chain




> Text content from parser is empty when NamedEntityParser is enabled
> ---
>
> Key: TIKA-1815
> URL: https://issues.apache.org/jira/browse/TIKA-1815
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda N
> Fix For: 1.12
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> When the NamedEntityParser is enabled, the Tika#parseToString() and other 
> parse() methods produces an empty string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1816) Lenient testing for NamedEntityParser

2015-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066111#comment-15066111
 ] 

ASF GitHub Bot commented on TIKA-1816:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/68


> Lenient testing for NamedEntityParser
> -
>
> Key: TIKA-1816
> URL: https://issues.apache.org/jira/browse/TIKA-1816
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
>
> NamedEntityParser has a hard setup requirement like downloading of NER models 
> from remote servers and adding them to classpath.
> These model files are huge and hence are not added to source control.
> So, the tests are most likely to fail in various environments.
> Make the best effort to set up the tests, but in the worst case skip tests 
> instead of failing the whole build process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1803) Use lucene-geo-gazetteer REST API in GeoTopicParser

2015-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065848#comment-15065848
 ] 

ASF GitHub Bot commented on TIKA-1803:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/65


> Use lucene-geo-gazetteer REST API in GeoTopicParser
> ---
>
> Key: TIKA-1803
> URL: https://issues.apache.org/jira/browse/TIKA-1803
> Project: Tika
>  Issue Type: Sub-task
>  Components: parser
>Reporter: Madhav Sharan
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
>
> As of now tika uses lucene-geo-gazetteer CLI to extract co-ordinates of a 
> location. CLI requires jvm and lucene to instantiate for every request. With 
> all new REST api it will be possible to gain improvement in this space.
> Idea is to create a client of lucene-geo-gazetteer in tika and use it in 
> GeoTopicParser



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled

2015-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065887#comment-15065887
 ] 

ASF GitHub Bot commented on TIKA-1815:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/67


> Text content from parser is empty when NamedEntityParser is enabled
> ---
>
> Key: TIKA-1815
> URL: https://issues.apache.org/jira/browse/TIKA-1815
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> When the NamedEntityParser is enabled, the Tika#parseToString() and other 
> parse() methods produces an empty string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1798) Parser for Video Similarity using PooledTimeSeries metric

2015-11-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15030293#comment-15030293
 ] 

ASF GitHub Bot commented on TIKA-1798:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/64


> Parser for Video Similarity using PooledTimeSeries metric
> -
>
> Key: TIKA-1798
> URL: https://issues.apache.org/jira/browse/TIKA-1798
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
>
> My student [~1ceb00da] and I are working on a parser that's an implementation 
> of the PooledTimeSeries metric for video similarity:
> http://github.com/chrismattmann/pooled_time_series
> The original author of the algorithm approach was Michael Ryoo in his paper 
> here:
> https://github.com/chrismattmann/pooled_time_series#research-background-and-detail
> The code is Apache License, version 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1798) Parser for Video Similarity using PooledTimeSeries metric

2015-11-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021220#comment-15021220
 ] 

ASF GitHub Bot commented on TIKA-1798:
--

GitHub user chrismattmann opened a pull request:

https://github.com/apache/tika/pull/64

Pull request for TIKA-1798 Parser for Video Similarity using 
PooledTimeSeries metric contributed by Aditya Dhulipala and Chris Mattmann.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chrismattmann/tika trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/64.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #64


commit d79f57c8fe9a13f8bb549777d82ce5be0034acb1
Author: Aditya Dhulipala 
Date:   2015-10-02T12:45:22Z

PoT definitions

commit 78daf1a0cd30b4810f64582582a873cd1c5f81ce
Author: Aditya Dhulipala 
Date:   2015-10-12T02:09:56Z

PoT Config defns

commit 9923a7c0a8dc7338b845e314c9caa20f0b8142af
Author: Aditya Dhulipala 
Date:   2015-10-14T14:43:04Z

Registered PoTParser with ParserServices

commit a76b6c643a55a84c310d5bfea0ab799d13a1d8a9
Author: Aditya Dhulipala 
Date:   2015-10-14T15:30:40Z

defined PoT properties config file

commit ca8c1811fea9f45ead9c1dc60ba566befafd374d
Author: Aditya Dhulipala 
Date:   2015-10-14T15:32:31Z

defined PoT config values

commit 1012ba1e68939d4ab3784043694ef4a26b27e3af
Author: Aditya Dhulipala 
Date:   2015-10-14T15:33:33Z

Modified config file to better reflect PoT config values

commit d3f78b567d22e1626dcd7ceea624e789fe3057ef
Author: Aditya Dhulipala 
Date:   2015-10-14T15:34:13Z

Fixed build errors

commit 43969b154a83cfb3dad29eb49f597426c88bcfb4
Author: Aditya Dhulipala 
Date:   2015-10-29T16:17:25Z

Built basic version of tika-pot-parser

commit 30f240abfe5b0be9fd572e0d55be764d43827412
Author: Aditya Dhulipala 
Date:   2015-11-14T02:06:33Z

Added a timeout value

commit 3c9719370a62c12e31fa744e4ec9547c83ac2ca6
Author: Aditya Dhulipala 
Date:   2015-11-14T02:07:54Z

Parser now runs PoT and extracts OF metadata

Uses JavaProcessBuilder to invoke the PoT jar.
Runs PoT to generate of.txt and hog.txt.
Extracts of.txt metadata as a list of lists,
i.e. a 2D matrix in HTML

commit 90e42d145d806cc2b122baf8f7b12a4b465c5125
Author: Aditya Dhulipala 
Date:   2015-11-14T02:22:46Z

Added Apache License header

commit 235156a122e164276d164ef78229a8f01784022a
Author: Chris Mattmann 
Date:   2015-11-22T22:20:42Z

Merge branch 'pooled-time-series' of 
https://github.com/cafed00d4j/tika-pooled-time-series into trunk




> Parser for Video Similarity using PooledTimeSeries metric
> -
>
> Key: TIKA-1798
> URL: https://issues.apache.org/jira/browse/TIKA-1798
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.12
>
>
> My student [~1ceb00da] and I are working on a parser that's an implementation 
> of the PooledTimeSeries metric for video similarity:
> http://github.com/chrismattmann/pooled_time_series
> The original author of the algorithm approach was Michael Ryoo in his paper 
> here:
> https://github.com/chrismattmann/pooled_time_series#research-background-and-detail
> The code is Apache License, version 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1978) Invocation of java.net.URL.equals(Object), which blocks to do domain name resolution, in org.apache.tika.parser.geo.topic.GeoParser.initialize(URL)

2016-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300248#comment-15300248
 ] 

ASF GitHub Bot commented on TIKA-1978:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/122


> Invocation of java.net.URL.equals(Object), which blocks to do domain name 
> resolution, in org.apache.tika.parser.geo.topic.GeoParser.initialize(URL)
> ---
>
> Key: TIKA-1978
> URL: https://issues.apache.org/jira/browse/TIKA-1978
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.14
>
>
> Performance - The equals and hashCode methods of URL are blocking
> The equals and hashCode method of URL perform domain name resolution, this 
> can result in a big performance hit. See 
> http://michaelscharf.blogspot.com/2006/11/javaneturlequals-and-hashcode-make.html
>  for more information. Consider using java.net.URI instead. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1993) Image Recognition with Tika

2016-06-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326163#comment-15326163
 ] 

ASF GitHub Bot commented on TIKA-1993:
--

GitHub user thammegowda opened a pull request:

https://github.com/apache/tika/pull/125

TIKA-1993: ObjectRecognitionParser + Tensorflow image recognition with 
Inception-V3 model as default implementation

Summary of changes:

- Fixed TIKA-2002 : ExternalParser.check() empties stdout and stderr 
buffers so no more hanging is expected
- Added ObjectRecognitionParser, ObjectRecogniser, RecognisedObject - A 
parser, interface and a model class respectively
- implemented TensorFlowImageRecParser - an `ExternalParser` which (if 
missing) downloads and calls tensorflow `image_classify.py` script (the script 
then downloads Inception-v3 model) 


---
## Quick Setup and Test
-  Install tensor flow using pip - 
https://www.tensorflow.org/versions/r0.9/get_started/os_setup.html#pip-installation
- Checkout the test case  
`tika-parsers/src/test/java/org/apache/tika/parser/recognition/ObjectRecognitionParserTest.java`

## Demos
Compile package : `mvn clean install` # `-DskipTests` if you dont like to 
wait for tests

Lets check 
- (for animal lovers,) on a cat's image at 
https://raw.githubusercontent.com/apache/tika/master/tika-parsers/src/test/resources/test-documents/testJPEG.jpg
```
java -jar tika-app/target/tika-app-1.14-SNAPSHOT.jar \
 
--config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml
 \
  tika-parsers/src/test/resources/test-documents/testJPEG.jpg
```
```xml





```
- (For law-keepers) On a rifle at 
https://upload.wikimedia.org/wikipedia/commons/thumb/3/38/US_Navy_100714-N-4965F-174_Chief_Mass_Communication_Specialist_Paula_Ludwick%2C_assigned_to_Fleet_Combat_Camera_Group_Pacific%2C_shoots_at_a_target_during_a_Navy_Rifle_Qualification_Course.jpg/220px-thumbnail.jpg
```xml





```
- (for law-keepers) On a revolver  at 
https://upload.wikimedia.org/wikipedia/commons/8/8d/Glock17.jpg
```xml





```
- (for car enthusiasts) On a car at 
http://www.trbimg.com/img-57226a08/turbine/ct-tesla-model-3-unveiling-20160404/650/650x366
```xml





```


/ /NOTE:
1. The most efficient way to make use of tensorflow would be to use C++ api 
via JNI. I didn't have a chance to learn that stuff so far so help needed to 
make this efficient. Or else we may wait for tensorflow folks to offer Java 
bindings! Right now, the image recognition model is loaded and unloaded every 
time by the script (200MB of disk-read per parse call, very inefficient!).
2. The very first call takes plenty of time as the model is downloaded 
lazily
3.  Only `image/jpeg` is supported. PNG coming later



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thammegowda/tika TIKA-1993

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/125.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #125


commit 9b5dc7fae4456b12b75ec21d050b9439e6527c47
Author: Thamme Gowda 
Date:   2016-06-12T02:04:08Z

External Parser now have consumer for ignored lines,  Fix TIKA-2002

commit eccc15387f0d4a5c62d8d12e6579878dba2f52a8
Author: Thamme Gowda 
Date:   2016-06-12T02:04:28Z

Added an utility to load and insatiate classes

commit 2184e2c2c2a0e507be6be4f9692e0fab5b38a476
Author: Thamme Gowda 
Date:   2016-06-12T02:04:49Z

Object recognition parser, tensorflow based implementation, and test cases 
for these

commit 0305cfb402f5d5e289533411d5737e1e832888ac
Author: Thamme Gowda 
Date:   2016-06-12T02:43:07Z

Explicit Locale




> Image Recognition with Tika 
> 
>
> Key: TIKA-1993
> URL: https://issues.apache.org/jira/browse/TIKA-1993
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Thamme Gowda
>
> Create "ImageRecognitionParser" which can have pluggable implementation for 
> core recognition logic.
> As the name suggests, this parser should detect objects in the images, and 
> support many implementations + models (similar to what NamedEntityParser did 
> for text).
> Supply a default implementation based on Tensorflow with the current 
> state-of-the-art model \[1\].
> Links:
> \[1\] 
> https://www.tensorflow.org/versions/r0.8/tutorials/image_recognition/index.html#usage-with-python-api



--
This 

[jira] [Commented] (TIKA-1986) support parser parameters with type (int, double, etc) in configuration XML file

2016-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15301236#comment-15301236
 ] 

ASF GitHub Bot commented on TIKA-1986:
--

GitHub user thammegowda opened a pull request:

https://github.com/apache/tika/pull/123

TIKA-1986 : support for typed parameters from XML configuration

 This is a sub-task of TIKA-1508 (please merge #91 first).

This extends configuration file with a place for specifying parameters and 
their types.

+ Added a class called `Param`
+ Relying on JAXB annotations to convert between XML and Java objects
+ Test case updated to check for types


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thammegowda/tika TIKA-1986

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/123.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #123


commit b2cf23178ede925b0ef23f88ebf1aff95c8c157c
Author: Thamme Gowda 
Date:   2016-03-09T02:23:19Z

Add uniformity to parser parameter configuration.

1. Added Configurable interface.
 This can be used for all services like Parser, Detector which can take
  configurable parameters.

2. Added ConfigurableParser interface which extends Parser interface.
   I didn't add new method to existing Parser because
that will break the compatibility.

3. AbstractParser extends ConfigurableParser and has
  default implementation for configure() contract.
  I think it is safe to do so and it doesnt break anything.
  In addition all parsers which extend AbstractParser will can easily
  access config from TikaConfig if they want to

3. Added a TODO to TikaConfig,
 after this should allow multiple instances of same parser with
 different runtime configurations.

4. TikaConfig is modified to detect if instance can be configured,
  if so, then checks if params are available in XML file, parses the
  params and invokes configure(ctx) method with these params

5. Added DummyConfigurableParser that simply copies parameters to
 metadata for the sake of testing

6. Added a sample XML config file for testing.
Added ConfigurableParserTest that performs an end to end test of all
the above.

commit ae51417d8881dd90b921f02c2677a7d5bfd69a30
Author: Thamme Gowda 
Date:   2016-03-09T03:23:47Z

remove unwanted TODO:

commit 64db9614cfaa3e873a9dc9efc6d201d887f6a4c5
Author: Thamme Gowda 
Date:   2016-03-12T14:43:44Z

Added a TikaConfigException, params getter

commit 0d69ca7540b4350e043c5b9ed34d14a46bd70cf7
Author: Thamme Gowda 
Date:   2016-03-12T14:51:14Z

Test Case updated with newer exception and getter

commit e780d56652d48dd0f50b4e62a58153e95f055022
Author: Thamme Gowda 
Date:   2016-05-23T18:30:13Z

merged upstream changes and resolved conflicts

commit b64612dcdb021fbb8b3fbf31d70a02f1bb7736cb
Author: Thamme Gowda 
Date:   2016-05-23T18:52:55Z

Update javadoc with @since

commit 01869923533b330ec7728995e3ee5feceee1b90e
Author: Thamme Gowda 
Date:   2016-05-26T00:18:25Z

Added support for type for runtime parameters

commit 9e08a6bc0a2b2ffad12e4b6f90725b2201d0a69b
Author: Thamme Gowda 
Date:   2016-05-26T00:50:49Z

Updated test case with type checking




> support parser parameters with type (int, double, etc) in configuration XML 
> file
> 
>
> Key: TIKA-1986
> URL: https://issues.apache.org/jira/browse/TIKA-1986
> Project: Tika
>  Issue Type: Sub-task
>  Components: config
>Reporter: Thamme Gowda
> Fix For: 1.14
>
>
> Tika Configuration should be enhanced to support for basic types like int, 
> double, boolean, url, file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1978) Invocation of java.net.URL.equals(Object), which blocks to do domain name resolution, in org.apache.tika.parser.geo.topic.GeoParser.initialize(URL)

2016-05-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302610#comment-15302610
 ] 

ASF GitHub Bot commented on TIKA-1978:
--

GitHub user lewismc opened a pull request:

https://github.com/apache/tika/pull/124

TIKA-1978 Invocation of java.net.URL.equals(Object), which blocks to do 
domain name resolution, in 
org.apache.tika.parser.geo.topic.GeoParser.initialize(URL) 2.x branch

This issue addresses https://issues.apache.org/jira/browse/TIKA-1978 for 
the 2.x branch.
It also makes a number of additional improvements such as using the diamond 
operator where possible, throwing and logging the correct exceptions and using 
the correct syntax for constants.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lewismc/tika TIKA-1978v2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/124.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #124


commit bd3ecfcddeaf13262e477ba29c5256ebd44e32db
Author: Lewis John McGibbney 
Date:   2016-05-26T18:15:02Z

TIKA-1978 Invocation of java.net.URL.equals(Object), which blocks to do 
domain name resolution, in 
org.apache.tika.parser.geo.topic.GeoParser.initialize(URL) 2.x branch




> Invocation of java.net.URL.equals(Object), which blocks to do domain name 
> resolution, in org.apache.tika.parser.geo.topic.GeoParser.initialize(URL)
> ---
>
> Key: TIKA-1978
> URL: https://issues.apache.org/jira/browse/TIKA-1978
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 2.0, 1.14
>
>
> Performance - The equals and hashCode methods of URL are blocking
> The equals and hashCode method of URL perform domain name resolution, this 
> can result in a big performance hit. See 
> http://michaelscharf.blogspot.com/2006/11/javaneturlequals-and-hashcode-make.html
>  for more information. Consider using java.net.URI instead. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1834) Fix for GeoTopic parser holding state while running Tika server

2016-01-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105609#comment-15105609
 ] 

ASF GitHub Bot commented on TIKA-1834:
--

GitHub user smadha opened a pull request:

https://github.com/apache/tika/pull/71

fix for TIKA-1834 contributed by msha...@usc.edu



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/smadha/tika TIKA-1834

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/71.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #71


commit 0154e3067c8a63ab176e4e2161515d2d7d45b8e7
Author: smadha 
Date:   2016-01-18T18:08:18Z

fix for TIKA-1834 contributed by msha...@usc.edu




> Fix for GeoTopic parser holding state while running Tika server
> ---
>
> Key: TIKA-1834
> URL: https://issues.apache.org/jira/browse/TIKA-1834
> Project: Tika
>  Issue Type: Sub-task
>  Components: parser
>Affects Versions: 1.12
> Environment: All
>Reporter: Madhav Sharan
> Fix For: 1.12
>
>
> While using TIKA-server  we observed that GeoTopic parser started holding 
> state and returned all the location retrieved for any previous request. 
> This was happening as mutable object 
> org.apache.tika.parser.geo.topic.NameEntityExtractor was initialised once and 
> then was reused by all request. 
> As part of this fix org.apache.tika.parser.geo.topic.NameEntityExtractor is 
> recreated for every request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1834) Fix for GeoTopic parser holding state while running Tika server

2016-01-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105626#comment-15105626
 ] 

ASF GitHub Bot commented on TIKA-1834:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/71


> Fix for GeoTopic parser holding state while running Tika server
> ---
>
> Key: TIKA-1834
> URL: https://issues.apache.org/jira/browse/TIKA-1834
> Project: Tika
>  Issue Type: Sub-task
>  Components: parser
>Affects Versions: 1.12
> Environment: All
>Reporter: Madhav Sharan
>Assignee: Chris A. Mattmann
> Fix For: 1.12
>
>
> While using TIKA-server  we observed that GeoTopic parser started holding 
> state and returned all the location retrieved for any previous request. 
> This was happening as mutable object 
> org.apache.tika.parser.geo.topic.NameEntityExtractor was initialised once and 
> then was reused by all request. 
> As part of this fix org.apache.tika.parser.geo.topic.NameEntityExtractor is 
> recreated for every request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2016) A parser that combines Apache OpenNLP and Apache Tika and provides facilities for automatically deriving sentiment from text.

2016-06-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351373#comment-15351373
 ] 

ASF GitHub Bot commented on TIKA-2016:
--

GitHub user amensiko opened a pull request:

https://github.com/apache/tika/pull/127

creation of TIKA-2016 contributed by amensiko

Sentiment Analysis Parser

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/amensiko/tika TIKA-2016

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/127.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #127


commit be8b433d2994af4e0d95de3bc5ab4fea99ee6bcd
Author: amensiko 
Date:   2016-06-25T19:29:20Z

creation of TIKA-2016 contributed by amensiko




> A parser that combines Apache OpenNLP and Apache Tika and provides facilities 
> for automatically deriving sentiment from text.
> -
>
> Key: TIKA-2016
> URL: https://issues.apache.org/jira/browse/TIKA-2016
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Anastasija Mensikova
>  Labels: analysis, parser, sentiment
>
> A new project that implements a parser that uses Apache OpenNLP and Apache 
> Tika to perform Sentiment Analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2021) Improving accuracy of Tesseract parser

2016-06-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349029#comment-15349029
 ] 

ASF GitHub Bot commented on TIKA-2021:
--

GitHub user Zarana-Parekh opened a pull request:

https://github.com/apache/tika/pull/126

fix for TIKA-2021 contributed by Zarana Parekh

Improving accuracy of Tesseract for better extraction of numeric and 
alphanumeric text from images.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Zarana-Parekh/tika TIKA-2021

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/126.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #126


commit 48b27d219f791ee14f1e0ffa18e4e80583f3df54
Author: Zarana Parekh 
Date:   2016-06-25T01:53:00Z

fix for TIKA-2021 contributed by Zarana Parekh




> Improving accuracy of Tesseract parser
> --
>
> Key: TIKA-2021
> URL: https://issues.apache.org/jira/browse/TIKA-2021
> Project: Tika
>  Issue Type: Improvement
>Reporter: Zarana Parekh
>
> Tesseract OCR parser works well with images containing English text. However, 
> there is possibility of improvement in case of alphanumeric and numeric 
> content which require training Tesseract with the relevant cases in order to 
> better extract content from images. Such a customization can be helpful in 
> extraction of serial numbers from images of counterfeit electronics and other 
> applications focussing on atypical textual content.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1869) Jackson update to latest version

2016-02-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15162969#comment-15162969
 ] 

ASF GitHub Bot commented on TIKA-1869:
--

GitHub user nhojpatrick opened a pull request:

https://github.com/apache/tika/pull/75

TIKA-1869 update Jackson to latest version 2.7.1



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nhojpatrick/tika bugfix/TIKA-1869

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/75.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #75


commit 13d772ab6317c6151b4b4e6111b80af2bd30cd7b
Author: John Patrick 
Date:   2016-02-24T13:19:53Z

TIKA-1869 update Jackson to latest version 2.7.1




> Jackson update to latest version
> 
>
> Key: TIKA-1869
> URL: https://issues.apache.org/jira/browse/TIKA-1869
> Project: Tika
>  Issue Type: Bug
>  Components: translation
>Affects Versions: 1.11, 1.12
>Reporter: John Patrick
>  Labels: github-import, newbie, patch
> Fix For: 1.13
>
>
> Linked to TIKA-1868 this is to update the version of Jackson used from 2.4.0 
> to 2.7.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1868) create clean tika-server jar and shaded classifier jar

2016-02-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15162977#comment-15162977
 ] 

ASF GitHub Bot commented on TIKA-1868:
--

GitHub user nhojpatrick opened a pull request:

https://github.com/apache/tika/pull/76

TIKA-1868 tika-server split into clean and standalone jar

Understand based upon mailing email and jira defect this might be rejected.

But this the change I was intending to do, my original email was to 
understand if tika-server meant to be a shaded jar, which it appears to was 
intended to be.

But if you need to use classes that only live within tika-server it does 
make it harder to write custom code. If the guts of tika-server where put into 
another module maybe tika-server-internals then those that really need to used 
classes that just live in tika-server can use tika-server-internals and 
tika-server can be a simply shaded jar. Just a thought.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nhojpatrick/tika bugfix/TIKA-1868

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/76.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #76


commit 9b106210fa8be284b47ab5b904dcf83b0f175308
Author: John Patrick 
Date:   2016-02-24T12:33:56Z

TIKA-1868 tika-server split into clean and standalone jar




> create clean tika-server jar and shaded classifier jar
> --
>
> Key: TIKA-1868
> URL: https://issues.apache.org/jira/browse/TIKA-1868
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.11, 1.12
> Environment: n/a
>Reporter: John Patrick
>  Labels: github-import, maven, newbie, patch
> Fix For: 1.13
>
>
> If using tika-server-VERSION.jar as a standalone component it works. But if 
> you use it as a dependency so is included with other jars then it causes 
> classpath issues specifically around jackson.
> The project I'm working on is using Jackson 2.6.1, we have just added tika 
> but when adding tika-server-VERSION.jar we have discovered it contains 
> Jackson 2.4.0 classes.
> I've update the maven build so two jar's are now created.
> 1) tika-server-VERSION.jar correct clean jar
> 2) tika-server-VERSION-standalone.jar what was previously created
> This in my view is more inline with how maven should be being used to create 
> jars as the previous way restricted the consumers ability to override maven 
> dependencies.
> I've also updated the documentation in source control that refs to 
> tika-server to include the new tika-server standalone jar. I realize other 
> documentation might also need to change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1868) create clean tika-server jar and shaded classifier jar

2016-02-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163113#comment-15163113
 ] 

ASF GitHub Bot commented on TIKA-1868:
--

Github user nhojpatrick closed the pull request at:

https://github.com/apache/tika/pull/76


> create clean tika-server jar and shaded classifier jar
> --
>
> Key: TIKA-1868
> URL: https://issues.apache.org/jira/browse/TIKA-1868
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.11, 1.12
> Environment: n/a
>Reporter: John Patrick
>  Labels: github-import, maven, newbie, patch
> Fix For: 1.13
>
>
> If using tika-server-VERSION.jar as a standalone component it works. But if 
> you use it as a dependency so is included with other jars then it causes 
> classpath issues specifically around jackson.
> The project I'm working on is using Jackson 2.6.1, we have just added tika 
> but when adding tika-server-VERSION.jar we have discovered it contains 
> Jackson 2.4.0 classes.
> I've update the maven build so two jar's are now created.
> 1) tika-server-VERSION.jar correct clean jar
> 2) tika-server-VERSION-standalone.jar what was previously created
> This in my view is more inline with how maven should be being used to create 
> jars as the previous way restricted the consumers ability to override maven 
> dependencies.
> I've also updated the documentation in source control that refs to 
> tika-server to include the new tika-server standalone jar. I realize other 
> documentation might also need to change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1869) Jackson update to latest version

2016-02-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163133#comment-15163133
 ] 

ASF GitHub Bot commented on TIKA-1869:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/75


> Jackson update to latest version
> 
>
> Key: TIKA-1869
> URL: https://issues.apache.org/jira/browse/TIKA-1869
> Project: Tika
>  Issue Type: Bug
>  Components: translation
>Affects Versions: 1.11, 1.12
>Reporter: John Patrick
>  Labels: github-import, newbie, patch
> Fix For: 1.13
>
>
> Linked to TIKA-1868 this is to update the version of Jackson used from 2.4.0 
> to 2.7.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1870) Relocating RichTextContentHandler into tika-core from tika-server

2016-02-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163127#comment-15163127
 ] 

ASF GitHub Bot commented on TIKA-1870:
--

GitHub user nhojpatrick opened a pull request:

https://github.com/apache/tika/pull/77

TIKA-1870 refactor RichTextContentHandler into tika-core from tika-se…

…rver so users if needing it don't need to depend upon tika-server

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nhojpatrick/tika bugfix/TIKA-1870

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/77.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #77


commit 0bd05cec54c581c971d90380304aaa23c9543296
Author: John Patrick 
Date:   2016-02-24T14:50:38Z

TIKA-1870 refactor RichTextContentHandler into tika-core from tika-server 
so users if needing it don't need to depend upon tika-server




> Relocating RichTextContentHandler into tika-core from tika-server
> -
>
> Key: TIKA-1870
> URL: https://issues.apache.org/jira/browse/TIKA-1870
> Project: Tika
>  Issue Type: Bug
>  Components: core, server
>Reporter: John Patrick
>  Labels: newbie, patch
> Fix For: 1.13
>
>
> linked to TIKA-1868, different solution by refactoring class into tika-core 
> so don't need to depend upon tika-server and changing other classes used to 
> custom ones or other alternatives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   7   8   9   10   >