[jira] [Commented] (TIKA-1600) Unable to parse ODT files because of failed to close temporary resources

2015-04-13 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492084#comment-14492084
 ] 

Hong-Thai Nguyen commented on TIKA-1600:


The root exception is an NPE when parsing ODT files with elements in footnote:
{code}
java.lang.NullPointerException
at 
org.apache.tika.parser.odf.OpenDocumentContentParser$OpenDocumentElementMappingContentHandler.startSpan(OpenDocumentContentParser.java:174)
at 
org.apache.tika.parser.odf.OpenDocumentContentParser$OpenDocumentElementMappingContentHandler.startElement(OpenDocumentContentParser.java:287)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.parser.odf.NSNormalizerContentHandler.startElement(NSNormalizerContentHandler.java:69)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:501)
at 
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:400)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2756)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:647)
at 
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
at 
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
at 
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
at 
org.apache.tika.parser.odf.OpenDocumentContentParser.parseInternal(OpenDocumentContentParser.java:503)
at 
org.apache.tika.parser.odf.OpenDocumentParser.handleZipEntry(OpenDocumentParser.java:187)
at 
org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.java:164)
at 
org.apache.tika.parser.odf.OpenDocumentParserTest.can_parse_odt_file(OpenDocumentParserTest.java:41)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
{code}

Seems that supporting style for ODF is recently added in 1.8:
{noformat}
Revision: 107
Author: tpalsulich
Date: samedi 14 mars 2015 00:25:53

[jira] [Commented] (TIKA-1581) jhighlight license concerns

2015-03-30 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386900#comment-14386900
 ] 

Hong-Thai Nguyen commented on TIKA-1581:


And great thank to [~kkrugler] with many investigation and efforts to push 
release of jhighlight 1.0.2

> jhighlight license concerns
> ---
>
> Key: TIKA-1581
> URL: https://issues.apache.org/jira/browse/TIKA-1581
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Karl Wright
> Fix For: 1.8
>
>
> jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
> it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
> only:
> {code}
> Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
> as dual CDDL or LGPL license. However, some of its classes are distributed 
> only under LGPL, e.g.
> com.uwyn.jhighlight.highlighter.
>   CppHighlighter.java
>   GroovyHighlighter.java
>   JavaHighlighter.java
>   XmlHighlighter.java
> I downloaded the sources from Maven 
> (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
>  to confirm that, and also found this SVN repo: 
> http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
> website seems to not exist anymore (https://jhighlight.dev.java.net/).
> I didn't find any direct usage of it in our code, so I guess it's probably 
> needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
> things will compile, but may fail at runtime.
> {code}
> Is it possible to remove this dependency for future releases, or allow only 
> optional inclusion of this package?  It is of concern to the ManifoldCF 
> project because we distribute a binary package that includes Tika and its 
> required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1581) jhighlight license concerns

2015-03-27 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1581.

Resolution: Fixed

> jhighlight license concerns
> ---
>
> Key: TIKA-1581
> URL: https://issues.apache.org/jira/browse/TIKA-1581
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Karl Wright
> Fix For: 1.8
>
>
> jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
> it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
> only:
> {code}
> Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
> as dual CDDL or LGPL license. However, some of its classes are distributed 
> only under LGPL, e.g.
> com.uwyn.jhighlight.highlighter.
>   CppHighlighter.java
>   GroovyHighlighter.java
>   JavaHighlighter.java
>   XmlHighlighter.java
> I downloaded the sources from Maven 
> (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
>  to confirm that, and also found this SVN repo: 
> http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
> website seems to not exist anymore (https://jhighlight.dev.java.net/).
> I didn't find any direct usage of it in our code, so I guess it's probably 
> needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
> things will compile, but may fail at runtime.
> {code}
> Is it possible to remove this dependency for future releases, or allow only 
> optional inclusion of this package?  It is of concern to the ManifoldCF 
> project because we distribute a binary package that includes Tika and its 
> required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1581) jhighlight license concerns

2015-03-27 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1581:
---
Fix Version/s: 1.8

> jhighlight license concerns
> ---
>
> Key: TIKA-1581
> URL: https://issues.apache.org/jira/browse/TIKA-1581
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Karl Wright
> Fix For: 1.8
>
>
> jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
> it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
> only:
> {code}
> Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
> as dual CDDL or LGPL license. However, some of its classes are distributed 
> only under LGPL, e.g.
> com.uwyn.jhighlight.highlighter.
>   CppHighlighter.java
>   GroovyHighlighter.java
>   JavaHighlighter.java
>   XmlHighlighter.java
> I downloaded the sources from Maven 
> (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
>  to confirm that, and also found this SVN repo: 
> http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
> website seems to not exist anymore (https://jhighlight.dev.java.net/).
> I didn't find any direct usage of it in our code, so I guess it's probably 
> needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
> things will compile, but may fail at runtime.
> {code}
> Is it possible to remove this dependency for future releases, or allow only 
> optional inclusion of this package?  It is of concern to the ManifoldCF 
> project because we distribute a binary package that includes Tika and its 
> required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1581) jhighlight license concerns

2015-03-27 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383827#comment-14383827
 ] 

Hong-Thai Nguyen commented on TIKA-1581:


On r1669583, I switched to latest jhighlight 1.0.2, update Notices.txt and also 
in SourceCodeParser.java to aware of using CDDL/LGPL dual-license of this 
library.

> jhighlight license concerns
> ---
>
> Key: TIKA-1581
> URL: https://issues.apache.org/jira/browse/TIKA-1581
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Karl Wright
>
> jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
> it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
> only:
> {code}
> Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
> as dual CDDL or LGPL license. However, some of its classes are distributed 
> only under LGPL, e.g.
> com.uwyn.jhighlight.highlighter.
>   CppHighlighter.java
>   GroovyHighlighter.java
>   JavaHighlighter.java
>   XmlHighlighter.java
> I downloaded the sources from Maven 
> (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
>  to confirm that, and also found this SVN repo: 
> http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
> website seems to not exist anymore (https://jhighlight.dev.java.net/).
> I didn't find any direct usage of it in our code, so I guess it's probably 
> needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
> things will compile, but may fail at runtime.
> {code}
> Is it possible to remove this dependency for future releases, or allow only 
> optional inclusion of this package?  It is of concern to the ManifoldCF 
> project because we distribute a binary package that includes Tika and its 
> required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1581) jhighlight license concerns

2015-03-20 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371432#comment-14371432
 ] 

Hong-Thai Nguyen edited comment on TIKA-1581 at 3/20/15 3:36 PM:
-

I've contacted also 'gbe...@uwyn.com', seem that it's his email. Wait for feel 
days for his feedback.
Otherwise, we can create an 'unshipped' module to group all parsers and their 
dependencies without Apache license.


was (Author: thaichat04):
I've contacted also 'gbe...@uwyn.com', seem that it's his email. Wait for feel 
days for his feedback.
Otherwise, we can create an 'unshipped' module to group all parsers and their 
dependencies without Apache license.

[~steve_rowe], folked vesion you mentioned don't change anything about original 
license terms of JHighlight.

> jhighlight license concerns
> ---
>
> Key: TIKA-1581
> URL: https://issues.apache.org/jira/browse/TIKA-1581
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Karl Wright
>
> jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
> it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
> only:
> {code}
> Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
> as dual CDDL or LGPL license. However, some of its classes are distributed 
> only under LGPL, e.g.
> com.uwyn.jhighlight.highlighter.
>   CppHighlighter.java
>   GroovyHighlighter.java
>   JavaHighlighter.java
>   XmlHighlighter.java
> I downloaded the sources from Maven 
> (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
>  to confirm that, and also found this SVN repo: 
> http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
> website seems to not exist anymore (https://jhighlight.dev.java.net/).
> I didn't find any direct usage of it in our code, so I guess it's probably 
> needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
> things will compile, but may fail at runtime.
> {code}
> Is it possible to remove this dependency for future releases, or allow only 
> optional inclusion of this package?  It is of concern to the ManifoldCF 
> project because we distribute a binary package that includes Tika and its 
> required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1581) jhighlight license concerns

2015-03-20 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371432#comment-14371432
 ] 

Hong-Thai Nguyen edited comment on TIKA-1581 at 3/20/15 3:10 PM:
-

I've contacted also 'gbe...@uwyn.com', seem that it's his email. Wait for feel 
days for his feedback.
Otherwise, we can create an 'unshipped' module to group all parsers and their 
dependencies without Apache license.

[~steve_rowe], folked vesion you mentioned don't change anything about original 
license terms of JHighlight.


was (Author: thaichat04):
I've contacted also 'gbe...@uwyn.com', seem that it's his email. Wait for feel 
days for his feedback.
Otherwise, we can create an 'unshipped' module to group all parsers and their 
dependencies without Apache license

> jhighlight license concerns
> ---
>
> Key: TIKA-1581
> URL: https://issues.apache.org/jira/browse/TIKA-1581
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Karl Wright
>
> jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
> it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
> only:
> {code}
> Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
> as dual CDDL or LGPL license. However, some of its classes are distributed 
> only under LGPL, e.g.
> com.uwyn.jhighlight.highlighter.
>   CppHighlighter.java
>   GroovyHighlighter.java
>   JavaHighlighter.java
>   XmlHighlighter.java
> I downloaded the sources from Maven 
> (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
>  to confirm that, and also found this SVN repo: 
> http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
> website seems to not exist anymore (https://jhighlight.dev.java.net/).
> I didn't find any direct usage of it in our code, so I guess it's probably 
> needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
> things will compile, but may fail at runtime.
> {code}
> Is it possible to remove this dependency for future releases, or allow only 
> optional inclusion of this package?  It is of concern to the ManifoldCF 
> project because we distribute a binary package that includes Tika and its 
> required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1581) jhighlight license concerns

2015-03-20 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371432#comment-14371432
 ] 

Hong-Thai Nguyen commented on TIKA-1581:


I've contacted also 'gbe...@uwyn.com', seem that it's his email. Wait for feel 
days for his feedback.
Otherwise, we can create an 'unshipped' module to group all parsers and their 
dependencies without Apache license

> jhighlight license concerns
> ---
>
> Key: TIKA-1581
> URL: https://issues.apache.org/jira/browse/TIKA-1581
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Karl Wright
>
> jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
> it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
> only:
> {code}
> Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
> as dual CDDL or LGPL license. However, some of its classes are distributed 
> only under LGPL, e.g.
> com.uwyn.jhighlight.highlighter.
>   CppHighlighter.java
>   GroovyHighlighter.java
>   JavaHighlighter.java
>   XmlHighlighter.java
> I downloaded the sources from Maven 
> (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
>  to confirm that, and also found this SVN repo: 
> http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
> website seems to not exist anymore (https://jhighlight.dev.java.net/).
> I didn't find any direct usage of it in our code, so I guess it's probably 
> needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
> things will compile, but may fail at runtime.
> {code}
> Is it possible to remove this dependency for future releases, or allow only 
> optional inclusion of this package?  It is of concern to the ManifoldCF 
> project because we distribute a binary package that includes Tika and its 
> required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1505) chmparser breaks down when extracting from file of CHM format v3

2015-01-05 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14264786#comment-14264786
 ] 

Hong-Thai Nguyen commented on TIKA-1505:


Can you provide also problem files and tests ?
And, 1.7 in releasing out, this issue is not really blocking and we can 
postpone to next 1.8

> chmparser breaks down when extracting from file of CHM format v3
> 
>
> Key: TIKA-1505
> URL: https://issues.apache.org/jira/browse/TIKA-1505
> Project: Tika
>  Issue Type: Bug
>Reporter: Bin Hawking
> Fix For: 1.7
>
>
> chmparser throws exception or returns faulty text when:
> 1. extracting from file of CHM format version 3
> 2. chm file with lzx reset interval > 2
> 3. chm file with >5000 objects
> I am making the fix now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-672) Proper error handling in the CHM parser

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-672.
---
Resolution: Fixed

Check no more System.err/System.out inside CHM parser

> Proper error handling in the CHM parser
> ---
>
> Key: TIKA-672
> URL: https://issues.apache.org/jira/browse/TIKA-672
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Jukka Zitting
>Priority: Minor
> Fix For: 1.7
>
>
> The new CHM parser (TIKA-245) swallows exceptions and uses System.err and 
> System.out prints to report problems in many places. We should change that to 
> properly throw exceptions as follows:
> - IOExceptions when the document stream can not be read
> - TikaExceptions when the stream can not be parsed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-672) Proper error handling in the CHM parser

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-672:
--
Fix Version/s: 1.7

> Proper error handling in the CHM parser
> ---
>
> Key: TIKA-672
> URL: https://issues.apache.org/jira/browse/TIKA-672
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Jukka Zitting
>Priority: Minor
> Fix For: 1.7
>
>
> The new CHM parser (TIKA-245) swallows exceptions and uses System.err and 
> System.out prints to report problems in many places. We should change that to 
> properly throw exceptions as follows:
> - IOExceptions when the document stream can not be read
> - TikaExceptions when the stream can not be parsed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1448) CHM parser : defect in file extraction

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1448:
---
Fix Version/s: 1.7

> CHM parser : defect in file extraction
> --
>
> Key: TIKA-1448
> URL: https://issues.apache.org/jira/browse/TIKA-1448
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.7
>Reporter: Bin Hawking
> Fix For: 1.7
>
>
> in ChmBlockInfo class:
> chmBlockInfo
> .setIniBlock((chmBlockInfo.startBlock - 
> chmBlockInfo.startBlock)
> % (int) clcd.getResetInterval());
> always sets 0
> according to the lzx algorithm, should be
> chmBlockInfo
> .setIniBlock( chmBlockInfo.startBlock - 
> chmBlockInfo.startBlock
> % (int) clcd.getResetInterval());



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1446:
---
Fix Version/s: 1.7

> CHM parser : wrong decompression of aligned blocks
> --
>
> Key: TIKA-1446
> URL: https://issues.apache.org/jira/browse/TIKA-1446
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Bin Hawking
>Priority: Critical
> Fix For: 1.7
>
> Attachments: chm.zip
>
>
> If an embedded file contains aligned blocks, the parser outputs chaotic text 
> or empty text as to this file.
> I have fixed it myself, corrected decompressAlignedBlock() and its 
> preparation methods. Mostly this bug is due to misusing main tree/align 
> tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1430) CHM parser gets faulty text (fix found)

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1430:
---
Fix Version/s: 1.7

> CHM parser gets faulty text (fix found)
> ---
>
> Key: TIKA-1430
> URL: https://issues.apache.org/jira/browse/TIKA-1430
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5, 1.6
> Environment: Windows 7; JDK 7 or 8
>Reporter: Bin Hawking
>Priority: Critical
> Fix For: 1.7
>
>
> Get partially wrong text out of a CHM file, including the chm files in 
> tika-parsers/src/test/resources/test-documents/testChm*.chm
> I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? 
> I checked the source code. The cause is obvious:
> When tika decompresses the LZX, the first block is done well, but as to the 
> 2nd block and later on, Tika uses previous content as the compressed data. 
> see in org.apache.tika.parser.chm.lzx.ChmLzxBlock
> """
> if (prevBlock != null
> && prevBlock.getState().getBlockLength() > prevBlock
> .getState().getBlockRemaining())
> setChmSection(new ChmSection(prevBlock.getContent()));
> //   NOTE: the dataSegment to be decompressed is not kept
> else
> setChmSection(new ChmSection(dataSegment));
> """
> My fix:
> 1.Add a prevcontent member variable in ChmSection class, so that 
> dataSegment and prevBlock.getContent() are both kept in it.
> 2.In ChmLzxBlock.extractContent() when invoking decompressBlock(), 
> pass ChmSection.prevcontent if exists, instead of ChmSection.data.
> Now, I tried some chm files, and got the correct looking texts. 
> BTW. The unit test should be tougher, as in this case some small text (the 
> first block) is decompressed correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1430) CHM parser gets faulty text (fix found)

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1430.

Resolution: Fixed

> CHM parser gets faulty text (fix found)
> ---
>
> Key: TIKA-1430
> URL: https://issues.apache.org/jira/browse/TIKA-1430
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5, 1.6
> Environment: Windows 7; JDK 7 or 8
>Reporter: Bin Hawking
>Priority: Critical
> Fix For: 1.7
>
>
> Get partially wrong text out of a CHM file, including the chm files in 
> tika-parsers/src/test/resources/test-documents/testChm*.chm
> I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? 
> I checked the source code. The cause is obvious:
> When tika decompresses the LZX, the first block is done well, but as to the 
> 2nd block and later on, Tika uses previous content as the compressed data. 
> see in org.apache.tika.parser.chm.lzx.ChmLzxBlock
> """
> if (prevBlock != null
> && prevBlock.getState().getBlockLength() > prevBlock
> .getState().getBlockRemaining())
> setChmSection(new ChmSection(prevBlock.getContent()));
> //   NOTE: the dataSegment to be decompressed is not kept
> else
> setChmSection(new ChmSection(dataSegment));
> """
> My fix:
> 1.Add a prevcontent member variable in ChmSection class, so that 
> dataSegment and prevBlock.getContent() are both kept in it.
> 2.In ChmLzxBlock.extractContent() when invoking decompressBlock(), 
> pass ChmSection.prevcontent if exists, instead of ChmSection.data.
> Now, I tried some chm files, and got the correct looking texts. 
> BTW. The unit test should be tougher, as in this case some small text (the 
> first block) is decompressed correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1447) CHM parser: wrong directory list

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1447:
---
Fix Version/s: 1.7

> CHM parser: wrong directory list
> 
>
> Key: TIKA-1447
> URL: https://issues.apache.org/jira/browse/TIKA-1447
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Bin Hawking
>Priority: Critical
> Fix For: 1.7
>
>
> CHM parser gets wrong directory list of a chm file (eg. testChm2.chm in 
> tika-parser's test-resources):
> 1. Duplicate entries (mostly from PMGI chunks, which should have been 
> ignored.)
> 2. Invalid entry (usually with unreadable entry name).
> 3. Missed entries (some times it is like TIKA-1176)
> I have fixed it (to some degree), by using the PMGL header to find dir chunks 
> and their respective meaningful parts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1448) CHM parser : defect in file extraction

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1448.

Resolution: Fixed

> CHM parser : defect in file extraction
> --
>
> Key: TIKA-1448
> URL: https://issues.apache.org/jira/browse/TIKA-1448
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.7
>Reporter: Bin Hawking
> Fix For: 1.7
>
>
> in ChmBlockInfo class:
> chmBlockInfo
> .setIniBlock((chmBlockInfo.startBlock - 
> chmBlockInfo.startBlock)
> % (int) clcd.getResetInterval());
> always sets 0
> according to the lzx algorithm, should be
> chmBlockInfo
> .setIniBlock( chmBlockInfo.startBlock - 
> chmBlockInfo.startBlock
> % (int) clcd.getResetInterval());



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1447) CHM parser: wrong directory list

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1447.

Resolution: Fixed

> CHM parser: wrong directory list
> 
>
> Key: TIKA-1447
> URL: https://issues.apache.org/jira/browse/TIKA-1447
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Bin Hawking
>Priority: Critical
>
> CHM parser gets wrong directory list of a chm file (eg. testChm2.chm in 
> tika-parser's test-resources):
> 1. Duplicate entries (mostly from PMGI chunks, which should have been 
> ignored.)
> 2. Invalid entry (usually with unreadable entry name).
> 3. Missed entries (some times it is like TIKA-1176)
> I have fixed it (to some degree), by using the PMGL header to find dir chunks 
> and their respective meaningful parts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1446.

Resolution: Fixed

> CHM parser : wrong decompression of aligned blocks
> --
>
> Key: TIKA-1446
> URL: https://issues.apache.org/jira/browse/TIKA-1446
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Bin Hawking
>Priority: Critical
> Attachments: chm.zip
>
>
> If an embedded file contains aligned blocks, the parser outputs chaotic text 
> or empty text as to this file.
> I have fixed it myself, corrected decompressAlignedBlock() and its 
> preparation methods. Mostly this bug is due to misusing main tree/align 
> tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1447) CHM parser: wrong directory list

2014-11-17 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214535#comment-14214535
 ] 

Hong-Thai Nguyen commented on TIKA-1447:


[~binhawking], The work on TIKA-1446 fixed this issue ? Any change to double 
check again ?

Thanks,

> CHM parser: wrong directory list
> 
>
> Key: TIKA-1447
> URL: https://issues.apache.org/jira/browse/TIKA-1447
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Bin Hawking
>Priority: Critical
>
> CHM parser gets wrong directory list of a chm file (eg. testChm2.chm in 
> tika-parser's test-resources):
> 1. Duplicate entries (mostly from PMGI chunks, which should have been 
> ignored.)
> 2. Invalid entry (usually with unreadable entry name).
> 3. Missed entries (some times it is like TIKA-1176)
> I have fixed it (to some degree), by using the PMGL header to find dir chunks 
> and their respective meaningful parts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-12 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208079#comment-14208079
 ] 

Hong-Thai Nguyen edited comment on TIKA-1446 at 11/12/14 2:38 PM:
--

Hi [~binhawking], I've merged your contribution and make title comparison 
before/after on a local corpus of CHM files.
Before merge, we have only one failed file, after merge we have 10 failed 
files. I've pushed failed CHM files under _test-documents/chm_ & a checking 
test case into: https://github.com/thaichat04/tika
I made also some clean-up.

Any chance you have a look again ?


was (Author: thaichat04):
Hi [~binhawking], I've merge your pull request and make title comparison 
before/after on a local corpus of CHM files.
Before merge, we have only one failed file, after merge we have 10 failed 
files. I've pushed failed CHM files under _test-documents/chm_ & a checking 
test case into: https://github.com/thaichat04/tika
I made also some clean-up.

Any chance you have a look again ?

> CHM parser : wrong decompression of aligned blocks
> --
>
> Key: TIKA-1446
> URL: https://issues.apache.org/jira/browse/TIKA-1446
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Bin Hawking
>Priority: Critical
> Attachments: chm.zip
>
>
> If an embedded file contains aligned blocks, the parser outputs chaotic text 
> or empty text as to this file.
> I have fixed it myself, corrected decompressAlignedBlock() and its 
> preparation methods. Mostly this bug is due to misusing main tree/align 
> tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-12 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208079#comment-14208079
 ] 

Hong-Thai Nguyen commented on TIKA-1446:


Hi [~binhawking], I've merge your pull request and make title comparison 
before/after on a local corpus of CHM files.
Before merge, we have only one failed file, after merge we have 10 failed 
files. I've pushed failed CHM files under _test-documents/chm_ & a checking 
test case into: https://github.com/thaichat04/tika
I made also some clean-up.

Any chance you have a look again ?

> CHM parser : wrong decompression of aligned blocks
> --
>
> Key: TIKA-1446
> URL: https://issues.apache.org/jira/browse/TIKA-1446
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Bin Hawking
>Priority: Critical
> Attachments: chm.zip
>
>
> If an embedded file contains aligned blocks, the parser outputs chaotic text 
> or empty text as to this file.
> I have fixed it myself, corrected decompressAlignedBlock() and its 
> preparation methods. Mostly this bug is due to misusing main tree/align 
> tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-04 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196343#comment-14196343
 ] 

Hong-Thai Nguyen commented on TIKA-1463:


Thank [~lfcnassif], without .exe effectively works also. BTW, path with space 
is buggy.
I leave this fix because adding ".exe"  only in Windows don't hurt anything.

> TesseractOCRParser does not work in Windows
> ---
>
> Key: TIKA-1463
> URL: https://issues.apache.org/jira/browse/TIKA-1463
> Project: Tika
>  Issue Type: Bug
>Reporter: Hong-Thai Nguyen
>
> STR:
> * Case 1:
> ** Setting tesseractPath to a common installation path of Tesseract:  
> C:\Program Files (x86)\Tesseract-OCR
> ** the checking available Tesseract command returns always false
> * Case 2:
> ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR
> ** the checking & running command of tesseract on Windows is not correct: 
> C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen closed TIKA-1463.
--
Resolution: Fixed

> TesseractOCRParser does not work in Windows
> ---
>
> Key: TIKA-1463
> URL: https://issues.apache.org/jira/browse/TIKA-1463
> Project: Tika
>  Issue Type: Bug
>Reporter: Hong-Thai Nguyen
>
> STR:
> * Case 1:
> ** Setting tesseractPath to a common installation path of Tesseract:  
> C:\Program Files (x86)\Tesseract-OCR
> ** the checking available Tesseract command returns always false
> * Case 2:
> ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR
> ** the checking & running command of tesseract on Windows is not correct: 
> C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1463:
---
Description: 
STR:
* Case 1:
** Setting tesseractPath to a common installation path of Tesseract:  
C:\Program Files (x86)\Tesseract-OCR
** the checking available Tesseract command returns always false

* Case 2:
** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR
** the checking & running command of tesseract on Windows is not correct: 
C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe

  was:
STR:
* Case 1:
** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR
** the checking available Tesseract command returns always false

* Case 2:
** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR
** the checking & running command of tesseract on Windows is not correct: 
C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe


> TesseractOCRParser does not work in Windows
> ---
>
> Key: TIKA-1463
> URL: https://issues.apache.org/jira/browse/TIKA-1463
> Project: Tika
>  Issue Type: Bug
>Reporter: Hong-Thai Nguyen
>
> STR:
> * Case 1:
> ** Setting tesseractPath to a common installation path of Tesseract:  
> C:\Program Files (x86)\Tesseract-OCR
> ** the checking available Tesseract command returns always false
> * Case 2:
> ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR
> ** the checking & running command of tesseract on Windows is not correct: 
> C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1463:
---
Summary: TesseractOCRParser does not work in Windows  (was: 
TesseractOCRParser does work in Windows)

> TesseractOCRParser does not work in Windows
> ---
>
> Key: TIKA-1463
> URL: https://issues.apache.org/jira/browse/TIKA-1463
> Project: Tika
>  Issue Type: Bug
>Reporter: Hong-Thai Nguyen
>
> STR:
> * Case 1:
> ** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR
> ** the checking available Tesseract command returns always false
> * Case 2:
> ** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR
> ** the checking & running command of tesseract on Windows is not correct: 
> C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1463) TesseractOCRParser does work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194694#comment-14194694
 ] 

Hong-Thai Nguyen commented on TIKA-1463:


Fixed in r1636382

> TesseractOCRParser does work in Windows
> ---
>
> Key: TIKA-1463
> URL: https://issues.apache.org/jira/browse/TIKA-1463
> Project: Tika
>  Issue Type: Bug
>Reporter: Hong-Thai Nguyen
>
> STR:
> * Case 1:
> ** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR
> ** the checking available Tesseract command returns always false
> * Case 2:
> ** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR
> ** the checking & running command of tesseract on Windows is not correct: 
> C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1463) TesseractOCRParser does work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)
Hong-Thai Nguyen created TIKA-1463:
--

 Summary: TesseractOCRParser does work in Windows
 Key: TIKA-1463
 URL: https://issues.apache.org/jira/browse/TIKA-1463
 Project: Tika
  Issue Type: Bug
Reporter: Hong-Thai Nguyen


STR:
* Case 1:
** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR
** the checking available Tesseract command returns always false

* Case 2:
** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR
** the checking & running command of tesseract on Windows is not correct: 
C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-23 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181530#comment-14181530
 ] 

Hong-Thai Nguyen commented on TIKA-1446:


Thank alot [~binhawking], I've quick look on your fix. Effectually, there's 
quite a lot of changes. After cleanup & fix some minor, I broke CHM tests.

We appreciate really your contribution and we should continue & finalize. I've 
created new pull request basing on a branch with your fix + my cleanup:
https://github.com/apache/tika/pull/21
https://github.com/thaichat04/tika.git, branch TIKA-1446

> CHM parser : wrong decompression of aligned blocks
> --
>
> Key: TIKA-1446
> URL: https://issues.apache.org/jira/browse/TIKA-1446
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Bin Hawking
>Priority: Critical
> Attachments: chm.zip
>
>
> If an embedded file contains aligned blocks, the parser outputs chaotic text 
> or empty text as to this file.
> I have fixed it myself, corrected decompressAlignedBlock() and its 
> preparation methods. Mostly this bug is due to misusing main tree/align 
> tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-21 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178186#comment-14178186
 ] 

Hong-Thai Nguyen edited comment on TIKA-1422 at 10/21/14 9:48 AM:
--

Applied latest fix on r1633325 & r161 with some formatting. Thank


was (Author: thaichat04):
Applied latest fix on r1633325 with some formatting. Thank

> org.apache.tika.parser.mail.RFC822ParserTest fails
> --
>
> Key: TIKA-1422
> URL: https://issues.apache.org/jira/browse/TIKA-1422
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.7
>
> Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
> TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, 
> TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch
>
>
> I'm seeing test failures from:
> {noformat}
> Results :
> Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
> (..)
> Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
> {noformat}
> CentOS6 VM image, running:
> {noformat}
> [mattmann@memex tika]$ java -version
> java version "1.7.0_67"
> Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
> [mattmann@memex tika]$ mvn -version
> Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
> 2014-02-14T09:37:52-08:00)
> Maven home: /usr/share/apache-maven
> Java version: 1.7.0_65, vendor: Oracle Corporation
> Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "2.6.32-431.23.3.el6.centos.plus.x86_64", arch: 
> "amd64", family: "unix"
> [mattmann@memex tika]$ 
> {noformat}
> Here are the surefire reports - no clue what's up here:
> {noformat}
> [mattmann@memex tika]$ more 
> tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
>  
> ---
> Test set: org.apache.tika.parser.mail.RFC822ParserTest
> ---
> Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec <<< 
> FAILURE!
> testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
> 0.152 sec  <<< FAILURE!
> org.mockito.exceptions.verification.TooManyActualInvocations: 
> xHTMLContentHandler.startElement(
> "http://www.w3.org/1999/xhtml";,
> "div",
> "div",
> isA(org.xml.sax.Attributes)
> );
> Wanted 4 times but was 5
>   at 
> org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
> Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
> Undesired invocation:
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
>   at 
> org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
>   at 
> org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
>   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
>   at 
> org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodA

[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-21 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178186#comment-14178186
 ] 

Hong-Thai Nguyen commented on TIKA-1422:


Applied latest fix on r1633325 with some formatting. Thank

> org.apache.tika.parser.mail.RFC822ParserTest fails
> --
>
> Key: TIKA-1422
> URL: https://issues.apache.org/jira/browse/TIKA-1422
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.7
>
> Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
> TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, 
> TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch
>
>
> I'm seeing test failures from:
> {noformat}
> Results :
> Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
> (..)
> Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
> {noformat}
> CentOS6 VM image, running:
> {noformat}
> [mattmann@memex tika]$ java -version
> java version "1.7.0_67"
> Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
> [mattmann@memex tika]$ mvn -version
> Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
> 2014-02-14T09:37:52-08:00)
> Maven home: /usr/share/apache-maven
> Java version: 1.7.0_65, vendor: Oracle Corporation
> Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "2.6.32-431.23.3.el6.centos.plus.x86_64", arch: 
> "amd64", family: "unix"
> [mattmann@memex tika]$ 
> {noformat}
> Here are the surefire reports - no clue what's up here:
> {noformat}
> [mattmann@memex tika]$ more 
> tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
>  
> ---
> Test set: org.apache.tika.parser.mail.RFC822ParserTest
> ---
> Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec <<< 
> FAILURE!
> testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
> 0.152 sec  <<< FAILURE!
> org.mockito.exceptions.verification.TooManyActualInvocations: 
> xHTMLContentHandler.startElement(
> "http://www.w3.org/1999/xhtml";,
> "div",
> "div",
> isA(org.xml.sax.Attributes)
> );
> Wanted 4 times but was 5
>   at 
> org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
> Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
> Undesired invocation:
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
>   at 
> org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
>   at 
> org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
>   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
>   at 
> org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:

[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-16 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173537#comment-14173537
 ] 

Hong-Thai Nguyen commented on TIKA-1422:


I'm not using Tesseract

> org.apache.tika.parser.mail.RFC822ParserTest fails
> --
>
> Key: TIKA-1422
> URL: https://issues.apache.org/jira/browse/TIKA-1422
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.7
>
> Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
> TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.palsulich.100414.patch, 
> TIKA-1422.palsulich.100714.patch
>
>
> I'm seeing test failures from:
> {noformat}
> Results :
> Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
> (..)
> Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
> {noformat}
> CentOS6 VM image, running:
> {noformat}
> [mattmann@memex tika]$ java -version
> java version "1.7.0_67"
> Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
> [mattmann@memex tika]$ mvn -version
> Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
> 2014-02-14T09:37:52-08:00)
> Maven home: /usr/share/apache-maven
> Java version: 1.7.0_65, vendor: Oracle Corporation
> Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "2.6.32-431.23.3.el6.centos.plus.x86_64", arch: 
> "amd64", family: "unix"
> [mattmann@memex tika]$ 
> {noformat}
> Here are the surefire reports - no clue what's up here:
> {noformat}
> [mattmann@memex tika]$ more 
> tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
>  
> ---
> Test set: org.apache.tika.parser.mail.RFC822ParserTest
> ---
> Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec <<< 
> FAILURE!
> testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
> 0.152 sec  <<< FAILURE!
> org.mockito.exceptions.verification.TooManyActualInvocations: 
> xHTMLContentHandler.startElement(
> "http://www.w3.org/1999/xhtml";,
> "div",
> "div",
> isA(org.xml.sax.Attributes)
> );
> Wanted 4 times but was 5
>   at 
> org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
> Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
> Undesired invocation:
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
>   at 
> org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
>   at 
> org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
>   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
>   at 
> org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
> 

[jira] [Commented] (TIKA-1176) ChmDirectoryListingSet does not correctly enumerate directory entries

2014-10-13 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169146#comment-14169146
 ] 

Hong-Thai Nguyen commented on TIKA-1176:


Hi [~mdgeek], thank for your offering code & testing file. Unfortunately, this 
check raised other exception on this file:
{code}
The full exception stack trace is included below:

org.apache.tika.exception.TikaException
at 
org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:355)
at org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:70)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:326)
at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:285)
at 
org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
at 
org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
at javax.swing.TransferHandler.importData(TransferHandler.java:755)
at 
javax.swing.TransferHandler$DropHandler.drop(TransferHandler.java:1478)
at java.awt.dnd.DropTarget.drop(DropTarget.java:434)
at 
javax.swing.TransferHandler$SwingDropTarget.drop(TransferHandler.java:1203)
at 
sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(SunDropTargetContextPeer.java:519)
at 
sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(SunDropTargetContextPeer.java:832)
at 
sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(SunDropTargetContextPeer.java:756)
at sun.awt.dnd.SunDropTargetEvent.dispatch(SunDropTargetEvent.java:30)
at java.awt.Component.dispatchEventImpl(Component.java:4517)
at java.awt.Container.dispatchEventImpl(Container.java:2097)
at java.awt.Component.dispatchEvent(Component.java:4488)
at 
java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4575)
at 
java.awt.LightweightDispatcher.processDropTargetEvent(Container.java:4310)
at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4161)
at java.awt.Container.dispatchEventImpl(Container.java:2083)
at java.awt.Window.dispatchEventImpl(Window.java:2489)
at java.awt.Component.dispatchEvent(Component.java:4488)
at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:674)
at java.awt.EventQueue.access$400(EventQueue.java:81)
at java.awt.EventQueue$2.run(EventQueue.java:633)
at java.awt.EventQueue$2.run(EventQueue.java:631)
at java.security.AccessController.doPrivileged(Native Method)
at 
java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
at 
java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98)
at java.awt.EventQueue$3.run(EventQueue.java:647)
at java.awt.EventQueue$3.run(EventQueue.java:645)
at java.security.AccessController.doPrivileged(Native Method)
at 
java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
at java.awt.EventQueue.dispatchEvent(EventQueue.java:644)
at 
java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:269)
at 
java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:184)
at 
java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:174)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:169)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:161)
at java.awt.EventDispatchThread.run(EventDispatchThread.java:122)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at 
org.apache.tika.parser.chm.core.ChmCommons.copyOfRange(ChmCommons.java:342)
at 
org.apache.tika.parser.chm.core.ChmCommons.getChmBlockSegment(ChmCommons.java:108)
at 
org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:337)
... 43 more
{code} 

It's quite complex our CHM Parser, can you apply a full fix and a test with 
expected content in output on your file ?

Thanks,

> ChmDirectoryListingSet does not correctly enumerate directory entries
> -
>
> Key: TIKA-1176
> URL: https://issues.apache.org/jira/browse/TIKA-1176
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Doug Martin
> Attachments: HelpStudioSample.chm
>
>
> ChmDirectoryLis

[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-13 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169130#comment-14169130
 ] 

Hong-Thai Nguyen commented on TIKA-1422:


Strange, I'm unable to build causing this failed test on latest. I'm on Windows 
+ Oracle JDK7:
{code}

http://www.w3.org/1999/xhtml";,
"div",
"div",
isA(org.xml.sax.Attributes)
);
Wanted 4 times but was 5" 
type="org.mockito.exceptions.verification.TooManyActualInvocations">org.mockito.exceptions.verification.TooManyActualInvocations:
 
xHTMLContentHandler.startElement(
"http://www.w3.org/1999/xhtml";,
"div",
"div",
isA(org.xml.sax.Attributes)
);
Wanted 4 times but was 5
at 
org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
Undesired invocation:
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
at 
org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:250)
at 
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:163)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
at 
org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
at 
org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
at 
org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:236)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:134)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:113)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
at 
org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.jav

[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-13 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169098#comment-14169098
 ] 

Hong-Thai Nguyen commented on TIKA-1446:


Thank [~binhawking], Any change you can attach your fix and eventually a test 
case on this issue ?

> CHM parser : wrong decompression of aligned blocks
> --
>
> Key: TIKA-1446
> URL: https://issues.apache.org/jira/browse/TIKA-1446
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Bin Hawking
>Priority: Critical
>
> If an embedded file contains aligned blocks, the parser outputs chaotic text 
> or empty text as to this file.
> I have fixed it myself, corrected decompressAlignedBlock() and its 
> preparation methods. Mostly this bug is due to misusing main tree/align 
> tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-10-13 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169090#comment-14169090
 ] 

Hong-Thai Nguyen commented on TIKA-1445:


Interesting question !
For me, parser's selection and parsers priority decision should be done on 
runtime by configuration, not inside a parser.
Image's parser is an interesting case of concurrent parsers (Tesseract vs 
classical Image Parsers). We have double problem here:
1. When many parsers can work with same mime type, which one is selected ?
2. When we have many parsers, can we apply many parsers and merge results 
(metadata & handler) .

* For case 1, if we use a override config of parsers on runtime, we can declare 
many parsers with matching mimetype and the later one in list will be selected. 
We may extend CLI/WebService to inject this kind of configuration.
* For case 2, we don't have a solution for now. We may extend CompositeParser 
to accept a mode 'many' parsers and call matching parsers in chain. The merging 
result is an other problem.we can accept a same metadata name is override by an 
other parser. The perfect solution is (again) using nested structure on our 
metadata which enable store each parser's result.

> Figure out how to add Image metadata extraction to Tesseract parser
> ---
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.7
>
> Attachments: TIKA-1445.Mattmann.101214.patch.txt
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1428) Microsoft Word 97 - 2003 (.doc) footnote references are Unicode Replacement Character

2014-09-25 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147880#comment-14147880
 ] 

Hong-Thai Nguyen commented on TIKA-1428:


Thanks [~theoettheo], any chance to have a patch with a test case for this 
problem ?

> Microsoft Word 97 - 2003 (.doc) footnote references are Unicode Replacement 
> Character
> -
>
> Key: TIKA-1428
> URL: https://issues.apache.org/jira/browse/TIKA-1428
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.4, 1.6
>Reporter: Theodor Sjöstedt
>Priority: Minor
> Attachments: TIKA-doc-footnotes-issue.png
>
>
> Footnotes from {{.doc}} documents are extracted, but the references to the 
> footnotes are replaced by the Unicode Replacement Character (�).
> I have tried this in 1.4 and 1.6.
> In 1.4, both reference in text and reference at footnote have been replaced.
> In 1.6, reference in text has disappeared completely.
> See attached image for original document, 1.4 Formatted text, and 1.6 
> Formatted text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1412) NPE in OpenDocumentParser

2014-09-22 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143043#comment-14143043
 ] 

Hong-Thai Nguyen commented on TIKA-1412:


Add a test at r1626706

> NPE in OpenDocumentParser
> -
>
> Key: TIKA-1412
> URL: https://issues.apache.org/jira/browse/TIKA-1412
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: Andrzej Bialecki 
> Fix For: 1.7
>
> Attachments: TIKA-1412.diff
>
>
> There's a missing "else" in OpenDocumentParser when it constructs a 
> ZipInputStream from the InputStream, which results in NPE when the 
> InputStream is an instance of TikaInputStream but has neither openContainer 
> nor file:
> {code}
> ...
> Caused by: java.lang.NullPointerException
> at 
> org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.java:161)
>  ~[tika-parsers-1.6.jar:1.6]
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) 
> ~[tika-core-1.6.jar:1.6]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1421) Tika-Parsers tests fail on CentOS6 if tesseract isn't installed

2014-09-22 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1421:
---
Priority: Blocker  (was: Major)

> Tika-Parsers tests fail on CentOS6 if tesseract isn't installed
> ---
>
> Key: TIKA-1421
> URL: https://issues.apache.org/jira/browse/TIKA-1421
> Project: Tika
>  Issue Type: Bug
>  Components: parser
> Environment: CentOS6 AWS VM for DARPA Memex
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Fix For: 1.7
>
>
> While testing TIKA-93 on CentOS6, I ran into some test failing issues on a 
> 1.7-trunk fresh install of tika in tika-parsers:
> {noformat}
> Running org.apache.tika.parser.chm.TestChmLzxcControlData
> Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
> Running org.apache.tika.parser.chm.TestChmBlockInfo
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
> Running org.apache.tika.parser.chm.TestChmItsfHeader
> Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec
> Running org.apache.tika.parser.txt.TXTParserTest
> Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec
> Running org.apache.tika.parser.txt.CharsetDetectorTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
> Running org.apache.tika.parser.image.xmp.JempboxExtractorTest
> Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
> Running org.apache.tika.parser.image.PSDParserTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
> Running org.apache.tika.parser.image.ImageParserTest
> Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.034 sec
> Running org.apache.tika.parser.image.ImageMetadataExtractorTest
> Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.241 sec
> Running org.apache.tika.parser.image.MetadataFieldsTest
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
> Running org.apache.tika.parser.image.TiffParserTest
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
> Running org.apache.tika.parser.font.FontParsersTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.192 sec
> Running org.apache.tika.parser.mp4.MP4ParserTest
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.07 sec
> Running org.apache.tika.parser.mp3.Mp3ParserTest
> Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.046 sec
> Running org.apache.tika.parser.mp3.MpegStreamTest
> Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
> Running org.apache.tika.parser.dwg.DWGParserTest
> Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
> Running org.apache.tika.parser.pkg.GzipParserTest
> Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.252 sec
> Running org.apache.tika.parser.pkg.Seven7ParserTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.37 sec
> Running org.apache.tika.parser.pkg.TarParserTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.118 sec
> Running org.apache.tika.parser.pkg.Bzip2ParserTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.233 sec
> Running org.apache.tika.parser.pkg.ArParserTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.017 sec
> Running org.apache.tika.parser.pkg.ZipParserTest
> Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.302 sec
> Running org.apache.tika.parser.video.FLVParserTest
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec
> Running org.apache.tika.parser.solidworks.SolidworksParserTest
> Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec
> Running org.apache.tika.parser.ibooks.iBooksParserTest
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec
> Running org.apache.tika.parser.ParsingReaderTest
> Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.018 sec
> Running org.apache.tika.parser.mail.RFC822ParserTest
> Tests run: 8, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 0.31 sec <<< 
> FAILURE!
> Running org.apache.tika.parser.mbox.MboxParserTest
> Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec
> Running org.apache.tika.parser.mbox.OutlookPSTParserTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.094 sec
> Running org.apache.tika.parser.jpeg.JpegParserTest
> Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.153 sec
> Running org.apache.tika.parser.executable.ExecutableParserTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
> Running org.apache.tika.

[jira] [Commented] (TIKA-1421) Tika-Parsers tests fail on CentOS6 if tesseract isn't installed

2014-09-22 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143041#comment-14143041
 ] 

Hong-Thai Nguyen commented on TIKA-1421:


Not only CentOS, this test failed also on my Windows without Tesseract 
installed.

> Tika-Parsers tests fail on CentOS6 if tesseract isn't installed
> ---
>
> Key: TIKA-1421
> URL: https://issues.apache.org/jira/browse/TIKA-1421
> Project: Tika
>  Issue Type: Bug
>  Components: parser
> Environment: CentOS6 AWS VM for DARPA Memex
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.7
>
>
> While testing TIKA-93 on CentOS6, I ran into some test failing issues on a 
> 1.7-trunk fresh install of tika in tika-parsers:
> {noformat}
> Running org.apache.tika.parser.chm.TestChmLzxcControlData
> Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
> Running org.apache.tika.parser.chm.TestChmBlockInfo
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
> Running org.apache.tika.parser.chm.TestChmItsfHeader
> Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec
> Running org.apache.tika.parser.txt.TXTParserTest
> Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec
> Running org.apache.tika.parser.txt.CharsetDetectorTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
> Running org.apache.tika.parser.image.xmp.JempboxExtractorTest
> Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
> Running org.apache.tika.parser.image.PSDParserTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
> Running org.apache.tika.parser.image.ImageParserTest
> Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.034 sec
> Running org.apache.tika.parser.image.ImageMetadataExtractorTest
> Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.241 sec
> Running org.apache.tika.parser.image.MetadataFieldsTest
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
> Running org.apache.tika.parser.image.TiffParserTest
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
> Running org.apache.tika.parser.font.FontParsersTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.192 sec
> Running org.apache.tika.parser.mp4.MP4ParserTest
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.07 sec
> Running org.apache.tika.parser.mp3.Mp3ParserTest
> Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.046 sec
> Running org.apache.tika.parser.mp3.MpegStreamTest
> Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
> Running org.apache.tika.parser.dwg.DWGParserTest
> Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
> Running org.apache.tika.parser.pkg.GzipParserTest
> Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.252 sec
> Running org.apache.tika.parser.pkg.Seven7ParserTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.37 sec
> Running org.apache.tika.parser.pkg.TarParserTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.118 sec
> Running org.apache.tika.parser.pkg.Bzip2ParserTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.233 sec
> Running org.apache.tika.parser.pkg.ArParserTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.017 sec
> Running org.apache.tika.parser.pkg.ZipParserTest
> Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.302 sec
> Running org.apache.tika.parser.video.FLVParserTest
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec
> Running org.apache.tika.parser.solidworks.SolidworksParserTest
> Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec
> Running org.apache.tika.parser.ibooks.iBooksParserTest
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec
> Running org.apache.tika.parser.ParsingReaderTest
> Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.018 sec
> Running org.apache.tika.parser.mail.RFC822ParserTest
> Tests run: 8, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 0.31 sec <<< 
> FAILURE!
> Running org.apache.tika.parser.mbox.MboxParserTest
> Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec
> Running org.apache.tika.parser.mbox.OutlookPSTParserTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.094 sec
> Running org.apache.tika.parser.jpeg.JpegParserTest
> Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.153 sec
> Running org.apache.tika.parser.executable.ExecutableParserTest
> Tests run: 2, Failures: 0,

[jira] [Resolved] (TIKA-1413) OOXML thumbnail name added to body

2014-09-09 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1413.

Resolution: Fixed

> OOXML thumbnail name added to body
> --
>
> Key: TIKA-1413
> URL: https://issues.apache.org/jira/browse/TIKA-1413
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: Andrzej Bialecki 
>
> AbstractOOXMLExtractor.handleThumbnail processes thumbnails using 
> EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike 
> other embedded parts in handleEmbeddedParts(...)).
> This results in adding the thumbnail name to the main body of the document 
> (as a package-entry), which in my opinion is wrong.
> Example:
> {code}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
> 
> 
> 
> 
>  content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> 
> The quick brown fox jumps over the lazy dog
> 
> The quick brown fox jumps over the lazy dog
> 
> 
> 
> 
> 
> 
>  class="package-entry">thumbnail_0.jpeg
> {code}
> The extracted plain text looks like this (using tika-app):
> {code}
> The quick brown fox jumps over the lazy dog
> thumbnail_0.jpeg
> {code}
> The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false.
> I think also that the id attribute should be set to the real thumbnail path 
> within the package (i.e. tPart.getPartName().getName()) instead of the 
> artificially created sequential name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1413) OOXML thumbnail name added to body

2014-09-09 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126949#comment-14126949
 ] 

Hong-Thai Nguyen commented on TIKA-1413:


I agree. Fixed in r1623819 and _id_ is now from partName().

> OOXML thumbnail name added to body
> --
>
> Key: TIKA-1413
> URL: https://issues.apache.org/jira/browse/TIKA-1413
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: Andrzej Bialecki 
>
> AbstractOOXMLExtractor.handleThumbnail processes thumbnails using 
> EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike 
> other embedded parts in handleEmbeddedParts(...)).
> This results in adding the thumbnail name to the main body of the document 
> (as a package-entry), which in my opinion is wrong.
> Example:
> {code}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
> 
> 
> 
> 
>  content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> 
> The quick brown fox jumps over the lazy dog
> 
> The quick brown fox jumps over the lazy dog
> 
> 
> 
> 
> 
> 
>  class="package-entry">thumbnail_0.jpeg
> {code}
> The extracted plain text looks like this (using tika-app):
> {code}
> The quick brown fox jumps over the lazy dog
> thumbnail_0.jpeg
> {code}
> The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false.
> I think also that the id attribute should be set to the real thumbnail path 
> within the package (i.e. tPart.getPartName().getName()) instead of the 
> artificially created sequential name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-29 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077885#comment-14077885
 ] 

Hong-Thai Nguyen commented on TIKA-1373:


Normally it's on next  official 1.6 release, but you can try with this 
candidate release: http://people.apache.org/~mattmann/apache-tika-1.6/rc1/

> AutoDetectParser extracts no text when SourceCodeParser is selected
> ---
>
> Key: TIKA-1373
> URL: https://issues.apache.org/jira/browse/TIKA-1373
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Andrés Aguilar-Umaña
>
> When using the AutoDetectParser in java code, and the SourceCodeParser is 
> selected (i.e. java files), the handler gets no text:
> I have this test program:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
> try {
>autoDetectParser.parse(bais, bch, metadata, parseContext);
> } catch (Exception e) {
>e.printStackTrace();
> }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> It returns (using the SourceCodeParser): 
> {code} > Text extracted: {code}
> But when I use this code:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/plain");
> try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
> catch (Exception e) {  e.printStackTrace();  }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> The Text Parser is used and I get:
> {code} > Text extracted: public class HelloWorld {} {code}
> I have also tested this command: 
> {code}
> > java -jar tika-app-1.5.jar -t D:\text.java
>   (no text)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1373.


Resolution: Fixed

> AutoDetectParser extracts no text when SourceCodeParser is selected
> ---
>
> Key: TIKA-1373
> URL: https://issues.apache.org/jira/browse/TIKA-1373
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Andrés Aguilar-Umaña
>
> When using the AutoDetectParser in java code, and the SourceCodeParser is 
> selected (i.e. java files), the handler gets no text:
> I have this test program:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
> try {
>autoDetectParser.parse(bais, bch, metadata, parseContext);
> } catch (Exception e) {
>e.printStackTrace();
> }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> It returns (using the SourceCodeParser): 
> {code} > Text extracted: {code}
> But when I use this code:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/plain");
> try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
> catch (Exception e) {  e.printStackTrace();  }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> The Text Parser is used and I get:
> {code} > Text extracted: public class HelloWorld {} {code}
> I have also tested this command: 
> {code}
> > java -jar tika-app-1.5.jar -t D:\text.java
>   (no text)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-24 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073042#comment-14073042
 ] 

Hong-Thai Nguyen commented on TIKA-1373:


HtmlParser skips tags generated by JHighlight. I found a solution by using 
directly TagSoup Parser. Commit in r1613051.
As I mentioned in TIKA-1224, this parser is quick & dirty approach to parser 
source code file. Again, the _right_ one parser is must have dedicate parser by 
language and parse deeply elements and build events on-the-fly.

> AutoDetectParser extracts no text when SourceCodeParser is selected
> ---
>
> Key: TIKA-1373
> URL: https://issues.apache.org/jira/browse/TIKA-1373
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Andrés Aguilar-Umaña
>
> When using the AutoDetectParser in java code, and the SourceCodeParser is 
> selected (i.e. java files), the handler gets no text:
> I have this test program:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
> try {
>autoDetectParser.parse(bais, bch, metadata, parseContext);
> } catch (Exception e) {
>e.printStackTrace();
> }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> It returns (using the SourceCodeParser): 
> {code} > Text extracted: {code}
> But when I use this code:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/plain");
> try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
> catch (Exception e) {  e.printStackTrace();  }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> The Text Parser is used and I get:
> {code} > Text extracted: public class HelloWorld {} {code}
> I have also tested this command: 
> {code}
> > java -jar tika-app-1.5.jar -t D:\text.java
>   (no text)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-23 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071643#comment-14071643
 ] 

Hong-Thai Nguyen edited comment on TIKA-1373 at 7/23/14 1:42 PM:
-

Can you format your description with {noformat}{code}{noformat} annotation and 
if I understand well the output of 1st section is empty ?


was (Author: thaichat04):
Can you format your description with {code} annotation and if I understand well 
the output of 1st section is empty ?

> AutoDetectParser extracts no text when SourceCodeParser is selected
> ---
>
> Key: TIKA-1373
> URL: https://issues.apache.org/jira/browse/TIKA-1373
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Andrés Aguilar-Umaña
>
> When using the AutoDetectParser in java code, and the SourceCodeParser is 
> selected (i.e. java files), the handler gets no text:
> I have this test program:
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> autoDetectParser = new SourceCodeParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
> try {
>autoDetectParser.parse(bais, bch, metadata, parseContext);
> } catch (Exception e) {
>e.printStackTrace();
> }
> System.out.println("Text extracted: "+bch.toString())
> It returns (using the SourceCodeParser): 
> > Text extracted: 
> But when I use this code:
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> autoDetectParser = new SourceCodeParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/plain");
> try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
> catch (Exception e) {  e.printStackTrace();  }
> System.out.println("Text extracted: "+bch.toString())
> The Text Parser is used and I get:
> > Text extracted: public class HelloWorld {}
> I have also tested this command: 
> > java -jar tika-app-1.5.jar -t D:\text.java
>   (no text)
> > 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-23 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071713#comment-14071713
 ] 

Hong-Thai Nguyen commented on TIKA-1373:


Yes, I saw the trouble when implementing this parser. How can we get that we 
are asking for text instead of HTML ? Can Handler is instanceOf 
BodyContentHandler is enough ?

> AutoDetectParser extracts no text when SourceCodeParser is selected
> ---
>
> Key: TIKA-1373
> URL: https://issues.apache.org/jira/browse/TIKA-1373
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Andrés Aguilar-Umaña
>
> When using the AutoDetectParser in java code, and the SourceCodeParser is 
> selected (i.e. java files), the handler gets no text:
> I have this test program:
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> autoDetectParser = new SourceCodeParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
> try {
>autoDetectParser.parse(bais, bch, metadata, parseContext);
> } catch (Exception e) {
>e.printStackTrace();
> }
> System.out.println("Text extracted: "+bch.toString())
> It returns (using the SourceCodeParser): 
> > Text extracted: 
> But when I use this code:
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> autoDetectParser = new SourceCodeParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/plain");
> try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
> catch (Exception e) {  e.printStackTrace();  }
> System.out.println("Text extracted: "+bch.toString())
> The Text Parser is used and I get:
> > Text extracted: public class HelloWorld {}
> I have also tested this command: 
> > java -jar tika-app-1.5.jar -t D:\text.java
>   (no text)
> > 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-23 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071643#comment-14071643
 ] 

Hong-Thai Nguyen commented on TIKA-1373:


Can you format your description with {code} annotation and if I understand well 
the output of 1st section is empty ?

> AutoDetectParser extracts no text when SourceCodeParser is selected
> ---
>
> Key: TIKA-1373
> URL: https://issues.apache.org/jira/browse/TIKA-1373
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Andrés Aguilar-Umaña
>
> When using the AutoDetectParser in java code, and the SourceCodeParser is 
> selected (i.e. java files), the handler gets no text:
> I have this test program:
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> autoDetectParser = new SourceCodeParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
> try {
>autoDetectParser.parse(bais, bch, metadata, parseContext);
> } catch (Exception e) {
>e.printStackTrace();
> }
> System.out.println("Text extracted: "+bch.toString())
> It returns (using the SourceCodeParser): 
> > Text extracted: 
> But when I use this code:
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> autoDetectParser = new SourceCodeParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/plain");
> try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
> catch (Exception e) {  e.printStackTrace();  }
> System.out.println("Text extracted: "+bch.toString())
> The Text Parser is used and I get:
> > Text extracted: public class HelloWorld {}
> I have also tested this command: 
> > java -jar tika-app-1.5.jar -t D:\text.java
>   (no text)
> > 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1095) Only gibberish extracted from this PDF

2014-07-15 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1095:
---

Labels: pdfbox  (was: patch)

> Only gibberish extracted from this PDF
> --
>
> Key: TIKA-1095
> URL: https://issues.apache.org/jira/browse/TIKA-1095
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
> Environment: Probably any
>Reporter: Bas van Meurs
>  Labels: pdfbox
> Attachments: ALG 2010-05-19 03 bijlage 1 -  besluitenlijst dagelijks 
> bestuur d d  10 februari 2010.pdf, test.txt
>
>
> java -jar /usr/share/tika/tika-app-1.3.jar -t 
> "/home/adrupal/www/sites/stadsregio.nl/files/files/Agendastukken/ALG 
> 2010-05-19 03 bijlage 1 -  besluitenlijst dagelijks bestuur d d  10 februari 
> 2010.pdf" > /tmp/test.txt
> This produces all gibberish.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1095) Only gibberish extracted from this PDF

2014-07-15 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1095:
---

Component/s: (was: general)
 parser

> Only gibberish extracted from this PDF
> --
>
> Key: TIKA-1095
> URL: https://issues.apache.org/jira/browse/TIKA-1095
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
> Environment: Probably any
>Reporter: Bas van Meurs
>  Labels: pdfbox
> Attachments: ALG 2010-05-19 03 bijlage 1 -  besluitenlijst dagelijks 
> bestuur d d  10 februari 2010.pdf, test.txt
>
>
> java -jar /usr/share/tika/tika-app-1.3.jar -t 
> "/home/adrupal/www/sites/stadsregio.nl/files/files/Agendastukken/ALG 
> 2010-05-19 03 bijlage 1 -  besluitenlijst dagelijks bestuur d d  10 februari 
> 2010.pdf" > /tmp/test.txt
> This produces all gibberish.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1095) Only gibberish extracted from this PDF

2014-07-15 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061867#comment-14061867
 ] 

Hong-Thai Nguyen commented on TIKA-1095:


Event with latest Tika can't convert this file. It seems that a font problem on 
this PDF file. Can you report this to PDFBox tracker: 
https://issues.apache.org/jira/browse/PDFBOX/ ?

> Only gibberish extracted from this PDF
> --
>
> Key: TIKA-1095
> URL: https://issues.apache.org/jira/browse/TIKA-1095
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.3
> Environment: Probably any
>Reporter: Bas van Meurs
>  Labels: patch
> Attachments: ALG 2010-05-19 03 bijlage 1 -  besluitenlijst dagelijks 
> bestuur d d  10 februari 2010.pdf, test.txt
>
>
> java -jar /usr/share/tika/tika-app-1.3.jar -t 
> "/home/adrupal/www/sites/stadsregio.nl/files/files/Agendastukken/ALG 
> 2010-05-19 03 bijlage 1 -  besluitenlijst dagelijks bestuur d d  10 februari 
> 2010.pdf" > /tmp/test.txt
> This produces all gibberish.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1332) Create "eval" code

2014-06-26 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044706#comment-14044706
 ] 

Hong-Thai Nguyen commented on TIKA-1332:


What you are describing is something alike _functional_ tests for Tika. Kinds 
of Cucumber, Fitness tools may help tests more readable and we can obtain 
report at output ?

> Create "eval" code
> --
>
> Key: TIKA-1332
> URL: https://issues.apache.org/jira/browse/TIKA-1332
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
>
> For this issue, we can start with code to gather statistics on each run (# of 
> exceptions per file type, most common exceptions per file type, number of 
> metadata items, total text extracted, etc).  We should also be able to 
> compare one run against another.  Going forward, there's plenty of room to 
> improve.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1350) OutlookPSTParser: Unknown message type: IPM.Note

2014-06-23 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14040519#comment-14040519
 ] 

Hong-Thai Nguyen commented on TIKA-1350:


Richard Johnson (author of java-pstlib) is trying deploy new version 0.8.1 to 
Maven Center (ref. 
https://issues.sonatype.org/browse/OSSRH-8965?focusedCommentId=260254&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-260254)

When this work done, we can upgrade to 0.8.1 in Tika dependence to get fix.

> OutlookPSTParser: Unknown message type: IPM.Note
> 
>
> Key: TIKA-1350
> URL: https://issues.apache.org/jira/browse/TIKA-1350
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.7
>Reporter: Jonathan Evans
>  Labels: libpst, parser, pst
> Fix For: 1.7
>
>   Original Estimate: 0.2h
>  Remaining Estimate: 0.2h
>
> When parsing some emails in a PST file I get the error "Unknown message type: 
> IPM.Note" preventing them from being parsed. This is because of an extra null 
> byte at the end of the message class string.
> This has been fixed in version 0.8.1 of java-libpst so a version bump is all 
> that is required. 
> https://github.com/rjohnsondev/java-libpst/issues/14
> I would attempt to do this myself but I am unsure how to open a pull request 
> with SVN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1320) extract text from jpeg in solr tika

2014-06-04 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14017473#comment-14017473
 ] 

Hong-Thai Nguyen commented on TIKA-1320:


OCR is a solution: TIKA-93. Unfortunately, this issue is still in process. 

> extract text from jpeg in solr tika
> ---
>
> Key: TIKA-1320
> URL: https://issues.apache.org/jira/browse/TIKA-1320
> Project: Tika
>  Issue Type: New Feature
>Reporter: muruganv
>  Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> How to extract text from jpeg or image format or tiff in solr tika



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE

2014-05-26 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008704#comment-14008704
 ] 

Hong-Thai Nguyen commented on TIKA-1308:


A virtual FileSystem may be a solution, If you're on Java 7. The NIO APIs with 
FileSytemProvider [1] allows you define or inject a Virtual FileSystem (eg. 
Common VFS [2]).

[1] 
http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileSystemProvider.html
[2] http://commons.apache.org/proper/commons-vfs/filesystems.html






> Support in memory parse mode(don't create temp file): to support run Tika in 
> GAE
> 
>
> Key: TIKA-1308
> URL: https://issues.apache.org/jira/browse/TIKA-1308
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: yuanyun.cn
>  Labels: gae
> Fix For: 1.6
>
>
> I am trying to use Tika in GAE and write a simple servlet to extract meta 
> data info from jpeg:
> String urlStr = req.getParameter("imageUrl");
> byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr));
> ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData);
> Metadata metadata = new Metadata();
> BodyContentHandler ch = new BodyContentHandler();
> AutoDetectParser parser = new AutoDetectParser();
> parser.parse(bais, ch, metadata, new ParseContext());
> bais.close();
> This fails with exception:
> Caused by: java.lang.SecurityException: Unable to create temporary file
>   at java.io.File.createTempFile(File.java:1986)
>   at 
> org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
>   at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
>   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
>   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
> Checked the code, in 
> org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, 
> Metadata, ParseContext), it creates a temp file from the input stream.
> I can understand why tika create temp file from the stream: so tika can parse 
> it multiple times.
> But as GAE and other cloud servers are getting more popular, is it possible 
> to avoid create temp file: instead we can copy the origin stream to a 
> byteArray stream, so tika can also parse it multiple times.
> -- This will have a limit on the file size, as tika keeps the whole file in 
> memory, but this can make tika work in GAE and maybe other cloud server.
> We can add a parameter in parser.parse to indicate whether do in memory parse 
> only.
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1290) Upgrade to PDFBOX 1.8.5

2014-05-06 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1290.


Resolution: Fixed

r1592780

> Upgrade to PDFBOX 1.8.5
> ---
>
> Key: TIKA-1290
> URL: https://issues.apache.org/jira/browse/TIKA-1290
> Project: Tika
>  Issue Type: Improvement
>Reporter: Hong-Thai Nguyen
>  Labels: trivial
>
> PDFBOX 1.8.5 has been released: http://pdfbox.apache.org/downloads.html#recent
> We can update to this version, and eventually test & fix also TIKA-1231



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1290) Upgrade to PDFBOX 1.8.5

2014-05-06 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1290:
---

Labels: trivial  (was: )

> Upgrade to PDFBOX 1.8.5
> ---
>
> Key: TIKA-1290
> URL: https://issues.apache.org/jira/browse/TIKA-1290
> Project: Tika
>  Issue Type: Improvement
>Reporter: Hong-Thai Nguyen
>  Labels: trivial
>
> PDFBOX 1.8.5 has been released: http://pdfbox.apache.org/downloads.html#recent
> We can update to this version, and eventually test & fix also TIKA-1231



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1290) Upgrade to PDFBOX 1.8.5

2014-05-02 Thread Hong-Thai Nguyen (JIRA)
Hong-Thai Nguyen created TIKA-1290:
--

 Summary: Upgrade to PDFBOX 1.8.5
 Key: TIKA-1290
 URL: https://issues.apache.org/jira/browse/TIKA-1290
 Project: Tika
  Issue Type: Improvement
Reporter: Hong-Thai Nguyen


PDFBOX 1.8.5 has been released: http://pdfbox.apache.org/downloads.html#recent

We can update to this version, and eventually test & fix also TIKA-1231



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1287) Update NetCDF .jar file on Maven Central

2014-05-02 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987521#comment-13987521
 ] 

Hong-Thai Nguyen commented on TIKA-1287:


Technically, not difficult to upload new jar lib on Maven Center, you follow 
just steps mention by [~gagravarr], I did recently for java-pstlib.
BTW, we must care about license of lib if you are not the author of this lib. 
See 
http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/documentation.htm,
 netCDF's license not not Apache license. You should contact them first to ask 
authorization if you want to upload yourself this lib.

> Update NetCDF .jar file on Maven Central
> 
>
> Key: TIKA-1287
> URL: https://issues.apache.org/jira/browse/TIKA-1287
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Ann Burgess
>  Labels: jar, maven, netcdf, tika, unit-test, update
>
> I am working to update the NetCDFParser file.  When using the most-recent 
> .jar file available from http://www.unidata.ucar.edu/ at the command line I 
> receive a note about a depreciated API: 
> javac -classpath 
> ../../../../tika-core/target/tika-core-1.6-SNAPSHOT.jar:../../../../toolsUI-4.3.jar
>  org/apache/tika/parser/netcdf/NetCDFParser.java
> Note: org/apache/tika/parser/netcdf/NetCDFParser.java uses or overrides a 
> deprecated API.
> Note: Recompile with -Xlint:deprecation for details.
> After updating the NetCDFParser file with non-deprecated methods (e.x. 
> changing "dimension.getName()" to "dimension.getFullName()") however, I get 
> failed unit tests in maven, which I assume is because the Maven Central Repo 
> has the lapsed version of the .jar file needed for NetCDF files (
> http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22edu.ucar%22%20AND%20a%3A%22netcdf%22)
>  .
> Can anyone provide insight into how I get the updated .jar file into the 
> Maven Central Repository? Is there an alternative method to update Tika so I 
> can run my unit tests in Maven?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1283) Add "thumbnail" as possible metadata item to TikaCoreProperties

2014-04-28 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983434#comment-13983434
 ] 

Hong-Thai Nguyen commented on TIKA-1283:


+1 for me to create a thumbnail field in metadata Set.
- For OOXML, that's an item inside archive (see TIKA-1223). PowerPoint has 
always embedded thumbnail in Jpeg, but optional with docx & xlsx (available 
only when user check on 'save preview' option when saving document).
- For OLE Documents, see: http://poi.apache.org/hpsf/thumbnails.html. You can 
get thumbnail content from POI API:
{code}
static byte[] process(File docFile) throws Exception {
final HWPFDocumentCore wordDocument = AbstractWordUtils.loadDoc(docFile);
SummaryInformation summaryInformation = 
wordDocument.getSummaryInformation();
System.out.println(summaryInformation.getAuthor());
System.out.println(summaryInformation.getApplicationName() + ":" + 
summaryInformation.getTitle());
Thumbnail thumbnail = new Thumbnail(summaryInformation.getThumbnail());
System.out.println(thumbnail.getClipboardFormat());
System.out.println(thumbnail.getClipboardFormatTag());
return thumbnail.getThumbnailAsWMF();
  }
{code}
Unfortunately , there's an open bug on POI to get properly thumbnail content: 
https://issues.apache.org/bugzilla/show_bug.cgi?id=56194
docx, xlsx & ole formats, they are WMF & EMF formats. Quite difficult to handle 
these kind of images. But, this is out of our scope.


> Add "thumbnail" as possible metadata item to TikaCoreProperties
> ---
>
> Key: TIKA-1283
> URL: https://issues.apache.org/jira/browse/TIKA-1283
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Tim Allison
>Priority: Minor
>
> TIKA-90 originally requested to add thumbnails to a document's metadata.
> I'd like to have a unified way of determining whether an embedded 
> document/resource is a thumbnail or a regular attachment.
> With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
> out more thumbnails than before.
> I propose adding "tika:thumbnail" to the metadata of each thumbnail image.  
> The consumer can then determine what to do with the embedded resource based 
> on the metadata.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1279) Missing return lines at output of SourceCodeParser

2014-04-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1279.


Resolution: Fixed

Thank [~rgauss] for this good catch. I fixed with more tests in r1589742
Hoping that we can move away Java 6 soon :)

> Missing return lines at output of SourceCodeParser
> --
>
> Key: TIKA-1279
> URL: https://issues.apache.org/jira/browse/TIKA-1279
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Hong-Thai Nguyen
>Assignee: Hong-Thai Nguyen
>Priority: Trivial
> Fix For: 1.6
>
>
> xhtml output is on a single line.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1276) Missing embedded dependencies in tika-bundle

2014-04-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1276.


Resolution: Fixed

Thank [~rwesten], added your patch at r1589717

> Missing embedded dependencies in tika-bundle
> 
>
> Key: TIKA-1276
> URL: https://issues.apache.org/jira/browse/TIKA-1276
> Project: Tika
>  Issue Type: Bug
>  Components: packaging
>Affects Versions: 1.5
> Environment: OSGI, Apache Felix via Apache Sling Launcher
>Reporter: Rupert Westenthaler
> Fix For: 1.6
>
> Attachments: TIKA-1276_20140423_rwesten.diff
>
>
> While updating from tika 1.2 to 1.5 I that the 
> `org.apache.tika:tika-bundle:1.5` module has some missing dependences.
> 1. `com.uwyn:jhighlight:1.0` is not embedded
> Because of that installing the bundle results in the following exception
> {code}
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
> [103.0] osgi.wiring.package; 
> (osgi.wiring.package=com.uwyn.jhighlight.renderer))
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
> [103.0] osgi.wiring.package; 
> (osgi.wiring.package=com.uwyn.jhighlight.renderer)
>   at 
> org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
>   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
>   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
>   at 
> org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> 2. `org.ow2.asm:asm:4.1` is not embedded because 
> `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and 
> therefore the `Embed-Dependency` directive `asm` does not match any 
> dependency. 
> Because of that one do get the following exception (after fixing (1))
> {code}
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; 
> (&(osgi.wiring.package=org.objectweb.asm)(version>=4.1.0)(!(version>=5.0.0
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; 
> (&(osgi.wiring.package=org.objectweb.asm)(version>=4.1.0)(!(version>=5.0.0)))
>   at 
> org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
>   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
>   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
>   at 
> org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> There are two possibilities to fix this (a) change the `Embed-Dependency` to 
> `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the 
> tika-bundle pom file.
> 3. `edu.ucar:netcdf:4.2-min` is not embedded
> Because of that one does get the following exception (after fixing (1) and 
> (2))
> {code}
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2))
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)
>   at 
> org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
>   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
>   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
>   at 
> org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime
> After fixing the above issues the tika-bundle was started successfully. 
> However when extracting EXIG metadata from a jpeg image I got the following 
> exception.
> {code}
> java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException
>   at 
> com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
>   at 
> com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
>   at 
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
>   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
>   [..]
> {code}
> Emb

[jira] [Updated] (TIKA-1276) Missing embedded dependencies in tika-bundle

2014-04-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1276:
---

Fix Version/s: 1.6

> Missing embedded dependencies in tika-bundle
> 
>
> Key: TIKA-1276
> URL: https://issues.apache.org/jira/browse/TIKA-1276
> Project: Tika
>  Issue Type: Bug
>  Components: packaging
>Affects Versions: 1.5
> Environment: OSGI, Apache Felix via Apache Sling Launcher
>Reporter: Rupert Westenthaler
> Fix For: 1.6
>
> Attachments: TIKA-1276_20140423_rwesten.diff
>
>
> While updating from tika 1.2 to 1.5 I that the 
> `org.apache.tika:tika-bundle:1.5` module has some missing dependences.
> 1. `com.uwyn:jhighlight:1.0` is not embedded
> Because of that installing the bundle results in the following exception
> {code}
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
> [103.0] osgi.wiring.package; 
> (osgi.wiring.package=com.uwyn.jhighlight.renderer))
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
> [103.0] osgi.wiring.package; 
> (osgi.wiring.package=com.uwyn.jhighlight.renderer)
>   at 
> org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
>   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
>   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
>   at 
> org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> 2. `org.ow2.asm:asm:4.1` is not embedded because 
> `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and 
> therefore the `Embed-Dependency` directive `asm` does not match any 
> dependency. 
> Because of that one do get the following exception (after fixing (1))
> {code}
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; 
> (&(osgi.wiring.package=org.objectweb.asm)(version>=4.1.0)(!(version>=5.0.0
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; 
> (&(osgi.wiring.package=org.objectweb.asm)(version>=4.1.0)(!(version>=5.0.0)))
>   at 
> org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
>   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
>   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
>   at 
> org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> There are two possibilities to fix this (a) change the `Embed-Dependency` to 
> `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the 
> tika-bundle pom file.
> 3. `edu.ucar:netcdf:4.2-min` is not embedded
> Because of that one does get the following exception (after fixing (1) and 
> (2))
> {code}
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2))
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)
>   at 
> org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
>   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
>   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
>   at 
> org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime
> After fixing the above issues the tika-bundle was started successfully. 
> However when extracting EXIG metadata from a jpeg image I got the following 
> exception.
> {code}
> java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException
>   at 
> com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
>   at 
> com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
>   at 
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
>   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
>   [..]
> {code}
> Embedding xmpcore in the tika-bundle solved this iss

[jira] [Resolved] (TIKA-1279) Missing return lines at output of SourceCodeParser

2014-04-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1279.


Resolution: Fixed

Fixed at r1589687

> Missing return lines at output of SourceCodeParser
> --
>
> Key: TIKA-1279
> URL: https://issues.apache.org/jira/browse/TIKA-1279
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Hong-Thai Nguyen
>Priority: Trivial
> Fix For: 1.6
>
>
> xhtml output is on a single line.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

2014-04-24 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13979614#comment-13979614
 ] 

Hong-Thai Nguyen commented on TIKA-1224:


Thank [~ben.12] for feedback.
For line return problem at output, I created a new issue: TIKA-1279
For -t option in TikaCLI, It's ambiguous on mimetype of java file. It's could 
be text/plain (in this case, TxtParser will be used to return original text as 
is), x-java-source (SourceCodeParser will be used).

For -h option, output is normally something:
{code}
Author: Hong-Thai.Nguyen
Content-Encoding: windows-1252
Content-Length: 4899
Content-Type: text/x-java-source
LoC: 133
creator: Hong-Thai.Nguyen
dc:creator: Hong-Thai.Nguyen
meta:author: Hong-Thai.Nguyen
resourceName: SourceCodeParser.java
{code}
the creator is from 'author' annotation in javadoc.

This parser is quite generic (quick and dirty as mentioned by [~kkrugler]) and 
simplistic. We can make a more dedicate Java source parser and extract more 
metadata (member, attributes...). If you interest this kind of parser, please 
create new issue and eventually an investigation on this work is warmly welcome.

Regards,

> Adding Source code (Java, Groovy, C) parser
> ---
>
> Key: TIKA-1224
> URL: https://issues.apache.org/jira/browse/TIKA-1224
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: Hong-Thai Nguyen
>Priority: Minor
>
> We can parser some source code file formats:
> text/x-java-source
> text/x-groovy
> text/x-c
> for HTML rendering from code, we can use jhightlight: 
> http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1279) Missing return lines at output of SourceCodeParser

2014-04-24 Thread Hong-Thai Nguyen (JIRA)
Hong-Thai Nguyen created TIKA-1279:
--

 Summary: Missing return lines at output of SourceCodeParser
 Key: TIKA-1279
 URL: https://issues.apache.org/jira/browse/TIKA-1279
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Trivial
 Fix For: 1.6


xhtml output is on a single line.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-623) Add support for Outlook PST

2014-04-04 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-623:
--

Assignee: (was: Hong-Thai Nguyen)

> Add support for Outlook PST
> ---
>
> Key: TIKA-623
> URL: https://issues.apache.org/jira/browse/TIKA-623
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Tran Nam Quang
> Fix For: 1.6
>
> Attachments: OutlookPSTParser.java
>
>
> Hello everyone,
> As you might know, Outlook stores its mails and other stuff in a single PST 
> file. There's a relatively new Java library called java-libpst for reading 
> Outlook PST files. It is licensed under the LGPL and available over here: 
> http://code.google.com/p/java-libpst/
> I have tested the library on Outlook 2000 and Outlook 2003, with good 
> results. It would be great if the library could be integrated into Tika.
> Best regards
> Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-623) Add support for Outlook PST

2014-04-04 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-623.
---

Resolution: Fixed

Improvement: extract each mail as attachment document. Recursion down to 
folders, subfolders and also attachments inside mail.
Committed at r1584574

> Add support for Outlook PST
> ---
>
> Key: TIKA-623
> URL: https://issues.apache.org/jira/browse/TIKA-623
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Tran Nam Quang
>Assignee: Hong-Thai Nguyen
> Fix For: 1.6
>
> Attachments: OutlookPSTParser.java
>
>
> Hello everyone,
> As you might know, Outlook stores its mails and other stuff in a single PST 
> file. There's a relatively new Java library called java-libpst for reading 
> Outlook PST files. It is licensed under the LGPL and available over here: 
> http://code.google.com/p/java-libpst/
> I have tested the library on Outlook 2000 and Outlook 2003, with good 
> results. It would be great if the library could be integrated into Tika.
> Best regards
> Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1244) Better parsing of Mbox files

2014-03-31 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1244.


   Resolution: Fixed
Fix Version/s: 1.6

Commited on r1583305, thanks [~lfcnassif]
I preserved metadata extraction from current MboxParser because message/rfc822  
seems not enable extract all fields in header.

> Better parsing of Mbox files
> 
>
> Key: TIKA-1244
> URL: https://issues.apache.org/jira/browse/TIKA-1244
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: Luis Filipe Nassif
>Assignee: Hong-Thai Nguyen
> Fix For: 1.6
>
> Attachments: MboxParser.java.patch
>
>
> MboxParser currently looses metadata of all emails, except first. It does not 
> extract/parse emails, nor decode parts. It should handle embedded emails like 
> other container parsers do, so emails will be automatically parsed by 
> RFC822Parser. I will try to add a patch for this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (TIKA-1244) Better parsing of Mbox files

2014-03-28 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen reassigned TIKA-1244:
--

Assignee: Hong-Thai Nguyen

> Better parsing of Mbox files
> 
>
> Key: TIKA-1244
> URL: https://issues.apache.org/jira/browse/TIKA-1244
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: Luis Filipe Nassif
>Assignee: Hong-Thai Nguyen
> Attachments: MboxParser.java.patch
>
>
> MboxParser currently looses metadata of all emails, except first. It does not 
> extract/parse emails, nor decode parts. It should handle embedded emails like 
> other container parsers do, so emails will be automatically parsed by 
> RFC822Parser. I will try to add a patch for this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1244) Better parsing of Mbox files

2014-03-21 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13942965#comment-13942965
 ] 

Hong-Thai Nguyen commented on TIKA-1244:


+1 for me too, I was at same intention to redo this parser when making PST. 
I'll have some next week, and hope can have a look on your patch. Thanks

> Better parsing of Mbox files
> 
>
> Key: TIKA-1244
> URL: https://issues.apache.org/jira/browse/TIKA-1244
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: Luis Filipe Nassif
> Attachments: MboxParser.java.patch
>
>
> MboxParser currently looses metadata of all emails, except first. It does not 
> extract/parse emails, nor decode parts. It should handle embedded emails like 
> other container parsers do, so emails will be automatically parsed by 
> RFC822Parser. I will try to add a patch for this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (TIKA-623) Add support for Outlook PST

2014-03-07 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen reopened TIKA-623:
---

  Assignee: Tim Allison  (was: Hong-Thai Nguyen)

> Add support for Outlook PST
> ---
>
> Key: TIKA-623
> URL: https://issues.apache.org/jira/browse/TIKA-623
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Tran Nam Quang
>Assignee: Tim Allison
> Fix For: 1.6
>
> Attachments: OutlookPSTParser.java
>
>
> Hello everyone,
> As you might know, Outlook stores its mails and other stuff in a single PST 
> file. There's a relatively new Java library called java-libpst for reading 
> Outlook PST files. It is licensed under the LGPL and available over here: 
> http://code.google.com/p/java-libpst/
> I have tested the library on Outlook 2000 and Outlook 2003, with good 
> results. It would be great if the library could be integrated into Tika.
> Best regards
> Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2014-03-07 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13923703#comment-13923703
 ] 

Hong-Thai Nguyen commented on TIKA-623:
---

[~lfcnassif], binary attached is handled with embeddedExtractor. BTW, I agree 
that we can split each mail to a separate unit.
[~talli...@apache.org], we couldn't fix .pst and .msg (msg is already handled 
as part of OfficeParser), and feel free to finish properly this issue as you 
can :)

> Add support for Outlook PST
> ---
>
> Key: TIKA-623
> URL: https://issues.apache.org/jira/browse/TIKA-623
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Tran Nam Quang
>Assignee: Hong-Thai Nguyen
> Fix For: 1.6
>
> Attachments: OutlookPSTParser.java
>
>
> Hello everyone,
> As you might know, Outlook stores its mails and other stuff in a single PST 
> file. There's a relatively new Java library called java-libpst for reading 
> Outlook PST files. It is licensed under the LGPL and available over here: 
> http://code.google.com/p/java-libpst/
> I have tested the library on Outlook 2000 and Outlook 2003, with good 
> results. It would be great if the library could be integrated into Tika.
> Best regards
> Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1257:
---

Attachment: testControlCharacters.doc

> MS Word Filter out control characters on ouput
> --
>
> Key: TIKA-1257
> URL: https://issues.apache.org/jira/browse/TIKA-1257
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Hong-Thai Nguyen
> Fix For: 1.6
>
> Attachments: testControlCharacters.doc, tika-doc-control-char.png
>
>
> Control characters present mostly in table of index and un-visualizable. We 
> should filter out them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1257:
---

Attachment: (was: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc)

> MS Word Filter out control characters on ouput
> --
>
> Key: TIKA-1257
> URL: https://issues.apache.org/jira/browse/TIKA-1257
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Hong-Thai Nguyen
> Fix For: 1.6
>
> Attachments: tika-doc-control-char.png
>
>
> Control characters present mostly in table of index and un-visualizable. We 
> should filter out them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922490#comment-13922490
 ] 

Hong-Thai Nguyen edited comment on TIKA-1257 at 3/6/14 1:50 PM:


Fixed on r1574874 & r1574877


was (Author: thaichat04):
Fixed on r1574874

> MS Word Filter out control characters on ouput
> --
>
> Key: TIKA-1257
> URL: https://issues.apache.org/jira/browse/TIKA-1257
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Hong-Thai Nguyen
> Fix For: 1.6
>
> Attachments: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc, 
> tika-doc-control-char.png
>
>
> Control characters present mostly in table of index and un-visualizable. We 
> should filter out them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1257.


Resolution: Fixed

Fixed on r1574874

> MS Word Filter out control characters on ouput
> --
>
> Key: TIKA-1257
> URL: https://issues.apache.org/jira/browse/TIKA-1257
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Hong-Thai Nguyen
> Fix For: 1.6
>
> Attachments: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc, 
> tika-doc-control-char.png
>
>
> Control characters present mostly in table of index and un-visualizable. We 
> should filter out them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1257:
---

Attachment: tika-doc-control-char.png
5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc

> MS Word Filter out control characters on ouput
> --
>
> Key: TIKA-1257
> URL: https://issues.apache.org/jira/browse/TIKA-1257
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Hong-Thai Nguyen
> Fix For: 1.6
>
> Attachments: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc, 
> tika-doc-control-char.png
>
>
> Control characters present mostly in table of index and un-visualizable. We 
> should filter out them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)
Hong-Thai Nguyen created TIKA-1257:
--

 Summary: MS Word Filter out control characters on ouput
 Key: TIKA-1257
 URL: https://issues.apache.org/jira/browse/TIKA-1257
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Hong-Thai Nguyen
 Fix For: 1.6


Control characters present mostly in table of index and un-visualizable. We 
should filter out them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-623) Add support for Outlook PST

2014-03-05 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-623.
---

Resolution: Fixed

Commit on r1574411

> Add support for Outlook PST
> ---
>
> Key: TIKA-623
> URL: https://issues.apache.org/jira/browse/TIKA-623
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Tran Nam Quang
>Assignee: Hong-Thai Nguyen
> Fix For: 1.6
>
> Attachments: OutlookPSTParser.java
>
>
> Hello everyone,
> As you might know, Outlook stores its mails and other stuff in a single PST 
> file. There's a relatively new Java library called java-libpst for reading 
> Outlook PST files. It is licensed under the LGPL and available over here: 
> http://code.google.com/p/java-libpst/
> I have tested the library on Outlook 2000 and Outlook 2003, with good 
> results. It would be great if the library could be integrated into Tika.
> Best regards
> Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-623) Add support for Outlook PST

2014-03-05 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13920692#comment-13920692
 ] 

Hong-Thai Nguyen edited comment on TIKA-623 at 3/5/14 9:30 AM:
---

java-libpst-0.7 has been uploaded to oss sonatype nexus: 
https://issues.sonatype.org/browse/OSSRH-8965
If there's no objection, I'll refactory attached parser and provide output as:
{code}
http://www.w3.org/1999/xhtml";>








Début du fichier de données Outlook

<530d9cac.5080...@gmail.com>







mail content


Éléments supprimés



Racine (pour la recherche)


SPAM Search Folder 2



{code}


was (Author: thaichat04):
java-libpst-0.7 has been uploaded to oss sonatype nexus. If there's no 
objection, I'll refactory attached parser and provide output as:
{code}
http://www.w3.org/1999/xhtml";>








Début du fichier de données Outlook

<530d9cac.5080...@gmail.com>







mail content


Éléments supprimés



Racine (pour la recherche)


SPAM Search Folder 2



{code}

> Add support for Outlook PST
> ---
>
> Key: TIKA-623
> URL: https://issues.apache.org/jira/browse/TIKA-623
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Tran Nam Quang
>Assignee: Hong-Thai Nguyen
> Fix For: 1.6
>
> Attachments: OutlookPSTParser.java
>
>
> Hello everyone,
> As you might know, Outlook stores its mails and other stuff in a single PST 
> file. There's a relatively new Java library called java-libpst for reading 
> Outlook PST files. It is licensed under the LGPL and available over here: 
> http://code.google.com/p/java-libpst/
> I have tested the library on Outlook 2000 and Outlook 2003, with good 
> results. It would be great if the library could be integrated into Tika.
> Best regards
> Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (TIKA-623) Add support for Outlook PST

2014-03-05 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen reassigned TIKA-623:
-

Assignee: Hong-Thai Nguyen

> Add support for Outlook PST
> ---
>
> Key: TIKA-623
> URL: https://issues.apache.org/jira/browse/TIKA-623
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Tran Nam Quang
>Assignee: Hong-Thai Nguyen
> Fix For: 1.6
>
> Attachments: OutlookPSTParser.java
>
>
> Hello everyone,
> As you might know, Outlook stores its mails and other stuff in a single PST 
> file. There's a relatively new Java library called java-libpst for reading 
> Outlook PST files. It is licensed under the LGPL and available over here: 
> http://code.google.com/p/java-libpst/
> I have tested the library on Outlook 2000 and Outlook 2003, with good 
> results. It would be great if the library could be integrated into Tika.
> Best regards
> Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2014-03-05 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13920692#comment-13920692
 ] 

Hong-Thai Nguyen commented on TIKA-623:
---

java-libpst-0.7 has been uploaded to oss sonatype nexus. If there's no 
objection, I'll refactory attached parser and provide output as:
{code}
http://www.w3.org/1999/xhtml";>








Début du fichier de données Outlook

<530d9cac.5080...@gmail.com>







mail content


Éléments supprimés



Racine (pour la recherche)


SPAM Search Folder 2



{code}

> Add support for Outlook PST
> ---
>
> Key: TIKA-623
> URL: https://issues.apache.org/jira/browse/TIKA-623
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Tran Nam Quang
> Fix For: 1.6
>
> Attachments: OutlookPSTParser.java
>
>
> Hello everyone,
> As you might know, Outlook stores its mails and other stuff in a single PST 
> file. There's a relatively new Java library called java-libpst for reading 
> Outlook PST files. It is licensed under the LGPL and available over here: 
> http://code.google.com/p/java-libpst/
> I have tested the library on Outlook 2000 and Outlook 2003, with good 
> results. It would be great if the library could be integrated into Tika.
> Best regards
> Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-623) Add support for Outlook PST

2014-03-05 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-623:
--

Fix Version/s: 1.6

> Add support for Outlook PST
> ---
>
> Key: TIKA-623
> URL: https://issues.apache.org/jira/browse/TIKA-623
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Tran Nam Quang
> Fix For: 1.6
>
> Attachments: OutlookPSTParser.java
>
>
> Hello everyone,
> As you might know, Outlook stores its mails and other stuff in a single PST 
> file. There's a relatively new Java library called java-libpst for reading 
> Outlook PST files. It is licensed under the LGPL and available over here: 
> http://code.google.com/p/java-libpst/
> I have tested the library on Outlook 2000 and Outlook 2003, with good 
> results. It would be great if the library could be integrated into Tika.
> Best regards
> Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (TIKA-1223) Extract thumbnail of OOXML Office files

2014-02-17 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen reassigned TIKA-1223:
--

Assignee: (was: Hong-Thai Nguyen)

> Extract thumbnail of OOXML Office files
> ---
>
> Key: TIKA-1223
> URL: https://issues.apache.org/jira/browse/TIKA-1223
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.4
>Reporter: Hong-Thai Nguyen
>Priority: Minor
> Fix For: 1.6
>
> Attachments: TIKA-1223.patch
>
>
> From Microsoft Office 2007 file formats, thumbnail could be included in 
> package. We can extract this embedded thumbnail for OOXML files.
> As discussed in mailing list, we should extract thumbnail as a attachment, 
> not as metadata (TIKA-90).
> {noformat}
> embeddedRelationId format is thumbnail_{i}.{extension}.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-1223) Extract thumbnail of OOXML Office files

2014-02-17 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1223.


Resolution: Fixed

r1568954

> Extract thumbnail of OOXML Office files
> ---
>
> Key: TIKA-1223
> URL: https://issues.apache.org/jira/browse/TIKA-1223
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.4
>Reporter: Hong-Thai Nguyen
>Assignee: Hong-Thai Nguyen
>Priority: Minor
> Fix For: 1.6
>
> Attachments: TIKA-1223.patch
>
>
> From Microsoft Office 2007 file formats, thumbnail could be included in 
> package. We can extract this embedded thumbnail for OOXML files.
> As discussed in mailing list, we should extract thumbnail as a attachment, 
> not as metadata (TIKA-90).
> {noformat}
> embeddedRelationId format is thumbnail_{i}.{extension}.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Assigned] (TIKA-1223) Extract thumbnail of OOXML Office files

2014-02-17 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen reassigned TIKA-1223:
--

Assignee: Hong-Thai Nguyen

> Extract thumbnail of OOXML Office files
> ---
>
> Key: TIKA-1223
> URL: https://issues.apache.org/jira/browse/TIKA-1223
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.4
>Reporter: Hong-Thai Nguyen
>Assignee: Hong-Thai Nguyen
>Priority: Minor
> Fix For: 1.6
>
> Attachments: TIKA-1223.patch
>
>
> From Microsoft Office 2007 file formats, thumbnail could be included in 
> package. We can extract this embedded thumbnail for OOXML files.
> As discussed in mailing list, we should extract thumbnail as a attachment, 
> not as metadata (TIKA-90).
> {noformat}
> embeddedRelationId format is thumbnail_{i}.{extension}.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-1089) Tika conversion failed on following documents

2014-02-17 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1089.


   Resolution: Invalid
Fix Version/s: 1.5
 Assignee: Hong-Thai Nguyen

Should create each issue by file, then investigate to resolve one by one.

> Tika conversion failed on following documents
> -
>
> Key: TIKA-1089
> URL: https://issues.apache.org/jira/browse/TIKA-1089
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
> Environment: windows, api
>Reporter: Hong-Thai Nguyen
>Assignee: Hong-Thai Nguyen
>  Labels: test
> Fix For: 1.5
>
> Attachments: crawler.log
>
>
> We are using Tika as our major converter of divers file formats to text, html 
> version in a Search Engine.
> We've collected some documents (46) which Tika can not convert: 
> http://www.mediafire.com/?60clr812lerx3gy



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

2014-02-03 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1224.


Resolution: Fixed

> Adding Source code (Java, Groovy, C) parser
> ---
>
> Key: TIKA-1224
> URL: https://issues.apache.org/jira/browse/TIKA-1224
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: Hong-Thai Nguyen
>Priority: Minor
>
> We can parser some source code file formats:
> text/x-java-source
> text/x-groovy
> text/x-c
> for HTML rendering from code, we can use jhightlight: 
> http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

2014-02-03 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13889491#comment-13889491
 ] 

Hong-Thai Nguyen commented on TIKA-1224:


Commited on 1563902

> Adding Source code (Java, Groovy, C) parser
> ---
>
> Key: TIKA-1224
> URL: https://issues.apache.org/jira/browse/TIKA-1224
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: Hong-Thai Nguyen
>Priority: Minor
>
> We can parser some source code file formats:
> text/x-java-source
> text/x-groovy
> text/x-c
> for HTML rendering from code, we can use jhightlight: 
> http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

2014-01-21 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13877343#comment-13877343
 ] 

Hong-Thai Nguyen commented on TIKA-1224:


I agree that parsing deeply each language is not simple. This work (already 
done) is just providing HTML format of source languages and some metadata 
possible (as author, version ...) extracting from javadoc comment and probably 
interesting others as LoC. When we need more detailed result on a language, we 
must implement a dedicated parser.
This parser is useful in search application.

> Adding Source code (Java, Groovy, C) parser
> ---
>
> Key: TIKA-1224
> URL: https://issues.apache.org/jira/browse/TIKA-1224
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: Hong-Thai Nguyen
>Priority: Minor
>
> We can parser some source code file formats:
> text/x-java-source
> text/x-groovy
> text/x-c
> for HTML rendering from code, we can use jhightlight: 
> http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

2014-01-20 Thread Hong-Thai Nguyen (JIRA)
Hong-Thai Nguyen created TIKA-1224:
--

 Summary: Adding Source code (Java, Groovy, C) parser
 Key: TIKA-1224
 URL: https://issues.apache.org/jira/browse/TIKA-1224
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Minor


We can parser some source code file formats:
text/x-java-source
text/x-groovy
text/x-c


for HTML rendering from code, we can use jhightlight: 
http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1223) Extract thumbnail of OOXML Office files

2014-01-17 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1223:
---

Description: 
>From Microsoft Office 2007 file formats, thumbnail could be included in 
>package. We can extract this embedded thumbnail for OOXML files.

As discussed in mailing list, we should extract thumbnail as a attachment, not 
as metadata (TIKA-90).
{noformat}
embeddedRelationId format is thumbnail_{i}.{extension}.
{noformat}

  was:
>From Microsoft Office 2007 file formats, thumbnail could be included in 
>package. We can extract this embedded thumbnail for OOXML files.

As discussed in mailing list, we should extract thumbnail as a attachment, not 
as metadata (TIKA-90).

embeddedRelationId format is thumbnail_{i}.{extension}.


> Extract thumbnail of OOXML Office files
> ---
>
> Key: TIKA-1223
> URL: https://issues.apache.org/jira/browse/TIKA-1223
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.4
>Reporter: Hong-Thai Nguyen
>Priority: Minor
> Fix For: 1.5
>
> Attachments: TIKA-1223.patch
>
>
> From Microsoft Office 2007 file formats, thumbnail could be included in 
> package. We can extract this embedded thumbnail for OOXML files.
> As discussed in mailing list, we should extract thumbnail as a attachment, 
> not as metadata (TIKA-90).
> {noformat}
> embeddedRelationId format is thumbnail_{i}.{extension}.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1223) Extract thumbnail of OOXML Office files

2014-01-17 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1223:
---

Attachment: TIKA-1223.patch

> Extract thumbnail of OOXML Office files
> ---
>
> Key: TIKA-1223
> URL: https://issues.apache.org/jira/browse/TIKA-1223
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.4
>Reporter: Hong-Thai Nguyen
>Priority: Minor
> Fix For: 1.5
>
> Attachments: TIKA-1223.patch
>
>
> From Microsoft Office 2007 file formats, thumbnail could be included in 
> package. We can extract this embedded thumbnail for OOXML files.
> As discussed in mailing list, we should extract thumbnail as a attachment, 
> not as metadata (TIKA-90).
> embeddedRelationId format is thumbnail_{i}.{extension}.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (TIKA-1223) Extract thumbnail of OOXML Office files

2014-01-17 Thread Hong-Thai Nguyen (JIRA)
Hong-Thai Nguyen created TIKA-1223:
--

 Summary: Extract thumbnail of OOXML Office files
 Key: TIKA-1223
 URL: https://issues.apache.org/jira/browse/TIKA-1223
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
Reporter: Hong-Thai Nguyen
Priority: Minor
 Fix For: 1.5


>From Microsoft Office 2007 file formats, thumbnail could be included in 
>package. We can extract this embedded thumbnail for OOXML files.

As discussed in mailing list, we should extract thumbnail as a attachment, not 
as metadata (TIKA-90).

embeddedRelationId format is thumbnail_{i}.{extension}.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-14 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870573#comment-13870573
 ] 

Hong-Thai Nguyen commented on TIKA-1215:


Great catch. Thank [~jukkaz]

> Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
> --
>
> Key: TIKA-1215
> URL: https://issues.apache.org/jira/browse/TIKA-1215
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Hong-Thai Nguyen
>Priority: Critical
> Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
> rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, 
> tika-1215-without-wildcard.patch
>
>
> With attached file, 1.5 raises this exception on parsing. This file has no 
> problem on 1.4
> {code}
> ...
> Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
> not declared
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
>   at 
> org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
>   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
>   ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-13 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869590#comment-13869590
 ] 

Hong-Thai Nguyen commented on TIKA-1215:


[~talli...@apache.org], here's XML of input to parse:
{noformat}
http://www.w3.org/1999/xhtml";>Matin Première - Tour des régions 
080806
RTBF - La Première
Speech
101698.914
XXX - 
A propos du contrat de quartier rues Dublin/Dubreucq
{noformat}

I think this regression came from TIKA-1070
{code}
currentElement = currentElement.parent;
{code}

The parentElement of  is null, then getPrefix() raised exception, that's 
different from 1.4

> Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
> --
>
> Key: TIKA-1215
> URL: https://issues.apache.org/jira/browse/TIKA-1215
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Hong-Thai Nguyen
>Priority: Critical
> Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
> rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, 
> tika-1215-without-wildcard.patch
>
>
> With attached file, 1.5 raises this exception on parsing. This file has no 
> problem on 1.4
> {code}
> ...
> Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
> not declared
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
>   at 
> org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
>   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
>   ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-13 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1215:
---

Attachment: tika-1215-without-wildcard.patch

[~gagravarr], my code style is different the one of Apache convention. 
Apologize for that.
I attached new patch file containing changes only.

Thanks


> Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
> --
>
> Key: TIKA-1215
> URL: https://issues.apache.org/jira/browse/TIKA-1215
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Hong-Thai Nguyen
>Priority: Critical
> Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
> rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, 
> tika-1215-without-wildcard.patch
>
>
> With attached file, 1.5 raises this exception on parsing. This file has no 
> problem on 1.4
> {code}
> ...
> Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
> not declared
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
>   at 
> org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
>   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
>   ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-90) Allow thumbnails as document metadata

2014-01-09 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866498#comment-13866498
 ] 

Hong-Thai Nguyen commented on TIKA-90:
--

Useful for Open XML Office & OpenOffice files and some others with embedded 
thumbnail.

> Allow thumbnails as document metadata
> -
>
> Key: TIKA-90
> URL: https://issues.apache.org/jira/browse/TIKA-90
> Project: Tika
>  Issue Type: New Feature
>  Components: general
>Reporter: Jukka Zitting
>
> It would be nice if parser components could produce thumbnail images and 
> other non-string metadata when parsing documents.
> To do this, we could either generalize the current Metadata methods, or 
> introduce new methods for handling such non-string metadata.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files

2014-01-07 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864202#comment-13864202
 ] 

Hong-Thai Nguyen edited comment on TIKA-1216 at 1/7/14 3:57 PM:


I've tested with a simple test case with this file. It seems that, this problem 
is identical with TIKA-1215. A patch has been submitted on this issue.
Waiting for a revision & commit.

Thanks


was (Author: thaichat04):
I've test with a simple test case with this file. It seems that, this problem 
is identical with TIKA-1215.

> parse method of Mp3Parser doesn't work for few mp3 files
> 
>
> Key: TIKA-1216
> URL: https://issues.apache.org/jira/browse/TIKA-1216
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: Windows 7 ultimate 32-bit OS, Java 1.7
>Reporter: Sumeet Gorab
>Priority: Blocker
>  Labels: patch
> Attachments: 05 - Dharti - Sarkaaran [www.DJMaza.Com].mp3
>
>
> Try to parse a Mp3 file but parse method of Mp3Parser class is not able to 
> parse that mp3 file. Parse method is not able to complete its execution their 
> is some issue in that method.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


  1   2   >