[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file

2016-11-22 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2153:
-
Attachment: Jinwoo_032910.pptx

> TaggedIOException on a valid Powerpoint file
> 
>
> Key: TIKA-2153
> URL: https://issues.apache.org/jira/browse/TIKA-2153
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Jinwoo_032910.pptx, tika_2153_unzipping.png
>
>
> On the following Powerpoint file, which opens fine with Powerpoint:
> https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx
> the Tika parses throws the following error:
> org.apache.tika.io.TaggedIOException: invalid stored block lengths
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
>   at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
>   at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
>   at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78)
>   ... 13 more
> Caused by: java.util.zip.ZipException: invalid stored block lengths
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   ... 19 more
> Could be similar to #2130.
> EDIT: similar exception on the attached Jinwoo_032910.pptx



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file

2016-11-22 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2153:
-
Description: 
On the following Powerpoint file, which opens fine with Powerpoint:

https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx

the Tika parses throws the following error:

org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78)
... 13 more
Caused by: java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at 
org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
... 19 more




Could be similar to #2130.

EDIT: similar exception on the attached Jinwoo_032910.pptx

  was:
On the following Powerpoint file, which opens fine with Powerpoint:

https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx

the Tika parses throws the following error:

org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82)
at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at 

[jira] [Commented] (TIKA-2180) Multiple requests on Tika to extract text slows down

2016-11-22 Thread Ashish Basran (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687795#comment-15687795
 ] 

Ashish Basran commented on TIKA-2180:
-

I am calling Tika (http://localhost:8080/tika) using HttpClient (.NET) from the 
same tika-server box. I used Task.Run to create requests for all 22 documents. 
It has 4 CPUs.

> Multiple requests on Tika to extract text slows down
> 
>
> Key: TIKA-2180
> URL: https://issues.apache.org/jira/browse/TIKA-2180
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.13, 1.14
> Environment: Windows OS, Open JDK, 4 core 32 GB RAM
>Reporter: Ashish Basran
>
> I observed that if I send multiple requests to Tika (eg. 
> http://localhost:8080/tika) with around 5MB files, Tika is very slow in 
> completing the action. I tried with ~20 random files, it took 170 seconds to 
> process all the files in sequence. If I pass all files in parallel, it took 
> around 780 seconds to process same set of files. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2180) Multiple requests on Tika to extract text slows down

2016-11-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687776#comment-15687776
 ] 

Tim Allison commented on TIKA-2180:
---

Thank you for this.  That isn't by design... that I'm aware of.  How many 
threads are you running and how many cpus are on the tika-server box?

> Multiple requests on Tika to extract text slows down
> 
>
> Key: TIKA-2180
> URL: https://issues.apache.org/jira/browse/TIKA-2180
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.13, 1.14
> Environment: Windows OS, Open JDK, 4 core 32 GB RAM
>Reporter: Ashish Basran
>
> I observed that if I send multiple requests to Tika (eg. 
> http://localhost:8080/tika) with around 5MB files, Tika is very slow in 
> completing the action. I tried with ~20 random files, it took 170 seconds to 
> process all the files in sequence. If I pass all files in parallel, it took 
> around 780 seconds to process same set of files. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2180) Multiple requests on Tika to extract text slows down

2016-11-22 Thread Ashish Basran (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687660#comment-15687660
 ] 

Ashish Basran commented on TIKA-2180:
-

I tested with Word document and Excel. I observed this in 1.13 too. 

Passed 22 document to Tika server for processing. 2, 5 MB documents and rest 
less than 1 MB documents. Following are the processing time in seconds (totals 
at the end) while processing documents in parallel and one after other is done. 
I am not sure if this behavior is by design but difference in processing time 
is huge. 

SequenceParallel
77.4790976  22.6876726
0.9335904   17.9678267
0.8854624   26.0525849
5.0577852   15.5999804
0.8060567   26.6077107
0.7831427   17.7433509
0.8196296   26.7486071
0.7667276   26.7675274
0.7648827   26.8234494
0.7632169   22.8773994
0.8247712   16.9681799
0.9260035   26.9742814
79.6387803  21.0023846
0.7795755   14.0186599
0.7646085   27.0261048
0.8339278   26.0542291
0.8345049   15.0697296
0.8402716   24.0850932
0.7785933   20.1221993
0.9135003   13.1501129
0.9229104   170.2784636
0.8859913   178.3212539

178.0030304 782.9468017


> Multiple requests on Tika to extract text slows down
> 
>
> Key: TIKA-2180
> URL: https://issues.apache.org/jira/browse/TIKA-2180
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.13, 1.14
> Environment: Windows OS, Open JDK, 4 core 32 GB RAM
>Reporter: Ashish Basran
>
> I observed that if I send multiple requests to Tika (eg. 
> http://localhost:8080/tika) with around 5MB files, Tika is very slow in 
> completing the action. I tried with ~20 random files, it took 170 seconds to 
> process all the files in sequence. If I pass all files in parallel, it took 
> around 780 seconds to process same set of files. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2161) EOFException on a valid Powerpoint file

2016-11-22 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2161:
-
Description: 
On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
parser throws the following error:

org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at java.nio.file.Files.copy(Files.java:2908)
at java.nio.file.Files.copy(Files.java:3027)
at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:587)
at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:615)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:377)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:443)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140)
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:116)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:368)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:138)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
Caused by: java.io.EOFException: Unexpected end of ZLIB input stream
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at 
org.apache.poi.util.BoundedInputStream.read(BoundedInputStream.java:121)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
... 22 more

EDIT: Tika 1.14 throws EOFException

  was:
On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
parser throws the following error:

org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream
at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at java.nio.file.Files.copy(Files.java:2908)
at java.nio.file.Files.copy(Files.java:3027)
at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:587)
at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:615)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:377)
at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:443)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140)
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:116)
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:368)
at 

[jira] [Updated] (TIKA-2161) EOFException on a valid Powerpoint file

2016-11-22 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2161:
-
Summary: EOFException on a valid Powerpoint file  (was: TaggedIOException 
from EOFException on a valid Powerpoint file)

> EOFException on a valid Powerpoint file
> ---
>
> Key: TIKA-2161
> URL: https://issues.apache.org/jira/browse/TIKA-2161
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Erik-LymeChipBranchSeminar.ppt
>
>
> On the attached Powerpoint file, which opens fine with Powerpoint, the Tika 
> parser throws the following error:
> org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream
>   at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at java.nio.file.Files.copy(Files.java:2908)
>   at java.nio.file.Files.copy(Files.java:3027)
>   at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:587)
>   at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:615)
>   at 
> org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:377)
>   at 
> org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:443)
>   at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140)
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:116)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:368)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:138)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> Caused by: java.io.EOFException: Unexpected end of ZLIB input stream
>   at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
>   at 
> org.apache.poi.util.BoundedInputStream.read(BoundedInputStream.java:121)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   ... 22 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2182) Investigate rare IllegalArgumentException in macro extraction

2016-11-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2182:
--
Description: poi bug 
[60279|https://bz.apache.org/bugzilla/show_bug.cgi?id=60279]  (was: poi bug 
60279)

> Investigate rare IllegalArgumentException in macro extraction
> -
>
> Key: TIKA-2182
> URL: https://issues.apache.org/jira/browse/TIKA-2182
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
>
> poi bug [60279|https://bz.apache.org/bugzilla/show_bug.cgi?id=60279]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2118) Misleading exception on a password protected XLS

2016-11-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2118.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

> Misleading exception on a password protected XLS
> 
>
> Key: TIKA-2118
> URL: https://issues.apache.org/jira/browse/TIKA-2118
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Fix For: 2.0, 1.15
>
> Attachments: BUSJDRVGZF7FKDA6L4PNTNATHQCLRW4O.xls, Copy of I-LHD 
> 3E.xls
>
>
> When parsing the attached password protected Excel file "Copy of I-LHD 
> 3E.xls", Tika emits an IllegalArgumentException with a message "Unsupported 
> codepage requested". The inability to parse has nothing to do with codepage, 
> that error is misleading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2104) Upgrade to a version of POI that fixes common bugs in macro extraction, when available

2016-11-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687415#comment-15687415
 ] 

Tim Allison commented on TIKA-2104:
---

moved poi bug 60279 to separate issue: TIKA-2182

> Upgrade to a version of POI that fixes common bugs in macro extraction, when 
> available
> --
>
> Key: TIKA-2104
> URL: https://issues.apache.org/jira/browse/TIKA-2104
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
> Fix For: 2.0, 1.15
>
> Attachments: newExceptionsInBByMimeTypeByStackTrace.xlsx, 
> newExceptionsInBDetails.xlsx
>
>
> On TIKA-2069, we found two bugs in POI that prevented the extraction of 
> macros from MSOffice files.  Let's use this issue to track fixes in POI.
> Current known bugs are POI:
> -60162- duplicate of -59302-
> -60158-
> -59830-
> -59858-
> -60273-
> After we release Tika 1.14, let's remove the catch blocks in Tika and rerun 
> against our regression corpus to help identify the most common bugs and find 
> new ones.
> As always, patches are welcome on POI!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2104) Upgrade to a version of POI that fixes common bugs in macro extraction, when available

2016-11-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2104:
--
Description: 
On TIKA-2069, we found two bugs in POI that prevented the extraction of macros 
from MSOffice files.  Let's use this issue to track fixes in POI.

Current known bugs are POI:
-60162- duplicate of -59302-
-60158-
-59830-
-59858-
-60273-


After we release Tika 1.14, let's remove the catch blocks in Tika and rerun 
against our regression corpus to help identify the most common bugs and find 
new ones.

As always, patches are welcome on POI!

  was:
On TIKA-2069, we found two bugs in POI that prevented the extraction of macros 
from MSOffice files.  Let's use this issue to track fixes in POI.

Current known bugs are POI:
-60162- duplicate of -59302-
-60158-
-59830-
-59858-
-60273-
60279

After we release Tika 1.14, let's remove the catch blocks in Tika and rerun 
against our regression corpus to help identify the most common bugs and find 
new ones.

As always, patches are welcome on POI!


> Upgrade to a version of POI that fixes common bugs in macro extraction, when 
> available
> --
>
> Key: TIKA-2104
> URL: https://issues.apache.org/jira/browse/TIKA-2104
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
> Attachments: newExceptionsInBByMimeTypeByStackTrace.xlsx, 
> newExceptionsInBDetails.xlsx
>
>
> On TIKA-2069, we found two bugs in POI that prevented the extraction of 
> macros from MSOffice files.  Let's use this issue to track fixes in POI.
> Current known bugs are POI:
> -60162- duplicate of -59302-
> -60158-
> -59830-
> -59858-
> -60273-
> After we release Tika 1.14, let's remove the catch blocks in Tika and rerun 
> against our regression corpus to help identify the most common bugs and find 
> new ones.
> As always, patches are welcome on POI!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2182) Investigate rare IllegalArgumentException in macro extraction

2016-11-22 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2182:
-

 Summary: Investigate rare IllegalArgumentException in macro 
extraction
 Key: TIKA-2182
 URL: https://issues.apache.org/jira/browse/TIKA-2182
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Trivial


poi bug 60279



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2104) Upgrade to a version of POI that fixes common bugs in macro extraction, when available

2016-11-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2104.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

> Upgrade to a version of POI that fixes common bugs in macro extraction, when 
> available
> --
>
> Key: TIKA-2104
> URL: https://issues.apache.org/jira/browse/TIKA-2104
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
> Fix For: 2.0, 1.15
>
> Attachments: newExceptionsInBByMimeTypeByStackTrace.xlsx, 
> newExceptionsInBDetails.xlsx
>
>
> On TIKA-2069, we found two bugs in POI that prevented the extraction of 
> macros from MSOffice files.  Let's use this issue to track fixes in POI.
> Current known bugs are POI:
> -60162- duplicate of -59302-
> -60158-
> -59830-
> -59858-
> -60273-
> After we release Tika 1.14, let's remove the catch blocks in Tika and rerun 
> against our regression corpus to help identify the most common bugs and find 
> new ones.
> As always, patches are welcome on POI!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2158) NullPointerException on a valid Word file

2016-11-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2158.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2158
> URL: https://issues.apache.org/jira/browse/TIKA-2158
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Fix For: 2.0, 1.15
>
> Attachments: RTOP_Template01112015063856.docx
>
>
> On the attached Word file, which opens fine with Word, the Tika parser throws 
> the following error:
> java.lang.NullPointerException
>   at 
> org.apache.poi.xwpf.usermodel.XWPFSDTContentCell.(XWPFSDTContentCell.java:49)
>   at org.apache.poi.xwpf.usermodel.XWPFSDTCell.(XWPFSDTCell.java:35)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFTableRow.getTableICells(XWPFTableRow.java:147)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTable(XWPFWordExtractorDecorator.java:359)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:111)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2160) POIXMLException from NullPointerException on a valid Word file

2016-11-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2160.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

> POIXMLException from NullPointerException on a valid Word file
> --
>
> Key: TIKA-2160
> URL: https://issues.apache.org/jira/browse/TIKA-2160
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Fix For: 2.0, 1.15
>
> Attachments: test_16022016081053.docx
>
>
> On the attached word file, which opens fine with Word (albeit with no text), 
> the Tika parser throws the following error:
> org.apache.poi.POIXMLException: java.lang.NullPointerException
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:130)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:208)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.poi.xwpf.usermodel.AbstractXWPFSDT.(AbstractXWPFSDT.java:37)
>   at org.apache.poi.xwpf.usermodel.XWPFSDT.(XWPFSDT.java:38)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:124)
>   ... 9 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2142) ArrayIndexOutOfBoundsException

2016-11-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2142.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

> ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2142
> URL: https://issues.apache.org/jira/browse/TIKA-2142
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Fix For: 2.0, 1.15
>
> Attachments: HPV8dHinge Confocal Results.ppt
>
>
> On the attached PowerPoint presentation, which opens fine with PowerPoint, 
> the Tika parser throws the following error:
> java.lang.ArrayIndexOutOfBoundsException
>   at java.lang.System.arraycopy(Native Method)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.readPictures(HSLFSlideShowImpl.java:438)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.getPictureData(HSLFSlideShowImpl.java:772)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShow.getPictureData(HSLFSlideShow.java:547)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:305)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2145) InvalidFormatException on a valid Word file

2016-11-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2145.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

> InvalidFormatException on a valid Word file
> ---
>
> Key: TIKA-2145
> URL: https://issues.apache.org/jira/browse/TIKA-2145
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Fix For: 2.0, 1.15
>
> Attachments: safety_analysis_report_FINAL2.docx
>
>
> On the attached Word file, which opens fine with Word, the Tika parser throws 
> the following exception:
> org.apache.tika.exception.TikaException: Error creating OOXML extractor
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: java.lang.IllegalArgumentException: Date for created could not be 
> parsed: 2015-07-27
>   at 
> org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setCreatedProperty(PackagePropertiesPart.java:408)
>   at 
> org.apache.poi.openxml4j.opc.internal.unmarshallers.PackagePropertiesUnmarshaller.unmarshall(PackagePropertiesUnmarshaller.java:124)
>   at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:743)
>   at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:230)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:69)
>   ... 3 more
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: Date 
> 2015-07-27 not well formatted, expected format in: -MM-dd'T'HH:mm:ssz, 
> -MM-dd'T'HH:mm:ss.SSSz, -MM-dd'T'HH:mm:ss'Z', 
> -MM-dd'T'HH:mm:ss.SS'Z'
>   at 
> org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setDateValue(PackagePropertiesPart.java:615)
>   at 
> org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setCreatedProperty(PackagePropertiesPart.java:406)
>   ... 7 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2132) NullPointerException on a valid Excel file

2016-11-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2132.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

> NullPointerException on a valid Excel file
> --
>
> Key: TIKA-2132
> URL: https://issues.apache.org/jira/browse/TIKA-2132
> Project: Tika
>  Issue Type: Bug
>  Components: parser
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Fix For: 2.0, 1.15
>
> Attachments: 2a-Executive_Summary_50_Work.xlsm
>
>
> The attached XLSM file, which opens fine in Excel, causes the following error 
> in the Tika parser:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@a5bd950
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at gov.nih.niaid.fscanner.Extract.ExtractContents(Extract.java:62)
>   at gov.nih.niaid.temp.Main.main(Main.java:60)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.next(XSSFReader.java:254)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:124)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2125) XmlValueOutOfRangeException on a good Word document

2016-11-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2125.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

> XmlValueOutOfRangeException on a good Word document
> ---
>
> Key: TIKA-2125
> URL: https://issues.apache.org/jira/browse/TIKA-2125
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Fix For: 2.0, 1.15
>
> Attachments: LMVR Mentoring Activities brm.docx
>
>
> On the attached Word document, which opens fine with Word, the Tika parser 
> throws a TikaException caused by 
> org.apache.xmlbeans.impl.values.XmlValueOutOfRangeException with message 
> "string value 'odd' is not a valid enumeration value for ST_HdrFtr in 
> namespace http://schemas.openxmlformats.org/wordprocessingml/2006/main;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2129) IllegalArgumentException/"Unknown shape type" on a valid Powerpoint file

2016-11-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2129.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

Thank you, [~kiwiwings]!

> IllegalArgumentException/"Unknown shape type" on a valid Powerpoint file
> 
>
> Key: TIKA-2129
> URL: https://issues.apache.org/jira/browse/TIKA-2129
> Project: Tika
>  Issue Type: Bug
>  Components: parser
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Fix For: 2.0, 1.15
>
> Attachments: 10.1056-NEJMra020100Figure01.ppt
>
>
> The attached valid Powerpoint file, when parsed with Tika, throws the 
> following error:
> java.lang.IllegalArgumentException: Unknown shape type: 4095
>   at org.apache.poi.sl.usermodel.ShapeType.forId(ShapeType.java:314)
>   at 
> org.apache.poi.hslf.usermodel.HSLFShapeFactory.createSimpleShape(HSLFShapeFactory.java:98)
>   at 
> org.apache.poi.hslf.usermodel.HSLFShapeFactory.createShape(HSLFShapeFactory.java:62)
>   at org.apache.poi.hslf.usermodel.HSLFSheet.getShapes(HSLFSheet.java:173)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:93)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2115) OOM caused by corrupt embedded OLE object

2016-11-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2115.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

> OOM caused by corrupt embedded OLE object
> -
>
> Key: TIKA-2115
> URL: https://issues.apache.org/jira/browse/TIKA-2115
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment:  Generic test with tika-app-1.13.jar on test document
>Reporter: Thomas Galla
> Fix For: 2.0, 1.15
>
> Attachments: TikaTestcase.pptx
>
>
> There is a size field when parsing an embedded OLE object in a Powerpoint 
> presentation that says there are 2GB of data that needs to be read and the 
> code simply tries to allocate a buffer for that, which results in OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2116) Upgrade to POI 3.16-beta1 when available

2016-11-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2116:
--
Fix Version/s: 1.15
   2.0

> Upgrade to POI 3.16-beta1 when available
> 
>
> Key: TIKA-2116
> URL: https://issues.apache.org/jira/browse/TIKA-2116
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.15
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2116) Upgrade to POI 3.16-beta1 when available

2016-11-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2116.
---
Resolution: Fixed

> Upgrade to POI 3.16-beta1 when available
> 
>
> Key: TIKA-2116
> URL: https://issues.apache.org/jira/browse/TIKA-2116
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1658) unable to parse microsoft visio files with tika

2016-11-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1658.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

> unable to parse microsoft visio files with tika
> ---
>
> Key: TIKA-1658
> URL: https://issues.apache.org/jira/browse/TIKA-1658
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 0.9, 1.1, 1.3, 1.4, 1.5, 1.8
> Environment: ubuntu 14.04 and windows 7
>Reporter: senthil
> Fix For: 2.0, 1.15
>
> Attachments: Connection Types.vsd
>
>
> hi
> With parsing an microsoft visio it throws an exception.
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@13d28e3
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>
> Caused by: java.lang.RuntimeException: TODO
>   at 
> org.apache.poi.hdgf.pointers.PointerFactory.createPointer(PointerFactory.java:45)
>   at org.apache.poi.hdgf.HDGFDiagram.(HDGFDiagram.java:99)
> application/vnd.visio
>   at 
> org.apache.poi.hdgf.extractor.VisioTextExtractor.(VisioTextExtractor.java:55)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:200)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   ... 4 more
> Please help with a resolution
> regards
> sentil



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2116) Upgrade to POI 3.16-beta1 when available

2016-11-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687264#comment-15687264
 ] 

Hudson commented on TIKA-2116:
--

SUCCESS: Integrated in Jenkins build tika-2.x #175 (See 
[https://builds.apache.org/job/tika-2.x/175/])
TIKA-2116 upgrade to POI 3.16-beta1 (tallison: rev 
8c01e4d8e7b37bdcb1a1aa1bf99675dfb01d49e4)
* (edit) CHANGES.txt
* (edit) tika-parser-modules/pom.xml


> Upgrade to POI 3.16-beta1 when available
> 
>
> Key: TIKA-2116
> URL: https://issues.apache.org/jira/browse/TIKA-2116
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2143) POI deprecated method used in TIKA 1.13

2016-11-22 Thread sbathrutheen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687243#comment-15687243
 ] 

sbathrutheen commented on TIKA-2143:


We have requested our client for opts details. will update you as soon as got 
the details.

> POI deprecated method used in TIKA 1.13 
> 
>
> Key: TIKA-2143
> URL: https://issues.apache.org/jira/browse/TIKA-2143
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9, 1.13
> Environment: Windows java application
>Reporter: sbathrutheen
> Fix For: 1.13
>
>
> We see that TIKA throws a long list of errors when extraction ppt files. We  
> tested with standalone tike application (1.13) we cannot reproduce the issue.
> We took a look at POI source code and abserved the class "HSLFSlideShow" we 
> could see the below deprecated method defined 
> *
> /**
> -  * Get the lookup from slide numbers to their offsets inside
> -  *  _ptrData, used when adding or moving slides.
> -  * 
> -  * @deprecated since POI 3.11, not supported anymore
> -  */
> - @Deprecated
> - public Hashtable getSlideOffsetDataLocationsLookup() {
> - throw new 
> UnsupportedOperationException("PersistPtrHolder.getSlideOffsetDataLocationsLookup()
>  is not supported since 3.12-Beta1");
> - }
> *
> we may think Tika library still calling this deprecated method causing this 
> run time Exception
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@204c3b78
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> com.searchtechnologies.aspire.docprocessing.extracttext.ExtractTextStage.process(ExtractTextStage.java:140)
> ... 14 more
> Caused by: java.lang.UnsupportedOperationException
> at java.util.AbstractMap$SimpleImmutableEntry.setValue(Unknown Source)
> at org.apache.poi.hslf.HSLFSlideShow.read(HSLFSlideShow.java:293)
> at org.apache.poi.hslf.HSLFSlideShow.buildRecords(HSLFSlideShow.java:273)
> at org.apache.poi.hslf.HSLFSlideShow.(HSLFSlideShow.java:188)
> at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61)
> at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
> at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> ... 17 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2116) Upgrade to POI 3.16-beta1 when available

2016-11-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687159#comment-15687159
 ] 

Hudson commented on TIKA-2116:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #76 (See 
[https://builds.apache.org/job/tika-2.x-windows/76/])
TIKA-2116 upgrade to POI 3.16-beta1 (tallison: rev 
8c01e4d8e7b37bdcb1a1aa1bf99675dfb01d49e4)
* (edit) tika-parser-modules/pom.xml
* (edit) CHANGES.txt


> Upgrade to POI 3.16-beta1 when available
> 
>
> Key: TIKA-2116
> URL: https://issues.apache.org/jira/browse/TIKA-2116
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


tika-2.x-windows - Build # 76 - Still Failing

2016-11-22 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #76)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/76/ to 
view the results.

[jira] [Created] (TIKA-2181) Upgrade to POI 3.16-beta2 when available

2016-11-22 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2181:
-

 Summary: Upgrade to POI 3.16-beta2 when available
 Key: TIKA-2181
 URL: https://issues.apache.org/jira/browse/TIKA-2181
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2143) POI deprecated method used in TIKA 1.13

2016-11-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15686939#comment-15686939
 ] 

Tim Allison commented on TIKA-2143:
---

Any further info on this issue, [~sbathrutheen]?

> POI deprecated method used in TIKA 1.13 
> 
>
> Key: TIKA-2143
> URL: https://issues.apache.org/jira/browse/TIKA-2143
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9, 1.13
> Environment: Windows java application
>Reporter: sbathrutheen
> Fix For: 1.13
>
>
> We see that TIKA throws a long list of errors when extraction ppt files. We  
> tested with standalone tike application (1.13) we cannot reproduce the issue.
> We took a look at POI source code and abserved the class "HSLFSlideShow" we 
> could see the below deprecated method defined 
> *
> /**
> -  * Get the lookup from slide numbers to their offsets inside
> -  *  _ptrData, used when adding or moving slides.
> -  * 
> -  * @deprecated since POI 3.11, not supported anymore
> -  */
> - @Deprecated
> - public Hashtable getSlideOffsetDataLocationsLookup() {
> - throw new 
> UnsupportedOperationException("PersistPtrHolder.getSlideOffsetDataLocationsLookup()
>  is not supported since 3.12-Beta1");
> - }
> *
> we may think Tika library still calling this deprecated method causing this 
> run time Exception
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@204c3b78
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> com.searchtechnologies.aspire.docprocessing.extracttext.ExtractTextStage.process(ExtractTextStage.java:140)
> ... 14 more
> Caused by: java.lang.UnsupportedOperationException
> at java.util.AbstractMap$SimpleImmutableEntry.setValue(Unknown Source)
> at org.apache.poi.hslf.HSLFSlideShow.read(HSLFSlideShow.java:293)
> at org.apache.poi.hslf.HSLFSlideShow.buildRecords(HSLFSlideShow.java:273)
> at org.apache.poi.hslf.HSLFSlideShow.(HSLFSlideShow.java:188)
> at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61)
> at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
> at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> ... 17 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2180) Multiple requests on Tika to extract text slows down

2016-11-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15686698#comment-15686698
 ] 

Tim Allison commented on TIKA-2180:
---

Is this new in 1.14? Are the files of a particular format? I regret that I 
haven't done any performance tests on tika-server.

> Multiple requests on Tika to extract text slows down
> 
>
> Key: TIKA-2180
> URL: https://issues.apache.org/jira/browse/TIKA-2180
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.13, 1.14
> Environment: Windows OS, Open JDK, 4 core 32 GB RAM
>Reporter: Ashish Basran
>
> I observed that if I send multiple requests to Tika (eg. 
> http://localhost:8080/tika) with around 5MB files, Tika is very slow in 
> completing the action. I tried with ~20 random files, it took 170 seconds to 
> process all the files in sequence. If I pass all files in parallel, it took 
> around 780 seconds to process same set of files. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)