tika-2.x-windows - Build # 172 - Still Failing

2017-01-11 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #172)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/172/ to 
view the results.

tika-2.x-windows - Build # 171 - Still Failing

2017-01-11 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #171)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/171/ to 
view the results.

tika-2.x-windows - Build # 170 - Still Failing

2017-01-11 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #170)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/170/ to 
view the results.

tika-2.x-windows - Build # 169 - Still Failing

2017-01-11 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #169)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/169/ to 
view the results.

tika-2.x-windows - Build # 168 - Still Failing

2017-01-11 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #168)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/168/ to 
view the results.

tika-2.x-windows - Build # 167 - Still Failing

2017-01-11 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #167)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/167/ to 
view the results.

tika-2.x-windows - Build # 166 - Still Failing

2017-01-11 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #166)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/166/ to 
view the results.

[jira] [Commented] (TIKA-2210) Add experimental SAX/Streaming XSLF/pptx extractor

2017-01-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818815#comment-15818815
 ] 

Hudson commented on TIKA-2210:
--

SUCCESS: Integrated in Jenkins build tika-2.x #194 (See 
[https://builds.apache.org/job/tika-2.x/194/])
TIKA-2210 -- add experimental SAX parser for pptx and update (also (tallison: 
rev 68161573140cb584f8af136c57045fbca833fec5)
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java
* (add) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLWordAndPowerPointTextHandler.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParserTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractor.java
* (add) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXSLFExtractorTest.java
* (add) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/ParagraphProperties.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLDocHandler.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java
* (add) 
tika-test-resources/src/test/resources/test-documents/testWORD_template.dotx
* (edit) tika-app/src/test/java/org/apache/tika/parser/TestParsers.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/MetadataExtractor.java
* (add) 
tika-test-resources/src/test/resources/test-documents/testWORD_template.docx
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java
* (add) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
* (add) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/RunProperties.java
* (add) 
tika-test-resources/src/test/resources/test-documents/testPPTX_overlappingRelations.pptx
* (add) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFEventBasedPowerPointExtractor.java
* (edit) CHANGES.txt
* (add) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLTikaBodyPartHandler.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/SXWPFExtractorTest.java
* (add) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXSLFPowerPointExtractorDecorator.java
* (add) 
tika-test-resources/src/test/resources/test-documents/testWORD_embedded_pics.docx
* (add) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFNumberingShim.java
* (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java
* (add) 
tika-test-resources/src/test/resources/test-documents/testPPT_various2.pptx
* (add) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFStylesShim.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
* (add) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/WordAndPowerPointTextPartHandler.java


> Add experimental SAX/Streaming XSLF/pptx extractor
> --
>
> Key: TIKA-2210
> URL: https://issues.apache.org/jira/browse/TIKA-2210
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Tim Allison
>

[jira] [Commented] (TIKA-2192) Extract embedded files from headers, footers, footnotes, etc from docx/m

2017-01-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818813#comment-15818813
 ] 

Hudson commented on TIKA-2192:
--

SUCCESS: Integrated in Jenkins build tika-2.x #194 (See 
[https://builds.apache.org/job/tika-2.x/194/])
TIKA-2192 (tallison: rev e02084cc64c5a825dae6e16853c5dac3cbb55f46)
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java


> Extract embedded files from headers, footers, footnotes, etc from docx/m
> 
>
> Key: TIKA-2192
> URL: https://issues.apache.org/jira/browse/TIKA-2192
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Fix For: 2.0, 1.15
>
>
> While working on an alternate SAX parser for docx/docm, I found that we're 
> not currently extracting embedded documents from headers, footers, footnotes, 
> endnotes or comments.  We should fix this in our classic DOM parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2237) UnsupportedOperationException due to SingletonList.set in ProbabilisticMimeDetectionSelector

2017-01-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818816#comment-15818816
 ] 

Hudson commented on TIKA-2237:
--

SUCCESS: Integrated in Jenkins build tika-2.x #194 (See 
[https://builds.apache.org/job/tika-2.x/194/])
TIKA-2237 (tallison: rev 2d908d59b022ab2c12800336c648ae9763faf107)
* (edit) 
tika-core/src/test/java/org/apache/tika/mime/ProbabilisticMimeDetectionTestWithTika.java
* (edit) 
tika-core/src/main/java/org/apache/tika/mime/ProbabilisticMimeDetectionSelector.java
* (edit) tika-app/src/test/java/org/apache/tika/parser/TestParsers.java


> UnsupportedOperationException due to SingletonList.set in 
> ProbabilisticMimeDetectionSelector
> 
>
> Key: TIKA-2237
> URL: https://issues.apache.org/jira/browse/TIKA-2237
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.14
>Reporter: Jasper Hafkenscheid
> Fix For: 2.0, 1.15
>
>
> {noformat}java.lang.UnsupportedOperationException
>   at java.util.AbstractList.set(AbstractList.java:132)
>   at 
> org.apache.tika.mime.ProbabilisticMimeDetectionSelector.applyProbilities(ProbabilisticMimeDetectionSelector.java:241)
>   at 
> org.apache.tika.mime.ProbabilisticMimeDetectionSelector.detect(ProbabilisticMimeDetectionSelector.java:190)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


tika-2.x-windows - Build # 165 - Still Failing

2017-01-11 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #165)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/165/ to 
view the results.

[jira] [Commented] (TIKA-2232) Add JBIG2 image parsing support

2017-01-11 Thread Pascal Essiembre (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818741#comment-15818741
 ] 

Pascal Essiembre commented on TIKA-2232:


Either way. I think the most important is not to have JBIG2 images "silently" 
ignored when the library is not on classpath.  So having some sort of 
indication when encountering such files without the library would be nice 
(either log or exception).

> Add JBIG2 image parsing support
> ---
>
> Key: TIKA-2232
> URL: https://issues.apache.org/jira/browse/TIKA-2232
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.14
> Environment: Any
>Reporter: Pascal Essiembre
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> If you are interested, I would like to add support for JBIG2 image files 
> (.jb2, or .jbig2).  I have encountered them PDFs.
> I will make a pull-request shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2167) Image processing causes OCR to fail

2017-01-11 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2167.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

No problems now with tika-app's gui on this file both with Tesseract enabled 
and w/out Tesseract.

> Image processing causes OCR to fail
> ---
>
> Key: TIKA-2167
> URL: https://issues.apache.org/jira/browse/TIKA-2167
> Project: Tika
>  Issue Type: Bug
>  Components: ocr
>Affects Versions: 1.14
> Environment: Mac OS X 10.11.6; Java 1.8.0_45; tesseract 3.04.01; 
> ImageMagick 6.9.6-2
>Reporter: Matthew Caruana Galizia
>Priority: Critical
>  Labels: convert, image, ocr, tiff
> Fix For: 2.0, 1.15
>
> Attachments: simple.tiff
>
>
> Image processing before OCR is enabled by default in the OCR configuration 
> properties file. Unless this is disabled, running Tika on a simple TIFF image 
> (attached) with two clear words fails. When image processing is disabled, it 
> succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2198) NullPointerException on a valid Word file

2017-01-11 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818709#comment-15818709
 ] 

Tim Allison commented on TIKA-2198:
---

https://bz.apache.org/bugzilla/show_bug.cgi?id=60574

opened and fixed.

> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2198
> URL: https://issues.apache.org/jira/browse/TIKA-2198
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: CIPRA SA concept project 2 rev JM.doc
>
>
> On the attached file, which opens fine in Word, the Tika parser throws the 
> following error:
> java.lang.NullPointerException: 
>   at org.apache.poi.hwpf.model.ListTables.getLevel:141
>   at org.apache.poi.hwpf.usermodel.Paragraph.newParagraph:125
>   at org.apache.poi.hwpf.usermodel.Range.getParagraph:766
>   at org.apache.tika.parser.microsoft.WordExtractor.parse:178
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2200) XML schema mismatch error on a valid Word document

2017-01-11 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818682#comment-15818682
 ] 

Tim Allison commented on TIKA-2200:
---

I confirmed the new experimental SAX docx parser handles this file without 
problem.

> XML schema mismatch error on a valid Word document
> --
>
> Key: TIKA-2200
> URL: https://issues.apache.org/jira/browse/TIKA-2200
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: MK2048_FROM_ISENTRIS.docx
>
>
> The attached document, which opens in Word, errors out in Tika:
> org.apache.poi.POIXMLException: org.apache.xmlbeans.XmlException: error: The 
> document is not a 
> document@http://schemas.openxmlformats.org/wordprocessingml/2006/main: 
> document element local name mismatch expected document got wordDocument
>   at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead:241
>   at org.apache.poi.POIXMLDocument.load:190
>   at org.apache.poi.xwpf.usermodel.XWPFDocument.:124
>   at org.apache.poi.xwpf.extractor.XWPFWordExtractor.:58
>   at org.apache.poi.extractor.ExtractorFactory.createExtractor:232
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:86
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87
> Caused by: org.apache.xmlbeans.XmlException: error: The document is not a 
> document@http://schemas.openxmlformats.org/wordprocessingml/2006/main: 
> document element local name mismatch expected document got wordDocument
>   at org.apache.xmlbeans.impl.store.Locale.verifyDocumentType:459
>   at org.apache.xmlbeans.impl.store.Locale.autoTypeDocument:364
>   at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject:1391
>   at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject:1370
>   at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse:370
>   at org.apache.poi.POIXMLTypeLoader.parse:116
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse:-1
>   at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead:164
>   at org.apache.poi.POIXMLDocument.load:190
>   at org.apache.poi.xwpf.usermodel.XWPFDocument.:124
>   at org.apache.poi.xwpf.extractor.XWPFWordExtractor.:58
>   at org.apache.poi.extractor.ExtractorFactory.createExtractor:232
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse:86
>   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse:87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2237) UnsupportedOperationException due to SingletonList.set in ProbabilisticMimeDetectionSelector

2017-01-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818672#comment-15818672
 ] 

Hudson commented on TIKA-2237:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1175 (See 
[https://builds.apache.org/job/Tika-trunk/1175/])
TIKA-2237 (tallison: rev a38a2b093c7f4f1128dd987663c4406fcd019cd8)
* (edit) 
tika-core/src/main/java/org/apache/tika/mime/ProbabilisticMimeDetectionSelector.java
* (edit) 
tika-core/src/test/java/org/apache/tika/mime/ProbabilisticMimeDetectionTest.java


> UnsupportedOperationException due to SingletonList.set in 
> ProbabilisticMimeDetectionSelector
> 
>
> Key: TIKA-2237
> URL: https://issues.apache.org/jira/browse/TIKA-2237
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.14
>Reporter: Jasper Hafkenscheid
> Fix For: 2.0, 1.15
>
>
> {noformat}java.lang.UnsupportedOperationException
>   at java.util.AbstractList.set(AbstractList.java:132)
>   at 
> org.apache.tika.mime.ProbabilisticMimeDetectionSelector.applyProbilities(ProbabilisticMimeDetectionSelector.java:241)
>   at 
> org.apache.tika.mime.ProbabilisticMimeDetectionSelector.detect(ProbabilisticMimeDetectionSelector.java:190)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2192) Extract embedded files from headers, footers, footnotes, etc from docx/m

2017-01-11 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2192.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

> Extract embedded files from headers, footers, footnotes, etc from docx/m
> 
>
> Key: TIKA-2192
> URL: https://issues.apache.org/jira/browse/TIKA-2192
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Fix For: 2.0, 1.15
>
>
> While working on an alternate SAX parser for docx/docm, I found that we're 
> not currently extracting embedded documents from headers, footers, footnotes, 
> endnotes or comments.  We should fix this in our classic DOM parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2191) Apply current .docx unit tests to experimental SAX parser and fix or document as necessary

2017-01-11 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2191.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

Finally got around to updating 2.x

> Apply current .docx unit tests to experimental SAX parser and fix or document 
> as necessary
> --
>
> Key: TIKA-2191
> URL: https://issues.apache.org/jira/browse/TIKA-2191
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.15
>
> Attachments: element_counts_ooxml-docx.xlsx
>
>
> There are many areas for clean up to ensure that the new SAX .docx parser 
> yields similar results to the legacy DOM .docx parser.  Let's use this issue 
> to track work on improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2210) Add experimental SAX/Streaming XSLF/pptx extractor

2017-01-11 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2210.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

Waited until this was in 2.0 before resolving.

> Add experimental SAX/Streaming XSLF/pptx extractor
> --
>
> Key: TIKA-2210
> URL: https://issues.apache.org/jira/browse/TIKA-2210
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> On TIKA-2201, [~sevaa] shared a reasonably sized pptx that caused an OOM.  
> While the SAX docx parser is still fresh in my mind, let's add one for pptx.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2237) UnsupportedOperationException due to SingletonList.set in ProbabilisticMimeDetectionSelector

2017-01-11 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2237.
---
   Resolution: Fixed
Fix Version/s: 2.0

Thank you for opening this, diagnosing the problem and submitting a unit test.

> UnsupportedOperationException due to SingletonList.set in 
> ProbabilisticMimeDetectionSelector
> 
>
> Key: TIKA-2237
> URL: https://issues.apache.org/jira/browse/TIKA-2237
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.14
>Reporter: Jasper Hafkenscheid
> Fix For: 2.0, 1.15
>
>
> {noformat}java.lang.UnsupportedOperationException
>   at java.util.AbstractList.set(AbstractList.java:132)
>   at 
> org.apache.tika.mime.ProbabilisticMimeDetectionSelector.applyProbilities(ProbabilisticMimeDetectionSelector.java:241)
>   at 
> org.apache.tika.mime.ProbabilisticMimeDetectionSelector.detect(ProbabilisticMimeDetectionSelector.java:190)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


tika-2.x-windows - Build # 164 - Still Failing

2017-01-11 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #164)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/164/ to 
view the results.

[jira] [Updated] (TIKA-2237) UnsupportedOperationException due to SingletonList.set in ProbabilisticMimeDetectionSelector

2017-01-11 Thread Jasper Hafkenscheid (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Hafkenscheid updated TIKA-2237:
--
Description: 

{noformat}java.lang.UnsupportedOperationException
at java.util.AbstractList.set(AbstractList.java:132)
at 
org.apache.tika.mime.ProbabilisticMimeDetectionSelector.applyProbilities(ProbabilisticMimeDetectionSelector.java:241)
at 
org.apache.tika.mime.ProbabilisticMimeDetectionSelector.detect(ProbabilisticMimeDetectionSelector.java:190)
{noformat}

  was:
java.lang.UnsupportedOperationException
at java.util.AbstractList.set(AbstractList.java:132)
at 
org.apache.tika.mime.ProbabilisticMimeDetectionSelector.applyProbilities(ProbabilisticMimeDetectionSelector.java:241)
at 
org.apache.tika.mime.ProbabilisticMimeDetectionSelector.detect(ProbabilisticMimeDetectionSelector.java:190)


> UnsupportedOperationException due to SingletonList.set in 
> ProbabilisticMimeDetectionSelector
> 
>
> Key: TIKA-2237
> URL: https://issues.apache.org/jira/browse/TIKA-2237
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.14
>Reporter: Jasper Hafkenscheid
> Fix For: 1.15
>
>
> {noformat}java.lang.UnsupportedOperationException
>   at java.util.AbstractList.set(AbstractList.java:132)
>   at 
> org.apache.tika.mime.ProbabilisticMimeDetectionSelector.applyProbilities(ProbabilisticMimeDetectionSelector.java:241)
>   at 
> org.apache.tika.mime.ProbabilisticMimeDetectionSelector.detect(ProbabilisticMimeDetectionSelector.java:190)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2237) UnsupportedOperationException due to SingletonList.set in ProbabilisticMimeDetectionSelector

2017-01-11 Thread Jasper Hafkenscheid (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818551#comment-15818551
 ] 

Jasper Hafkenscheid edited comment on TIKA-2237 at 1/11/17 2:57 PM:


Unit test that causes the exception to occur.
{code:java}
@Test
public void tikaTest() throws IOException {
Metadata metadata = new Metadata();
metadata.add(Metadata.CONTENT_TYPE, 
MediaType.text("javascript").toString());
InputStream input = new ByteArrayInputStream(("function() {};\n" +
"try {\n" +
"window.location = 'index.html';\n" +
"} catch (e) {\n" +
"console.log(e);\n" +
"}").getBytes(StandardCharsets.UTF_8));
MediaType detect = new ProbabilisticMimeDetectionSelector().detect(input, 
metadata);
assertEquals(MediaType.text("javascript"), detect);
}
{code}


was (Author: hafkensite):
Unit test that causes the exception to occur.
{code:java}
@Test
public void tikaTest() throws IOException {
Metadata metadata = new Metadata();
metadata.add(Metadata.CONTENT_TYPE, 
MediaType.text("javascript").toString());
InputStream input = new ByteArrayInputStream(("function() {};\n" +
"try {\n" +
"window.location = \"index.html\";\n" +
"} catch (e) {\n" +
"console.log(e);\n" +
"}").getBytes(StandardCharsets.UTF_8));
MediaType detect = new ProbabilisticMimeDetectionSelector().detect(input, 
metadata);
assertEquals(MediaType.text("javascript"), detect);
}
{code}

> UnsupportedOperationException due to SingletonList.set in 
> ProbabilisticMimeDetectionSelector
> 
>
> Key: TIKA-2237
> URL: https://issues.apache.org/jira/browse/TIKA-2237
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.14
>Reporter: Jasper Hafkenscheid
> Fix For: 1.15
>
>
> java.lang.UnsupportedOperationException
>   at java.util.AbstractList.set(AbstractList.java:132)
>   at 
> org.apache.tika.mime.ProbabilisticMimeDetectionSelector.applyProbilities(ProbabilisticMimeDetectionSelector.java:241)
>   at 
> org.apache.tika.mime.ProbabilisticMimeDetectionSelector.detect(ProbabilisticMimeDetectionSelector.java:190)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2237) UnsupportedOperationException due to SingletonList.set in ProbabilisticMimeDetectionSelector

2017-01-11 Thread Jasper Hafkenscheid (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818551#comment-15818551
 ] 

Jasper Hafkenscheid commented on TIKA-2237:
---

Unit test that causes the exception to occur.
{code:java}
@Test
public void tikaTest() throws IOException {
Metadata metadata = new Metadata();
metadata.add(Metadata.CONTENT_TYPE, 
MediaType.text("javascript").toString());
InputStream input = new ByteArrayInputStream(("function() {};\n" +
"try {\n" +
"window.location = \"index.html\";\n" +
"} catch (e) {\n" +
"console.log(e);\n" +
"}").getBytes(StandardCharsets.UTF_8));
MediaType detect = new ProbabilisticMimeDetectionSelector().detect(input, 
metadata);
assertEquals(MediaType.text("javascript"), detect);
}
{code}

> UnsupportedOperationException due to SingletonList.set in 
> ProbabilisticMimeDetectionSelector
> 
>
> Key: TIKA-2237
> URL: https://issues.apache.org/jira/browse/TIKA-2237
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.14
>Reporter: Jasper Hafkenscheid
> Fix For: 1.15
>
>
> java.lang.UnsupportedOperationException
>   at java.util.AbstractList.set(AbstractList.java:132)
>   at 
> org.apache.tika.mime.ProbabilisticMimeDetectionSelector.applyProbilities(ProbabilisticMimeDetectionSelector.java:241)
>   at 
> org.apache.tika.mime.ProbabilisticMimeDetectionSelector.detect(ProbabilisticMimeDetectionSelector.java:190)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2232) Add JBIG2 image parsing support

2017-01-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818541#comment-15818541
 ] 

Hudson commented on TIKA-2232:
--

SUCCESS: Integrated in Jenkins build tika-2.x #193 (See 
[https://builds.apache.org/job/tika-2.x/193/])
TIKA-2232 -- add processing of jbig2 (with necessary non ASL 2.0 libs) 
(tallison: rev 0bc9bd89675d866b6ccd9e8b9e04ecfed8988544)
* (add) tika-test-resources/src/test/resources/test-documents/testPDF_JBIG2.pdf
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* (edit) tika-parser-modules/tika-parser-multimedia-module/pom.xml
* (edit) 
tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/image/ImageParser.java
* (edit) CHANGES.txt
* (edit) 
tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
* (edit) 
tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (edit) 
tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/image/ImageParserTest.java
* (add) tika-test-resources/src/test/resources/test-documents/testJBIG2.jb2


> Add JBIG2 image parsing support
> ---
>
> Key: TIKA-2232
> URL: https://issues.apache.org/jira/browse/TIKA-2232
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.14
> Environment: Any
>Reporter: Pascal Essiembre
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> If you are interested, I would like to add support for JBIG2 image files 
> (.jb2, or .jbig2).  I have encountered them PDFs.
> I will make a pull-request shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2232) Add JBIG2 image parsing support

2017-01-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818530#comment-15818530
 ] 

Hudson commented on TIKA-2232:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1174 (See 
[https://builds.apache.org/job/Tika-trunk/1174/])
TIKA-2232 add unit test for OCR of jbig2 embedded in PDF. (tallison: rev 
ba26f6ee01574702f5eaa56bf45aaf06e043d6df)
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/image/ImageParserTest.java


> Add JBIG2 image parsing support
> ---
>
> Key: TIKA-2232
> URL: https://issues.apache.org/jira/browse/TIKA-2232
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.14
> Environment: Any
>Reporter: Pascal Essiembre
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> If you are interested, I would like to add support for JBIG2 image files 
> (.jb2, or .jbig2).  I have encountered them PDFs.
> I will make a pull-request shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2235) Use Tesseract's recommended DPI for PDF images

2017-01-11 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818522#comment-15818522
 ] 

Tim Allison commented on TIKA-2235:
---

Duh...you're the one who got jbig2 fixed in PDFBox...y, I guess you're using 
it. :)


> Use Tesseract's recommended DPI for PDF images
> --
>
> Key: TIKA-2235
> URL: https://issues.apache.org/jira/browse/TIKA-2235
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: ocr, pdf
> Fix For: 2.0, 1.15
>
>
> From the [Tesseract 
> wiki|https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality]:
> {quote}
> Tesseract works best on images which have a DPI of at least 300 dpi
> {quote}
> PDFParserConfig is currently initialised with a value of 200 for ocrDPI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


tika-2.x-windows - Build # 163 - Still Failing

2017-01-11 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #163)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/163/ to 
view the results.

[jira] [Resolved] (TIKA-2232) Add JBIG2 image parsing support

2017-01-11 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2232.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

Added PDF OCR test.  Tesseract can't process jbig2 directly, but for jbig2 
embedded in pdfs, if users go with option 2 for PDF OCR and use PDFBox to 
generate an image of the full page, embedded jbig2's are OCR'd.

Thank you, [~pascal.essiembre]!

Let's reopen if we want to check for jbig2 on classpath before ImageParser 
claims that it can handle jbig2.

> Add JBIG2 image parsing support
> ---
>
> Key: TIKA-2232
> URL: https://issues.apache.org/jira/browse/TIKA-2232
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.14
> Environment: Any
>Reporter: Pascal Essiembre
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> If you are interested, I would like to add support for JBIG2 image files 
> (.jb2, or .jbig2).  I have encountered them PDFs.
> I will make a pull-request shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2232) Add JBIG2 image parsing support

2017-01-11 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818403#comment-15818403
 ] 

Tim Allison commented on TIKA-2232:
---

Do we want to check for JBIG2 on classpath in ImageParser before ImageParser 
includes jbig2 in supported types?

Old behavior was users could see that the EmptyParser was applied (I think?).  
New behavior if jbig2 libs are not on classpath is that the image is processed 
by the ImageParser, but no metadata is added.

For pdfs for those without jbig2 on the classpath, they'll receive a stacktrace 
for a missing library in their metadata...which makes sense.

> Add JBIG2 image parsing support
> ---
>
> Key: TIKA-2232
> URL: https://issues.apache.org/jira/browse/TIKA-2232
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.14
> Environment: Any
>Reporter: Pascal Essiembre
>Assignee: Tim Allison
>Priority: Minor
>
> If you are interested, I would like to add support for JBIG2 image files 
> (.jb2, or .jbig2).  I have encountered them PDFs.
> I will make a pull-request shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


tika-2.x-windows - Build # 162 - Still Failing

2017-01-11 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #162)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/162/ to 
view the results.

[jira] [Commented] (TIKA-2235) Use Tesseract's recommended DPI for PDF images

2017-01-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818243#comment-15818243
 ] 

Hudson commented on TIKA-2235:
--

SUCCESS: Integrated in Jenkins build tika-2.x #192 (See 
[https://builds.apache.org/job/tika-2.x/192/])
TIKA-2235 -- bump default dpi for images created via PDF for OCR to 300 
(tallison: rev c14e75070f3b691fc54292a89395017014d12572)
* (edit) 
tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
* (edit) 
tika-parser-modules/tika-parser-multimedia-module/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties


> Use Tesseract's recommended DPI for PDF images
> --
>
> Key: TIKA-2235
> URL: https://issues.apache.org/jira/browse/TIKA-2235
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: ocr, pdf
> Fix For: 2.0, 1.15
>
>
> From the [Tesseract 
> wiki|https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality]:
> {quote}
> Tesseract works best on images which have a DPI of at least 300 dpi
> {quote}
> PDFParserConfig is currently initialised with a value of 200 for ocrDPI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2235) Use Tesseract's recommended DPI for PDF images

2017-01-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818225#comment-15818225
 ] 

Hudson commented on TIKA-2235:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1173 (See 
[https://builds.apache.org/job/Tika-trunk/1173/])
TIKA-2235 - set default dpi for OCR to 300 via Matthew Caruana Galizia 
(tallison: rev d1b1ad3d916a413e8fec20b5f68d20fa9b2c4ab6)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
* (edit) 
tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties


> Use Tesseract's recommended DPI for PDF images
> --
>
> Key: TIKA-2235
> URL: https://issues.apache.org/jira/browse/TIKA-2235
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: ocr, pdf
> Fix For: 2.0, 1.15
>
>
> From the [Tesseract 
> wiki|https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality]:
> {quote}
> Tesseract works best on images which have a DPI of at least 300 dpi
> {quote}
> PDFParserConfig is currently initialised with a value of 200 for ocrDPI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2236) Upgrade to PDFBox 2.0.5 when available

2017-01-11 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2236:
-

 Summary: Upgrade to PDFBox 2.0.5 when available
 Key: TIKA-2236
 URL: https://issues.apache.org/jira/browse/TIKA-2236
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor


Upgrade when available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2236) Upgrade to PDFBox 2.0.5 when available

2017-01-11 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818196#comment-15818196
 ] 

Tim Allison commented on TIKA-2236:
---

Remember to clean up jbig2 metadata/suffix stuff from TIKA-2232.

> Upgrade to PDFBox 2.0.5 when available
> --
>
> Key: TIKA-2236
> URL: https://issues.apache.org/jira/browse/TIKA-2236
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>
> Upgrade when available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2235) Use Tesseract's recommended DPI for PDF images

2017-01-11 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818176#comment-15818176
 ] 

Matthew Caruana Galizia commented on TIKA-2235:
---

Yes, I am already! Thanks for linking me to that. It's good that that pull 
request adds metadata support for JBIG2, but would it not be better to wait for 
the PDFBox 2.0.5 release (which I'm assuming is soon) instead of adding todos?

> Use Tesseract's recommended DPI for PDF images
> --
>
> Key: TIKA-2235
> URL: https://issues.apache.org/jira/browse/TIKA-2235
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: ocr, pdf
> Fix For: 2.0, 1.15
>
>
> From the [Tesseract 
> wiki|https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality]:
> {quote}
> Tesseract works best on images which have a DPI of at least 300 dpi
> {quote}
> PDFParserConfig is currently initialised with a value of 200 for ocrDPI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


tika-2.x-windows - Build # 161 - Still Failing

2017-01-11 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #161)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/161/ to 
view the results.

[jira] [Comment Edited] (TIKA-2235) Use Tesseract's recommended DPI for PDF images

2017-01-11 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818136#comment-15818136
 ] 

Tim Allison edited comment on TIKA-2235 at 1/11/17 12:13 PM:
-

Thank you!

Btw...did you notice TIKA-2232 via [~pascal.essiembre]?  Make sure to add jbig2 
dependencies to classpath...if you aren't already. :)


was (Author: talli...@mitre.org):
Thank you!

Btw...did you notice TIKA-2232 via [~pascal.essiembre]?  Make sure to add jpx 
dependencies to classpath...if you aren't already. :)

> Use Tesseract's recommended DPI for PDF images
> --
>
> Key: TIKA-2235
> URL: https://issues.apache.org/jira/browse/TIKA-2235
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: ocr, pdf
> Fix For: 2.0, 1.15
>
>
> From the [Tesseract 
> wiki|https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality]:
> {quote}
> Tesseract works best on images which have a DPI of at least 300 dpi
> {quote}
> PDFParserConfig is currently initialised with a value of 200 for ocrDPI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2235) Use Tesseract's recommended DPI for PDF images

2017-01-11 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2235.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

Thank you!

Btw...did you notice TIKA-2232 via [~pascal.essiembre]?  Make sure to add jpx 
dependencies to classpath...if you aren't already. :)

> Use Tesseract's recommended DPI for PDF images
> --
>
> Key: TIKA-2235
> URL: https://issues.apache.org/jira/browse/TIKA-2235
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: ocr, pdf
> Fix For: 2.0, 1.15
>
>
> From the [Tesseract 
> wiki|https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality]:
> {quote}
> Tesseract works best on images which have a DPI of at least 300 dpi
> {quote}
> PDFParserConfig is currently initialised with a value of 200 for ocrDPI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2235) Use Tesseract's recommended DPI for PDF images

2017-01-11 Thread Matthew Caruana Galizia (JIRA)
Matthew Caruana Galizia created TIKA-2235:
-

 Summary: Use Tesseract's recommended DPI for PDF images
 Key: TIKA-2235
 URL: https://issues.apache.org/jira/browse/TIKA-2235
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.14
Reporter: Matthew Caruana Galizia
Priority: Minor


>From the [Tesseract 
>wiki|https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality]:

{quote}
Tesseract works best on images which have a DPI of at least 300 dpi
{quote}

PDFParserConfig is currently initialised with a value of 200 for ocrDPI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


tika-2.x-windows - Build # 160 - Still Failing

2017-01-11 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #160)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/160/ to 
view the results.

tika-2.x-windows - Build # 159 - Still Failing

2017-01-11 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #159)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/159/ to 
view the results.

tika-2.x-windows - Build # 158 - Still Failing

2017-01-11 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #158)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/158/ to 
view the results.

tika-2.x-windows - Build # 157 - Still Failing

2017-01-11 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #157)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/157/ to 
view the results.