[jira] [Commented] (TIKA-2620) Set sys property to get better rendering speed by default

2018-04-02 Thread Ewan Mellor (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422984#comment-16422984
 ] 

Ewan Mellor commented on TIKA-2620:
---

See TIKA-2624.  I think that the statement re 300 DPI from [~lfcnassif] is not 
quite correct and it's more complicated than it's meant to be.


> Set sys property to get better rendering speed by default
> -
>
> Key: TIKA-2620
> URL: https://issues.apache.org/jira/browse/TIKA-2620
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.18, 2.0.0
>
>
> After upgrading to PDFBox 2.0.9, we now get a logged warning:
> {noformat}
> INFO  To get higher rendering speed on JDK8 or later,
> INFOuse the option 
> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
> INFOor call System.setProperty("sun.java2d.cmm", 
> "sun.java2d.cmm.kcms.KcmsServiceProvider")
> {noformat}
> Unless there are objections, I'll add a static call to the PDFParser to 
> {{System.setProperty...}}.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2620) Set sys property to get better rendering speed by default

2018-04-02 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422239#comment-16422239
 ] 

Tilman Hausherr commented on TIKA-2620:
---

The subsampling is when decoding, but this would influence rendering, 
obviously. The worst case would be a fine horizontal or vertical line that 
could be missing.

> Set sys property to get better rendering speed by default
> -
>
> Key: TIKA-2620
> URL: https://issues.apache.org/jira/browse/TIKA-2620
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.18, 2.0.0
>
>
> After upgrading to PDFBox 2.0.9, we now get a logged warning:
> {noformat}
> INFO  To get higher rendering speed on JDK8 or later,
> INFOuse the option 
> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
> INFOor call System.setProperty("sun.java2d.cmm", 
> "sun.java2d.cmm.kcms.KcmsServiceProvider")
> {noformat}
> Unless there are objections, I'll add a static call to the PDFParser to 
> {{System.setProperty...}}.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2620) Set sys property to get better rendering speed by default

2018-04-02 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422231#comment-16422231
 ] 

Luis Filipe Nassif commented on TIKA-2620:
--

Hi [~tilman]. When printing PDFs to images before OCR, our default is to use 
300dpi. If the image is bigger than that, it will be scaled down at the end. 
Reading PDFBOX-4137, I understood images will be subsampled before being 
decoded and not when rendering, possibly saving lots of memory, or am I wrong?

Thanks

> Set sys property to get better rendering speed by default
> -
>
> Key: TIKA-2620
> URL: https://issues.apache.org/jira/browse/TIKA-2620
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.18, 2.0.0
>
>
> After upgrading to PDFBox 2.0.9, we now get a logged warning:
> {noformat}
> INFO  To get higher rendering speed on JDK8 or later,
> INFOuse the option 
> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
> INFOor call System.setProperty("sun.java2d.cmm", 
> "sun.java2d.cmm.kcms.KcmsServiceProvider")
> {noformat}
> Unless there are objections, I'll add a static call to the PDFParser to 
> {{System.setProperty...}}.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2620) Set sys property to get better rendering speed by default

2018-03-30 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420503#comment-16420503
 ] 

Tilman Hausherr commented on TIKA-2620:
---

In most cases subsampling shouldn't be used. It might degrade OCR. Image 
extraction wouldn't get the best image.

> Set sys property to get better rendering speed by default
> -
>
> Key: TIKA-2620
> URL: https://issues.apache.org/jira/browse/TIKA-2620
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.18, 2.0.0
>
>
> After upgrading to PDFBox 2.0.9, we now get a logged warning:
> {noformat}
> INFO  To get higher rendering speed on JDK8 or later,
> INFOuse the option 
> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
> INFOor call System.setProperty("sun.java2d.cmm", 
> "sun.java2d.cmm.kcms.KcmsServiceProvider")
> {noformat}
> Unless there are objections, I'll add a static call to the PDFParser to 
> {{System.setProperty...}}.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2620) Set sys property to get better rendering speed by default

2018-03-30 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420454#comment-16420454
 ] 

Tim Allison commented on TIKA-2620:
---

+1

> Set sys property to get better rendering speed by default
> -
>
> Key: TIKA-2620
> URL: https://issues.apache.org/jira/browse/TIKA-2620
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.18, 2.0.0
>
>
> After upgrading to PDFBox 2.0.9, we now get a logged warning:
> {noformat}
> INFO  To get higher rendering speed on JDK8 or later,
> INFOuse the option 
> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
> INFOor call System.setProperty("sun.java2d.cmm", 
> "sun.java2d.cmm.kcms.KcmsServiceProvider")
> {noformat}
> Unless there are objections, I'll add a static call to the PDFParser to 
> {{System.setProperty...}}.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2620) Set sys property to get better rendering speed by default

2018-03-30 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420423#comment-16420423
 ] 

Luis Filipe Nassif commented on TIKA-2620:
--

Maybe we should add another option to allow configuring image subsampling? See 
PDFBOX-4137

> Set sys property to get better rendering speed by default
> -
>
> Key: TIKA-2620
> URL: https://issues.apache.org/jira/browse/TIKA-2620
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.18, 2.0.0
>
>
> After upgrading to PDFBox 2.0.9, we now get a logged warning:
> {noformat}
> INFO  To get higher rendering speed on JDK8 or later,
> INFOuse the option 
> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
> INFOor call System.setProperty("sun.java2d.cmm", 
> "sun.java2d.cmm.kcms.KcmsServiceProvider")
> {noformat}
> Unless there are objections, I'll add a static call to the PDFParser to 
> {{System.setProperty...}}.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2620) Set sys property to get better rendering speed by default

2018-03-30 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420375#comment-16420375
 ] 

Tim Allison commented on TIKA-2620:
---

Thank you [~ewanmellor-2] and [~tilman]!  I've split the difference for now.  
I've added configurability via tika-config.xml in case anyone doesn't have 
control over the system properties in their framework.  The default is 
{{false}}.

> Set sys property to get better rendering speed by default
> -
>
> Key: TIKA-2620
> URL: https://issues.apache.org/jira/browse/TIKA-2620
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> After upgrading to PDFBox 2.0.9, we now get a logged warning:
> {noformat}
> INFO  To get higher rendering speed on JDK8 or later,
> INFOuse the option 
> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
> INFOor call System.setProperty("sun.java2d.cmm", 
> "sun.java2d.cmm.kcms.KcmsServiceProvider")
> {noformat}
> Unless there are objections, I'll add a static call to the PDFParser to 
> {{System.setProperty...}}.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2620) Set sys property to get better rendering speed by default

2018-03-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419784#comment-16419784
 ] 

Hudson commented on TIKA-2620:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1463 (See 
[https://builds.apache.org/job/Tika-trunk/1463/])
TIKA-2620 allow configuration of setting KCMS (tallison: 
[https://github.com/apache/tika/commit/4cdb330526e7001997f939271b5185d2efb6451c])
* (edit) 
tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java


> Set sys property to get better rendering speed by default
> -
>
> Key: TIKA-2620
> URL: https://issues.apache.org/jira/browse/TIKA-2620
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> After upgrading to PDFBox 2.0.9, we now get a logged warning:
> {noformat}
> INFO  To get higher rendering speed on JDK8 or later,
> INFOuse the option 
> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
> INFOor call System.setProperty("sun.java2d.cmm", 
> "sun.java2d.cmm.kcms.KcmsServiceProvider")
> {noformat}
> Unless there are objections, I'll add a static call to the PDFParser to 
> {{System.setProperty...}}.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2620) Set sys property to get better rendering speed by default

2018-03-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419770#comment-16419770
 ] 

Hudson commented on TIKA-2620:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #227 (See 
[https://builds.apache.org/job/tika-2.x-windows/227/])
TIKA-2620 allow configuration of setting KCMS (tallison: rev 
4cdb330526e7001997f939271b5185d2efb6451c)
* (edit) 
tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java


> Set sys property to get better rendering speed by default
> -
>
> Key: TIKA-2620
> URL: https://issues.apache.org/jira/browse/TIKA-2620
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> After upgrading to PDFBox 2.0.9, we now get a logged warning:
> {noformat}
> INFO  To get higher rendering speed on JDK8 or later,
> INFOuse the option 
> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
> INFOor call System.setProperty("sun.java2d.cmm", 
> "sun.java2d.cmm.kcms.KcmsServiceProvider")
> {noformat}
> Unless there are objections, I'll add a static call to the PDFParser to 
> {{System.setProperty...}}.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2620) Set sys property to get better rendering speed by default

2018-03-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419687#comment-16419687
 ] 

Hudson commented on TIKA-2620:
--

FAILURE: Integrated in Jenkins build tika-branch-1x #14 (See 
[https://builds.apache.org/job/tika-branch-1x/14/])
TIKA-2620 allow configuration of setting KCMS (tallison: 
[https://github.com/apache/tika/commit/fc718f436b2d9676321a65c4f2d6009e23303ef6])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* (edit) 
tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties


> Set sys property to get better rendering speed by default
> -
>
> Key: TIKA-2620
> URL: https://issues.apache.org/jira/browse/TIKA-2620
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> After upgrading to PDFBox 2.0.9, we now get a logged warning:
> {noformat}
> INFO  To get higher rendering speed on JDK8 or later,
> INFOuse the option 
> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
> INFOor call System.setProperty("sun.java2d.cmm", 
> "sun.java2d.cmm.kcms.KcmsServiceProvider")
> {noformat}
> Unless there are objections, I'll add a static call to the PDFParser to 
> {{System.setProperty...}}.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2620) Set sys property to get better rendering speed by default

2018-03-29 Thread Ewan Mellor (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419501#comment-16419501
 ] 

Ewan Mellor commented on TIKA-2620:
---

[https://bugs.openjdk.java.net/browse/JDK-8041125]

This showed a slowdown of 3x in the ColorConvertOp.filter method from JDK 7 to 
8, but was closed as "Won't Fix" as "This is a consequence of switching to 
LittleCMS. It may be difficult or even impossible to get back to the previous 
performance but the trade-off is we have a more modern and maintained library."

So if I understand correctly, KCMS is faster but unmaintained and deprecated.

There are security and stability risks in using an unmaintained codepath, so I 
would vote against setting it by default inside Tika.  It should definitely go 
in the documentation though.

 

> Set sys property to get better rendering speed by default
> -
>
> Key: TIKA-2620
> URL: https://issues.apache.org/jira/browse/TIKA-2620
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> After upgrading to PDFBox 2.0.9, we now get a logged warning:
> {noformat}
> INFO  To get higher rendering speed on JDK8 or later,
> INFOuse the option 
> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
> INFOor call System.setProperty("sun.java2d.cmm", 
> "sun.java2d.cmm.kcms.KcmsServiceProvider")
> {noformat}
> Unless there are objections, I'll add a static call to the PDFParser to 
> {{System.setProperty...}}.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2620) Set sys property to get better rendering speed by default

2018-03-29 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419487#comment-16419487
 ] 

Tilman Hausherr commented on TIKA-2620:
---

[~gagravarr] KCMS is the legacy setting. It is much faster.

It is the only up to jdk7; in jdk8 and 9 it is optional; in jdk10 it no longer 
exists. So anybody using jdk10 with PDFs with many Images will have to wait a 
lot.

[~talli...@mitre.org] the only reason not to set it is if somebody wants the 
new CMM (LittleCMS) for his own application that uses tika.

My suggestion: make a setting in tika config like "setKCMS" that is true by 
default. Read that setting and if it is set, then do the call that is in the 
INFO message.

> Set sys property to get better rendering speed by default
> -
>
> Key: TIKA-2620
> URL: https://issues.apache.org/jira/browse/TIKA-2620
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> After upgrading to PDFBox 2.0.9, we now get a logged warning:
> {noformat}
> INFO  To get higher rendering speed on JDK8 or later,
> INFOuse the option 
> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
> INFOor call System.setProperty("sun.java2d.cmm", 
> "sun.java2d.cmm.kcms.KcmsServiceProvider")
> {noformat}
> Unless there are objections, I'll add a static call to the PDFParser to 
> {{System.setProperty...}}.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2620) Set sys property to get better rendering speed by default

2018-03-29 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418968#comment-16418968
 ] 

Tim Allison commented on TIKA-2620:
---

[~tilman], any recommendations?  We do have an option to render pages and then 
run OCR on those rendered pages...is there any reason we shouldn't set this 
statically in our PDFParser?  Better to put it in app+server?  Thank you!

> Set sys property to get better rendering speed by default
> -
>
> Key: TIKA-2620
> URL: https://issues.apache.org/jira/browse/TIKA-2620
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> After upgrading to PDFBox 2.0.9, we now get a logged warning:
> {noformat}
> INFO  To get higher rendering speed on JDK8 or later,
> INFOuse the option 
> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
> INFOor call System.setProperty("sun.java2d.cmm", 
> "sun.java2d.cmm.kcms.KcmsServiceProvider")
> {noformat}
> Unless there are objections, I'll add a static call to the PDFParser to 
> {{System.setProperty...}}.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2620) Set sys property to get better rendering speed by default

2018-03-29 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418961#comment-16418961
 ] 

Nick Burch commented on TIKA-2620:
--

Do you know why Oracle haven't set that by default?

If there's no good reason for it to not be set by default, I'm happy with us 
putting it in PDFParser. If there is a good reason, maybe just put it in the 
Tika App and Server, but leave it off for normal Java users to decide 
themselves?

> Set sys property to get better rendering speed by default
> -
>
> Key: TIKA-2620
> URL: https://issues.apache.org/jira/browse/TIKA-2620
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> After upgrading to PDFBox 2.0.9, we now get a logged warning:
> {noformat}
> INFO  To get higher rendering speed on JDK8 or later,
> INFOuse the option 
> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
> INFOor call System.setProperty("sun.java2d.cmm", 
> "sun.java2d.cmm.kcms.KcmsServiceProvider")
> {noformat}
> Unless there are objections, I'll add a static call to the PDFParser to 
> {{System.setProperty...}}.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)