[jira] [Created] (TIKA-1149) 12% performance improvement by caching in CompositeParser

2013-07-22 Thread Luca Della Toffola (JIRA)
Luca Della Toffola created TIKA-1149:


 Summary: 12% performance improvement by caching in CompositeParser
 Key: TIKA-1149
 URL: https://issues.apache.org/jira/browse/TIKA-1149
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4, 1.3
Reporter: Luca Della Toffola
Priority: Minor


We found an easy way to improve Tika's performance. The idea is to avoid 
recomputing parsers map over and over 
in CompositeParser.getParsers(...) if the context is empty and to cache the 
returned value instead. 
This can be done safely even under the assumption that the media-registry and 
the list of component parsers do change while Tika is executing, by 
invalidating the cache in the case.
Our attached patch computes the parsers map once per instance of 
CompositeParser.
The patch checks for the case where the context is empty and invalidates the 
cache if both media-registry and the list of component parsers change in the 
corresponding setters.
For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
(i.e., Java class library + Tika app + other apps), the patch reduces the 
running time
from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the same 
order of magnitude are found also for smaller workloads.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1149) 12% performance improvement by caching in CompositeParser

2013-07-22 Thread Luca Della Toffola (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Della Toffola updated TIKA-1149:
-

Attachment: CompositeParser.patch
ParseContext.patch

 12% performance improvement by caching in CompositeParser
 -

 Key: TIKA-1149
 URL: https://issues.apache.org/jira/browse/TIKA-1149
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3, 1.4
Reporter: Luca Della Toffola
Priority: Minor
  Labels: performance
 Attachments: CompositeParser.patch, ParseContext.patch


 We found an easy way to improve Tika's performance. The idea is to avoid 
 recomputing parsers map over and over 
 in CompositeParser.getParsers(...) if the context is empty and to cache the 
 returned value instead. 
 This can be done safely even under the assumption that the media-registry and 
 the list of component parsers do change while Tika is executing, by 
 invalidating the cache in the case.
 Our attached patch computes the parsers map once per instance of 
 CompositeParser.
 The patch checks for the case where the context is empty and invalidates the 
 cache if both media-registry and the list of component parsers change in the 
 corresponding setters.
 For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
 (i.e., Java class library + Tika app + other apps), the patch reduces the 
 running time
 from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the 
 same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1149) 12% performance improvement by caching in CompositeParser

2013-07-22 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715180#comment-13715180
 ] 

Jukka Zitting commented on TIKA-1149:
-

Note that for example {{DefaultParser.getParsers(ParseContext)}} can return a 
different set of parsers on each invocation, thanks to the dynamic service 
lookup mechanism in {{ServiceLoader}}. Thus caching the return value can lead 
to incorrect behavior.

An alternative optimization would be to refactor the 
{{CompositeParser.getParser(Metadata, ParseContext)}} method so that it doesn't 
need to always instantiate the full type-parser map. Instead it could for 
example restrict the search to only the specified type and its supertypes.

 12% performance improvement by caching in CompositeParser
 -

 Key: TIKA-1149
 URL: https://issues.apache.org/jira/browse/TIKA-1149
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3, 1.4
Reporter: Luca Della Toffola
Priority: Minor
  Labels: performance
 Attachments: CompositeParser.patch, ParseContext.patch


 We found an easy way to improve Tika's performance. The idea is to avoid 
 recomputing parsers map over and over 
 in CompositeParser.getParsers(...) if the context is empty and to cache the 
 returned value instead. 
 This can be done safely even under the assumption that the media-registry and 
 the list of component parsers do change while Tika is executing, by 
 invalidating the cache in the case.
 Our attached patch computes the parsers map once per instance of 
 CompositeParser.
 The patch checks for the case where the context is empty and invalidates the 
 cache if both media-registry and the list of component parsers change in the 
 corresponding setters.
 For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
 (i.e., Java class library + Tika app + other apps), the patch reduces the 
 running time
 from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the 
 same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1150) Extract text from textbox in XLSX

2013-07-22 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1150:
-

 Summary: Extract text from textbox in XLSX
 Key: TIKA-1150
 URL: https://issues.apache.org/jira/browse/TIKA-1150
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.4
Reporter: Tim Allison
Priority: Minor


Underlying POI library doesn't appear to support easy extraction of text from 
text boxes in XLSX files. Personal preference would be to wait for 
modifications in POI and then make a few small changes to Tika to run 
XSSFTextBox code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1150) Extract text from textbox in XLSX

2013-07-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1150:
--

Attachment: testEXCEL_textbox.xlsx

Simple file that shows issue.

 Extract text from textbox in XLSX
 -

 Key: TIKA-1150
 URL: https://issues.apache.org/jira/browse/TIKA-1150
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.4
Reporter: Tim Allison
Priority: Minor
 Attachments: testEXCEL_textbox.xlsx


 Underlying POI library doesn't appear to support easy extraction of text from 
 text boxes in XLSX files. Personal preference would be to wait for 
 modifications in POI and then make a few small changes to Tika to run 
 XSSFTextBox code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Tika Core and Parsers Test Artifacts

2013-07-22 Thread Ray Gauss II
Hi Ken, 

Yes, by other tika projects I meant tika-app, tika-bundle, tika-xmp, etc., and 
yes each sub-project would end up with it's own test-jar.

It probably makes more sense to just add the plugin to each project 
individually.

Since there's been no opposition to the concept in general I'll create a JIRA 
issue where we can discuss the details.

Regards,

Ray


On Jul 21, 2013, at 3:25 PM, Ken Krugler kkrugler_li...@transpac.com wrote:

 Hi Ray,
 
 On Jul 18, 2013, at 6:37am, Ray Gauss II wrote:
 
 Hi Ken,
 
 They recommend typetest-jar/type instead of classifier now [1], but yes.
 
 Thanks for the reference.
 
 Perhaps the other tika projects could benefit from this as well and it could 
 just go into tika-parent's build plugins.
 
 By other tika projects do you mean things like tika-app?
 
 And if it's in the tika-parent's build plugins, does that mean each 
 sub-project would wind up with its own corresponding test-jar?
 
 Thanks,
 
 -- Ken
 
 [1] http://maven.apache.org/guides/mini/guide-attached-tests.html
 
 
 On Jul 18, 2013, at 9:19 AM, Ken Krugler kkrugler_li...@transpac.com wrote:
 
 Hi Ray,
 
 On Jul 18, 2013, at 5:14am, Ray Gauss II wrote:
 
 I don't recall if we've discussed this already (I did do a brief search 
 and didn't see anything).
 
 Is there any opposition to adding test-jar Maven artifacts for tika-core 
 and tika-parsers?
 
 Seems like it would be good to allow others to extend from tests there if 
 need be.
 
 +1
 
 I assume you're talking about adding a 
 tika-(core|parsers)-version-tests.jar, so that we'd pull it in via:
 
  dependency
   groupIdorg.apache.tika/groupId
  artifactIdtika-parsers/artifactId
  version1.4/version
  classifiertests/classifier
  scopetest/scope
  /dependency
 
 -- Ken
 
 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr
 
 
 
 
 
 
 
 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr
 
 
 
 
 



[jira] [Created] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2013-07-22 Thread Ray Gauss II (JIRA)
Ray Gauss II created TIKA-1151:
--

 Summary: Maven Build Should Automatically Produce test-jar 
Artifacts
 Key: TIKA-1151
 URL: https://issues.apache.org/jira/browse/TIKA-1151
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Reporter: Ray Gauss II
Assignee: Ray Gauss II


The Maven build should be updated to produce test jar artifacts for appropriate 
sub-projects (see below) such that developers can extend test classes by adding 
the {{test-jar}} artifact as a dependency, i.e.:
{code}
dependency
  groupIdorg.apache.tika/groupId
  artifactIdtika-parsers/artifactId
  version1.5-SNAPSHOT/version
  typetest-jar/type
  scopetest/scope
/dependency
{code}

The following sub-projects contain tests that developers might want to extend 
and their corresponding {{pom.xml}} should have the [attached 
tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
- tika-app
- tika-bundle
- tika-core
- tika-parsers
- tika-server
- tika-xmp



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira