[
https://issues.apache.org/jira/browse/TIKA-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355185#comment-17355185
]
Nick Burch commented on TIKA-3429:
----------------------------------
Most bits of Tika need the mime entries loading, even if you are doing Parsing
not Detection we still need the mime hierarchy that comes from the mime file
for certain parsing related operations.
However, Tika should only load the mime xml files if you create a TikaConfig or
DefaultDetector or Tika object. As long as you don't create one of those during
startup of your app, the mime xml file won't be loaded or processed. If you
don't need Tika for the normal use-case, and it's a bit special, you should be
fine to lazy load the Tika or TikaConfig object on demand.
Trying on my laptop, running the Tika CLI App in detect mode to detect a small
text file is consistently coming in at under 1 second, 800ms is the average.
That includes loading the mime xml file, loading all the parsers and detectors,
loading all their dependencies (and the Tika CLI App comes with pretty much all
the dependencies baked in!), and doing detection. Oh, and starting + stopping
the JVM. I'm not sure where the other 8.25 seconds is coming from in your 9
second startup, but I do rather suspect it isn't Tika....
> Performance problems partially caused by tika eagerly loading configuration
> ---------------------------------------------------------------------------
>
> Key: TIKA-3429
> URL: https://issues.apache.org/jira/browse/TIKA-3429
> Project: Tika
> Issue Type: New Feature
> Reporter: Caleb Cushing
> Priority: Major
>
> referencing
> https://github.com/spring-projects/spring-boot/issues/26709#issuecomment-851953515
> {quote}
> the tika configuration (eagerly loading a 7K lines XML file)
> {quote}
> Here's the text of that issue
> I'm not sure the problem is spring boot, but I'm having problems finding it.
> The Jar is currently taking 3 seconds (9 if I live out tiered) to run on my
> system. Just to error out due to missing options and do nothing.
> https://github.com/xenoterracide/brix/tree/8e3d86bcf773e564cc24b51572b0bbd8bb60b73f
> {code}
> time java -Xverify:none -XX:TieredStopAtLevel=1 -jar
> modules/app/build/libs/app-0.1.0.jar
> # brix -> ccushing/copy-5-1
> Missing required parameters: '<language>', '<moduleType>', '<project>'
> Usage: <main class> [--repo=<repo>] [--workdir=<workdir>] <language>
> <moduleType> <project> [COMMAND]
> <language> The programming language you're generating code
> for. Directory under --dir
> <moduleType> The type of code you're generating e.g controller,
> also the name of the config file without the
> extension.
> <project> The name of the project you're generating code
> for.
> The name of the module to be created within the
> project.
> --repo=<repo> Repository path from the current working
> directory.
> Templates and configs are looked up relative to
> here. If the config isn't found here, then we
> will search ~/.config/brix
> --workdir=<workdir> The working directory you want your destination
> paths to be relative to. Defaults to current
> working directory
> Default:
> Commands:
> run
> java -Xverify:none -XX:TieredStopAtLevel=1 -jar 3.15s user 0.26s system
> 142% cpu 2.386 total
> {code}
> since it's a CLI app lazy init isn't helpful. This is worded like a question
> (that really would not be suitable for stackoverflow, I hate that SO is the
> support forum for things now, it's terrible because of the attitude of people
> that the objective is not to help people, also it's bad at getting answers
> for harder problems, spring should get a discourse or something again), but I
> also know I had a tika CLI app in the past that loaded in less than 1s
> without Tiered, so I'm also concerned it's a spring boot bug. I'm going to
> connect a profiler later to see what I can find, but I'm not sure that will
> do it.
> {code}
> Fedora 33
> 5.11.16-200.fc33.x86_64
> 14:08:34 up 3 days, 2:04, 1 user, load average: 0.79, 1.10, 1.66
> total used free shared buff/cache
> available
> Mem: 15G 11G 1.0G 1.4G 3.0G
> 2.3G
> Swap: 12G 1.5G 10G
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)