[
https://issues.apache.org/jira/browse/ANY23-336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416856#comment-16416856
]
Hans Brende commented on ANY23-336:
-----------------------------------
A little more info on the two branches:
I'd really recommend using the first branch, because it will be faster. The
reality is, all the key-value pairs in your delegating cache are placed there
*by* the {{CachingHttpClient}} via the [putEntry(key,
entry)|https://github.com/jsonld-java/jsonld-java/blob/fd4b95f6451586f10705682a88e68b571ecee610/core/src/main/java/com/github/jsonldjava/utils/JarCacheStorage.java#L87]
method, and it wants to be able to access them quickly again. The key-value
pairs placed in the delegating cache by the {{CachingHttpClient}} will _not_
correspond to paths in your classpath, because the {{CachingHttpClient}} only
caches things there when it had to fetch them via HTTP because their keys had
previously returned null, and you'll never return null for the paths on the
classpath since they are always present.
Keep in mind: retrieving a value from the delegate cache is by no means a
"live" operation. You're not "delegating" any queries to an HTTP request! The
delegate cache is nothing more than a glorified HashMap, which the
{{CachingHttpClient}} occasionally asks you to dump stuff into, and asks you to
keep it cached until it asks for it again (and if you don't give it back, THEN
it requests the same resource *again* via HTTP, and *again* dumps the result
into your glorified HashMap and asks you to keep it cached.)
And then, of course, my second branch accomplishes the same thing as the first,
but it preserves the (slower) order of *first* querying the classpath, *then*
the delegate cache.
> Parsing json-ld content takes prohibitively long time
> -----------------------------------------------------
>
> Key: ANY23-336
> URL: https://issues.apache.org/jira/browse/ANY23-336
> Project: Apache Any23
> Issue Type: Bug
> Components: core, extractors
> Affects Versions: 2.2
> Reporter: Hans Brende
> Assignee: Peter Ansell
> Priority: Critical
> Fix For: 2.3
>
> Attachments: Screen Shot 2018-03-27 at 2.52.15 PM.png, Screen Shot
> 2018-03-27 at 2.54.43 PM.png
>
>
> Using the page [https://www.guthriegreen.com|https://www.guthriegreen.com/]
> as a benchmark, a page fetch took about 100 ms, while simply *parsing* the
> json-ld content on that page took a *staggering 27400 ms*. For reference, I'm
> using Java 8, build 162, on a Macbook Pro (early 2015).
> The bad news is that this is not our fault.
> I've profiled this behavior down to the
> {{com.github.jsonldjava.utils.JsonUtils.fromURL(URL, CloseableHttpClient)}}
> function. 94% of the parsing time is spent there. This function is called
> when trying to load remote json-ld contexts.
> In order to avoid loading remote contexts repeatedly, this function tries to
> *cache* them by using a {{CachingHttpClient}} from the httpclient-osgi
> library.
> Unfortunately, that strategy is *not* working, as I have recorded exactly
> *zero* cache hits, meaning that *every* retrieval is a cache miss and a
> remote context is re-fetched via http every single time it's accessed.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)