[GitHub] [incubator-annotator] vrish88 opened a new pull request, #130: Improve performance for large documents with many annotations

GitBox Wed, 31 Aug 2022 15:37:23 -0700


vrish88 opened a new pull request, #130:
URL: https://github.com/apache/incubator-annotator/pull/130

Hello and thank you for this wonderful project. It's provided some excellent
shoulders to stand on.

### Context
I'm extracting footnotes embedded in markdown and converting them
annotations. Some of these markdown files have over 500k characters in them and
have over 100 footnotes. After a quite circuitous route, I'm using
mdast/hast/remark to convert the markdown into html and then loading the html
into a jsdom Document.

### The Problem
I found that extracting footnotes for some of the larger files was taking 7
- 10 minutes to process. Running a profiler, it looked like 70% of the time was
spent determining if the node intersected the document/scope.

![image](https://user-images.githubusercontent.com/36475/187796176-82638def-398a-405a-b78c-6a0177f2f04b.png)

That [call is
happening](https://github.com/apache/incubator-annotator/blob/main/packages/dom/src/text-node-chunker.ts#L64)
when the node is being converted to a chunk, which happens many times, per
annotation. It is also only being used to ensure that the node is apart of the
document (as far as I can tell).

### The Solution
This PR removes that check. It improved the performance on my machine by 75%
for the large files.

Behaviorally I _think_ it is the same. The two things which invoke
`nodeToChunk` appear to be already checking if those nodes are a part of the
scope.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@annotator.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [incubator-annotator] vrish88 opened a new pull request, #130: Improve performance for large documents with many annotations

Reply via email to