Some thoughts:

If you just convert the HTML to a plain string, you’ve lost the knowledge of 
how the characters in that string map back to the HTML, and I don’t think you 
can feasibly put it back together after modifying the string.

There are two approaches I can see.

(1) Use an NSMutableAttributedString. Don’t use the regular styled-text 
attributes, but instead a custom attribute that stores the HTML element 
metadata for that span of text. Then you can modify the string and it will 
still keep track of which ranges are part of which tag. Unfortunately I have 
doubts about whether you can restore the HTML exactly — I can foresee there’d 
be issues with elements that don’t contain any text (like <br/>).

(2) Parse the HTML into an NSXMLDocument, i.e. a DOM tree, and walk through the 
tree looking at the text nodes. At that point it’s easy to insert new nodes or 
text at the point you want. The difficulty here is that the text will be broken 
up across lots of nodes, for instance if one word in a phrase is italicized, 
and some HTML generators even do redundant stuff like breaking text into 
multiple <span> elements unnecessarily. So it depends on how well your NLP 
engine works with disconnected chunks of text. If you can stream the text into 
it a piece at a time, that would be ideal; you just do a depth-first traversal 
of the DOM tree feeding all the text nodes into it as you find them.

—Jens
_______________________________________________

Cocoa-dev mailing list ([email protected])

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [email protected]

Reply via email to