[ 
https://issues.apache.org/jira/browse/TEXT-216?focusedWorklogId=748574&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-748574
 ]

ASF GitHub Bot logged work on TEXT-216:
---------------------------------------

                Author: ASF GitHub Bot
            Created on: 28/Mar/22 12:46
            Start Date: 28/Mar/22 12:46
    Worklog Time Spent: 10m 
      Work Description: rbunel35 opened a new pull request #312:
URL: https://github.com/apache/commons-text/pull/312


   Hello !
   
   This time a much larger pull request ^^
   I added to the EntityArray class the HTML 5.0 Entities.
   
   Here is how I produced the feature:
   
   - I got the list of every HTML 5.0 entities in JSON format from whatwg: 
https://html.spec.whatwg.org/multipage/named-characters.html
   - I ordered them by their Unicode value.
   - I removed all Unicode characters already found in the BASIC, ISO8859_1 and 
HTML40 maps (the HTML50 map is an extension of those).
   - I separated entities with a semicolon from those without one (which are 
not part of the HTML Standard).
     - In HTML 5.0, many Unicode characters can translate into different 
character entities.
     - For example the left bracket can translate into [ or [
   - I provided the HTML50_ESCAPE map with the entities with semicolon. For 
characters translating into multiple character entities,
   I used the first one (ex: I associated \u005B with [ but not [).
   - I provided the HTML50_UNESCAPE map, which is an invert of the 
HTML50_ESCAPE, along with character entities ignored from the
   HTML50_ESCAPE map (ex: I added an entry for [).
   - I provided the NO_SEMICOLON_UNESCAPE (for unescape purpose only) which 
maps character entities without semicolon with
   their corresponding Unicode character.
   - I added the escapeHtml5 and unescapeHtml5 methods in the StringEscapeUtils 
class using the aforementioned maps.
   - I provided unit tests for all these features.
   
   I am very open to reviewing of this feature and to any question regarding 
the choices I made.
   The JIRA ticket: https://issues.apache.org/jira/browse/TEXT-216


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

            Worklog Id:     (was: 748574)
    Remaining Estimate: 0h
            Time Spent: 10m

> HTML 5.0 Entities are not supported
> -----------------------------------
>
>                 Key: TEXT-216
>                 URL: https://issues.apache.org/jira/browse/TEXT-216
>             Project: Commons Text
>          Issue Type: Improvement
>    Affects Versions: 1.0
>            Reporter: Richard Bunel
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> As noted in 
> [TEXT-193|https://issues.apache.org/jira/projects/TEXT/issues/TEXT-193] and 
> probably other tickets, HTML 5.0 entities are not supported.
> A nice evolution would be to include them all.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to