Re: Canonize emoji for an XML file

jj Sat, 16 Apr 2022 07:24:22 -0700

Hi Community,

Let's celebrate BBEdit's 30 years of existence. 👏  🎉  🎂  🍾


🥂 👉 👨🏼‍💻 ＆ 🍀️  & 🦜 & 👥

Here is a Swift text filter that could help you prepare your inDesign 
birthday cards.

Based on Unicode's Emoji regular expression and Swift's ICU regular 
expression engine.

Save in ~/Library/Application Support/BBEdit/Text 
Filters/encode_emojis.swift

    #!/usr/bin/env swift

    // Based on: https://unicode.org/reports/tr51/#EBNF_and_Regex
    //
    // Changed \p{Emoji} to \p{Basic_Emoji} to avoid matching '#', numbers, 
etc.
    // Tweaked to match uncovered cases revealed by test files.
    //
    // Tested against the contents of those test files:
    // ------------------------------------------------
    // https://unicode.org/emoji/charts/full-emoji-list.html
    // 
https://raw.githubusercontent.com/unicode-org/icu/main/icu4c/source/data/unidata/emoji-sequences.txt
    // 
https://raw.githubusercontent.com/unicode-org/icu/main/icu4c/source/data/unidata/emoji-zwj-sequences.txt
    // https://unicode.org/Public/emoji/14.0/emoji-test.txt

    // example:     France 🇫🇷, Snail 🐌, Family👨‍👩‍👧‍👦, man 
technologist with skin tone 👨🏼‍💻
    // decimal:     France (ef)&#127467;&#127479;(\ef), Snail 
(ef)&#128012;(\ef), 
Family(ef)&#128104;&#8205;&#128105;&#8205;&#128103;&#8205;&#128102;(\ef), 
man technologist with skin tone (ef)&#128104;&#127996;&#8205;&#128187;(\ef)
    // hex:         France (ef)&#x1F1EB;&#x1F1F7;(\ef), Snail 
(ef)&#x1F40C;(\ef), 
Family(ef)&#x1F468;&#x200D;&#x1F469;&#x200D;&#x1F467;&#x200D;&#x1F466;(\ef), 
man technologist with skin tone (ef)&#x1F468;&#x1F3FC;&#x200D;&#x1F4BB;(\ef)
    import Foundation

    let useDecimalEntities = true           // Change to false to encode as 
hexadecimal entities.
    let openingWrapperTag = "(ef)"          // Set to "" if no wrapper tag 
needed.
    let closingWrapperTag = "(\\ef)"        // Set to "" if no wrapper tag 
needed.

    let pattern = #"""
    (?x-i)
    (?:
        \p{RI} \p{RI}
    |
        [
            \x{00A9}
            \x{00AE}
            \x{203C}
            \x{2049}
            \x{2122}
            \x{2139}
            \x{2194}
            \x{2195}
            \x{2196}
            \x{2197}
            \x{2198}
            \x{2199}
            \x{21A9}
            \x{21AA}
            \x{2328}
            \x{23CF}
            \x{23ED}
            \x{23EE}
            \x{23EF}
            \x{23F1}
            \x{23F2}
            \x{23F8}
            \x{23F9}
            \x{23FA}
            \x{24C2}
            \x{25AA}
            \x{25AB}
            \x{25B6}
            \x{25C0}
            \x{25FB}
            \x{25FC}
            \x{2702}
            \x{2708}
            \x{2709}
            \x{270F}
            \x{2712}
            \x{2714}
            \x{2716}
            \x{271D}
            \x{2721}
            \x{2733}
            \x{2734}
            \x{2744}
            \x{2747}
            \x{2763}
            \x{27A1}
            \x{2934}
            \x{2935}
            \x{2B05}
            \x{2B06}
            \x{2B07}
            \x{3030}
            \x{303D}
            \x{3297}
            \x{3299}
            \x{1F170}
            \x{1F171}
            \x{1F17E}
            \x{1F17F}
            \x{1F202}
            \x{1F237}
        ]
        \x{FE0F}
    |
        [
            \x{0023}
            \x{002A}
            \x{0030}
            \x{0031}
            \x{0032}
            \x{0033}
            \x{0034}
            \x{0035}
            \x{0036}
            \x{0037}
            \x{0038}
            \x{0039}
        ]
        \x{FE0F} \x{20E3}

    |
        [
            \p{Basic_Emoji}
            \x{1F300}-\x{1F5FF}
            \x{1F3CA}-\x{1F3CC}
            \x{1F3F3}
            \x{1F3F4}
            \x{1F441}
            \x{1F574}
            \x{1F575}
            \x{1F590}
            \x{1F680}-\x{1F6FF}
            \x{2600}-\x{26FF}
            \x{261D}
            \x{26F9}
            \x{270C}
            \x{270D}
            \x{2764}
        ]
        (?:
            \p{EMod}
        |
            \x{FE0F} \x{20E3}?
        |
            [\x{E0020}-\x{E007E}]+ 
            \x{E007F}
        )?
        (?:
            \x{200D} 
            [
                \p{Basic_Emoji}
                \x{1F32B}
                \x{1F5E8}
                \x{2620}
                \x{2640}
                \x{2642}
                \x{2695}
                \x{2696}
                \x{26A7}
                \x{2708}
                \x{2744}
                \x{2764}
            ]
            (?:
                \p{EMod}
            |
                \x{FE0F} \x{20E3}?
            | 
                [\x{E0020}-\x{E007E}]+ 
                \x{E007F}
            )?
        )*
    )
    """#

    let regex = try NSRegularExpression(pattern: pattern, options: [])
    var output: [String] = []

    while var line = readLine() {
        let range = NSRange(line.startIndex..<line.endIndex, in: line)
        let matches = regex.matches(in: line, options: [], range: range)
        for match in matches.reversed() {
            if let range = Range(match.range, in: line) {
                let emoji = line[range];
                let entities = emoji.unicodeScalars.map {
                    useDecimalEntities ? "&#\(String($0.value, radix: 10, 
uppercase: true));" : "&#x\(String($0.value, radix: 16, uppercase: true));"
                }
                let replacement = entities.joined(separator:"")
                line.replaceSubrange(range, with: 
"\(openingWrapperTag)\(replacement)\(closingWrapperTag)")
            }
        }
        output.append(line)
    }

    print(output.joined(separator: "\n"), terminator:"")

--

BBEdit rocks!

Kind regards,

Jean Jourdain

On Tuesday, April 12, 2022 at 5:38:38 PM UTC+2 Justin Ross wrote:

> Sorted it.
>
> The only way for it to work with extra text on either end is to move all 
> the single entity codes (e.g. &#128102;) below the multiple entity ones 
> (e.g. &#128102;&#127995;) in the canonize file.
>
> So it would look something like this...
>
> &#128102;&#127995;
> &#1255;&#127967;
> &#210;&#129;
> &#128102;
> &#128103;
> &#128104;
>
> On Sunday, 10 April 2022 at 20:48:51 UTC+1 Justin Ross wrote:
>
>> Hi all, I'm looking for a way to catch any emoji that's used amongst 
>> regular text. This is so that I can create an XML file to import into 
>> InDesign. Then I simply find/replace any emoji found and convert the 
>> character to an emoji font so it can be printed.
>>
>> I've made a canonize file with over 1,000 emoji, separated by a tab, then 
>> the decimal equivalent.
>>
>> This works great.  
>>
>> *For example:*
>> This emoji is found in the text somewhere:
>> 👦
>>
>> and is changed to:
>> &#128102;
>>
>> It also works for skintone emoji where the decimal code can be repeated.
>>
>> This emoji is found in the text somewhere:
>> 👦🏻
>>
>> and is changed to:
>> &#128102;&#127995;
>>
>>
>> However, as soon as I wrap the code (so it's easier to find/change in 
>> InDesign), the duplicate codes cause a problem.
>>
>> For example:
>>
>> This emoji:
>> 👦
>>
>> Is changed to this:
>> (ef)&\#128102;(\ef) 
>>
>> *BUT...*
>>
>> This:
>> 👦🏻
>>
>> Is changed to this:
>> (ef)&#128102;(\ef)(ef)&#127995;(\ef)
>>
>> Note the extra (\ef)(ef) in the middle.
>>
>> Now I could use a find/replace to remove that bit. But what if there are 
>> two different emoji next to each other? I'm replacing one problem with 
>> another.
>>
>> Is there a way round this?
>>
>> Many thanks if anyone can help.
>>
>>

-- 
This is the BBEdit Talk public discussion group. If you have a feature request 
or need technical support, please email "[email protected]" rather than 
posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit>
--- 
You received this message because you are subscribed to the Google Groups 
"BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/bbedit/8488b600-6f0b-40e0-8b2b-a74c114cb71an%40googlegroups.com.

Re: Canonize emoji for an XML file

Reply via email to