Re: [go-nuts] x/text: Interest in Unicode text segmentation?

2020-04-17 Thread Matt Sherman
Nice. Well, happy to discuss how I might be helpful — implementation, API
design, etc.

For the work I’m doing on UAX 29, the key API is unicode.Is. I am satisfied
with the perf so far. unicode.Is dominates the profiling, but that’s to be
expected, as my scanner is basically a tight loop evaluating rune
categories. Certainly open to using a different trie-driven API.

On Fri, Apr 17, 2020 at 1:47 AM  wrote:

> Most of the x/text packages use tries and not rangetables. These allow
> arbitrary data (as long as it fits in an int) to be associated with runes
> and allow operating on utf8 without having to convert to tunes.
> https://godoc.org/golang.org/x/text/internal/triegen. But that’s not a
> requirement.
>
> The package
> https://godoc.org/golang.org/x/text/internal/gen/bitfield converts Go
> structs to ints and can be used to pack the rune data in a convenient way.
>
> Furthermore Package
> https://godoc.org/golang.org/x/text/internal/ucd
> can be used for reading UCD files
>
> And Package
> https://godoc.org/golang.org/x/text/internal/gen
> can be used to generate Go tables other than the trie and include
> utilities to generate canonical x/text files, such as including the Unicode
> and CLDR versions.
>
> The top-level file gen.go is used to orchestrate building x/text and
> captured dependencies between packages.
>
> I may have some designs laying around for the API.
>
> On Thu, 16 Apr 2020 at 21:46 Matt Sherman  wrote:
>
>> Great. Yes, the data files are here:
>> https://unicode.org/reports/tr41/tr41-26.html#Props0
>>
>> I’ve done a proof of concept here: https://github.com/clipperhouse/uax29
>>
>> To do it properly, I assume we’d want to use the house style here?
>> https://github.com/golang/text/blob/master/unicode/rangetable/gen.go
>>
>> On Thu, Apr 16, 2020 at 1:52 PM  wrote:
>>
>>> Yes that would be interesting. Especially if it can be generated from
>>> the Unicode raw data upon updates.
>>>
>>> On Wed, 15 Apr 2020 at 23:56 Ian Lance Taylor  wrote:
>>>
 [ +mpvl ]

 On Wed, Apr 15, 2020 at 2:30 PM Matt Sherman 
 wrote:
 >
 > Hi, I am working on a tokenizer based on Unicode text segmentation
 (UAX 29). I am wondering if there would be an interest in adding range
 tables for word break categories to the x/text or unicode packages. It
 appears they could be code-gen’d alongside the rest of the range tables.
 >
 > Pardon if this is already being done and I have missed it. I see some
 mention of those categories (e.g. ALetter) in other places.
 >
 > My code is here. Thanks.
 >
 > --
 > You received this message because you are subscribed to the Google
 Groups "golang-nuts" group.
 > To unsubscribe from this group and stop receiving emails from it,
 send an email to golang-nuts+unsubscr...@googlegroups.com.
 > To view this discussion on the web visit
 https://groups.google.com/d/msgid/golang-nuts/2a058556-da51-46d0-a41b-28e323541332%40googlegroups.com
 .

>>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAMPnbukOfdaV_D9P1cChmWrN%2BT1kf2OSOAgyXmRf-3PBakbOSw%40mail.gmail.com.


Re: [go-nuts] x/text: Interest in Unicode text segmentation?

2020-04-16 Thread mpvl
Most of the x/text packages use tries and not rangetables. These allow
arbitrary data (as long as it fits in an int) to be associated with runes
and allow operating on utf8 without having to convert to tunes.
https://godoc.org/golang.org/x/text/internal/triegen. But that’s not a
requirement.

The package
https://godoc.org/golang.org/x/text/internal/gen/bitfield converts Go
structs to ints and can be used to pack the rune data in a convenient way.

Furthermore Package
https://godoc.org/golang.org/x/text/internal/ucd
can be used for reading UCD files

And Package
https://godoc.org/golang.org/x/text/internal/gen
can be used to generate Go tables other than the trie and include utilities
to generate canonical x/text files, such as including the Unicode and CLDR
versions.

The top-level file gen.go is used to orchestrate building x/text and
captured dependencies between packages.

I may have some designs laying around for the API.

On Thu, 16 Apr 2020 at 21:46 Matt Sherman  wrote:

> Great. Yes, the data files are here:
> https://unicode.org/reports/tr41/tr41-26.html#Props0
>
> I’ve done a proof of concept here: https://github.com/clipperhouse/uax29
>
> To do it properly, I assume we’d want to use the house style here?
> https://github.com/golang/text/blob/master/unicode/rangetable/gen.go
>
> On Thu, Apr 16, 2020 at 1:52 PM  wrote:
>
>> Yes that would be interesting. Especially if it can be generated from the
>> Unicode raw data upon updates.
>>
>> On Wed, 15 Apr 2020 at 23:56 Ian Lance Taylor  wrote:
>>
>>> [ +mpvl ]
>>>
>>> On Wed, Apr 15, 2020 at 2:30 PM Matt Sherman 
>>> wrote:
>>> >
>>> > Hi, I am working on a tokenizer based on Unicode text segmentation
>>> (UAX 29). I am wondering if there would be an interest in adding range
>>> tables for word break categories to the x/text or unicode packages. It
>>> appears they could be code-gen’d alongside the rest of the range tables.
>>> >
>>> > Pardon if this is already being done and I have missed it. I see some
>>> mention of those categories (e.g. ALetter) in other places.
>>> >
>>> > My code is here. Thanks.
>>> >
>>> > --
>>> > You received this message because you are subscribed to the Google
>>> Groups "golang-nuts" group.
>>> > To unsubscribe from this group and stop receiving emails from it, send
>>> an email to golang-nuts+unsubscr...@googlegroups.com.
>>> > To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/golang-nuts/2a058556-da51-46d0-a41b-28e323541332%40googlegroups.com
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAPQTvz1%3D2AL2HOTCdUsEuhjcnsmBK0Np-BMowojm91-XY4rr%2BQ%40mail.gmail.com.


Re: [go-nuts] x/text: Interest in Unicode text segmentation?

2020-04-16 Thread Matt Sherman
Great. Yes, the data files are here:
https://unicode.org/reports/tr41/tr41-26.html#Props0

I’ve done a proof of concept here: https://github.com/clipperhouse/uax29

To do it properly, I assume we’d want to use the house style here?
https://github.com/golang/text/blob/master/unicode/rangetable/gen.go

On Thu, Apr 16, 2020 at 1:52 PM  wrote:

> Yes that would be interesting. Especially if it can be generated from the
> Unicode raw data upon updates.
>
> On Wed, 15 Apr 2020 at 23:56 Ian Lance Taylor  wrote:
>
>> [ +mpvl ]
>>
>> On Wed, Apr 15, 2020 at 2:30 PM Matt Sherman  wrote:
>> >
>> > Hi, I am working on a tokenizer based on Unicode text segmentation (UAX
>> 29). I am wondering if there would be an interest in adding range tables
>> for word break categories to the x/text or unicode packages. It appears
>> they could be code-gen’d alongside the rest of the range tables.
>> >
>> > Pardon if this is already being done and I have missed it. I see some
>> mention of those categories (e.g. ALetter) in other places.
>> >
>> > My code is here. Thanks.
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> Groups "golang-nuts" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> an email to golang-nuts+unsubscr...@googlegroups.com.
>> > To view this discussion on the web visit
>> https://groups.google.com/d/msgid/golang-nuts/2a058556-da51-46d0-a41b-28e323541332%40googlegroups.com
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAMPnbukLN%3DSVkhBQ1TM8TYfp-t1Z3Wxc6MuAi6UZFYYnumU3rw%40mail.gmail.com.


Re: [go-nuts] x/text: Interest in Unicode text segmentation?

2020-04-16 Thread mpvl
Yes that would be interesting. Especially if it can be generated from the
Unicode raw data upon updates.

On Wed, 15 Apr 2020 at 23:56 Ian Lance Taylor  wrote:

> [ +mpvl ]
>
> On Wed, Apr 15, 2020 at 2:30 PM Matt Sherman  wrote:
> >
> > Hi, I am working on a tokenizer based on Unicode text segmentation (UAX
> 29). I am wondering if there would be an interest in adding range tables
> for word break categories to the x/text or unicode packages. It appears
> they could be code-gen’d alongside the rest of the range tables.
> >
> > Pardon if this is already being done and I have missed it. I see some
> mention of those categories (e.g. ALetter) in other places.
> >
> > My code is here. Thanks.
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "golang-nuts" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to golang-nuts+unsubscr...@googlegroups.com.
> > To view this discussion on the web visit
> https://groups.google.com/d/msgid/golang-nuts/2a058556-da51-46d0-a41b-28e323541332%40googlegroups.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAPQTvz2Grj-TR64we3a_w2iz1vwhPweO0sYL88Gy4Z-__zavCw%40mail.gmail.com.


Re: [go-nuts] x/text: Interest in Unicode text segmentation?

2020-04-15 Thread Ian Lance Taylor
[ +mpvl ]

On Wed, Apr 15, 2020 at 2:30 PM Matt Sherman  wrote:
>
> Hi, I am working on a tokenizer based on Unicode text segmentation (UAX 29). 
> I am wondering if there would be an interest in adding range tables for word 
> break categories to the x/text or unicode packages. It appears they could be 
> code-gen’d alongside the rest of the range tables.
>
> Pardon if this is already being done and I have missed it. I see some mention 
> of those categories (e.g. ALetter) in other places.
>
> My code is here. Thanks.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to golang-nuts+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/golang-nuts/2a058556-da51-46d0-a41b-28e323541332%40googlegroups.com.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAOyqgcXiKgZTi4RdgB2ZZAC1cKJ3A1dVsweDULCM2gf8g1SUEw%40mail.gmail.com.