+1 Bruno From: Benedikt Ritter <brit...@apache.org> To: Commons Developers List <dev@commons.apache.org> Sent: Wednesday, 14 September 2016 2:06 AM Subject: Re: [LANG] Add alphabet conversion API Does this really belong into [LANG]? We also have Commons Text [1] in the sandbox, which seems to be a better home for this functionality.
Benedikt [1] http://commons.apache.org/sandbox/commons-text/ Rob Tompkins <chtom...@gmail.com> schrieb am Di., 13. Sep. 2016 um 15:48 Uhr: > > > On Sep 13, 2016, at 4:39 AM, Eyal Allweil <eyal_allw...@yahoo.com.INVALID> > wrote: > > > > I've created a JIRA issue, > https://issues.apache.org/jira/browse/LANG-1266, and a pull request for > this: https://github.com/apache/commons-lang/pull/188 > > Regards,Eyal > > > > > > > > > > On Wednesday, September 7, 2016 5:27 PM, Eyal Allweil < > eyal_allw...@yahoo.com> wrote: > > > > > > Hi Simo, > > I'm not sure I understood how BitSets would be used in this case. For > example, an example with chars might look like this. > > AlphabetConverter ac = new AlphabetConverter(['a','b','c','d'], > ['a','e','f','g'],['a']) // 'a' is not encoded > > Hello Eyal, > > The first thing that springs to mind here is: are we naming this class > appropriately? I’ll preface my naming argument with I’m coming from a > mathematical background (combinatorics on words) here. Traditionally in the > literature such a “mapping” > > f: {Kleene Closure A} -> {Kleene Closure B} > > with the property f(StringConcatenate(x,y)) = StringConcatenate(f(x),f(y)) > for x,y strings from {Kleene Closure A}, is called a “Morphism” [1, pg. > 8][2]. Clearly that name is quite terse when one comes from an application > development mindset, so I’m not sure that going with the theoretical name > is appropriate here. That said, I minimally wanted to bring it up so that > we can have open discourse about naming. > > After looking at the code some, the following pop into my head (note. I’m > not tied to any of the ideas here, just stating thoughts that ran through > my head): > There are some stylistic differences that stand out (e.g. "methodName > (signature)" as opposed to “methodName(signature)”). > More javadoc? > Do we need the “doNotEncodeMap”? > The “.equals" method could use a null check. > Do we want to accommodate non-invertible or non-decodable encodings (e.g. > new AlphabetConverter([‘a’,’b’,’c’,’d’],[‘a’,’e’,’f’,’e’],[‘a’]))? > Do we want to accommodate alphabets over concatenated chars (e.g. new > AlphabetConverter([‘ab’,’c’,’d’,e’],[‘a’,’k’,’hi’,’z’],[]))? > > Personally I like the idea of having the ability of having the > generalization of the input/output alphabets, but it would seem that would > require having a superclass have that implementation and an extension for > an invertible AlphabetConverter. > > All that said, I’m not particularly tied to any of the ideas, and aside > from the stylistic changes and the .equals bit, the changes seem quite > reasonable. I would love to hear other folks’ thoughts on the proposed > functionality. > > Cheers, > -Rob > > Biblio. > [1] Jean-Paul Allouche and Jeffrey Shallit. Automatic sequences. Cambridge > University Press, Cambridge, 2003. Theory, ap- plications, and > generalizations. > > [2] https://en.wikipedia.org/wiki/Free_monoid#Morphisms > > > > > and the mapping would become a -> a, b -> e, c -> f, d -> g > > so encoding encode("abc") would become "aef". > > Ints can be used instead of chars to support unicode code points that > don't fit in a single char (which was our case, but if that seems overkill, > the chars implementation is much more direct). > > How did you mean the BitSet to be used? > > Regards,Eyal > > > > > > > > On Thursday, September 1, 2016 12:26 PM, Simone Tripodi < > simonetrip...@apache.org> wrote: > > > > > > Hi,I personally think it would a very "nice to have" feature, I had to > face similar issues in the past and, if that feature was available would > have saved me developing time. > > I just have a small request/suggestion: since int/char can be casted to > each other, I would use BitSets rather than Sets. > > Good luck!-Simo > > > > http://people.apache.org/~simonetripodi/ > > http://twitter.com/simonetripodi > > On Thu, Sep 1, 2016 at 10:53 AM, Eyal Allweil > > <eyal_allw...@yahoo.com.invalid> > wrote: > > > > Hi guys, > > Would you be interested in adding a utility class that creates alphabet > converters, perhaps using a helper method available from StringUtils? It > doesn't have to stay the way it is now, but the API for the class - > AlphabetConverter - is currently: > > /** * The input is integers representing code points, but we can make it > accept chars as well * * doNotEncode represents chars we want to leave in > the original state (not to encode them using the chars in encoding) */ > > public AlphabetConverter(Set<Integer> original, Set<Integer> encoding, > Set<Integer> doNotEncode); > > public String encode (String original); > > > > public String decode (String encoded); > > In StringUtils, we could add > > > > public AlphabetConverter getAlphabetConverter (Set<Integer> original, > Set<Integer> encoding, Set<Integer> doNotEncode); > > I used it to convert from unicode to latin letters, without using any > chars I wanted as delimiters, and preserving the English alphabet as is for > readability. If you'd like to add it, I'll clean up the code and prepare it > for a pull request so you can review it. > > > > It makes sense to me to add a method that returns the HashMaps used > internally for the mappings so they can be serialized (and deserialized) > for preserving the mapping. > > Regards,Eyal Allweil (PayPal) > > > > > > > > > > > > > > > > > > > >