Thank you Sven! Peter
On Sun, Jul 23, 2017 at 01:59:41PM +0200, Sven Van Caekenberghe wrote: > > > On 23 Jul 2017, at 09:55, Peter Uhnak <[email protected]> wrote: > > > > Ah, ByteArrayMap, I was trying ByteArray. > > ByteArrayMap is not a class, it is still a ByteArray, but of size 256 used as > an inclusion map for characters that fit a byte. > > BTW, using CharacterSet as an abstraction makes the trick easier. > > I committed: > > === > Name: Neo-CSV-Core-SvenVanCaekenberghe.26 > Author: SvenVanCaekenberghe > Time: 23 July 2017, 1:54:22.095185 pm > UUID: e429f87e-5c11-0d00-a72d-ae5d00db2a8c > Ancestors: Neo-CSV-Core-PeterUhnak.25 > > make the core test in #writeOptionalQuotedField faster by using a primitive > when possible > > add #testWideOptionalQuoted > === > Name: Neo-CSV-Tests-SvenVanCaekenberghe.22 > Author: SvenVanCaekenberghe > Time: 23 July 2017, 1:54:52.133003 pm > UUID: 5f81c280-5c11-0d00-a72e-7cd500db2a8c > Ancestors: Neo-CSV-Tests-PeterUhnak.21 > > make the core test in #writeOptionalQuotedField faster by using a primitive > when possible > > add #testWideOptionalQuoted > === > > Extra caution was needed for WideStrings. > > Sven > > > Not sure what the next step is here: should I add it and send you mczs, or > > will you do it yourself? (it is a simple change). > > > > Peter > > > > On Sat, Jul 22, 2017 at 10:51:42PM +0200, Sven Van Caekenberghe wrote: > >> Peter, > >> > >>> On 22 Jul 2017, at 22:27, Peter Uhnak <[email protected]> wrote: > >>> > >>> Hi Sven, > >>> > >>> my use case was hand-edited CSVs (therefore the quotes are extra > >>> clutter), which imples that I would be hand-viewing/editing only small > >>> CSVs (no performance issues). > >>> > >>> I agree that we should err on the safe side with CR & LF (e.g. tools may > >>> sometimes autoconvert line endings). > >>> > >>> Regarding performance: > >>> > >>> #findFirstInString:inSet:startingAt: didn't work for me (not sure if bug, > >>> or I don't know how to use), so I've trried with inCharacterSet: > >> > >> Yes, it is a bitch to use, the the version you used does not use the > >> primitive. > >> > >> Try playing with this: > >> > >> s10 := 'a' repeat: 10. > >> s100 := 'a' repeat: 100. > >> > >> searchSet := ByteArray new: 256 withAll: 0. > >> ',"', String crlf do: [ :each | searchSet at: each asInteger + 1 put: 1 ]. > >> > >> searchSet := (CharacterSet newFrom: ',"', String crlf) byteArrayMap. > >> > >> [ ByteString findFirstInString: s10 inSet: searchSet startingAt: 1 ] bench. > >> "15,160,161 per second" > >> [ ByteString findFirstInString: s100 inSet: searchSet startingAt: 1 ] > >> bench. > >> "8,219,081 per second" > >> > >> ByteString findFirstInString: ',"', String crlf inSet: searchSet > >> startingAt: 1. > >> > >> Sven > >> > >>> Tested on worst-case scenario - strings don't contain tested symbols. > >>> > >>> s10 := 'a' repeat: 10. > >>> s100 := 'a' repeat: 100. > >>> > >>> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s10 > >>> includesSubstring: each ] ] bench. "'1,200,046 per second'" > >>> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s100 > >>> includesSubstring: each ] ] bench. "'495,482 per second'" > >>> > >>> [ s10 includesAny: { $,. $". Character cr. Character lf } ] bench. > >>> "'2,819,416 per second'" > >>> [ s100 includesAny: { $,. $". Character cr. Character lf } ] bench. > >>> "'2,200,668 per second'" > >>> > >>> [ ByteString findFirstInString: s10 inCharacterSet: ',"', String crlf > >>> startingAt: 1 ] bench. "'1,187,324 per second'" > >>> [ ByteString findFirstInString: s100 inCharacterSet: ',"', String crlf > >>> startingAt: 1 ] bench. "'165,526 per second'" > >>> > >>> > >>> #includesAny: seems to be the best by far. > >>> > >>> Storing the tested characters didn't improve it by much. > >>> > >>> Peter > >>> > >>> On Sat, Jul 22, 2017 at 06:51:31PM +0200, Sven Van Caekenberghe wrote: > >>>> Hi Peter, > >>>> > >>>>> On 22 Jul 2017, at 14:12, Peter Uhnak <[email protected]> wrote: > >>>>> > >>>>> Hi, > >>>>> > >>>>> this is a continuation of an older thread about quoting fields only > >>>>> when necessary. ( > >>>>> http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html ) > >>>>> > >>>>> I've attached fileouts of NeoCSV packages with the addition (I don't > >>>>> know if Metacello can file-out only changesets). > >>>>> > >>>>> The change doesn't affect the default behavior, only when explicitly > >>>>> requested. > >>>>> > >>>>> Peter > >>>> > >>>> I accepted your changes as such, the .mcz's were copied over to the main > >>>> repositories. This is a pure & clean extension, so that is good. Thank > >>>> you. > >>>> > >>>> This option is always going to be slower, but the current implementation > >>>> might be improved, I think. > >>>> > >>>> The key test in #writeOptionalQuotedField: > >>>> > >>>> { > >>>> lineEnd asString. > >>>> separator asString. > >>>> '"' } anySatisfy: [ :each | string includesSubstring: each ] > >>>> > >>>> will be quite slow. > >>>> > >>>> If we accept a little bit of (over safe) error on EOL and use any > >>>> occurrence of CR or LF as needing a quote, we could switch to characters > >>>> to test for. There exists a fast (primitive) test, > >>>> #findFirstInString:inSet:startingAt: that can do all the testing in one > >>>> go. If your version turns out to be slow, we could try that, if > >>>> measurements show a difference. > >>>> > >>>> Regards, > >>>> > >>>> Sven > >>>> > >>>> > >>> > >> > >> > > > >
