Thank you Sven!

Peter

On Sun, Jul 23, 2017 at 01:59:41PM +0200, Sven Van Caekenberghe wrote:
> 
> > On 23 Jul 2017, at 09:55, Peter Uhnak <[email protected]> wrote:
> > 
> > Ah, ByteArrayMap, I was trying ByteArray.
> 
> ByteArrayMap is not a class, it is still a ByteArray, but of size 256 used as 
> an inclusion map for characters that fit a byte.
> 
> BTW, using CharacterSet as an abstraction makes the trick easier.
> 
> I committed:
> 
> ===
> Name: Neo-CSV-Core-SvenVanCaekenberghe.26
> Author: SvenVanCaekenberghe
> Time: 23 July 2017, 1:54:22.095185 pm
> UUID: e429f87e-5c11-0d00-a72d-ae5d00db2a8c
> Ancestors: Neo-CSV-Core-PeterUhnak.25
> 
> make the core test in #writeOptionalQuotedField faster by using a primitive 
> when possible
> 
> add #testWideOptionalQuoted
> ===
> Name: Neo-CSV-Tests-SvenVanCaekenberghe.22
> Author: SvenVanCaekenberghe
> Time: 23 July 2017, 1:54:52.133003 pm
> UUID: 5f81c280-5c11-0d00-a72e-7cd500db2a8c
> Ancestors: Neo-CSV-Tests-PeterUhnak.21
> 
> make the core test in #writeOptionalQuotedField faster by using a primitive 
> when possible
> 
> add #testWideOptionalQuoted
> ===
> 
> Extra caution was needed for WideStrings.
> 
> Sven
> 
> > Not sure what the next step is here: should I add it and send you mczs, or 
> > will you do it yourself? (it is a simple change).
> > 
> > Peter
> > 
> > On Sat, Jul 22, 2017 at 10:51:42PM +0200, Sven Van Caekenberghe wrote:
> >> Peter,
> >> 
> >>> On 22 Jul 2017, at 22:27, Peter Uhnak <[email protected]> wrote:
> >>> 
> >>> Hi Sven,
> >>> 
> >>> my use case was hand-edited CSVs (therefore the quotes are extra 
> >>> clutter), which imples that I would be hand-viewing/editing only small 
> >>> CSVs (no performance issues).
> >>> 
> >>> I agree that we should err on the safe side with CR & LF (e.g. tools may 
> >>> sometimes autoconvert line endings).
> >>> 
> >>> Regarding performance:
> >>> 
> >>> #findFirstInString:inSet:startingAt: didn't work for me (not sure if bug, 
> >>> or I don't know how to use), so I've trried with inCharacterSet:
> >> 
> >> Yes, it is a bitch to use, the the version you used does not use the 
> >> primitive.
> >> 
> >> Try playing with this:
> >> 
> >> s10 := 'a' repeat: 10.
> >> s100 := 'a' repeat: 100.
> >> 
> >> searchSet := ByteArray new: 256 withAll: 0.
> >> ',"', String crlf do: [ :each | searchSet at: each asInteger + 1 put: 1 ].
> >> 
> >> searchSet := (CharacterSet newFrom: ',"', String crlf) byteArrayMap.
> >> 
> >> [ ByteString findFirstInString: s10 inSet: searchSet startingAt: 1 ] bench.
> >> "15,160,161 per second"
> >> [ ByteString findFirstInString: s100 inSet: searchSet startingAt: 1 ] 
> >> bench.
> >> "8,219,081 per second"
> >> 
> >> ByteString findFirstInString: ',"', String crlf inSet: searchSet 
> >> startingAt: 1.
> >> 
> >> Sven
> >> 
> >>> Tested on worst-case scenario - strings don't contain tested symbols.
> >>> 
> >>> s10 := 'a' repeat: 10.
> >>> s100 := 'a' repeat: 100.
> >>> 
> >>> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s10 
> >>> includesSubstring: each ] ] bench. "'1,200,046 per second'"
> >>> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s100 
> >>> includesSubstring: each ] ] bench. "'495,482 per second'"
> >>> 
> >>> [ s10 includesAny: { $,. $". Character cr. Character lf } ] bench. 
> >>> "'2,819,416 per second'"
> >>> [ s100 includesAny: { $,. $". Character cr. Character lf } ] bench. 
> >>> "'2,200,668 per second'"
> >>> 
> >>> [ ByteString findFirstInString: s10 inCharacterSet: ',"', String crlf 
> >>> startingAt: 1 ] bench. "'1,187,324 per second'"
> >>> [ ByteString findFirstInString: s100 inCharacterSet: ',"', String crlf 
> >>> startingAt: 1 ] bench. "'165,526 per second'"
> >>> 
> >>> 
> >>> #includesAny: seems to be the best by far.
> >>> 
> >>> Storing the tested characters didn't improve it by much.
> >>> 
> >>> Peter
> >>> 
> >>> On Sat, Jul 22, 2017 at 06:51:31PM +0200, Sven Van Caekenberghe wrote:
> >>>> Hi Peter,
> >>>> 
> >>>>> On 22 Jul 2017, at 14:12, Peter Uhnak <[email protected]> wrote:
> >>>>> 
> >>>>> Hi,
> >>>>> 
> >>>>> this is a continuation of an older thread about quoting fields only 
> >>>>> when necessary. ( 
> >>>>> http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
> >>>>> 
> >>>>> I've attached fileouts of NeoCSV packages with the addition (I don't 
> >>>>> know if Metacello can file-out only changesets).
> >>>>> 
> >>>>> The change doesn't affect the default behavior, only when explicitly 
> >>>>> requested.
> >>>>> 
> >>>>> Peter
> >>>> 
> >>>> I accepted your changes as such, the .mcz's were copied over to the main 
> >>>> repositories. This is a pure & clean extension, so that is good. Thank 
> >>>> you.
> >>>> 
> >>>> This option is always going to be slower, but the current implementation 
> >>>> might be improved, I think.
> >>>> 
> >>>> The key test in #writeOptionalQuotedField:
> >>>> 
> >>>> {
> >>>> lineEnd asString.
> >>>> separator asString.
> >>>> '"' } anySatisfy: [ :each | string includesSubstring: each ]
> >>>> 
> >>>> will be quite slow. 
> >>>> 
> >>>> If we accept a little bit of (over safe) error on EOL and use any 
> >>>> occurrence of CR or LF as needing a quote, we could switch to characters 
> >>>> to test for. There exists a fast (primitive) test, 
> >>>> #findFirstInString:inSet:startingAt: that can do all the testing in one 
> >>>> go. If your version turns out to be slow, we could try that, if 
> >>>> measurements show a difference.
> >>>> 
> >>>> Regards,
> >>>> 
> >>>> Sven 
> >>>> 
> >>>> 
> >>> 
> >> 
> >> 
> > 
> 
> 

Reply via email to