Wow, this will take some time to digest. Regrettably I have required
training today and won't be able to even play with this until tomorrow.
In my (parent,child), there are 304K unique parent values and 1.05M unique
child values. Of course many child values are also parent values. Thus,
total names are only just over 1.14M.
This data came from and will go back to a database and it must be correct
CSV. And I know there is data that needs to be escaped per RFC 4180. Also,
I had always thought that the CSV package was buffered since it has a
Flush() method (and if not used, you will lose data!). At any rate, I'm
reluctant to roll my own here since I've been bitten too many times by the
Thanks for taking the time to review this!
On Thursday, December 1, 2016 at 4:21:42 AM UTC-5, Egon wrote:
> See whether this works better:
> *There is a lot of room for improvements, (e.g. for your actual dataset
> increase defaultNameCount).*
> *PS: Avoid csv package for writing files, use bufio and Write directly...
> csv does some extra stuff, that you probably won't need.*
> btw. how many different names do you have?
> + Egon
> On Thursday, 1 December 2016 02:51:49 UTC+2, Mandolyte wrote:
>> Thanks for the discussion! Package with tester is at:
>> While I can't share the data, I could take a sample set of paths for a
>> root node, reverse engineer the pairs, and obfuscate... I've done this sort
>> of thing before but it is a bit of work. So I'll treat that as a last
>> resort. :-)
>> On Wednesday, November 30, 2016 at 5:36:41 PM UTC-5, Mandolyte wrote:
>>> The finite set idea might work, but the set is well over 300K. The
>>> strings (part numbers) are not regular.
>>> I could make a single pass over the "parent" column and record in a
>>> map[string]int the index of the first occurrence. Then I would avoid
>>> sort.Search() having to find it each time. Or use sort.Search() on the
>>> smaller 300K slice if map performance was a problem...
>>> There is a lot of repetition in the "branches" - perhaps some caching
>>> would be appropriate...
>>> thanks for the ideas!
>>> On Wednesday, November 30, 2016 at 9:31:56 AM UTC-5, adon...@google.com
>>>> On Wednesday, 30 November 2016 03:37:55 UTC+2, Mandolyte wrote:
>>>>>> I have a fairly large 2D slice of strings, about 115M rows. These are
>>>>>> (parent, child) pairs that, processed recursively, form a tree. I am
>>>>>> "materializing" all possible trees as well as determining for each root
>>>>>> node all child nodes for all levels for it.
>>>> In addition to what Egon said:
>>>> If the elements of the outer string slice are all string slices
>>>> of length 2, it would be much more efficient to use string instead,
>>>> is, fixed-sized arrays of two strings, since this would avoid a lot of
>>>> memory allocation and pointer indirection. The type of the outer slice
>>>> would become string.
>>>> Also, if all your strings are drawn from a finite set, you'll find many
>>>> operations (such as comparison or set membership tests) much more
>>>> if you represent each string by its index in some side table. Then
>>>> of string you can use uint32, or perhaps even a narrower integer
>>>> type, which will make your main array more compact by at least a factor of
You received this message because you are subscribed to the Google Groups
To unsubscribe from this group and stop receiving emails from it, send an email
For more options, visit https://groups.google.com/d/optout.