Wow, this will take some time to digest. Regrettably I have required 
training today and won't be able to even play with this until tomorrow.

In my (parent,child), there are 304K unique parent values and 1.05M unique 
child values. Of course many child values are also parent values. Thus, 
total names are only just over 1.14M.

This data came from and will go back to a database and it must be correct 
CSV. And I know there is data that needs to be escaped per RFC 4180. Also, 
I had always thought that the CSV package was buffered since it has a 
Flush() method (and if not used, you will lose data!). At any rate, I'm 
reluctant to roll my own here since I've been bitten too many times by the 
data content.

Thanks for taking the time to review this!

On Thursday, December 1, 2016 at 4:21:42 AM UTC-5, Egon wrote:
> See whether this works better: 
> *There is a lot of room for improvements, (e.g. for your actual dataset 
> increase defaultNameCount).*
> *PS: Avoid csv package for writing files, use bufio and Write directly... 
> csv does some extra stuff, that you probably won't need.*
> btw. how many different names do you have?
> + Egon
> On Thursday, 1 December 2016 02:51:49 UTC+2, Mandolyte wrote:
>> Thanks for the discussion! Package with tester is at:
>> While I can't share the data, I could take a sample set of paths for a 
>> root node, reverse engineer the pairs, and obfuscate... I've done this sort 
>> of thing before but it is a bit of work. So I'll treat that as a last 
>> resort. :-)
>> On Wednesday, November 30, 2016 at 5:36:41 PM UTC-5, Mandolyte wrote:
>>> The finite set idea might work, but the set is well over 300K. The 
>>> strings (part numbers) are not regular.
>>> I could make a single pass over the "parent" column and record in a 
>>> map[string]int the index of the first occurrence. Then I would avoid 
>>> sort.Search() having to find it each time. Or use sort.Search() on the 
>>> smaller 300K slice if map performance was a problem...
>>> There is a lot of repetition in the "branches" - perhaps some caching 
>>> would be appropriate...
>>> thanks for the ideas!
>>> On Wednesday, November 30, 2016 at 9:31:56 AM UTC-5, 
>>> wrote:
>>>> On Wednesday, 30 November 2016 03:37:55 UTC+2, Mandolyte wrote:
>>>>>> I have a fairly large 2D slice of strings, about 115M rows. These are 
>>>>>> (parent, child) pairs that, processed recursively, form a tree. I am 
>>>>>> "materializing" all possible trees as well as determining for each root 
>>>>>> node all child nodes for all levels for it.
>>>> In addition to what Egon said:
>>>> If the elements of the outer [][]string slice are all []string slices 
>>>> of length 2, it would be much more efficient to use [2]string instead, 
>>>> that 
>>>> is, fixed-sized arrays of two strings, since this would avoid a lot of 
>>>> memory allocation and pointer indirection.  The type of the outer slice 
>>>> would become [][2]string.
>>>> Also, if all your strings are drawn from a finite set, you'll find many 
>>>> operations (such as comparison or set membership tests) much more 
>>>> efficient 
>>>> if you represent each string by its index in some side table.  Then 
>>>> instead 
>>>> of [2]string you can use [2]uint32, or perhaps even a narrower integer 
>>>> type, which will make your main array more compact by at least a factor of 
>>>> 4.

You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
For more options, visit

Reply via email to