On Wed, 27 Jan 2016 11:17:04 -0800 (PST), Scotty C <costo...@hotmail.com> wrote:
>On Wednesday, January 27, 2016 at 2:57:42 AM UTC-6, gneuner2 wrote: > >> What is this other field on which the file is sorted? >this field is the cost in operators to arrive at the key value Is it important to retain that sorting? Or is it just informational? >> In the worst case of every key being a bignum >no, every key is contained within a bignum which can contain many many keys. Then you're not using the hash in a conventional manner ... else the filter entries would be unique ... and we really have no clue what you're actually doing. So any suggestions we give you are shots in the dark. >> Since you are only comparing the hash entries for equality, you could >> save a lot of space [at the expense of complexity] by defining a >> {bucket x chain_size x 16} byte array and storing the 16-byte keys >> directly. >i must be able to grow the chains. i can't make it fixed size like that. You said previously: >>> ... i use separate chaining with a vector of >>> bignums. i am willing to let my chains run up to 5 keys per chain >>> so i need a vector of 6 million pointers. That reads as needing a maximum of 6M x 5 entries. Exstensible structures don't have to be a list - a small array works very well when the number of entries is bounded [which you said it was]. Again this is a case of we really have no clue what you are doing. >> > have another rather large bignum in memory that i use to reduce >> >but not eliminate record duplication of about .5 gb. >> ??? > >ha, ok. this is what this bignum is for. cycle elimination. a sequence >of operators (2 bit per) when strung together is a number like 4126740 >which represents the operator sequence (0 0 3 0 0 3 1 1 2 2 1). i >change that bit in the bignum from 0 to 1. during data generation i >look up my about to be applied operator sequence in the bignum. if i >see a one, i skip data generation. i'm not really happy with the volume >of memory this takes but it is an insanely fast lookup and keeps a ton of >data off the hard drive. >> In the meantime, I would suggest you look up "merge sort" and it's >logarithmic? not happening By splitting the too-large data file and trying to process it in sections, you're already O(M*N) where M is the number of sections. Logarithmic might improve things. Rejecting an approach due solely to its BigO is naive. The order tells how it scales to the limits ... not how it behaves in any particular situation. In most cases there are constants and other linear multipliers that dominate the running time. It matter not just how many times you touch the data, but also the cost of each touch. E.g., people read that Quicksort is O(N*LogN) and ignore that it is O(N*N) under certain circumstances. Mergesort has a higher constant multiplier, but it is O(N*LogN) in _all_ cases ... including those that throw Quicksort into fits. Plus Quicksort can't be used effectively when the data set exceeds RAM. Mergesort and some others can be used regardless of the data set size. Also remember that it's the pattern of data access that's important - not what is actually being done with it. Even when data fits in memory, random access algorithms often are not the best. Prediction, caching and paging concerns often make the most effective algorithm require complex rearrangement and staging of data that is unnatural wrt the overall approach to the problem. Disk file systems are optimized for sequential access. When data is file based, sequential access is to be preferred unless there's no other choice. George -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.