[racket-users] Re: appending files

George Neuner Wed, 27 Jan 2016 18:16:20 -0800

On Wed, 27 Jan 2016 11:17:04 -0800 (PST), Scotty C
<costo...@hotmail.com> wrote:

>On Wednesday, January 27, 2016 at 2:57:42 AM UTC-6, gneuner2 wrote:
>
>> What is this other field on which the file is sorted?
>this field is the cost in operators to arrive at the key value

Is it important to retain that sorting?  Or is it just informational?

>> In the worst case of every key being a bignum
>no, every key is contained within a bignum which can contain many many keys.

Then you're not using the hash in a conventional manner ... else the
filter entries would be unique ... and we really have no clue what
you're actually doing.  So any suggestions we give you are shots in
the dark.

>> Since you are only comparing the hash entries for equality, you could
>> save a lot of space [at the expense of complexity] by defining a
>> {bucket x chain_size x 16} byte array and storing the 16-byte keys
>> directly.
>i must be able to grow the chains. i can't make it fixed size like that.

You said previously:

>>> ... i use separate chaining with a vector of
>>> bignums. i am willing to let my chains run up to 5 keys per chain
>>> so i need a vector of 6 million pointers.

That reads as needing a maximum of  6M x 5  entries.  

Exstensible structures don't have to be a list - a small array works
very well when the number of entries is bounded [which you said it
was].

Again this is a case of we really have no clue what you are doing.

>> > have another rather large bignum in memory that i use to reduce
>> >but not eliminate record duplication of about .5 gb. 
>> ???
>
>ha, ok. this is what this bignum is for. cycle elimination. a sequence
>of operators (2 bit per) when strung together is a number like 4126740
>which represents the operator sequence (0 0 3 0 0 3 1 1 2 2 1). i
>change that bit in the bignum from 0 to 1. during data generation i 
>look up my about to be applied operator sequence in the bignum. if i 
>see a one, i skip data generation. i'm not really happy with the volume
>of memory this takes but it is an insanely fast lookup and keeps a ton of
>data off the hard drive.

>> In the meantime, I would suggest you look up "merge sort" and it's
>logarithmic? not happening

By splitting the too-large data file and trying to process it in
sections, you're already O(M*N) where M is the number of sections.
Logarithmic might improve things.

Rejecting an approach due solely to its BigO is naive.  The order
tells how it scales to the limits ... not how it behaves in any
particular situation.  In most cases there are constants and other
linear multipliers that dominate the running time.

It matter not just how many times you touch the data, but also the
cost of each touch.

E.g., people read that Quicksort is O(N*LogN)  and ignore that it is
O(N*N) under certain circumstances.  Mergesort has a higher constant
multiplier, but it is  O(N*LogN)  in _all_ cases ... including those
that throw Quicksort into fits.  Plus Quicksort can't be used
effectively when the data set exceeds RAM.  Mergesort and some others
can be used regardless of the data set size.

Also remember that it's the pattern of data access that's important -
not what is actually being done with it.

Even when data fits in memory, random access algorithms often are not
the best.  Prediction, caching and paging concerns often make the most
effective algorithm require complex rearrangement and staging of data
that is unnatural wrt the overall approach to the problem.

Disk file systems are optimized for sequential access.  When data is
file based, sequential access is to be preferred unless there's no
other choice.  

George

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

Reply via email to