welcome to 824pm on sunday evening

it's been hard to move near the code i started. tomorrow morning i have 
appointment again.

maybe i can paste in a snippet and comment on a concern,

this is part of where i left off. last time i elided the in-progress 
commenting, here it's included:
            idx = 0
            while idx < len(offset_lengths):
            #for idx in pbar:#range(len(offset_lengths)):
                #offset, length = offset_lengths[idx]
                #tail = min(offset + length, len(self.mmap))
                #aligned_offset = (offset // self.blksize) * self.blksize
                #aligned_tail = min(self.size(), (((tail - 1) // self.blksize) 
+ 1) * self.blksize)

                aligned_offset = aligned_offsets[idx].item()
                if aligned_offset > min_hole:
                    next_hole = self._next_sparse(aligned_offset, os.SEEK_HOLE)
                else:
                    next_hole = min_hole
                missing_idcs = (next_hole < tails[idx:]).nonzero()[:,0]
                # here we are handling all undownloaded indices before the next 
cached ones
                # there could be many pages between them that don't need to be 
fetched
                num_missing_idcs = missing_idcs.shape[0]
                if num_missing_idcs > 0:

i expect the logic in the above is not quite correct. i've been trying to 
re-understand `missing_idcs = (next_hole < tails[idx:]).nonzero()[:,0]`. i 
think this line was ported from the non-batched loop in a naive manner.

roughly, i need to figure out how to correct process chunks of idcs that are 
cached and are not cached. when they are cached, then they are represented by a 
data region in the sparse mmap file. when they are not cached, then they are 
represented by a hole region.

`self. _next_sparse(self, off, region)` is a wrapper around `os.lseek` -- given 
an offset, and a region type, it returns the offset of the next byte with the 
given region type, such that it returns the passed offset if it is of the given 
region type. this is presently the only interface into determining what is 
cached and what is not, and it works fine for procedural scans and should work 
fine here for now. (the caller of read_many needs to provide the offsets and 
lengths of the requested data already). [it's also reasonable to add an index 
to the cache, although it would add a concern of syncing sparsity to manage 
diskspace. a sparse file is a special kind of file where regions of the file do 
not occupy disk space and cannot contain nonzero data, these regions are called 
holes, and they can be written to which uses more disk space when done.]

here's the current head of read_many:
        def read_many(self, offset_lengths, progress, validate_sorted=True):
            if validate_sorted:
                sorted_offset_lengths = list(offset_lengths)
                sorted_offset_lengths.sort()
                assert sorted_offset_lengths == offset_lengths
            OP_FETCH = 1
            OP_PLACE = 2
            OP_OUTPUT = 4
            offset_length_tail_idx_ops = 
torch.zeros([offset_lengths.shape[0]*2, 5])
            OFFSET, LENGTH, TAIL, IDX, OP = 
range(offset_length_tail_ops.shape[-1])
            op_ct = 0

my idea to prototype things was to have the structure 
offset_length_tail_idx_ops contain all the data in one parallel block that is 
needed to operate. so first this structure is filled with data in batches using 
vector operations, and then as much of it as possible is operated on at once. 
the abstraction bounds can change. it's just what i have atm, and obviously 
everything's confusing and slow for me so i'm working off it part by part.

the name `offset_length_tail_idx_ops` is named to directly represent the 
content of the rows in order. this is a convention i use to code using less of 
my working memory. i'm happy to add or remove data to this structure, and i do 
a find/replace to update the name when i do so.

so the next step i was talking about, i was planning to implement involved 
preparing for filling this structure with `OP_OUTPUT` entries for each of the 
`offset_lengths` specifying what values to output back to the user. this is 
maybe not necessary, but it focuses me around the next logical concern -- which 
is ensuring uncached data is properly fetched, whereas cached data is not. so 
maybe not quite needed yet.

been confused all day around ` missing_idcs = (next_hole < 
tails[idx:]).nonzero()[:,0]` despite writing this line earlier myself. [after 
success on finding region bounds]. given the loop is at `idx`, i gotta figure 
out which `idcs` to consider cached, and which to fetch, and how to organize 
that into a larger loop that repeats. i'm just getting really confused around 
that simple concern

Reply via email to