welcome to 824pm on sunday evening
it's been hard to move near the code i started. tomorrow morning i have
appointment again.
maybe i can paste in a snippet and comment on a concern,
this is part of where i left off. last time i elided the in-progress
commenting, here it's included:
idx = 0
while idx < len(offset_lengths):
#for idx in pbar:#range(len(offset_lengths)):
#offset, length = offset_lengths[idx]
#tail = min(offset + length, len(self.mmap))
#aligned_offset = (offset // self.blksize) * self.blksize
#aligned_tail = min(self.size(), (((tail - 1) // self.blksize)
+ 1) * self.blksize)
aligned_offset = aligned_offsets[idx].item()
if aligned_offset > min_hole:
next_hole = self._next_sparse(aligned_offset, os.SEEK_HOLE)
else:
next_hole = min_hole
missing_idcs = (next_hole < tails[idx:]).nonzero()[:,0]
# here we are handling all undownloaded indices before the next
cached ones
# there could be many pages between them that don't need to be
fetched
num_missing_idcs = missing_idcs.shape[0]
if num_missing_idcs > 0:
i expect the logic in the above is not quite correct. i've been trying to
re-understand `missing_idcs = (next_hole < tails[idx:]).nonzero()[:,0]`. i
think this line was ported from the non-batched loop in a naive manner.
roughly, i need to figure out how to correct process chunks of idcs that are
cached and are not cached. when they are cached, then they are represented by a
data region in the sparse mmap file. when they are not cached, then they are
represented by a hole region.
`self. _next_sparse(self, off, region)` is a wrapper around `os.lseek` -- given
an offset, and a region type, it returns the offset of the next byte with the
given region type, such that it returns the passed offset if it is of the given
region type. this is presently the only interface into determining what is
cached and what is not, and it works fine for procedural scans and should work
fine here for now. (the caller of read_many needs to provide the offsets and
lengths of the requested data already). [it's also reasonable to add an index
to the cache, although it would add a concern of syncing sparsity to manage
diskspace. a sparse file is a special kind of file where regions of the file do
not occupy disk space and cannot contain nonzero data, these regions are called
holes, and they can be written to which uses more disk space when done.]
here's the current head of read_many:
def read_many(self, offset_lengths, progress, validate_sorted=True):
if validate_sorted:
sorted_offset_lengths = list(offset_lengths)
sorted_offset_lengths.sort()
assert sorted_offset_lengths == offset_lengths
OP_FETCH = 1
OP_PLACE = 2
OP_OUTPUT = 4
offset_length_tail_idx_ops =
torch.zeros([offset_lengths.shape[0]*2, 5])
OFFSET, LENGTH, TAIL, IDX, OP =
range(offset_length_tail_ops.shape[-1])
op_ct = 0
my idea to prototype things was to have the structure
offset_length_tail_idx_ops contain all the data in one parallel block that is
needed to operate. so first this structure is filled with data in batches using
vector operations, and then as much of it as possible is operated on at once.
the abstraction bounds can change. it's just what i have atm, and obviously
everything's confusing and slow for me so i'm working off it part by part.
the name `offset_length_tail_idx_ops` is named to directly represent the
content of the rows in order. this is a convention i use to code using less of
my working memory. i'm happy to add or remove data to this structure, and i do
a find/replace to update the name when i do so.
so the next step i was talking about, i was planning to implement involved
preparing for filling this structure with `OP_OUTPUT` entries for each of the
`offset_lengths` specifying what values to output back to the user. this is
maybe not necessary, but it focuses me around the next logical concern -- which
is ensuring uncached data is properly fetched, whereas cached data is not. so
maybe not quite needed yet.
been confused all day around ` missing_idcs = (next_hole <
tails[idx:]).nonzero()[:,0]` despite writing this line earlier myself. [after
success on finding region bounds]. given the loop is at `idx`, i gotta figure
out which `idcs` to consider cached, and which to fetch, and how to organize
that into a larger loop that repeats. i'm just getting really confused around
that simple concern