Btw, this is the code getting the data each iteration:
# @profile
def train(self, n_epoch):
import time
self.tactic_predictor.train()
avg_loss = AverageMeter('train loss')
avg_acc = AverageMeter('train accuracy')
# iterations = len(self.dataloaders['train'])
# bar = ProgressBar(max_value=iterations)
self.dataloaders['train'] = iter(self.dataloaders['train'])
# for i, data_batch in enumerate(self.dataloaders['train']):
for i in range(len(self.dataloaders['train'])):
data_batch = next(self.dataloaders['train'])
data_batch = process_batch_ddp(self.opts, data_batch)
# loss, logits = self.tactic_predictor(data_batch)
# acc = accuracy(output=logits, target=data_batch[1])
# avg_loss.update(loss, self.opts.batch_size)
# avg_acc.update(acc, self.opts.batch_size)
self.log(f'{i=}')
#self.log(f"{i=}: {loss=}")
# self.optimizer.zero_grad()
# loss.backward() # each process synchronizes it's gradients
in the backward pass
# self.optimizer.step() # the right update is done since all
procs have the right synced grads
# del loss
# del logits
# del data_batch
gc.collect()
# bar.update(i)
if i >= 10:
time.sleep(2)
sys.exit()
return avg_loss.item(), avg_acc.item()
On Wed, Mar 31, 2021 at 11:18 AM Brando Miranda <[email protected]>
wrote:
> Hi Kenton,
>
> Thanks for the reply. I didn't want to overwhelm you guys, its been hard
> to decide what to share.
>
> Perhaps this will be a good peak into the main function giving me problems:
>
> class DagDataset(Dataset):
>
> def __init__(self, path2dataprep, path2hash2idx, split):
> self.split = split
> self.path2dataprep = path2dataprep
> db = torch.load(self.path2dataprep)
> self.data_prep = db['data_prep']
> self.list_files_current_split =
> self.data_prep.flatten_lst_files_split[self.split]
> self.list_counts_current_split = self.data_prep.counts[self.split]
> self.list_cummulative_sum_current_split =
> self.data_prep.cummulative_sum[self.split]
> self.list_cummulative_end_index_current_split =
> self.data_prep.cummulative_end_index[self.split]
> self.length = sum(self.list_counts_current_split)
> #
> self.path2hash2idx = path2hash2idx
> db = torch.load(self.path2hash2idx)
> self.hash2idx = db['hash2idx']
>
> def __len__(self):
> return self.length
>
> def __getitem__(self, idx: int) -> DagNode:
> # gets the file idx for the value we want
> file_idx =
> bisect.bisect_left(self.list_cummulative_end_index_current_split, idx)
> # now get the actual file name
> file_name = self.list_files_current_split[file_idx]
> # get the file with proof steps
> file_name = self.convert_to_local_home_path(file_name)
> f = open(file_name)
> current_dag_file = dag_api_capnp.Dag.read_packed(f,
> traversal_limit_in_words=2 ** 64 - 1)
> # current_dag_file = dag_api_capnp.Dag.read_packed(f)
> # - global idx 2 idx relative to this file
> prev_cummulative_sum = self.get_previous_cummulative_sum(file_idx)
> idx_rel_this_file = idx - prev_cummulative_sum
> # - data point
> node_idx = current_dag_file.proofSteps[idx_rel_this_file].node
> tactic_hash = current_dag_file.proofSteps[idx_rel_this_file].tactic
> tactic_label = self.hash2idx[tactic_hash]
> # - get Node obj
> node_ref = NodeRef(node_idx, 0) # indicates it's in the current file
> this cased named current_dag_file
> node = DagNode(current_dag_file, node_ref)
> # node = current_dag_file
> f.close()
> return node, tactic_label
>
>
> my suspicion is that its the this line of code that is giving me issues.:
>
> current_dag_file = dag_api_capnp.Dag.read_packed(f,
> traversal_limit_in_words=2 ** 64 - 1)
>
> I think if I return the current_dag_file reference directly and do a del
> and gc.collect() instead of the wrapper around it that I defined called
> DagNode, it seems to solve the memory problem (perhaps though I am not 100%
> but fairly sure).
>
> I will confirm if this is true in a sec.
>
>
> On Wed, Mar 31, 2021 at 10:56 AM Kenton Varda <[email protected]>
> wrote:
>
>> Hi Brando,
>>
>> It's hard for us to guess what might be the problem without seeing more
>> code.
>>
>> -Kenton
>>
>> On Tue, Mar 30, 2021 at 12:56 PM Brando Miranda <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> I am doing machine learning with captain proto (because captain proto is
>>> good at communicating between python and different languages).
>>>
>>> The challenge I have is that my data is represented in captain proto
>>> structs. I load a batch of these proto structs every so often to process
>>> them with a Neural Network. However, eventually after a certain number of
>>> iterations it seems I allocated all the system's memory and I get a SIGKILL
>>> from OOM.
>>>
>>> I am unsure why this would happen or where (I've been memory profiling
>>> my code all day but it is difficult to figure out what part is breaking). I
>>> am fairly sure it has to do with captain proto because I used to have a
>>> version of the data set with json files and I didn't have this error but
>>> now I do. I could directly use the captain proto dataset to create a new
>>> json file data set to really figure out if that is the case but it seems
>>> redundant.
>>>
>>> I thought it could be that I open a captain proto struct and then I
>>> close it:
>>>
>>>
>>> file_name = self.convert_to_local_home_path(file_name)
>>> f = open(file_name)
>>> ...
>>> bunch of processing to make it into my python class
>>> ....
>>> f.close()
>>> return x, y
>>>
>>> I am explicitly closing the file from captain proto so I'd assume that
>>> isn't the case. But anyway, is there a way to really check if the memory
>>> errors I am getting are due to captain proto or not?
>>>
>>> Thanks, Brando
>>>
>>>
>>>
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Cap'n Proto" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/capnproto/ac8e4eba-11e9-44cc-9095-4313e4b7e544n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/capnproto/ac8e4eba-11e9-44cc-9095-4313e4b7e544n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
--
You received this message because you are subscribed to the Google Groups
"Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/capnproto/CAO-aKn4y0u%3D7K-8Hqf1D2oMy%2BxYFxS782EWTP59hqJ5TChHWqA%40mail.gmail.com.