anirudhacharya commented on a change in pull request #14503: API to create
RecordIO files
URL: https://github.com/apache/incubator-mxnet/pull/14503#discussion_r269837202
##########
File path: python/mxnet/io/io.py
##########
@@ -966,6 +988,164 @@ def creator(*args, **kwargs):
creator.__doc__ = doc_str
return creator
+
+def _read_list(list_file, batch_size):
+ """
+ Helper function that reads the .lst file, binds it in
+ a generator and returns a batched version of the generator.
+ Parameters
+ ----------
+ list_file: input list file.
+ batch_size: batch size of the generator
+ Returns
+ -------
+ item iterator that contains information in .lst file
+ """
+ def get_generator():
+ """
+ wrap a generator around the list file
+ """
+ with open(list_file) as fin:
+ while True:
+ line = fin.readline()
+ if not line:
+ break
+ line = [i.strip() for i in line.strip().split('\t')]
+ line_len = len(line)
+ # check the data format of .lst file
+ if line_len < 3:
+ logging.info("lst should have at least has three parts, \
+ but only has {} parts for {}".format(line_len, line))
+ continue
+ try:
+ item = [int(line[0])] + [line[-1]] + [float(i) for i in
line[1:-1]]
+ except Exception:
+ logging.info('Parsing lst met error for {}, detail:
{}'.format(line))
+ continue
+ yield item
+ data_iter = iter(get_generator())
+ data_batch = list(itertools.islice(data_iter, batch_size))
+ while data_batch:
Review comment:
@nswamy also made a similar comment.
I wanted to create a batched iterator over the list file, and I thought
using a generator in this fashion was the best choice. What would you suggest?
how do i read the list file in batches without loading the whole file?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services