ShootingSpace opened a new issue #10068: rnn.encode_sentences deals with 
unknown token
URL: https://github.com/apache/incubator-mxnet/issues/10068
 
 
   ## Description
   For rnn.encode_sentences(), could mxnet provides user self-defined behavior 
when vocab dictionary is given, instead of just raise false assertion?  
   
   For example, I would like to add the unknown token as "UNK" into the `res` 
list, provided that available `vocab` dictionary is given already. 
   
   ## Environment info (Required)
   
   ```
   Name
   Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
   
   D:\Anaconda3\envs\mlp\lib\site-packages\urllib3\contrib\pyopenssl.py:46: 
DeprecationWarning: OpenSSL.rand is deprecated - you should use os.urandom 
instead
     import OpenSSL.SSL
   ----------Python Info----------
   Version      : 3.6.3
   Compiler     : MSC v.1900 64 bit (AMD64)
   Build        : ('default', 'Oct  6 2017 10:25:46')
   Arch         : ('64bit', 'WindowsPE')
   ------------Pip Info-----------
   Version      : 9.0.1
   Directory    : D:\Anaconda3\envs\mlp\lib\site-packages\pip
   ----------MXNet Info-----------
   Version      : 1.0.0
   Directory    : D:\Anaconda3\envs\mlp\lib\site-packages\mxnet
   Commit Hash   : 9ef196909ec7bf9cdda66d5b97c92793109798e1
   ----------System Info----------
   Platform     : Windows-7-6.1.7601-SP1
   system       : Windows
   node         : CHC
   release      : 7
   version      : 6.1.7601
   ----------Hardware Info----------
   machine      : AMD64
   processor    : Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0150 
sec, LOAD: 1.8871 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0200 sec, LOAD: 
0.4030 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1960 sec, LOAD: 
0.7660 sec.
   Timing for FashionMNIST: 
https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz,
 DNS: 0.0410 sec, LOAD: 1.0991 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0170 sec, LOAD: 
0.1960 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0120 sec, 
LOAD: 0.9331 sec.
   ```
   
   Package used (Python/R/Scala/Julia):
   (I'm using Python)
   
   ## Error Message:
   (Paste the complete error message, including stack trace.)
   ```python
   D:\Anaconda3\envs\mlp\lib\site-packages\urllib3\contrib\pyopenssl.py:46: 
DeprecationWarning: OpenSSL.rand is deprecated - you should use os.urandom 
instead
     import OpenSSL.SSL
   Traceback (most recent call last):
     File "code/train.py", line 56, in <module>
       train_corpus_indexing = corpus.tokenize(args.data_folder + 
'wiki-train.txt', args.train_size)
     File "E:\BD Cloud\Study\nlu-coursework\code\lstm.py", line 50, in tokenize
       invalid_key='UNK')
     File "D:\Anaconda3\envs\mlp\lib\site-packages\mxnet\rnn\io.py", line 68, 
in encode_sentences
       assert new_vocab, "Unknown token %s"%word
   AssertionError: Unknown token sri
   ```
   
   ## Minimum reproducible example
   (If you are using your own code, please provide a short script that 
reproduces the error. Otherwise, please provide link to the existing example.)
   ```python
   class Corpus(object):
       def __init__(self, vocab_size, vocab_file):
           self.vocab_size = vocab_size
           self.word2idx, self.idx2word = self.build_vocab(vocab_file)
   
       def build_vocab(self, vocab_file):
           '''build the vocabulary from given file'''
           vocab = pd.read_table(vocab_file, header=None, sep="\s+", 
index_col=0, names=['count', 'freq'], )
           idx2word = dict(enumerate(vocab.index[:self.vocab_size]))
           word2idx = invert_dict(idx2word)
           return word2idx, idx2word
   
       def tokenize(self, path, size=None):
           """Tokenizes a text file."""
           assert os.path.exists(path)
           # Loads the context sentence by sentence
           corpus = load_lm_dataset(path) 
           corpus_indexing, _ = mx.rnn.encode_sentences(corpus, 
vocab=self.word2idx,
                                                        invalid_key='UNK')
           if size:
               return corpus_indexing[:size]
           else:
               return corpus_indexing
   
   corpus = Corpus(vocab_size=args.vocab_size, vocab_file = args.data_folder+ 
'vocab.wiki.txt')
   
   train_corpus_indexing = corpus.tokenize(args.data_folder + 'wiki-train.txt', 
args.train_size)
   ```
   ## Steps to reproduce
   (Paste the commands you ran that produced the error.)
   `python code/train.py`
   
   ## What have you tried to solve it?
   Could we provide a variable say `unknown_token = "UNK"`  to enable this 
features, possible modification:
   ```python
   def encode_sentences(sentences, vocab=None, invalid_label=-1, 
invalid_key='\n', start_label=0, unknown_token=None) :
   
     idx = start_label
       if vocab is None:
           vocab = {invalid_key: invalid_label}
           new_vocab = True
       elif unknown_token:
           new_vocab = True
       else:
           new_vocab = False
       res = []
     for sent in sentences:
           coded = []
           for word in sent:
               if word not in vocab:
                   assert new_vocab, "Unknown token %s"%word
                   if idx == invalid_label:
                       idx += 1
                   if unknown_token:
                        vocab[unknown_token] = idx
                   else:
                        vocab[word] = idx
                   idx += 1
               coded.append(vocab[word])
           res.append(coded)
   ```
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to