On 5/19/2018 12:47 PM, Peter Otten wrote:
subhabangal...@gmail.com wrote:

I wrote a small piece of following code

import nltk
from nltk.corpus.reader import TaggedCorpusReader
from nltk.tag import CRFTagger

To implement Peter's suggestion:

def NE_TAGGER():

def tagger(stop):

     reader = TaggedCorpusReader('/python27/', r'.*\.pos')
     f1=reader.fileids()
     print "The Files of Corpus are:",f1
     sents=reader.tagged_sents()
     ls=len(sents)
     print "Length of Corpus Is:",ls
     train_data=sents[:300]
     test_data=sents[301:350]

Offtopic: not that sents[300] is neither in the training nor in the test
data; Python uses half-open intervals.

      train_data=sents[:stop]
      test_data=sents[stop:max+50]

     ct = CRFTagger()
     crf_tagger=ct.train(train_data,'model.crf.tagger')

This code is working fine.
Now if I change the data size to say 500 or 3000 in  train_data by giving
train_data=sents[:500] or
  train_data=sents[:3000] it is giving me the following error.

What about sents[:499], sents[:498], ...?

Do a rough binary search for the first stop value that raises.

tagger(400)
tagger(350 or 450, depending)
...

You could automate with bisect module, but bisection by eye should be faster.

I'm not an nltk user, but to debug the problem I suggest that you identify
the exact index that triggers the exception, and then print it

print sents[minimal_index_that_causes_typeerror]

Perhaps you can spot a problem with the input data.
(In the spirit of the "offtopic" remark: if sents[:333] triggers the failure
you have to print sents[332])

Or mentally subtract 1 from minimal failing stop value.


Traceback (most recent call last):
   File "<pyshell#2>", line 1, in <module>
     NE_TAGGER()
   File "C:\Python27\HindiCRFNERTagger1.py", line 20, in NE_TAGGER
     crf_tagger=ct.train(train_data,'model.crf.tagger')
   File "C:\Python27\lib\site-packages\nltk\tag\crf.py", line 185, in train
     trainer.append(features,labels)
   File "pycrfsuite\_pycrfsuite.pyx", line 312, in
   pycrfsuite._pycrfsuite.BaseTrainer.append
   (pycrfsuite/_pycrfsuite.cpp:3800) File "stringsource", line 53, in
   vector.from_py.__pyx_convert_vector_from_py_std_3a__3a_string
   (pycrfsuite/_pycrfsuite.cpp:10738) File "stringsource", line 15, in
   string.from_py.__pyx_convert_string_from_py_std__in_string
   (pycrfsuite/_pycrfsuite.cpp:10633)
TypeError: expected string or Unicode object, NoneType found


I have searched for solutions in web found the following links as,
https://stackoverflow.com/questions/14219038/python-multiprocessing-typeerror-expected-string-or-unicode-object-nonetype-f
or
https://github.com/kamakazikamikaze/easysnmp/issues/50

reloaded Python but did not find much help.

I am using Python 2.7.15 (v2.7.15:ca079a3ea3, Apr 30 2018, 16:22:17) [MSC
v.1500 32 bit (Intel)] on win32

My O/S is, MS-Windows 7.

If any body may kindly suggest a resolution.




--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to