On Jul 4, 6:21 pm, subhabangal...@gmail.com wrote: > [...] > To detect the document boundaries, I am splitting them into a bag > of words and using a simple for loop as, > > for i in range(len(bag_words)): > if bag_words[i]=="$": > print (bag_words[i],i)
Ignoring that you are attacking the problem incorrectly: that is very poor method of splitting a string since especially the Python gods have given you *power* over string objects. But you are going to have an even greater problem if the string contains a "$" char that you DID NOT insert :-O. You'd be wise to use a sep that is not likely to be in the file data. For example: "<SEP>" or "<SPLIT-HERE>". But even that approach is naive! Why not streamline the entire process and pass a list of file paths to a custom parser object instead? -- http://mail.python.org/mailman/listinfo/python-list