On Thu, May 5, 2011 at 2:26 PM, Robert Pazur <pazurrob...@gmail.com> wrote: > Dear all, > i would like to access some text and count the occurrence as follows > > I got a lots of pdf with some scientific articles and i want to preview > which words are usually related with for example "determinants" > as an example in the article is a sentence > ....elevation is the most > important determinant.... > how can i acquire the "elevation" string? > of course i dont know where the sententence in article is located or which > particular word could there be > any suggestions?
Extract the text using PDFMiner[1], pyPdf[2], or PageCatcher[3]. Then use something similar to n-grams on the extracted text, filtering out those that don't contain "determinant(s)". Then just keep a word frequency table for the remaining n-grams. Not-quite-pseudo-code: from collections import defaultdict, deque N = 7 # length of n-grams to consider; tune as needed buf = deque(maxlen=N) targets = frozenset(("determinant", "determinants")) steps_until_gone = 0 word2freq = defaultdict(int) for word in words_from_pdf: if word in targets: steps_until_gone = N buf.append(word) if steps_until_gone: for related_word in buf: if related_word not in targets: word2freq[related_word] += 1 steps_until_gone -= 1 for count, word in sorted((v,k) for k,v in word2freq.iteritems()): print(word, ':', count) Making this more efficient and less naive is left as an exercise to the reader. There may very well already be something similar but more sophisticated in NLTK[4]; I've never used it, so I dunno. [1]: http://www.unixuser.org/~euske/python/pdfminer/index.html [2]: http://pybrary.net/pyPdf/ [3]: http://www.reportlab.com/software/#pagecatcher [4]: http://www.nltk.org/ Cheers, Chris -- http://rebertia.com -- http://mail.python.org/mailman/listinfo/python-list