Ayushi Dalmia wrote: > On Wednesday, February 5, 2014 12:51:31 AM UTC+5:30, Dave Angel wrote: >> Ayushi Dalmia <ayushidalmia2...@gmail.com> Wrote in message: >> >> >> >> > >> >> > Where am I going wrong? What are the alternatives I can try? >> >> >> >> You've rejected all the alternatives so far without showing your >> >> code, or even properly specifying your problem. >> >> >> >> To get the "total" size of a list of strings, try (untested): >> >> >> >> a = sys.getsizeof (mylist ) >> >> for item in mylist: >> >> a += sys.getsizeof (item) >> >> >> >> This can be high if some of the strings are interned and get >> >> counted twice. But you're not likely to get closer without some >> >> knowledge of the data objects and where they come >> >> from. >> >> >> >> -- >> >> DaveA > > Hello Dave, > > I just thought that saving others time is better and hence I explained > only the subset of my problem. Here is what I am trying to do: > > I am trying to index the current wikipedia dump without using databases > and create a search engine for Wikipedia documents. Note, I CANNOT USE > DATABASES. My approach: > > I am parsing the wikipedia pages using SAX Parser, and then, I am dumping > the words along with the posting list (a list of doc ids in which the word > is present) into different files after reading 'X' number of pages. Now > these files may have the same word and hence I need to merge them and > write the final index again. Now these final indexes must be of limited > size as I need to be of limited size. This is where I am stuck. I need to > know how to determine the size of content in a variable before I write > into the file. > > Here is the code for my merging: > > def mergeFiles(pathOfFolder, countFile): > listOfWords={} > indexFile={} > topOfFile={} > flag=[0]*countFile > data=defaultdict(list) > heap=[] > countFinalFile=0 > for i in xrange(countFile): > fileName = pathOfFolder+'\index'+str(i)+'.txt.bz2' > indexFile[i]= bz2.BZ2File(fileName, 'rb') > flag[i]=1 > topOfFile[i]=indexFile[i].readline().strip() > listOfWords[i] = topOfFile[i].split(' ') > if listOfWords[i][0] not in heap: > heapq.heappush(heap, listOfWords[i][0])
At this point you have already done it wrong as your heap contains the complete data and you have done a lot of O(N) tests on the heap. This is both slow and consumes a lot of memory. See http://code.activestate.com/recipes/491285-iterator-merge/ for a sane way to merge sorted data from multiple files. Your code becomes (untested) with open("outfile.txt", "wb") as outfile: infiles = [] for i in xrange(countFile): filename = os.path.join(pathOfFolder, 'index'+str(i)+'.txt.bz2') infiles.append(bz2.BZ2File(filename, "rb")) outfile.writelines(imerge(*infiles)) for infile in infiles: infile.close() Once you have your data in a single file you can read from that file and do the postprocessing you mention below. > while any(flag)==1: > temp = heapq.heappop(heap) > for i in xrange(countFile): > if flag[i]==1: > if listOfWords[i][0]==temp: > > //This is where I am stuck. I cannot wait until memory > //error, as I need to do some postprocessing too. try: > data[temp].extend(listOfWords[i][1:]) > except MemoryError: > writeFinalIndex(data, countFinalFile, > pathOfFolder) data=defaultdict(list) > countFinalFile+=1 > > topOfFile[i]=indexFile[i].readline().strip() > if topOfFile[i]=='': > flag[i]=0 > indexFile[i].close() > os.remove(pathOfFolder+'\index'+str(i)+'.txt.bz2') > else: > listOfWords[i] = topOfFile[i].split(' ') > if listOfWords[i][0] not in heap: > heapq.heappush(heap, listOfWords[i][0]) > writeFinalIndex(data, countFinalFile, pathOfFolder) > > countFile is the number of files and writeFileIndex method writes into the > file. -- https://mail.python.org/mailman/listinfo/python-list