That's great! Going through your changes to see how much I can move upstream.
On Friday, April 28, 2017 at 5:26:03 PM UTC-4, MacRobb Simpson wrote: > > I'm currently in the process of implementing Mayan to replace our current > Document Management System(FileBound). > Our setup currently consists 14 document types, 64 metadata types, 16 > indexes and over 66,000 files currently loaded. > > Reindexing this system is... somewhat slow to say the least. > I let it crunch away for a good 16 hours, and got about halfway through. > > Obviously, this isn't good enough - Indexing might be slow, but it > shouldn't be /this/ slow. > > With a few mods, I've sped this up by at least 8x(figure around 4 hours > for a full rebuild... Acceptable). > What I did was: > 1. Instead of indexing by document, then index, I'm indexing by index, > then document. This allows for a single index to be rebuilt at a time, vs > multiple being 'filled in' at once. > 2. Modify the delete section to only delete the current index as it's > being worked on. This allows you to keep using the other indexes during the > rebuild process. > 3. removed the 'with transaction.atomic():' line in the indexer. I'm sure > this makes it 'less safe' if something were to fail, but I figure that if > something fails a reindex is needed anyway. > (By splitting the index rebuild from the single-file-indexer, I can leave > that atomic transaction line for a single file, where it makes sense). This > change easily doubled the speed, if not quadrupled it. > > My final code: > mayan/apps/document_indexing/managers.py: > >> def rebuild_all_indexes(self): >> from .models import Index >> >> for index in Index.objects.filter(enabled=True): >> print 'indexing',index >> #Delete nodes applicable to index >> print 'deleting nodes' >> for instance_node in self.filter(id=index.id): >> instance_node.delete() >> #Delete empty nodes >> self.delete_empty_index_nodes() >> print 'adding index node' >> #Add index node >> root_instance, created = self.get_or_create( >> index_template_node=index.template_root, parent=None >> ) >> print 'indexing documents...' >> docsIndexed = 0 >> #Reindex each document >> for document in >> Document.objects.filter(document_type=index.document_types.all()): >> >> #Add index nodes? >> for template_node in index.template_root.get_children(): >> self.cascade_eval(document, template_node, >> root_instance) >> docsIndexed += 1 >> if docsIndexed % 10 == 0: >> print 'indexing >> document',document,docsIndexed,'completed' >> > All of the 'print' lines could be removed, but are very handy when > watching it run from run-server/devel mode. > > > Anyone got any other improvement ideas or potential pitfalls that this > could cause? > -- --- You received this message because you are subscribed to the Google Groups "Mayan EDMS" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
