I'm currently in the process of implementing Mayan to replace our current Document Management System(FileBound). Our setup currently consists 14 document types, 64 metadata types, 16 indexes and over 66,000 files currently loaded.
Reindexing this system is... somewhat slow to say the least. I let it crunch away for a good 16 hours, and got about halfway through. Obviously, this isn't good enough - Indexing might be slow, but it shouldn't be /this/ slow. With a few mods, I've sped this up by at least 8x(figure around 4 hours for a full rebuild... Acceptable). What I did was: 1. Instead of indexing by document, then index, I'm indexing by index, then document. This allows for a single index to be rebuilt at a time, vs multiple being 'filled in' at once. 2. Modify the delete section to only delete the current index as it's being worked on. This allows you to keep using the other indexes during the rebuild process. 3. removed the 'with transaction.atomic():' line in the indexer. I'm sure this makes it 'less safe' if something were to fail, but I figure that if something fails a reindex is needed anyway. (By splitting the index rebuild from the single-file-indexer, I can leave that atomic transaction line for a single file, where it makes sense). This change easily doubled the speed, if not quadrupled it. My final code: mayan/apps/document_indexing/managers.py: > def rebuild_all_indexes(self): > from .models import Index > > for index in Index.objects.filter(enabled=True): > print 'indexing',index > #Delete nodes applicable to index > print 'deleting nodes' > for instance_node in self.filter(id=index.id): > instance_node.delete() > #Delete empty nodes > self.delete_empty_index_nodes() > print 'adding index node' > #Add index node > root_instance, created = self.get_or_create( > index_template_node=index.template_root, parent=None > ) > print 'indexing documents...' > docsIndexed = 0 > #Reindex each document > for document in > Document.objects.filter(document_type=index.document_types.all()): > > #Add index nodes? > for template_node in index.template_root.get_children(): > self.cascade_eval(document, template_node, > root_instance) > docsIndexed += 1 > if docsIndexed % 10 == 0: > print 'indexing > document',document,docsIndexed,'completed' > All of the 'print' lines could be removed, but are very handy when watching it run from run-server/devel mode. Anyone got any other improvement ideas or potential pitfalls that this could cause? -- --- You received this message because you are subscribed to the Google Groups "Mayan EDMS" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
