[Mayan EDMS: 1672] Indexing speed improvements

MacRobb Simpson Fri, 28 Apr 2017 14:26:45 -0700

I'm currently in the process of implementing Mayan to replace our current 
Document Management System(FileBound).
Our setup currently consists 14 document types, 64 metadata types, 16 
indexes and over 66,000 files currently loaded.


Reindexing this system is... somewhat slow to say the least.
I let it crunch away for a good 16 hours, and got about halfway through.

Obviously, this isn't good enough - Indexing might be slow, but it 
shouldn't be /this/ slow.

With a few mods, I've sped this up by at least 8x(figure around 4 hours for 
a full rebuild... Acceptable).
What I did was:
1. Instead of indexing by document, then index, I'm indexing by index, then 
document. This allows for a single index to be rebuilt at a time, vs 
multiple being 'filled in' at once.
2. Modify the delete section to only delete the current index as it's being 
worked on. This allows you to keep using the other indexes during the 
rebuild process.
3. removed the 'with transaction.atomic():' line in the indexer. I'm sure 
this makes it 'less safe' if something were to fail, but I figure that if 
something fails a reindex is needed anyway.
(By splitting the index rebuild from the single-file-indexer, I can leave 
that atomic transaction line for a single file, where it makes sense). This 
change easily doubled the speed, if not quadrupled it.

My final code:
mayan/apps/document_indexing/managers.py:

>     def rebuild_all_indexes(self):
>         from .models import Index
>         
>         for index in Index.objects.filter(enabled=True):
>             print 'indexing',index
>             #Delete nodes applicable to index
>             print 'deleting nodes'
>             for instance_node in self.filter(id=index.id):
>                 instance_node.delete()
>             #Delete empty nodes
>             self.delete_empty_index_nodes()   
>             print 'adding index node'
>             #Add index node
>             root_instance, created = self.get_or_create(
>                 index_template_node=index.template_root, parent=None
>             )
>             print 'indexing documents...'
>             docsIndexed = 0
>             #Reindex each document
>             for document in 
> Document.objects.filter(document_type=index.document_types.all()):
>                 
>                 #Add index nodes?
>                 for template_node in index.template_root.get_children():
>                     self.cascade_eval(document, template_node, 
> root_instance)
>                 docsIndexed += 1
>                 if docsIndexed % 10 == 0:
>                     print 'indexing 
> document',document,docsIndexed,'completed'
>
All of the 'print' lines could be removed, but are very handy when watching 
it run from run-server/devel mode.


Anyone got any other improvement ideas or potential pitfalls that this 
could cause?

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Mayan EDMS: 1672] Indexing speed improvements

Reply via email to