[Mayan EDMS: 1742] Re: Indexing speed improvements

Roberto Rosario Sat, 27 May 2017 08:02:34 -0700

That's great! Going through your changes to see how much I can move 
upstream.


On Friday, April 28, 2017 at 5:26:03 PM UTC-4, MacRobb Simpson wrote:
>
> I'm currently in the process of implementing Mayan to replace our current 
> Document Management System(FileBound).
> Our setup currently consists 14 document types, 64 metadata types, 16 
> indexes and over 66,000 files currently loaded.
>
> Reindexing this system is... somewhat slow to say the least.
> I let it crunch away for a good 16 hours, and got about halfway through.
>
> Obviously, this isn't good enough - Indexing might be slow, but it 
> shouldn't be /this/ slow.
>
> With a few mods, I've sped this up by at least 8x(figure around 4 hours 
> for a full rebuild... Acceptable).
> What I did was:
> 1. Instead of indexing by document, then index, I'm indexing by index, 
> then document. This allows for a single index to be rebuilt at a time, vs 
> multiple being 'filled in' at once.
> 2. Modify the delete section to only delete the current index as it's 
> being worked on. This allows you to keep using the other indexes during the 
> rebuild process.
> 3. removed the 'with transaction.atomic():' line in the indexer. I'm sure 
> this makes it 'less safe' if something were to fail, but I figure that if 
> something fails a reindex is needed anyway.
> (By splitting the index rebuild from the single-file-indexer, I can leave 
> that atomic transaction line for a single file, where it makes sense). This 
> change easily doubled the speed, if not quadrupled it.
>
> My final code:
> mayan/apps/document_indexing/managers.py:
>
>>     def rebuild_all_indexes(self):
>>         from .models import Index
>>         
>>         for index in Index.objects.filter(enabled=True):
>>             print 'indexing',index
>>             #Delete nodes applicable to index
>>             print 'deleting nodes'
>>             for instance_node in self.filter(id=index.id):
>>                 instance_node.delete()
>>             #Delete empty nodes
>>             self.delete_empty_index_nodes()   
>>             print 'adding index node'
>>             #Add index node
>>             root_instance, created = self.get_or_create(
>>                 index_template_node=index.template_root, parent=None
>>             )
>>             print 'indexing documents...'
>>             docsIndexed = 0
>>             #Reindex each document
>>             for document in 
>> Document.objects.filter(document_type=index.document_types.all()):
>>                 
>>                 #Add index nodes?
>>                 for template_node in index.template_root.get_children():
>>                     self.cascade_eval(document, template_node, 
>> root_instance)
>>                 docsIndexed += 1
>>                 if docsIndexed % 10 == 0:
>>                     print 'indexing 
>> document',document,docsIndexed,'completed'
>>
> All of the 'print' lines could be removed, but are very handy when 
> watching it run from run-server/devel mode.
>
>
> Anyone got any other improvement ideas or potential pitfalls that this 
> could cause?
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Mayan EDMS: 1742] Re: Indexing speed improvements

Reply via email to