[Mayan EDMS: 1744] Re: Indexing speed improvements

Roberto Rosario Sun, 28 May 2017 09:29:41 -0700

I'm rewriting most of the indexing code and managed to include reindexing 
for individual indexes and not all at once. Commit 
here: 
https://gitlab.com/mayan-edms/mayan-edms/commit/ac6f748113932d91f23f15dffd9a2ba95b2a1b66
The rewrite allows the use of less lock (just 2 now) so it is already much 
faster. This rewrite also open the possibility of indexing by workflow 
states and tags. The code is in a separate branch of the master branch 
(2.2) to try and push this to a next stable release (2.2.1 or 2.3) instead 
of waiting for the next major version (3.0). If you have a development 
install of Mayan please help test this branch to make its inclusion faster.


On Saturday, May 27, 2017 at 2:07:31 PM UTC-4, Roberto Rosario wrote:
>
> Doing some tests I've hit several regressions and a few race conditions 
> (without the 'document_indexing_task_do_rebuild_all_indexes' lock, deleting 
> a document would delete it's index instance if it is empty even while an 
> index is being rebuilt).
> The entire indexing locking workflow will need to be remade too. This 
> refactor is bigger than initially expected.  
>
> On Saturday, May 27, 2017 at 11:01:56 AM UTC-4, Roberto Rosario wrote:
>>
>> That's great! Going through your changes to see how much I can move 
>> upstream.
>>
>> On Friday, April 28, 2017 at 5:26:03 PM UTC-4, MacRobb Simpson wrote:
>>>
>>> I'm currently in the process of implementing Mayan to replace our 
>>> current Document Management System(FileBound).
>>> Our setup currently consists 14 document types, 64 metadata types, 16 
>>> indexes and over 66,000 files currently loaded.
>>>
>>> Reindexing this system is... somewhat slow to say the least.
>>> I let it crunch away for a good 16 hours, and got about halfway through.
>>>
>>> Obviously, this isn't good enough - Indexing might be slow, but it 
>>> shouldn't be /this/ slow.
>>>
>>> With a few mods, I've sped this up by at least 8x(figure around 4 hours 
>>> for a full rebuild... Acceptable).
>>> What I did was:
>>> 1. Instead of indexing by document, then index, I'm indexing by index, 
>>> then document. This allows for a single index to be rebuilt at a time, vs 
>>> multiple being 'filled in' at once.
>>> 2. Modify the delete section to only delete the current index as it's 
>>> being worked on. This allows you to keep using the other indexes during the 
>>> rebuild process.
>>> 3. removed the 'with transaction.atomic():' line in the indexer. I'm 
>>> sure this makes it 'less safe' if something were to fail, but I figure that 
>>> if something fails a reindex is needed anyway.
>>> (By splitting the index rebuild from the single-file-indexer, I can 
>>> leave that atomic transaction line for a single file, where it makes 
>>> sense). This change easily doubled the speed, if not quadrupled it.
>>>
>>> My final code:
>>> mayan/apps/document_indexing/managers.py:
>>>
>>>>     def rebuild_all_indexes(self):
>>>>         from .models import Index
>>>>         
>>>>         for index in Index.objects.filter(enabled=True):
>>>>             print 'indexing',index
>>>>             #Delete nodes applicable to index
>>>>             print 'deleting nodes'
>>>>             for instance_node in self.filter(id=index.id):
>>>>                 instance_node.delete()
>>>>             #Delete empty nodes
>>>>             self.delete_empty_index_nodes()   
>>>>             print 'adding index node'
>>>>             #Add index node
>>>>             root_instance, created = self.get_or_create(
>>>>                 index_template_node=index.template_root, parent=None
>>>>             )
>>>>             print 'indexing documents...'
>>>>             docsIndexed = 0
>>>>             #Reindex each document
>>>>             for document in 
>>>> Document.objects.filter(document_type=index.document_types.all()):
>>>>                 
>>>>                 #Add index nodes?
>>>>                 for template_node in index.template_root.get_children():
>>>>                     self.cascade_eval(document, template_node, 
>>>> root_instance)
>>>>                 docsIndexed += 1
>>>>                 if docsIndexed % 10 == 0:
>>>>                     print 'indexing 
>>>> document',document,docsIndexed,'completed'
>>>>
>>> All of the 'print' lines could be removed, but are very handy when 
>>> watching it run from run-server/devel mode.
>>>
>>>
>>> Anyone got any other improvement ideas or potential pitfalls that this 
>>> could cause?
>>>
>>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Mayan EDMS: 1744] Re: Indexing speed improvements

Reply via email to