Yet another storage cleanup and a bit of history

abdullah alamoudi Mon, 08 May 2017 23:00:21 -0700

Devs,
For some time, out storage code has been suffering from incremental design 
changes through additional use cases that come along research projects.
We have done some cleanup but the code base still suffered from lots of 
duplicate code and unneeded work (for both developers and CPUs).


One thing we used to do is whenever we need to access an index, we have to 
create its "IDataflowHelperFactory". This object will contain most things that 
defines the index. Interestingly, it didn't contain all that is needed.
For some obscure reason, typeTraits, comparatorFactories, and 
bloomFilterKeyFields were places in TreeIndexCreateOperatorDescriptor.
Different indexes had different dataflow helpers and so sometimes, multiple 
class definitions are needed for operators per index type. and this leads to 
further bloat of the system's code.

If one thinks about it, the index related objects that are needed during index 
construction are only needed when the index gets created. Further needs to 
access the index for any reason should only gets the index key (the path in 
this case).
In fact, if one follows the execution of the code, they will see that that is 
exactly what is needed and that whenever we try to 
search/insert/delete/upsert/bulkload, we recompute many artifacts that end up 
being useless.

So, I proposed a change that fixes this part. Index related objects are only 
provided at index creation time and for index access, only the index path is 
required. This is done by removing the create method from the 
IIndexDataflowHelper interface and moving it to IIndexBuilder.
With this, all implementations of IIndexDataflowHelper are now in a single 
class that basically uses the path to fetch the index on a Node Controller.

This change gets rid of more than 3000 lines of code and makes things much 
cleaner. Classes that should go to hyracks are moved to hyracks. Asterix 
related information such as dataset id and partition number are kept in asterix 
through the introduction of DatasetLocalResource. To show the effect of this 
change, you can look at an example in 
https://asterix-gerrit.ics.uci.edu/#/c/1728/3/asterixdb/asterix-algebra/src/main/java/org/apache/asterix/algebra/operators/physical/InvertedIndexPOperator.java
 
<https://asterix-gerrit.ics.uci.edu/#/c/1728/3/asterixdb/asterix-algebra/src/main/java/org/apache/asterix/algebra/operators/physical/InvertedIndexPOperator.java>

Look at the amount of work that was unneeded to access an inverted index. The 
only thing that was actually needed is the index FileSplitProvider. Take 
another look at 
https://asterix-gerrit.ics.uci.edu/#/c/1728/3/asterixdb/asterix-metadata/src/main/java/org/apache/asterix/metadata/declared/MetadataProvider.java
 
<https://asterix-gerrit.ics.uci.edu/#/c/1728/3/asterixdb/asterix-metadata/src/main/java/org/apache/asterix/metadata/declared/MetadataProvider.java>
 and see how much unneeded code gets removed. This is not enough on the 
asterixdb side but it creates a good foundation that would allow existing and 
new code to get cleaner.

Please, take a look at the change https://asterix-gerrit.ics.uci.edu/#/c/1728/ 
<https://asterix-gerrit.ics.uci.edu/#/c/1728/> if interested and let me know if 
you have any comments. Note that it fails storage check test and that is 
expected because it changes the persisted resource info classes.

Cheers,
Abdullah.

Yet another storage cleanup and a bit of history

Reply via email to