[ 
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705288#comment-14705288
 ] 

Sylvain Lebresne commented on CASSANDRA-9459:
---------------------------------------------

Haven't finished to review all of the patch, but I'm going to give a first 
batch of remarks/suggestions/review points. I'll note that while there is quite 
a few, they are mostly relatively small stuff: the bulk of the patch looks 
pretty good, so good job [~beobal].

* In {{SecondaryIndexManager}}:
** {{flushIndexesBlocking}} holds a lock on {{baseCfs.getTracker()}} for the 
whole duration of the flush. I don't think that's what we want. I think what we 
want is to hold the lock while we _submit_ the flush, but not for the whole 
time of the flushing.
** {{getBestIndexFor}} seems to now favor indexes that handle more of the 
expressions. I'm not convinced by that heuristic. It's totally possible for an 
index to handle less expression but be a lot more selective.  So I think we 
should stick to only considering {{estimateResultRows}} (as we're currently 
doing unless I'm missing something). If anything else, I'd rather not do that 
kind of change in this refactoring ticket.
** In {{WriteTimeTransaction.onUpdated}}, I don't think we should ignore the 
{{onPrimaryKeyLivenessInfo}} call: we could be updating a TTL on only the 
clustering columns and that should be carried out to the index.
** {{IndexGCTransaction.onRowMerge}} seems to do more work than it should. I 
believe all we want to do during compaction is remove cells that have been 
shadowed by some deletion (since we don't handle those at write time). But the 
code seems to also add any update (I'm saying imo the condition should be {{if 
(original != null && merged == null)}}). 
** In {{indexPartition}}, the static case should be inside the {{try()}}: no 
reason to filter normal rows but not the static one.
** Why do we need IndexAccessor, since that's created from a {{ReadCommand}} in 
the first place. Can't we just return an {{Index}}, and have the rest of the 
methods of IndexAccessor be methods of {{Index}} taking a {{ReadCommand}} 
(which they mostly already are anyway)? (would make {{ReadCommand.getIndex()}} 
method actually return an {{Index}}, which is a little bit more consistent).
** Should probably add a {{if (!hasIndexes())}} test on top of 
{{newUpdateTransaction}}: that's a very common case and a very hot path and 
currently even with no index I think we'll still do a bunch of work (including 
allocating an empty array).
** {{CleanupTransaction}} should be split up in 2 since Cleanup and Compaction 
use of it don't overlap in what they use and that's a bit confusing. I'd create 
3 interfaces: {{UpdateIndexTransaction}}, {{CleanupIndexTransaction}} and 
{{CompactionIndexTransaction}}. I'd also make those top-level interface to 
avoid the long {{SecondaryIndexManager}} everywhere (the concrete 
implementations can stay where they are). We could also have a 
{{IndexTransaction}} (that they all extend and have just start() and commit()) 
to put inside the {{TransactionType}} (just because {{IndexTransaction.Type}} 
looks better than {{SecondaryIndexManager.TransactionType}} :)).
** Is there a reason for using the whole {{IndexMetadata}} as map key in the 
{{indexes}} map? It feels that using the index name should be enough (since we 
guarantee it's unique and fixed) and would make looking a tad faster since 
there is less to hash/compare and might avoid building fake {{IndexMetadata}} 
just for lookup. Certainly feels cleaner to me in principle.
** I'd rename {{getAllIndexStorageTables}} to {{getAllIndexColumnFamilyStore}}: 
not sure it's worth adding the new verbiage "StorageTables" (note that I hope 
we'll soon rename {{ColumnFamilyteStore}} to {{TableStore}} and rename that 
method accordingly, but it's better to rename consistently for now and deal 
with that later imo).
* In {{Index}}:
** The initialization of an {{Index}} bothers me a bit: the fact there is 
basically 3 calls ({{init()}}, {{setIndexMetadata()}} and then {{register()}}) 
make it hard to understand what initialization actually does. It also means 
nothing can be final in the implementations even if it kind of should (at least 
for {{baseCfs}} in {{CassandraIndex}}). I haven't tested it so I might miss 
some detail, but what I could suggest would be to pass the base table CFS and 
the initial {{IndexMetadata}} to the ctor (so for custom index, we'd specify 
they should have a ctor expecting those) and we'd then just have a 
{{initializationTask()}} that return what needs to be done initially. As for 
registration, index can do it directly in the ctor by calling 
{{baseCfs.indexManager.register()}} (we can also specify they have to do it). 
Now, it's true that {{setIndexMetadata}} is also called during CFS reload, but 
that leads me to another point: it's a bit misleading imo that the index can't 
distinguish calls to this method when creating the index and when reloading it. 
It's also weird that some part of reloading an index is done by 
{{setIndexMetadata}} while other parts are done by {{getMetadataReloadTask()}}. 
 So I'd split {{SecondaryIndexManager.addIndex}} into 2 functions 
({{createIndex}} and {{reloadIndex}}), each being called appropriately in 
{{reload}}. Then, for the reload case, we can just have a {{Callable<?> 
reloadIndex(IndexMetadata)}} that would concretely do what {{setIndexMetadata}} 
and {{getMetadataReloadTask}} does.
** Not a fan of using {{Optional}} as return type of {{Index.getReducedFilter}} 
as I would have expected intuitively that an empty optional would means the 
whole filter is reduced (while it means the exact opposite). I think it would 
be clearer to handle the case where the filter is not handled by the index by 
testing if the filter returned is the same than the input (we could specify 
that the method should return the exact same object and use == if we're 
concerned with performance) (or alternatively with a separate method to test if 
the filter has any expression usable by the index).
** I also don't find the {{Index.getReducedFilter}} naming too intuitive. I'd 
have prefer something like {{Index.getUnhandledExpressions}} or something like 
that (possibly debatable though).
* In {{CassandraIndex.indexFor}}, the implementations of {{insertRow}} and 
{{removeRow}} seems dangerous to me. They both seem to ignore if the cells they 
found are tombstone or not, which sounds pretty dangerous to
me (if insertRow() is called with some tombstone, it will insert the cells 
instead of removing them for instance). The same is true of {{updateRow}} in 
fact. So I don't think we need {{removeRow}}, but rather than
we need to handle tombstones in {{insertRow}} and {{updateRow}}.
* In {{AtomicBtreePartition.addAllWithSizeDelta}}, the transaction 
{{onPartitionDeletion}} and {{onRangeTombstone}} methods are called with the 
post-update deletion infos. That means in particular that they will be called 
even if the update has deleted nothing provided the partition had prior 
deletions. I wonder if that's the best/most useful semantic (and the javadoc 
I've found around this isn't terribly clear that this is what happens). I see 2 
alternatives which both seems more intuitive to me: either pass only the update 
deletions, or pass the diff of what we had and what we add. The later is 
certainly more useful and more consistent with how we deal with rows, but it's 
also more annoying to do and a tad more costly. Maybe I can suggest going with 
the former option (which sounds to me like the one implied by the current 
javadoc) and revisit the latter option if it proves useful for someone. I'll 
note that the internal indexer doesn't do anything with those methods so this 
has not impact on internal indexes, but since it's there, we should still be 
careful of the semantic we provide.
* In {{Indexes.java}}, for {{get(ColumnDefinition)}}, not a fan of using an 
{{Optional<Collection>}} as return type since we can just use an empty 
collection. In fact, {{ImmutableMultimap.get}} never returns {{null}}, so the 
use of {{Optional}} is concretely currently useless. Also, the javadoc on that 
method is broken (it makes it sound like only one index can be returned and has 
a bad {{@return}}).
* Why do we care about {{IndexMetadata.TargetType}}? That is, I get we'll want 
index that are on more than one column, or even one some function of more than 
one column, but I'm not sure why the splitting between the ones on only one 
column and others that way make sense (nor am I convinced that {{COLUMN}} 
versus {{ROW}} is a great description of that dichotomy). Am I missing some of 
the intention?
* The fact that we have both {{IndexMetadata.TargetType}} and 
{{IndexTarget.TargetType}} is terribly confusing, especially some javadoc 
refers to just {{TargetType}}. We should at least rename one of them if we keep 
both (though as said above, I'm not sure about {{IndexMetadata.TargetType}}).
* In {{CompactionIterator}}, would be nice to avoid merging the {{Columns}} 
unless we actually need it (nits: a little bit below, the line with 
'newCleanupTransaction' needs indentation and the comment is not up-to-date 
(refers to taking the OpOrder which doesn't happen here anymore, to 
{{Indexer::finish/removeRow}} which are now 
{{IndexTransaction:commit/onRowMerge}})).
* I'd make the {{BiFunction}} used for post-processing takes a {{ReadCommand}} 
directly (rather than only the {{RowFilter}}): if a custom implementation needs 
more information on the read, no reason not to provide it.
* In {{AlterTableStatement.java}}, I'd rather make the {{if 
(index.columns.size() == 1)}} an assertion: we _will_ have to change this code 
once we support index on multiple columns (either by erroring out, or by 
actually dropping the index anyway), so I'd prefer having the code crash loudly 
if we forgot to update it rather than silently doing the wrong thing.
* In {{Keyspace.getValidColumnFamilies}}, it would make more sense to me to 
have a {{SecondaryIndexManager.getIndexByName(String indexName)}} method and 
use that (as a side note, it's weird that the method doesn't actually use the 
'idxName' variable. Do you know what gives?).
* Is there a reason for {{IndexRegistry}}? It doesn't hurt, but it doesn't seem 
of any concrete use to me (maybe it's for custom indexes, but even then I'm not 
really sure what it allows concretely).

Also, a few small nits:
* There is a few places where you call {{indexers}} what should be {{indexes}} 
(I suspect due to a change in naming). I've noted {{Keyspace.indexPartition}} 
(both parameter name and in the javadoc; There is also a commented line that 
should be removed in that method) and in {{SecondaryIndexBuilder}}. There may 
be other places.
* {{Index.searcherFor()}} javadoc is incorrect on the return type of the method.
* The last paragraph of the javadoc of {{Index.Indexer.removeRow()}} is a bit 
misleading: row deletion don't use RT in general, they use the {{deletion()}} 
parameter of {{Row}}. It's obviously true that {{removeRow}} is not called in 
the described case, but it's more because {{insertRow}} or {{updateRow}} is 
called instead.
* There's a 'todo' in {{CassandraIndex.setIndexMetadata}} that sounds like it 
should be done now (or removed if its outdated).
* In {{Index}} class javadoc, I wouldn't really start by discussing 
{{validateOptions}} as that's a details. We can mention it briefly at the end 
of the javadoc, and the details of what it should return can just be left to 
the method javadoc imo (you also want a s/duing/during/ and 
s/IndexSearcher/Searcher/ in that javadoc).
* The {{todo}} comment in {{AtomicBTreePartition}} is probably unnecessary: we 
do handle static row in the RowUpdater like any row, and they do go into the 
index transaction.
* I'd move the comment about CASSANDRA-3712 in {{CompactionManager.java}} to 
where the lock is actually taken.
* The {{Index.Indexer}} interface has _all_ his methods as default. Not a huge 
fan of the concept for what it's worth.
* Your IDE seems to expand {{.*}} imports. Not a huge deal, but it's a bit 
distracting and I'd personally prefer if it wasn't doing that.


> SecondaryIndex API redesign
> ---------------------------
>
>                 Key: CASSANDRA-9459
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sam Tunnicliffe
>            Assignee: Sam Tunnicliffe
>             Fix For: 3.0 beta 1
>
>
> For some time now the index subsystem has been a pain point and in large part 
> this is due to the way that the APIs and principal classes have grown 
> organically over the years. It would be a good idea to conduct a wholesale 
> review of the area and see if we can come up with something a bit more 
> coherent.
> A few starting points:
> * There's a lot in AbstractPerColumnSecondaryIndex & its subclasses which 
> could be pulled up into SecondaryIndexSearcher (note that to an extent, this 
> is done in CASSANDRA-8099).
> * SecondayIndexManager is overly complex and several of its functions should 
> be simplified/re-examined. The handling of which columns are indexed and 
> index selection on both the read and write paths are somewhat dense and 
> unintuitive.
> * The SecondaryIndex class hierarchy is rather convoluted and could use some 
> serious rework.
> There are a number of outstanding tickets which we should be able to roll 
> into this higher level one as subtasks (but I'll defer doing that until 
> getting into the details of the redesign):
> * CASSANDRA-7771
> * CASSANDRA-8103
> * CASSANDRA-9041
> * CASSANDRA-4458
> * CASSANDRA-8505
> Whilst they're not hard dependencies, I propose that this be done on top of 
> both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the 
> storage engine changes may facilitate a friendlier index API, but also 
> because of the changes to SIS mentioned above. As for 6717, the changes to 
> schema tables there will help facilitate CASSANDRA-7771.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to