[
https://issues.apache.org/jira/browse/CASSANDRA-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706745#comment-14706745
]
Sam Tunnicliffe commented on CASSANDRA-9459:
--------------------------------------------
Thanks [~slebresne], I appreciate it's a bit of a slog. I've pushed fixes that
address most of your first set of points (too many to enumerate so I'll only
mention the ones which I didn't just fix, but there's basically a new commit
per-point in the branch if you want to verify any in particular).
bq. {{getBestIndexFor}} seems to now favor indexes that handle more of the
expressions. I'm not convinced by that heuristic
The metric isn't so much how many of the original expressions can an Index
handle, but the amount of filtering left to do on the results of the index
lookup. It just so happens that all our current (built in) index
implementations only reduce any filter to the degree of removing 1 expression
(at most). In it's current incarnation {{RowFilter}} is limited to describing
only a very simple expression tree, one with depth of 1 and containing only AND
relations. As an effort to future-proof the Index interface somewhat I thought
transforming the original filter into one representing what the index *can't*
do would be a decent heuristic. The naive comparison in {{SIM}} which evaluates
the reduced filters by number of expressions could safely be changed later
without breaking custom index implementations (at least, that was the
intention). The merits of attempting to future proof vs YAGNI are of course
debatable, so if we agree this needs more consideration then I'll happily
revert the selection criteria to simply consider {{estimateResultRows}}.
bq. The initialization of an {{Index}} bothers me a bit
bq. As for registration, index can do it directly in the ctor by calling
{{baseCfs.indexManager.register()}} (we can also specify they have to do it).
I'm not a fan of enforcing that kind of thing just by specification. Also, IoC
is more explicit, cleaner and reduces dependencies. I *would* also argue that
it improves testability but as I'm not actually testing this anywhere, I won't.
Either way, I'm fine with requiring a constructor with a specific signature &
removing the {{init}} method, but I'd prefer to keep {{register}} separate if
you don't object too strongly.
Aside from that, I agree with all your points regarding
construction/init/reload etc. On caveat is that {{SIM#reload}} can't just
reload the indexes it knows about already, it has to check that every index
defined in its base CFS's metadata is present. In the schema update from
executing {{CreateIndexStatement}}, only the {{CFMetaData.indexes}} is updated,
not the {{ColumnFamilyStore}} itself, which is just reloaded. When that reload
happens then, the {{SIM}} needs to add the new index which is present in
{{CFM.indexes}}, but not already registered. With that exception, I've made
those changes in {{454bba7708018f872df825103aa52a99c8f653bd}}.
bq. Not a fan of using {{Optional}} as return type of
{{Index.getReducedFilter}} as I would have expected intuitively that an empty
optional would means the whole filter is reduced
As noted above, this whole approach to selection is up for debate so it may be
that this simply becomes not a problem. That said, I would argue against a
couple of your points here; I don't agree that an empty optional intuitively
implies the total reduction of the filter, in that case an empty filter, not an
empty optional, would seem most semantically correct to me. Secondly, it feels
fragile to make assumptions about object equality in this way, especially in an
extension point like this. I would rather not depend on documentation to
enforce this sort of thing. Prior art for doing this *kind* of thing in C* is
to return null when there's caller cannot satisfy a request and so using
{{Optional}} instead seems pretty reasonable to me.
bq. I also don't find the {{Index.getReducedFilter}} naming too intuitive. I'd
have prefer something like {{Index.getUnhandledExpressions}}
Again, the point of the method isn't necessarily to identify which of
*original* expressions are unhandled, isn't it conceivable that a custom index
could radically transform a filter into one containing an entirely disjoint set
of expressions from the original?
bq. In {{CassandraIndex.indexFor}}, the implementations of {{insertRow}} and
{{removeRow}} seems dangerous to me..if insertRow() is called with some
tombstone, it will insert the cells instead of removing them for
instance{{insertRow}} and {{updateRow}}.
Ok that's a good point, but can you clarify do you mean regular the row may
contain regular tombstones (i.e. the result of "DELETE col FROM table WHERE
..."?) If that is the case, then yes there is a bug currently in that we will
try to insert an index entry for the (empty) value of the tombstone. I've fixed
that (in {{4d7fa1b83a92f4f5067af5ac1693ebef48135ea0}}, but I don't quite get
what you mean by "it will insert the cells instead of removing them for
instance", as the tombstone has no value there's nothing we can remove - in
{{updateRow}} we can remove the value from the existing row ofc. I also think
we {{removeRow}} is still needed for cleanup & compaction
bq. Why do we care about {{IndexMetadata.TargetType}}?
The main driver for {{IndexMetadata.TargetType}} was to make explicit the
difference between indexes with dynamic and static sets of target columns. We
could infer that an index greedily processes all partition updates if its
target columns is an empty set, but we discussed this in CASSANDRA-6717 and
decided that explicitly indicating that was better (at least from a
{{system_schema}} perspective).
bq. The fact that we have both {{IndexMetadata.TargetType}} and
{{IndexTarget.TargetType}} is terribly confusing
It's pretty lame, but the best I could come up with was to rename
{{IndexTarget.TargetType}} to {{IndexTarget.Type}}, it's only used within
{{IndexTarget}} itself an {{CreateIndexStatement}} so maybe the extra info in
the name is redundant.
bq. In {{AlterTableStatement.java}}, I'd rather make the {{if
(index.columns.size() == 1)}} an assertion: we _will_ have to change this code
once we support index on multiple columns (either by erroring out, or by
actually dropping the index anyway), so I'd prefer having the code crash loudly
if we forgot to update it rather than silently doing the wrong thing.
Good call, I almost did this in the first place, fixed.
bq. In {{Keyspace.getValidColumnFamilies}}, it would make more sense to me to
have a {{SecondaryIndexManager.getIndexByName(String indexName)}} method and
use that (as a side note, it's weird that the method doesn't actually use the
'idxName' variable. Do you know what gives?).
Yes, that's something of a legacy from earlier versions before CASSANDRA-6717
added {{IndexMetadata}} (at that point I'd completely removed the
{{SIM.indexes}} map and missed here when adding it back. I'm not sure what you
mean about 'idxName', looks like it's used to me (though reading this method
was making my head hurt and there was a bug in it, so I've pushed a cleaned up
version).
bq. Is there a reason for {{IndexRegistry}}? It doesn't hurt, but it doesn't
seem of any concrete use to me (maybe it's for custom indexes, but even then
I'm not really sure what it allows concretely).
Mentioned above re:IoC, if we agree that this is a reasonable approach,
abstracting the registry-ness from {{SIM}} makes it much easier to use a
lightweight implementation for tests etc (argument slightly undermined by not
actually doing this anywhere).
bq. Your IDE seems to expand {{.*}} imports.
arrrgh, my settings had been lost and out of force of habit I'd been "fixing"
the imports (or so I thought) every time I made a change. I'll make sure I do a
final pass through all the modified files before this gets committed.
I'm nearly done with the points from your second comment, so I'll update again
shortly
> SecondaryIndex API redesign
> ---------------------------
>
> Key: CASSANDRA-9459
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9459
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Sam Tunnicliffe
> Assignee: Sam Tunnicliffe
> Fix For: 3.0 beta 1
>
>
> For some time now the index subsystem has been a pain point and in large part
> this is due to the way that the APIs and principal classes have grown
> organically over the years. It would be a good idea to conduct a wholesale
> review of the area and see if we can come up with something a bit more
> coherent.
> A few starting points:
> * There's a lot in AbstractPerColumnSecondaryIndex & its subclasses which
> could be pulled up into SecondaryIndexSearcher (note that to an extent, this
> is done in CASSANDRA-8099).
> * SecondayIndexManager is overly complex and several of its functions should
> be simplified/re-examined. The handling of which columns are indexed and
> index selection on both the read and write paths are somewhat dense and
> unintuitive.
> * The SecondaryIndex class hierarchy is rather convoluted and could use some
> serious rework.
> There are a number of outstanding tickets which we should be able to roll
> into this higher level one as subtasks (but I'll defer doing that until
> getting into the details of the redesign):
> * CASSANDRA-7771
> * CASSANDRA-8103
> * CASSANDRA-9041
> * CASSANDRA-4458
> * CASSANDRA-8505
> Whilst they're not hard dependencies, I propose that this be done on top of
> both CASSANDRA-8099 and CASSANDRA-6717. The former largely because the
> storage engine changes may facilitate a friendlier index API, but also
> because of the changes to SIS mentioned above. As for 6717, the changes to
> schema tables there will help facilitate CASSANDRA-7771.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)