[
https://issues.apache.org/jira/browse/BLUR-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13693558#comment-13693558
]
Aaron McCurry commented on BLUR-112:
------------------------------------
Rahul,
These are all good questions, to be honest this is a feature that I know we
need but I am unsure of what the correct implementation should be. I like your
line of thinking though. After given this some thought and I have tried to
document some of these thoughts below. Like I said I am not sold on any
implementation so I am open to other ideas on how we should proceed.
I believe there are 2 fundamental operating modes when it comes to column
definitions/types/analyzers.
* The first mode is when you know all the columns and definitions up front and
you define a column with a type and an optional analyzer if that type is a
TextField. This mode is required for MapReduce indexing because it needs to
know up front how to index all the data.
* The second mode is when no columns or types are known up front but as they
are discovered they need to be added to the table descriptor/analyzer def.
I believe we need to support both modes at the same time. So back to some of
your questions.
- Once defined, we do not allow updates for existing column definitions. We can
safely check whether any Inbound Column types have a conflict with any defined
types in AnalyzerDefinition as we do not have to worry about AnalyzerDefinition
object on the shard server being out of sync with that stored on Zookeeper.
This is true.
- However for dynamic columns as you explained, the same column might have been
added by a different shard server. This need not happen at the exact same
time(correct me) . Even if ShardServer-B added a dynamic column 5 seconds
before, the AnalyzerDefinition in ShardServer-A does not reflect that change
So this is where the ZkCachedMap comes in, basically it is an inmemory cache of
fields (or any other values really) that can only be set once and is
persistent. Also it serves as a consistent store for all the types across all
the shards servers.
- Also a little confused about how to use ZkCachedMap. What values will we be
caching/storing using this? Are they only the one's which can be overwritten on
Zookeeper?
I think that we should store all the types (maybe more information like the
entire column definition etc) in ZkCachedMap for all cases. I think that this
will make things more consistent. Perhaps the ZkCachedMap needs to become the
storage mechanism for the AnalyzerDefinition. Also the ZkCachedMap is probably
a bad name for that class we might need to come up with something else. As I
look at your pseudo code, you are right we will need to reload the analyzer
somehow, it will likely need to be driven from a ZK watcher on the update of
the ZkCacheMap.
Operations at a high level
* From an external api (Thrift), I would think we would need a method that
looks like "addColumnDefinition(family,name,type,analyzer,fulltext)" or
something like that
* Next it would call a method to update the ZkCacheMap and store to ZK
* Then a ZK watcher would fire in the ZooKeeperClusterStatus class (all the
shard servers would also fire on the watcher) that would update the
TableDescriptor, then perhaps we would clear the table context cache
* And the clearing of the cache would force the recreation of the the
BlurAnalyzer (if it doesn't we should make it)
* We would also need to figure out how to get the new analyzer into the index
writer (maybe with an atomic reference inside a analyzer decorator?)
As we talk about this feature I think we need to break this one up into sub
tasks. Let me know what you think.
Thanks,
Aaron
> Allow for types to be set on blur tables
> ----------------------------------------
>
> Key: BLUR-112
> URL: https://issues.apache.org/jira/browse/BLUR-112
> Project: Apache Blur
> Issue Type: Improvement
> Affects Versions: 0.2.0, 0.3.0
> Reporter: Aaron McCurry
> Fix For: 0.3.0
>
>
> Create the ability for Blur to handle the default Lucene field types. This
> should not be tied to the table descriptor because types should be allowed to
> be added at runtime. Also 2 new fields should be added to the
> TableDescriptor:
> 1. A strict types attribute. If set to true, if a new column is added to the
> table and there is no type mapping for it. Throw an exception. Set to false
> by default.
> 2. Default type is strict is set to false. The default type should be text.
> Also, dynamic columns could be allowed if their name included the type. Such
> as:
> The column name could be "col1" with a type of "int", in the Column struct in
> thrift the name would be "col1/int" and if the type did not exist before the
> call it would be added.
> Thoughts?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira