[ 
https://issues.apache.org/jira/browse/BLUR-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13693558#comment-13693558
 ] 

Aaron McCurry commented on BLUR-112:
------------------------------------

Rahul,

These are all good questions, to be honest this is a feature that I know we 
need but I am unsure of what the correct implementation should be.  I like your 
line of thinking though.  After given this some thought and I have tried to 
document some of these thoughts below.  Like I said I am not sold on any 
implementation so I am open to other ideas on how we should proceed.

I believe there are 2 fundamental operating modes when it comes to column 
definitions/types/analyzers.

* The first mode is when you know all the columns and definitions up front and 
you define a column with a type and an optional analyzer if that type is a 
TextField.  This mode is required for MapReduce indexing because it needs to 
know up front how to index all the data.

* The second mode is when no columns or types are known up front but as they 
are discovered they need to be added to the table descriptor/analyzer def.

I believe we need to support both modes at the same time.  So back to some of 
your questions.

- Once defined, we do not allow updates for existing column definitions. We can 
safely check whether any Inbound Column types have a conflict with any defined 
types in AnalyzerDefinition as we do not have to worry about AnalyzerDefinition 
object on the shard server being out of sync with that stored on Zookeeper. 

This is true.

- However for dynamic columns as you explained, the same column might have been 
added by a different shard server. This need not happen at the exact same 
time(correct me) . Even if ShardServer-B added a dynamic column 5 seconds 
before, the AnalyzerDefinition in ShardServer-A does not reflect that change

So this is where the ZkCachedMap comes in, basically it is an inmemory cache of 
fields (or any other values really) that can only be set once and is 
persistent.  Also it serves as a consistent store for all the types across all 
the shards servers.

- Also a little confused about how to use ZkCachedMap. What values will we be 
caching/storing using this? Are they only the one's which can be overwritten on 
Zookeeper?

I think that we should store all the types (maybe more information like the 
entire column definition etc) in ZkCachedMap for all cases.  I think that this 
will make things more consistent.  Perhaps the ZkCachedMap needs to become the 
storage mechanism for the AnalyzerDefinition.  Also the ZkCachedMap is probably 
a bad name for that class we might need to come up with something else.  As I 
look at your pseudo code, you are right we will need to reload the analyzer 
somehow, it will likely need to be driven from a ZK watcher on the update of 
the ZkCacheMap.

Operations at a high level

* From an external api (Thrift), I would think we would need a method that 
looks like "addColumnDefinition(family,name,type,analyzer,fulltext)" or 
something like that
* Next it would call a method to update the ZkCacheMap and store to ZK
* Then a ZK watcher would fire in the ZooKeeperClusterStatus class (all the 
shard servers would also fire on the watcher) that would update the 
TableDescriptor, then perhaps we would clear the table context cache
* And the clearing of the cache would force the recreation of the the 
BlurAnalyzer (if it doesn't we should make it)
* We would also need to figure out how to get the new analyzer into the index 
writer (maybe with an atomic reference inside a analyzer decorator?)

As we talk about this feature I think we need to break this one up into sub 
tasks.  Let me know what you think.

Thanks,
Aaron





                
> Allow for types to be set on blur tables
> ----------------------------------------
>
>                 Key: BLUR-112
>                 URL: https://issues.apache.org/jira/browse/BLUR-112
>             Project: Apache Blur
>          Issue Type: Improvement
>    Affects Versions: 0.2.0, 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>
>
> Create the ability for Blur to handle the default Lucene field types.  This 
> should not be tied to the table descriptor because types should be allowed to 
> be added at runtime.  Also 2 new fields should be added to the 
> TableDescriptor:
> 1. A strict types attribute.  If set to true, if a new column is added to the 
> table and there is no type mapping for it.  Throw an exception.  Set to false 
> by default.
> 2. Default type is strict is set to false.  The default type should be text.
> Also, dynamic columns could be allowed if their name included the type.  Such 
> as:
> The column name could be "col1" with a type of "int", in the Column struct in 
> thrift the name would be "col1/int" and if the type did not exist before the 
> call it would be added.
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to