[jira] [Commented] (LUCENE-6216) Make it easier to modify Japanese token attributes downstream

Robert Muir (JIRA) Tue, 03 Feb 2015 17:49:50 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304479#comment-14304479
 ]


Robert Muir commented on LUCENE-6216:
-------------------------------------

Yeah for the second idea, i guess my main concern is that it surfaces what 
should be an implementation detail up to the user. It has some practical 
challenges too, e.g. today if you have just a simple JapaneseTokenizer-only 
chain, you can see e.g. POS and so on when debugging. But with the alternate 
approach, you'd have to modify your analysis chain to see "everything".

It doesn't mean the current approach is the way it should be though: the whole 
chain could work differently rather than exposing all the attributes. But, if 
we stay with what we have, we should definitely try to clean this up more. For 
example, i hate that every Japanese*Attribute has a setToken() method at all. 
They should be more pojo-like with get/set. The current 
lazy-loaded/backed-by-token is an "optimization" that should somehow only be 
known by the Tokenizer and the *Impl. And as an optimization, setToken() should 
still work.

> Make it easier to modify Japanese token attributes downstream
> -------------------------------------------------------------
>
>                 Key: LUCENE-6216
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6216
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Christian Moen
>            Priority: Minor
>
> Japanese-specific token attributes such as {{PartOfSpeechAttribute}}, 
> {{BaseFormAttribute}}, etc. get their values from a 
> {{org.apache.lucene.analysis.ja.Token}} through a {{setToken()}} method.  
> This makes it cumbersome to change these token attributes later on in the 
> analysis chain since the {{Token}} instances are difficult to instantiate 
> (sort of read-only objects).
> I've ran into this issue in LUCENE-3922 (JapaneseNumberFilter) where it would 
> be appropriate to update token attributes to also reflect Japanese number 
> normalization.
> I think it might be more practical to allow setting a specific value for 
> these token attributes directly rather than through a {{Token}} since it 
> makes the APIs simpler, allows for easier changing attributes downstream, and 
> also supporting additional dictionaries easier.
> The drawback with the approach that I can think of is a performance hit as we 
> will miss out on the inherent lazy retrieval of these token attributes from 
> the {{Token}} object (and the underlying dictionary/buffer).
> I'd like to do some testing to better understand the performance impact of 
> this change. Happy to hear your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6216) Make it easier to modify Japanese token attributes downstream

Reply via email to