[jira] [Commented] (LUCENE-6216) Make it easier to modify Japanese token attributes downstream

Robert Muir (JIRA) Tue, 03 Feb 2015 06:15:02 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303308#comment-14303308
 ]


Robert Muir commented on LUCENE-6216:
-------------------------------------

I think we can do it without a performance hit. The last time I benchmarked, 
the current loading was fairly important to e.g. a "simple" analyzer, because 
some attributes like reading are a fair number of bytes per character to 
process. 

its not really lazy loading, but decodes from the dictionary on every single 
request. So maybe we should just make it lazy loaded?

Instead of:
{code}
String getPartOfSpeech() {
  return token == null ? null : token.getPartOfSpeech();
}
{code}

add a setPartOfSpeech() and have the code work something like this, so its just 
"caches" but can be changed:
{code}
if (pos == null) {
  if (token != null) {
    pos = token.getPartOfSpeech();
  }
}
return pos;
{code}

The disadvantage would be any semantics around 'null', but there are other ways 
to implement the same idea.

> Make it easier to modify Japanese token attributes downstream
> -------------------------------------------------------------
>
>                 Key: LUCENE-6216
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6216
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Christian Moen
>            Priority: Minor
>
> Japanese-specific token attributes such as {{PartOfSpeechAttribute}}, 
> {{BaseFormAttribute}}, etc. get their values from a 
> {{org.apache.lucene.analysis.ja.Token}} through a {{setToken()}} method.  
> This makes it cumbersome to change these token attributes later on in the 
> analysis chain since the {{Token}} instances are difficult to instantiate 
> (sort of read-only objects).
> I've ran into this issue in LUCENE-3922 (JapaneseNumberFilter) where it would 
> be appropriate to update token attributes to also reflect Japanese number 
> normalization.
> I think it might be more practical to allow setting a specific value for 
> these token attributes directly rather than through a {{Token}} since it 
> makes the APIs simpler, allows for easier changing attributes downstream, and 
> also supporting additional dictionaries easier.
> The drawback with the approach that I can think of is a performance hit as we 
> will miss out on the inherent lazy retrieval of these token attributes from 
> the {{Token}} object (and the underlying dictionary/buffer).
> I'd like to do some testing to better understand the performance impact of 
> this change. Happy to hear your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6216) Make it easier to modify Japanese token attributes downstream

Reply via email to