[ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051491#comment-13051491 ]
Michael McCandless commented on LUCENE-3206: -------------------------------------------- {quote} bq. this could be a non-negligible increase in FST size for the non-ascii case I think? I don't know. If the non-ASCII is encoded as UTF8 for the BytesRef, then storing full unicode points on transitions shouldn't really account for much more (in fact it may create fewer states/ transitions because multibyte UTF8 sequences will require multiple transitions)? This we would need to check, of course. And I assume input sequences ARE text, which in general may not be the case... I think I'll leave BYTE1/BYTE4 an option for now and see if I can improve on it once I have a working test suite. {quote} Ahh, yes I agree it'd be a more interesting comparison if you use UTF32 instead of UTF8. The case I was worried about is if you must use UTF8 (ie because TermsEnum speaks only BytesRef), then writing those bytes as a vInt instead of a fixed byte is a penalty to non-ascii. {quote} bq. I think SimpleText codec is a good example? Also VariableGapTermsIndexReader, and MemoryCodec? Each of these use the BytesRefFSTEnum, I believe. I wasn't clear -- I can find the places where they're used, but I wanted to clarify the nature of stored keys and values (are they UTF8 text, utf16, unicode, random bytes)? I can go through the code, but you're probably a faster source of information on this one. Robert, if you're reading this -- anything you envision could be stored as transition labels? {quote} Ahh... I think all uses have BytesRef (UTF8 encoded term) as the key, and various things as the values. I don't think we've used FST during analysis yet but we should try; then I suspect we'd use UTF16 labels? > FST package API refactoring > --------------------------- > > Key: LUCENE-3206 > URL: https://issues.apache.org/jira/browse/LUCENE-3206 > Project: Lucene - Java > Issue Type: Improvement > Components: core/FSTs > Affects Versions: 3.2 > Reporter: Dawid Weiss > Assignee: Dawid Weiss > Priority: Minor > Fix For: 3.3, 4.0 > > Attachments: LUCENE-3206.patch > > > The current API is still marked @experimental, so I think there's still time > to fiddle with it. I've been using the current API for some time and I do > have some ideas for improvement. This is a placeholder for these -- I'll post > a patch once I have a working proof of concept. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org