[jira] Updated: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

Michael McCandless (JIRA) Fri, 28 May 2010 13:06:59 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-2380:
---------------------------------------

    Attachment: LUCENE-2380.patch

OK I fixed up the patch.  I think it's ready to commit, though it'd be
great if someone could double check my Solr changes...:

  * Updated to trunk

  * Fixed bug in Solr's ByteUtils.java (it was not respecting the
    offset in the incoming BytesRef)

  * Added optional boolean "fasterButMoreRAM" option when loading
    field cache, defaults to true

  * For DocTermsIndex, I defined ord=0 to mean "unset"; and made it
    the caller's responsibility to do something with the ord=0 case if
    empty (length=0) BytesRef isn't acceptable.  Likewise, for
    DocTerms, I now directly return empty BytesRef if doc didn't have
    this field, but I also added an exists method to explicitly check
    if you need to.

  * Added a getTerm convenience method (calls getOrd then lookup, by
    default) to the terms index; renamed DocTerms.get -> getTerm for
    consistency

  * Fixed the nocommits and/or changed to TODOs

  * Small cleanups

I've also added a MIGRATE.txt that spells out more details on how an
app can cutover to the new APIs.

I think there are some other good things to do here, but as a future
issue (this one's big enough!) -- I'll open it:

  * For DocTermsIndex, make it optional whether the bytes data is
    loaded.  EG for a single segment index (LUCENE-2335), or for sort
    comparators apps that do not need the bytes data (eg because they
    use terms dict to resolve ord -> term, and v/v).

  * Possibly merge DocTerms & DocTermsIndex.  EG it's dangerous today
    if you load terms and then termsIndex because you're wasting tons
    of RAM; it'd be nicer if we could have a single cache entry that'd
    "upgrade" itself to be an index (have the ords).

> Add FieldCache.getTermBytes, to load term data as byte[]
> --------------------------------------------------------
>
>                 Key: LUCENE-2380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, 
> LUCENE-2380.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

Reply via email to