[ 
https://issues.apache.org/jira/browse/LUCENE-5748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-5748:
--------------------------------

    Attachment: LUCENE-5748.patch

Here's a prototype patch: just the default codec impl and simple support for 
sorting on multiple numeric values.

I didn't implement any simpletext, direct, memory, etc or all the other stuff 
needed yet.

Here's what i think is a minimal API:
{code}
/**
 * A list of per-document numeric values, sorted 
 * according to {@link Long#compare(long, long)}.
 */
public abstract class SortedNumericDocValues {
  /** 
   * Positions to the specified document 
   */
  public abstract void setDocument(int doc);
  
  /** 
   * Retrieve the value for the current document at the specified index. 
   * An index ranges from {@code 0} to {@code count()-1}. 
   */
  public abstract long valueAt(int index);
  
  /** 
   * Retrieves the count of values for the current document. 
   * This may be zero if a document has no values.
   */
  public abstract int count();
}
{code}



> SORTED_NUMERIC dv type
> ----------------------
>
>                 Key: LUCENE-5748
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5748
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Robert Muir
>         Attachments: LUCENE-5748.patch
>
>
> Currently for Strings you have SORTED and SORTED_SET, capable of single and 
> multiple values per document respectively.
> For multi-numerics, there are only a few choices:
> * encode with NumericUtils into byte[]'s and store with SORTED_SET.
> * encode yourself per-document into BINARY.
> Both of these techniques have problems: 
> SORTED_SET isn't bad if you just want to do basic sorting (e.g. min/max) or 
> faceting counts: most of the bloat in the "terms dict" is compressed away, 
> and it optimizes the case where the data is actually single-valued, but it 
> falls apart performance-wise if you want to do more complex stuff like solr's 
> analytics component or elasticsearch's aggregations: the ordinals just get in 
> your way and cause additional work, deref'ing each to a byte[] and then 
> decoding that back to a number. Worst of all, any mathematical calculations 
> are off because it discards frequency (deduplicates).
> using your own custom encoding in BINARY removes the unnecessary ordinal 
> dereferencing, but you trade off bad compression and access: you have no real 
> choice but to do something like vInt within each byte[] for the doc, which 
> means even basic sorting (e.g. max) is slow as its not constant time. There 
> is no chance for the codec to optimize things like dates with GCD compression 
> or optimize the single-valued case because its just an opaque byte[].
> So I think it would be good to explore a simple long[] type that solves these 
> problems.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to