[ 
https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635310#comment-13635310
 ] 

Robert Muir commented on LUCENE-4936:
-------------------------------------

Looks great! I'm glad you were able to make this fast.

A few ideas:
* I like the switch with corruption-check on DiskDV. Can we easily integrate 
this into Lucene42?
* Can we update the file format docs (we attempt to describe the numerics 
strategies succinctly here)

I can do a more thorough review and some additional testing later, but this 
looks awesome.

Later we should think about a place (maybe in codec file format docs, maybe 
even NumericDocValuesField?) to add some practical general guidelines to users, 
that might not otherwise be intuitive: Stuff like if you are putting Dates in 
NumericDV, zero out portions you dont care about (e.g. milliseconds, time, etc) 
to save space, indexing as UTC will be a little more efficient than with local 
offset, etc.

{quote}
Improves BaseDocValuesFormatTest which almost only tested "TABLE_COMPRESSED" 
with Lucene42DVF
{quote}

Yeah this is a good catch! We should also maybe open an issue to review DiskDV 
and try to make it more efficient. Optimizations like TABLE_COMPRESSED don't 
exist there I think: it could be handy if someone wants e.g. smallfloat scoring 
factor. Its nice this patch provides back compat for DiskDV but its not totally 
necessary in the future, if we want to review and rewrite it. In general that 
codec was just done very quickly and hasn't seen much benchmarking or anything: 
could use some work.
                
> docvalues date compression
> --------------------------
>
>                 Key: LUCENE-4936
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4936
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Robert Muir
>            Assignee: Adrien Grand
>         Attachments: LUCENE-4936.patch, LUCENE-4936.patch
>
>
> DocValues fields can be very wasteful if you are storing dates (like solr's 
> TrieDateField does if you enable docvalues) and don't actually need all the 
> precision: e.g. "date-only" fields like date of birth with no time component, 
> time fields without milliseconds precision, and so on.
> Ideally we'd compute GCD of all the values to save space 
> (numberOfTrailingZeros is not really enough here), but i think we should at 
> least look for values like 86400000, 3600000, and 1000 to be practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to