[jira] Commented: (LUCENE-830) norms file can become unexpectedly enormous

Robert Muir (JIRA) Tue, 11 Jan 2011 11:33:13 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980279#action_12980279
 ]


Robert Muir commented on LUCENE-830:
------------------------------------

bq. The other thing to consider is we may want to allow people to separate out 
boosting from length normalization and allow each to be on or off.

I think the first step is to move norm encode/decode float->byte out of 
Similarity (it does not belong here!) 

In my opinion we should index individual statistics (boost, # of terms, etc). 
Ideally how these are encoded/decoded is part of the codec.

As far as how the raw stats are treated in scoring (such as if you want to 
combine #terms and boost into a single byte, and put it in a huge array, or do 
something else entirely), I think this belongs in Similarity. A lot of 
Similarities cant just use one single byte array for this, and others might not 
want to even 
use bytes at all (this should be your choice, as you are making a tradeoff to 
lose precision intentionally for speed/RAM purposes). 

If we shuffled things around like this, then for example you could have a 
Similarity that uses your sparse vectors instead of huge bytes for 
"per-document normalization", and maybe it only cares about putting say the 
document boost in here. Its too limiting that we only have "huge byte[] or not"
and if people have ram issues (e.g. tons of fields) they are forced to both 
disable boosting and length normalization entirely.

Not really able to keep track of the realtime search issues, but it seems these 
things (static byte[]) are limiting there too. 

By the way, what i described here is what Mike prototyped on the flexible 
scoring issue. I think it was good to prototype but I think it would be
much better to look at breaking that up into smaller digestible issues (e.g. 
adding the necessary stats to be indexed, making similarity per-field, ...) 
so we actually make progress.


> norms file can become unexpectedly enormous
> -------------------------------------------
>
>                 Key: LUCENE-830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-830
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Priority: Minor
>
> Spinoff from this user thread:
>    http://www.gossamer-threads.com/lists/lucene/java-user/46754
> Norms are not stored sparsely, so even if a doc doesn't have field X
> we still use up 1 byte in the norms file (and in memory when that
> field is searched) for that segment.  I think this is done for
> performance at search time?
> For indexes that have a large # documents where each document can have
> wildly varying fields, each segment will use # documents times # fields
> seen in that segment.  When optimize merges all segments, that product
> grows multiplicatively so the norms file for the single segment will
> require far more storage than the sum of all previous segments' norm
> files.
> I think it's uncommon to have a huge number of distinct fields (?) so
> we would need a solution that doesn't hurt the more common case where
> most documents have the same fields.  Maybe something analogous to how
> bitvectors are now optionally stored sparsely?
> One simple workaround is to disable norms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-830) norms file can become unexpectedly enormous

Reply via email to