[jira] [Commented] (LUCENE-4547) DocValues field broken on large indexes

Michael McCandless (JIRA) Tue, 20 Nov 2012 10:44:59 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501370#comment-13501370
 ]


Michael McCandless commented on LUCENE-4547:
--------------------------------------------

{quote}
bq. I think letting the codec control in-RAM vs on-disk is a great idea!
actually that is not what I was saying and I strongly discourage that we 
require people to make ram vs. on disk decisions ahead of time.
{quote}

I think this is actually a clean way to do it, and it matches what we
do with other codec parts.  Eg with postings you pick MemoryPF if you
have the free RAM and want fast lookups for that field, else you pick
an on-disk postings format.

bq. Most of those decisions need to be made dynamically based on ram 
availability and growth.

I think making dynamic decisions based on ram availability and growth
is a more expert use case; eg in Lucene today we don't give you that:
Deleted docs, norms, field cache entries, doc values (if you sort by
them), terms index are all loaded into RAM.  So the only control users
have now is which fields they index/sort on...

If we give control to the codec over whether the DV format is in RAM
or on disk or something in between (like the terms index), and we make
a PerFieldDVFormat so you can easily switch impls by field, then users
can make the decisions themselves, field by field.

If a given field will be used for sorting or faceting, they can use
the fast RAM-based format, but if they are tight on RAM and have lots
of scoring factors, maybe they use the disk-based impl for those fields.

If an expert app really need to pick & choose ram vs disk dynamically,
depending on how many other indices are open and how much RAM they are
using, etc., they can always make a custom DV format ...

                
> DocValues field broken on large indexes
> ---------------------------------------
>
>                 Key: LUCENE-4547
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4547
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>            Priority: Blocker
>             Fix For: 4.1
>
>         Attachments: test.patch
>
>
> I tried to write a test to sanity check LUCENE-4536 (first running against 
> svn revision 1406416, before the change).
> But i found docvalues is already broken here for large indexes that have a 
> PackedLongDocValues field:
> {code}
> final int numDocs = 500000000;
> for (int i = 0; i < numDocs; ++i) {
>   if (i == 0) {
>     field.setLongValue(0L); // force > 32bit deltas
>   } else {
>     field.setLongValue(1<<33L); 
>   }
>   w.addDocument(doc);
> }
> w.forceMerge(1);
> w.close();
> dir.close(); // checkindex
> {code}
> {noformat}
> [junit4:junit4]   2> WARNING: Uncaught exception in thread: Thread[Lucene 
> Merge Thread #0,6,TGRP-Test2GBDocValues]
> [junit4:junit4]   2> org.apache.lucene.index.MergePolicy$MergeException: 
> java.lang.ArrayIndexOutOfBoundsException: -65536
> [junit4:junit4]   2>  at 
> __randomizedtesting.SeedInfo.seed([5DC54DB14FA5979]:0)
> [junit4:junit4]   2>  at 
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:535)
> [junit4:junit4]   2>  at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:508)
> [junit4:junit4]   2> Caused by: java.lang.ArrayIndexOutOfBoundsException: 
> -65536
> [junit4:junit4]   2>  at 
> org.apache.lucene.util.ByteBlockPool.deref(ByteBlockPool.java:305)
> [junit4:junit4]   2>  at 
> org.apache.lucene.codecs.lucene40.values.FixedStraightBytesImpl$FixedBytesWriterBase.set(FixedStraightBytesImpl.java:115)
> [junit4:junit4]   2>  at 
> org.apache.lucene.codecs.lucene40.values.PackedIntValues$PackedIntsWriter.writePackedInts(PackedIntValues.java:109)
> [junit4:junit4]   2>  at 
> org.apache.lucene.codecs.lucene40.values.PackedIntValues$PackedIntsWriter.finish(PackedIntValues.java:80)
> [junit4:junit4]   2>  at 
> org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:130)
> [junit4:junit4]   2>  at 
> org.apache.lucene.codecs.PerDocConsumer.merge(PerDocConsumer.java:65)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4547) DocValues field broken on large indexes

Reply via email to