[jira] [Commented] (LUCENE-4547) DocValues field broken on large indexes

Robert Muir (JIRA) Thu, 15 Nov 2012 13:35:14 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498359#comment-13498359
 ]


Robert Muir commented on LUCENE-4547:
-------------------------------------

These are hard questions. My personal goal here for this prototype (currently 
SimpleText only!) was to:

1. Making merging use (significantly) less RAM, to fix this bug.
2. Make it easier to write docvalues codecs, to encourage innovations (e.g. FST 
impls, etc etc)
3. Simplify the types to make it easier on the user.

the consumer api I think is simpler (part of #2), but I would like to (in the 
future) simplify the producer API too.
I'm not sure if we should do it here though? anyway we can think about the 
issues you raised one by one and do them separately on their own issues.

{quote}
fix other issues such as LUCENE-3862?
{quote}

Its my opinion we should do this sooner than later.

{quote}
merge the FieldCache / FunctionValues / DocValues.Source APIs?
{quote}

This really needs to be addressed, but I think not here. Its horrific that 
algorithms like grouping, sorting, and maybe faceting have to be duplicated for 
2 different things (fieldcache and docvalues).

{quote}
are you going to remove DocValues.Type.FLOAT_*?
{quote}

I think the 3 types we have here are enough. Someone can do a float or double 
type "on top of" the "number" type we have.
Lucene is already doing this today: look at norms. I think lucene should just 
have a number type that stores bits.

{quote}
are SimpleDVConsumer and SimpleDocValuesFormat going to replace PerDocConsumer 
and DocValuesFormat?
{quote}

This is the idea, once we are happy with the APIs we would implement the 4.0 
ones with these apis. 

{quote}
are you going to remove hasArray/getArray?
{quote}

I don't care about this. I am unsure similarity impls should be calling this 
though, definitely at least
it would be better for them to fall-back: I just cant bring myself to fix it 
until LUCENE-3862 is fixed :)

{quote}
will there still be a direct=true|false option at load-time or will it depend 
on the format impl (potentially with a PerFieldPerDocProducer similarly to the 
postings formats)?
{quote}

I don't want to change this in the branch. Personally i feel like a 
codec/segmentreader/etc should generally only manage
direct, producer exposing the same "stats" (minimum, maximum, fixed, whatever) 
that the consumer apis get (which will also make merging more efficient!) 
default source impl can be something nice, read the direct impl into a packed 
ints,
and so on. Codec could override to e.g. just slurp in their on-disk packed ints 
directly. So codec still has control
of the in-memory RAM representation, i think this is important. But i think 
codec and segmentreader should somehow not
be in control of caching: this should be elsewhere 
(FieldCache.DOCVALUES.xxx????)...

                
> DocValues field broken on large indexes
> ---------------------------------------
>
>                 Key: LUCENE-4547
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4547
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>            Priority: Blocker
>             Fix For: 4.1
>
>         Attachments: test.patch
>
>
> I tried to write a test to sanity check LUCENE-4536 (first running against 
> svn revision 1406416, before the change).
> But i found docvalues is already broken here for large indexes that have a 
> PackedLongDocValues field:
> {code}
> final int numDocs = 500000000;
> for (int i = 0; i < numDocs; ++i) {
>   if (i == 0) {
>     field.setLongValue(0L); // force > 32bit deltas
>   } else {
>     field.setLongValue(1<<33L); 
>   }
>   w.addDocument(doc);
> }
> w.forceMerge(1);
> w.close();
> dir.close(); // checkindex
> {code}
> {noformat}
> [junit4:junit4]   2> WARNING: Uncaught exception in thread: Thread[Lucene 
> Merge Thread #0,6,TGRP-Test2GBDocValues]
> [junit4:junit4]   2> org.apache.lucene.index.MergePolicy$MergeException: 
> java.lang.ArrayIndexOutOfBoundsException: -65536
> [junit4:junit4]   2>  at 
> __randomizedtesting.SeedInfo.seed([5DC54DB14FA5979]:0)
> [junit4:junit4]   2>  at 
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:535)
> [junit4:junit4]   2>  at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:508)
> [junit4:junit4]   2> Caused by: java.lang.ArrayIndexOutOfBoundsException: 
> -65536
> [junit4:junit4]   2>  at 
> org.apache.lucene.util.ByteBlockPool.deref(ByteBlockPool.java:305)
> [junit4:junit4]   2>  at 
> org.apache.lucene.codecs.lucene40.values.FixedStraightBytesImpl$FixedBytesWriterBase.set(FixedStraightBytesImpl.java:115)
> [junit4:junit4]   2>  at 
> org.apache.lucene.codecs.lucene40.values.PackedIntValues$PackedIntsWriter.writePackedInts(PackedIntValues.java:109)
> [junit4:junit4]   2>  at 
> org.apache.lucene.codecs.lucene40.values.PackedIntValues$PackedIntsWriter.finish(PackedIntValues.java:80)
> [junit4:junit4]   2>  at 
> org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:130)
> [junit4:junit4]   2>  at 
> org.apache.lucene.codecs.PerDocConsumer.merge(PerDocConsumer.java:65)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4547) DocValues field broken on large indexes

Reply via email to