Re: Parallel incremental indexing

Michael Busch Mon, 31 Aug 2009 00:23:15 -0700

On Sun, Aug 30, 2009 at 6:08 AM, Yonik Seeley <[email protected]>wrote:


> Cool stuff!
>
>
Thanks. It's actually really fun to work on! After I had the parallel
indexing working and didn't have to worry anymore about how to manage
parallel indexes the fun of implementing cool features on top of this
started. I hope you'll have that fun in Solr too! :)


> We should also think about how to do single document field updates or
> field adds since that is the most common usecase - not that it needs
>

I completely agree that we should solve that problem too.


> to be implemented in the first version, but kept in mind so we don't
> box ourselves in.
>

This code is currently non intrusive from Lucene's point of view (it can't
be cause I use it on top of vanilla 2.4.1). But I agree: when we integrate
it more tightly in Lucene to make it more efficient we should keep the end
goal in mind (e.g. the use cases you mentioned).


>
> Doug mentioned some ideas he had in passing almost a year ago about
> how to add a field to a single document, and it is similar in that it
> used parallel reader.  IndexWriter would be modified to maintain the
> same structure across parallel indexes, as you note.  If one wanted to
> add a new field value to document 1000, one would have to index dummy
> documents for docs 0-999... instead of this, the index format should
> support gaps.  On a segment merge, the IndexWriter could simply merge
> in this new segment.
>
>
Yeah currently it's kind of unefficient that we have to call addDocument()
999 times with an empty document to achive this. The .frq and .prx files
however work great as they use delta encoding. Also .del files support DGaps
now. On the other hand especially the stored fields index (.fdx) doesn't
support gaps because of random access support. Also norm files and term
vectors (though both can be turned off) don't support gaps.



> Anyway, updateable documents is fundamental enough, we should also
> consider changes to the index format if it makes it easer.
>
>
Yes I agree. We should make changes to the default index format if that
makes updating documents more efficient. Note that I said "default index
format" :) - I'm already excited about having parallel indexing and flexible
indexing in Lucene. It will be awesome what you can do then with Lucene!

So I think we should start with the necessary work to keep parallel indexes
in sync. When that's done we should continue with the usecases we discussed,
including the work of changing the index format to support gaps.




> -Yonik
> http://www.lucidimagination.com
>
>
> On Sun, Aug 30, 2009 at 2:23 AM, Michael Busch<[email protected]> wrote:
> > Hi all,
> >
> > I just added a wiki page for a new feature I'd like to add to
> > Lucene. Please take a look at the link. I will add more details and
> > diagrams to the page, but for now it should give a rough idea about
> > how to implement it:
> >
> > http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing
> >
> > Basically the idea is to allow updating documents partially, e.g. only
> > a subset of the fields without having to reindex the entire
> > document. This is a feature that is very often asked for.
> >
> > We have implemented the solution in IBM and it's working
> > great. It is a technology that allowed us already to add really exciting
> > new features to products that weren't easily possible before.
> >
> > The implementation I can currently contribute has some limitations:
> > e.g. multi-threaded indexing is not supported. But let me make clear
> > that this is not a limitation of the design described in the wiki - we
> > have these limitations because we implemented this on top of Lucene's 2.4
> > APIs. If we decide to add this to Lucene's core we should
> > reimplement some parts to overcome those limitations.
> >
> > In my opinion this will be a great addition to Lucene that many
> > people will find very useful. In Solr this is also something users often
> > ask for.
> >
> > In the last weeks I worked on getting internal approval for the
> contribution
> > to Lucene and the good news is that I already have a signed
> > software grant ready - so if the community likes this feature and
> > decides to add this to Lucene there won't be any delay for legal work
> > from IBM's side.
> >
> > Btw: I will be on vacation from 09/03-09/20 and won't have internet
> > access most of the time, so if I stop responding end of next week you'll
> > know why...
> >
> > Please let me know what you think!
> >
> >  Michael
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Parallel incremental indexing

Reply via email to