On Monday 08 November 2010 20:52:53 mark harwood wrote: > I came to the conclusion that the transient meaning of document ids is too > deeply ingrained in Lucene's design to use them to underpin any reliable > linking. > While it might work for relatively static indexes, any index with a > reasonable > number of updates or deletes will invalidate any stored document references > in > ways which are very hard to track. Lucene's compaction shuffles IDs without > taking care to preserve identity, unlike graph DBs like Neo4j (see "recycling > IDs" here: http://goo.gl/5UbJi )
Did you try to keep the docId's as segmentId-inSegmentDocId in a tree? Somehow I think this could work, because this would not really complicate adding/deleting relations between docId's and the segment merges would become massive but straightforward deletes and inserts, and with some luck the amount of work for that would be a small proportion of the work for a normal segment merge. Regards, Paul Elschot > > > Cheers, > Mark > > > ----- Original Message ---- > From: Ryan McKinley <[email protected]> > To: [email protected] > Sent: Mon, 8 November, 2010 19:03:59 > Subject: Re: Document links > > Any updates/progress with this? > > I'm looking at ways to implement an RTree with lucene -- and this > discussion seems relevant > > thanks > ryan > > > On Sat, Sep 25, 2010 at 5:42 PM, mark harwood <[email protected]> wrote: > >>>Both these on disk data structures and the ones in a B+ tree have seek > offsets > >>>into files > >>>that require disk seeks. And both could use document ids as key values. > > > > Yep. However my approach doesn't use a doc id as a key that is searched in > > any > > B+ tree index (which involves disk seeks) - it is used as direct offset > > into a > > file to get the pointer into a "links" data structure. > > > > > > > >>>But do these disk data structures support dynamic addition and deletion of > >>>(larger > >>>numbers of) document links? > > > > Yes, the slide deck I linked to shows how links (like documents) spend the > >early > > stages of life being merged frequently in the smaller, newer segments and > > over > > time migrate into larger, more stable segments as part of Lucene > > transactions. > > > > That's the theory - I'm currently benchmarking an early prototype. > > > > > > > > ----- Original Message ---- > > From: Paul Elschot <[email protected]> > > To: [email protected] > > Sent: Sat, 25 September, 2010 22:03:28 > > Subject: Re: Document links > > > > Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood: > >> My starting point in the solution I propose was to eliminate linking via > >> any > >>type of key. Key lookups mean indexes and indexes mean disk seeks. Graph > >>traversals have exponential numbers of links and so all these index disk > >>seeks > >>start to stack up. The solution I propose uses doc ids as more-or-less > >>direct > >>pointers into file structures avoiding any index lookup. > >> I've started coding up some tests using the file structures I outlined and > >will > >>compare that with a traditional key-based approach. > > > > Both these on disk data structures and the ones in a B+ tree have seek > > offsets > > into files > > that require disk seeks. And both could use document ids as key values. > > > > But do these disk data structures support dynamic addition and deletion of > > (larger > > numbers of) document links? > > > > B+ trees are a standard solution for problems like this one, and it would > > probably > > not be easy to outperform them. > > It may be possible to improve performance of B+ trees somewhat by > > specializing > > for the fairly simple keys that would be needed, and by encoding very short > > lists of links > > for a single document directly into a seek offset to avoid the actual seek, > but > > that's > > about it. > > > > Regards, > > Paul Elschot > > > >> > >> For reference - playing the "Kevin Bacon game" on a traditional Lucene > >> index > >of > >>IMDB data took 18 seconds to find a short path that Neo4j finds in 200 > >>milliseconds on the same data (and this was a disk based graph of 3m nodes, > 10m > >>edges). > >> Going from actor->movies->actors->movies produces a lot of key lookups and > the > >>difference between key indexes and direct node pointers becomes clear. > >> I know path finding analysis is perhaps not a typical Lucene application > >> but > >>other forms of link analysis e.g. recommendation engines require similar > >>performance. > >> > >> Cheers > >> Mark > >> > >> > >> > >> On 25 Sep 2010, at 11:41, Paul Elschot wrote: > >> > >> > Op vrijdag 24 september 2010 17:57:45 schreef mark harwood: > >> >>>> While not exactly equivalent, it reminds me of our earlier discussion > >>around > >> > >> >>>> "layered segments" for dealing with field updates > >> >> > >> >> Right. Fast discovery of document relations is a foundation on which > >> >> lots > >of > >> > >> >> things like this can build. Relations can be given types to support a > >number > >>of > >> > >> >> different use cases. > >> > > >> > How about using this (bsd licenced) tree as a starting point: > >> > http://bplusdotnet.sourceforge.net/ > >> > It has various keys: ao. byte array, String and long. > >> > > >> > A fixed size byte array as key seems to be just fine: two bytes for a > >> > field > >>number, > >> > four for the segment number and four for the in-segment document id. > >> > The separate segment number would allow to minimize the updates > >> > in the tree during merges. One could also use the normal doc id directly. > >> > > >> > The value could then be a similar to the key, but without > >> > the field number, and with an indication of the direction of the link. > >> > Or perhaps the direction of the link should be added to the key. > >> > A link would be present twice, once for each direction. > >> > Also both directions could have their own payloads. > >> > > >> > It could be put in its own file as a separate 'segment', or maybe > >> > each segment could allow for allocation of a part of this tree. > >> > > >> > I like this somehow, in case it is done right one might never > >> > need a relational database again. Well, almost... > >> > > >> > Regards, > >> > Paul Elschot > >> > > >> > > >> >> > >> >> > >> >> > >> >> ----- Original Message ---- > >> >> From: Grant Ingersoll <[email protected]> > >> >> To: [email protected] > >> >> Sent: Fri, 24 September, 2010 16:26:27 > >> >> Subject: Re: Document links > >> >> > >> >> While not exactly equivalent, it reminds me of our earlier discussion > >around > >> > >> >> "layered segments" for dealing with field updates [1], [2], albeit this > >> >> is > >a > >>bit > >> > >> >> more generic since one could not only use the links for relating > documents, > >>but > >> > >> >> one could use "special" links underneath the covers in Lucene to > >>maintain/mark > >> > >> >> which fields have been updated and then traverse to them. > >> >> > >> >> [1] > >> >> > >>http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384 > >> > >> > >> >> > >> >> [2] > >> >> > >>http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e > >> > >> > >> >> > >> >> > >> >> On Sep 24, 2010, at 10:36 AM, mark harwood wrote: > >> >> > >> >>> This slideshow has a first-cut on the Lucene file format extensions > >>required to > >> > >> >>> > >> >>> support fast linking between documents: > >> >>> > >> >>> http://www.slideshare.net/MarkHarwood/linking-lucene-documents > >> >>> > >> >>> > >> >>> Interested in any of your thoughts. > >> >>> > >> >>> Cheers, > >> >>> Mark > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> --------------------------------------------------------------------- > >> >>> To unsubscribe, e-mail: [email protected] > >> >>> For additional commands, e-mail: [email protected] > >> >>> > >> >> > >> >> -------------------------- > >> >> Grant Ingersoll > >> >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct > >> >> 7-8 > >> >> > >> >> > >> >> --------------------------------------------------------------------- > >> >> To unsubscribe, e-mail: [email protected] > >> >> For additional commands, e-mail: [email protected] > >> >> > >> >> > >> >> > >> >> > >> >> --------------------------------------------------------------------- > >> >> To unsubscribe, e-mail: [email protected] > >> >> For additional commands, e-mail: [email protected] > >> >> > >> >> > >> >> > >> > > >> > --------------------------------------------------------------------- > >> > To unsubscribe, e-mail: [email protected] > >> > For additional commands, e-mail: [email protected] > >> > > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [email protected] > >> For additional commands, e-mail: [email protected] > >> > >> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
