Re: Document links

Paul Elschot Tue, 09 Nov 2010 14:24:45 -0800

On Monday 08 November 2010 20:52:53 mark harwood wrote:
> I came to the conclusion that the transient meaning of document ids is too 
> deeply ingrained in Lucene's design to use them to underpin any reliable 
> linking.
> While it might work for relatively static indexes, any index with a 
> reasonable 
> number of updates or deletes will invalidate any stored document references 
> in 
> ways which are very hard to track. Lucene's compaction shuffles IDs without 
> taking care to preserve identity, unlike graph DBs like Neo4j (see "recycling 
> IDs" here: http://goo.gl/5UbJi )


Did you try to keep the docId's as segmentId-inSegmentDocId in a tree?

Somehow I think this could work, because this would not really complicate
adding/deleting relations between docId's and the segment merges would
become massive but straightforward deletes and inserts, and with some
luck the amount of work for that would be a small proportion of the work
for a normal segment merge.

Regards,
Paul Elschot

> 
> 
> Cheers,
> Mark
> 
> 
> ----- Original Message ----
> From: Ryan McKinley <[email protected]>
> To: [email protected]
> Sent: Mon, 8 November, 2010 19:03:59
> Subject: Re: Document links
> 
> Any updates/progress with this?
> 
> I'm looking at ways to implement an RTree with lucene -- and this
> discussion seems relevant
> 
> thanks
> ryan
> 
> 
> On Sat, Sep 25, 2010 at 5:42 PM, mark harwood <[email protected]> wrote:
> >>>Both these on disk data structures and the ones in a B+ tree have seek 
> offsets
> >>>into files
> >>>that require disk seeks. And both could use document ids as key values.
> >
> > Yep. However my approach doesn't use a doc id as a key that is searched in 
> > any
> > B+ tree index (which involves disk seeks) - it is used as direct offset 
> > into a
> > file to get the pointer into a "links" data structure.
> >
> >
> >
> >>>But do these disk data structures support dynamic addition and deletion of
> >>>(larger
> >>>numbers of) document links?
> >
> > Yes, the slide deck I linked to shows how links (like documents) spend the 
> >early
> > stages of life being merged frequently in the smaller, newer segments and 
> > over
> > time migrate into larger, more stable segments as part of Lucene 
> > transactions.
> >
> > That's the theory - I'm currently benchmarking an early prototype.
> >
> >
> >
> > ----- Original Message ----
> > From: Paul Elschot <[email protected]>
> > To: [email protected]
> > Sent: Sat, 25 September, 2010 22:03:28
> > Subject: Re: Document links
> >
> > Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood:
> >> My starting point in the solution I propose was to eliminate linking via 
> >> any
> >>type of key. Key lookups mean indexes and indexes mean disk seeks. Graph
> >>traversals have exponential numbers of links and so all these index disk 
> >>seeks
> >>start to stack up. The solution I propose uses doc ids as more-or-less 
> >>direct
> >>pointers into file structures avoiding any index lookup.
> >> I've started coding up some tests using the file structures I outlined and 
> >will
> >>compare that with a traditional key-based approach.
> >
> > Both these on disk data structures and the ones in a B+ tree have seek 
> > offsets
> > into files
> > that require disk seeks. And both could use document ids as key values.
> >
> > But do these disk data structures support dynamic addition and deletion of
> > (larger
> > numbers of) document links?
> >
> > B+ trees are a standard solution for problems like this one, and it would
> > probably
> > not be easy to outperform them.
> > It may be possible to improve performance of B+ trees somewhat by 
> > specializing
> > for the fairly simple keys that would be needed, and by encoding very short
> > lists of links
> > for a single document directly into a seek offset to avoid the actual seek, 
> but
> > that's
> > about it.
> >
> > Regards,
> > Paul Elschot
> >
> >>
> >> For reference - playing the "Kevin Bacon game" on a traditional Lucene 
> >> index 
> >of
> >>IMDB data took 18 seconds to find a short path that Neo4j finds in 200
> >>milliseconds on the same data (and this was a disk based graph of 3m nodes, 
> 10m
> >>edges).
> >> Going from actor->movies->actors->movies produces a lot of key lookups and 
> the
> >>difference between key indexes and direct node pointers becomes clear.
> >> I know path finding analysis is perhaps not a typical Lucene application 
> >> but
> >>other forms of link analysis e.g. recommendation engines require similar
> >>performance.
> >>
> >> Cheers
> >> Mark
> >>
> >>
> >>
> >> On 25 Sep 2010, at 11:41, Paul Elschot wrote:
> >>
> >> > Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
> >> >>>> While not exactly equivalent, it reminds me of our earlier discussion
> >>around
> >>
> >> >>>> "layered segments" for dealing with field updates
> >> >>
> >> >> Right. Fast discovery of document relations is a foundation on which 
> >> >> lots 
> >of
> >>
> >> >> things like this can build. Relations can be given types to support a 
> >number
> >>of
> >>
> >> >> different use cases.
> >> >
> >> > How about using this (bsd licenced) tree as a starting point:
> >> > http://bplusdotnet.sourceforge.net/
> >> > It has various keys: ao. byte array, String and long.
> >> >
> >> > A fixed size byte array as key seems to be just fine: two bytes for a 
> >> > field
> >>number,
> >> > four for the segment number and four for the in-segment document id.
> >> > The separate segment number would allow to minimize the updates
> >> > in the tree during merges. One could also use the normal doc id directly.
> >> >
> >> > The value could then be a similar to the key, but without
> >> > the field number, and with an indication of the direction of the link.
> >> > Or perhaps the direction of the link should be added to the key.
> >> > A link would be present twice, once for each direction.
> >> > Also both directions could have their own payloads.
> >> >
> >> > It could be put in its own file as a separate 'segment', or maybe
> >> > each segment could allow for allocation of a part of this tree.
> >> >
> >> > I like this somehow, in case it is done right one might never
> >> > need a relational database again. Well, almost...
> >> >
> >> > Regards,
> >> > Paul Elschot
> >> >
> >> >
> >> >>
> >> >>
> >> >>
> >> >> ----- Original Message ----
> >> >> From: Grant Ingersoll <[email protected]>
> >> >> To: [email protected]
> >> >> Sent: Fri, 24 September, 2010 16:26:27
> >> >> Subject: Re: Document links
> >> >>
> >> >> While not exactly equivalent, it reminds me of our earlier discussion 
> >around
> >>
> >> >> "layered segments" for dealing with field updates [1], [2], albeit this 
> >> >> is 
> >a
> >>bit
> >>
> >> >> more generic since one could not only use the links for relating 
> documents,
> >>but
> >>
> >> >> one could use "special" links underneath the covers in Lucene to
> >>maintain/mark
> >>
> >> >> which fields have been updated and then traverse to them.
> >> >>
> >> >> [1]
> >> >>
> >>http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384
> >>
> >>
> >> >>
> >> >> [2]
> >> >>
> >>http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e
> >>
> >>
> >> >>
> >> >>
> >> >> On Sep 24, 2010, at 10:36 AM, mark harwood wrote:
> >> >>
> >> >>> This slideshow has a first-cut on the Lucene file format extensions
> >>required to
> >>
> >> >>>
> >> >>> support fast linking between documents:
> >> >>>
> >> >>> http://www.slideshare.net/MarkHarwood/linking-lucene-documents
> >> >>>
> >> >>>
> >> >>> Interested in any of your thoughts.
> >> >>>
> >> >>> Cheers,
> >> >>> Mark
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> ---------------------------------------------------------------------
> >> >>> To unsubscribe, e-mail: [email protected]
> >> >>> For additional commands, e-mail: [email protected]
> >> >>>
> >> >>
> >> >> --------------------------
> >> >> Grant Ingersoll
> >> >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 
> >> >> 7-8
> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: [email protected]
> >> >> For additional commands, e-mail: [email protected]
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: [email protected]
> >> >> For additional commands, e-mail: [email protected]
> >> >>
> >> >>
> >> >>
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: [email protected]
> >> > For additional commands, e-mail: [email protected]
> >> >
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 
>       
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Document links

Reply via email to