On Mon, Sep 16, 2013 at 5:25 PM, Emmanuel Lécharny <[email protected]>wrote:
> Let's get back to what we need, from the BTree POV... > > We have a BTree storing Values associated with Keys. A given Key can be > assoicated to one or many values, but most of the time, we will have > only one value, and if we have many values, we will not have that many. > But occasionally, we may have thousands of values associated with a > single key. > > What happens if we have thousands of values for a given key ? It's > become very costly to manipulate the values, and as we have to order the > values to be able to find out if a value is already present in a decent > delay (ie, O(log(n) if the values are ordered, compared to O(n) if they > are not). For this very reason, we would like to use a decent way to > store the values, and this is why we do use a sub-btree (BTree<V, V>). > > In ApacheDS and JDBM, we have a threshold number which is used to know > when to create a sub-btree in this case : for instance, if we have more > than 500 values, we will use a sub-BTree, otherwise we use an array of > values. The reason is that up to a point, it's faster to manipulate an > array in memory than a BTree (the tiping poing though is not obvious : > depending on the value's size, this threshold number will vary). > > So we want to store : > - a single value most of the case > - an array of values when we have more than 1 value, but below N values > (N to be evaulated) > - a sub-BTree when we have more than N values. > > Now, we can simplify this scheme in two ways : > - we can ditch the single value, and use an array instead, and switch to > a sub-BTree at some point > - we can decide not to use an array at all, and use a sub-BTree when we > ave more than one value > - we can also always use a sub-btree when we have an index allowing > muliple values. > > So far, we decided to not go with option 3, as it puts a huge strain on > the BTree performances (the very first version was doing exactly taht, > and using something different improved the performances by 15%). > > ATM, we are experimenting option 2, but we may later go for option 1. > > That beng said, the BTree operations which manipulate the values are the > following : > > - getValue(K) > - getValues(K) > - insert(K, V) > - contains(K, V) > - delete(K, V) > > As we can see, we have two methods which return values, one whichs > return one single value, and the other which returns all the values. The > question here is why do we have two methods? That forces the users to > know when to use one or the other, ie the users must know if the BTree > allows multiple values. > > The suggestion would be to make the getValues() method to return an > iterator. That's fine, but what about the semantic of the getValue() > method, when we have more than one value stored ? Should it returns the > very first value ? > yeap, this is what it does now, the first value will be returned if there are more than one value > > IMO, I think we should have the getValues() to return an iterator, and > the getValue() method to returns either the single value or the first > value if there are more than one. > > I still prefer to get a BTree(that gives us better control if we want to search for a value in BTree) otherwise why not return a cursor? (personally, the last preference) > > wdyt ? > > > Le 9/16/13 12:10 PM, Pierre-Arnaud Marcelot a écrit : > > I would go with a simple Iterator. > > > > Regards, > > Pierre-Arnaud > > > > On 16 sept. 2013, at 11:51, Emmanuel Lécharny <[email protected]> > wrote: > > > >> Le 9/16/13 11:41 AM, Kiran Ayyagari a écrit : > >>> On Mon, Sep 16, 2013 at 3:00 PM, Emmanuel Lécharny < > [email protected]>wrote: > >>> > >>>> Le 9/14/13 9:45 AM, Kiran Ayyagari a écrit : > >>>>> On Sat, Sep 14, 2013 at 1:00 PM, Emmanuel Lécharny < > [email protected] > >>>>> wrote: > >>>>> > >>>>>> Hi guys, > >>>>>> > >>>>>> > >>>>>> /** > >>>>>> * @return The array of stored values. > >>>>>> */ > >>>>>> V[] getValues(); > >>>>>> > >>>>>> shouldn't this be returning a BTree<V,V>? cause we don't support an > >>>> array > >>>>> as a holder of > >>>>> multiple values > >>>> The fact that it's stored as a BTree is implementation dependent, I'm > >>>> not sure the user of the API should know about it. What the user wants > >>>> is to get back the stored values, and this is convenient to get back > an > >>>> array. > >>>> > >>>> IMO, it is not convenient, we cannot copy all the values into an array > >>> and return, and returning an iterator is not going to help either, > both will > >>> severely affect the search performance > >>> If needed, we can wrap it an immutable structure and return the BTree > >>> to prevent direct updates, but I wish to differ this until we see the > impact > >>> on partition implementation. > >> Except that from the BTree interface, the one the users see, you have : > >> > >> public V get( K key ) throws IOException, KeyNotFoundException > >> public DuplicateKeyVal<V> getValues( K key ) throws IOException, > >> KeyNotFoundException > >> > >> so the users has no clue about the underlying data structure. > >> > >> Now, I'm not sure if it makes any sense to expose a getValues() method > >> returning an array. May be an iterator is enough.. > >> > >> -- > >> Regards, > >> Cordialement, > >> Emmanuel Lécharny > >> www.iktek.com > >> > > > -- > Regards, > Cordialement, > Emmanuel Lécharny > www.iktek.com > > -- Kiran Ayyagari http://keydap.com
