Re: Join scalability for default implementation

Kasper Sørensen Thu, 05 May 2016 21:55:36 -0700

You can use MetaModelHelper for the static method, but if you feel a
separate class is better, maybe if you want to keep a bit of temporary
state or so, then I am fine with that as well.
It's AFAIK not guaranteed that there will only be two datasets in the
carthesian join.


2016-05-04 4:11 GMT-07:00 Jörg Unbehauen <
[email protected]>:

> Hi Kasper,
>
> thanks for your feedback. I'll try add your input as soon as possible.
>
> Sure, the code needs a lot more polish, before it is fit for a review.
>
> As this is actually a small query planner, it might be better to implement
> that in a separate class, or is it guaranteed that getCarthesianProduct is
> always on two DataSets? But for now it should work like this.
>
> And, as for me a deadline is approaching fast, i am unable to commit time
> until next week. But as i want to do a lot of joins with metamodel, i have
> a keen interest of improving it.
>
> Reg 4.: i was refering to  MetaModelHelper, line 220 for example
>
> About the BaseObject: I do not think, that having a BaseObject is a bad
> idea (but also i am not sure if it is a good one) but, the implementation
> of the hashCode/equals and imposing it on the subclasses as final perhaps
> hinders the VM to optimize here. But i just wanted to mention it, as it
> popped up during profiling. Anyways, you can work your way around it.
>
> Best,
>
> Jörg
>
>
> Am 04.05.16 um 06:18 schrieb Kasper Sørensen:
>
> Hi Jörg,
>>
>> Thank you for the great work so far! Discussing here is great, JIRA also
>> works, but I think it usually gets a bit more direct and personal/friendly
>> via the mailing list, which I like ;-)
>>
>> Hmm yes the whole BaseObject thing is kinda nasty. It's one of those
>> things
>> I'd like to get rid of, but it would break a couple of dependent projects
>> IMO so it should happen as part of a major version bump. Not that we don't
>> want to do it.
>>
>> Reg. 4 - which arraycopy invocation are you referring to? Something in
>> FilterItem presumably? I couldn't find it.
>>
>> I see that some of the unittests are breaking on your branch, so that
>> would
>> be the minimal requirement from my side at least, that we bring them back
>> to green.
>>
>> I saw something that could be optimized, which is your loading of rows
>> into
>> a List<Row>. For that there is already a DataSet.toRows() method which
>> will
>> in some cases be much faster (and less memory consuming) than to run
>> through the dataset. And it would also have the benefit of having the same
>> method call for both in-memory and other datasets.
>>
>> A note on your "instanceof InMemoryDataSet" checks: Actually we quite
>> often
>> wrap datasets on top of other datasets. So probably you will want to check
>> that instead. You can add a method like this maybe:
>>
>>      public static boolean isInMemory(final DataSet ds) {
>>
>>>          DataSet childDataSet = ds;
>>>          while (childDataSet instanceof WrappingDataSet) {
>>>              if (childDataSet instanceof InMemoryDataSet) {
>>>                  return true;
>>>              }
>>>              childDataSet =
>>> ((WrappingDataSet)childDataSet).getWrappedDataSet();
>>>          }
>>>          return false;
>>>      }
>>>
>>
>> That will make your checks catch more scenarios, e.g. where an
>> InMemoryDataSet is maybe decorated by a FilteredDataSet or a
>> MaxRowsDataSet
>> or so.
>>
>> Other than that, keep up the great work. Let me know if I can help. Looks
>> like you simply need to work a bit more on the corner cases found by the
>> unittests etc.
>>
>> BR,
>> Kasper
>>
>>
>> 2016-05-03 6:09 GMT-07:00 Jörg Unbehauen <
>> [email protected]>:
>>
>> Hi all,
>>>
>>> i just pushed into https://github.com/tomatophantastico/metamodel ,
>>> into
>>> the accelerateCarthesianProduct branch.
>>>
>>> The code is clearly not production ready, but you might want to look at
>>> it.
>>>
>>> Let me explain:
>>>
>>> 1., I added a test, that calculates the join between an employee and a
>>> department table, each with 10,000 tuples.
>>>
>>> This test terminates with a heap-overflow on my machine.
>>>
>>> 2., i implemented all the points, below, which pushed the execution time
>>> to 100 seconds,
>>>
>>> Profiling the app, i discovered two things:
>>>
>>> 3. he BaseObject.hashcode implementation was now causing a lot of load,
>>> as
>>> i as constantly putting/getting them from hashmaps. I overrode this just
>>> out of curiosity, which improved performance a lot. After doing P.2. this
>>> was however no longer necessary.
>>>
>>> 4. The System.arraycopy was also causing a lot of load, this changed the
>>> visibility of FilterItem.compare(), to only copy tuples matching the
>>> filter
>>> items.
>>>
>>> BTW, where do you usually have those discussions? JIRA, or Github or this
>>> mailinglist?
>>>
>>> Best,
>>>
>>> Jörg
>>>
>>>
>>>
>>>
>>>
>>> Am 03.05.16 um 12:06 schrieb Jörg Unbehauen:
>>>
>>> Hi Kasper,
>>>
>>>> i'd happily contribute, actually i already started working on it, but
>>>> soon discovered, that there might be a lot of side effects.
>>>>
>>>> So basic ideas i had were:
>>>>
>>>> 1. Start with in memory datasets
>>>>
>>>> 2. Stream non-rewindable datasets
>>>>
>>>> 3. Directy apply filters on every row created.
>>>>
>>>> 4. Join first between datasets with filters, in order to prevent
>>>> cathesian products.
>>>>
>>>> And further:
>>>>
>>>> 5. When joining, avoid loops, but build indexes for the in memory
>>>> datasets. Not sure about this one, though.
>>>>
>>>> Feedback is appreciated.
>>>>
>>>> I already started a fork (
>>>> https://github.com/tomatophantastico/metamodel)
>>>> , as soon as it works, i'll write an email again, before any pull
>>>> requests.
>>>>
>>>> Best,
>>>>
>>>> Jörg
>>>>
>>>>
>>>> Am 02.05.16 um 06:56 schrieb Kasper Sørensen:
>>>>
>>>> Hi Jörg,
>>>>>
>>>>> You're right about the very naive behaviour of that method. It could
>>>>> _certainly_ use an optimization or two. I can only speak for myself,
>>>>> but
>>>>> I
>>>>> just never used MetaModel much for joins and thus never gave it much
>>>>> thought. Looking at the code I'm thinking that we can do much better.
>>>>>
>>>>> Would you be interested in working on improving this condition? If so I
>>>>> will happily share insights and ideas on how we can pull it off.
>>>>>
>>>>> Cheers,
>>>>> Kasper
>>>>>
>>>>> 2016-05-01 4:12 GMT-07:00 Jörg Unbehauen <
>>>>> [email protected]>:
>>>>>
>>>>> Hi all,
>>>>>
>>>>>> we just tried out metamodel with mongodb and tried out a simple join
>>>>>> (as
>>>>>> in select * from t1 join t2 on (t1.id = t2.oid) ) between two
>>>>>> collections
>>>>>> each containing roughly 10,000 documents. Using a developer setup on a
>>>>>> mac,
>>>>>> we did not get a result, as the system was more or less stuck.
>>>>>> A quick examination revealed that
>>>>>> MetaModelHelper.getCarthesianProduct(DataSet[] fromDataSets,
>>>>>> Iterable<FilterItem> whereItems) consumes most of the resources.
>>>>>> This implementation first computes the carthesian product in memory
>>>>>> and
>>>>>> than applies filters on it.
>>>>>> I wonder what the rationale behind this implementation is, as it will
>>>>>> not
>>>>>> scale well, even for selective joins.
>>>>>> Or am i using Metamodel wrong here, as in: The join should never be
>>>>>> computed by getCarthesianProduct().
>>>>>>
>>>>>> The problem appears to me as a general one, i did not supply a code
>>>>>> example.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Jörg
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>

Re: Join scalability for default implementation

Reply via email to