You can use MetaModelHelper for the static method, but if you feel a separate class is better, maybe if you want to keep a bit of temporary state or so, then I am fine with that as well. It's AFAIK not guaranteed that there will only be two datasets in the carthesian join.
2016-05-04 4:11 GMT-07:00 Jörg Unbehauen < [email protected]>: > Hi Kasper, > > thanks for your feedback. I'll try add your input as soon as possible. > > Sure, the code needs a lot more polish, before it is fit for a review. > > As this is actually a small query planner, it might be better to implement > that in a separate class, or is it guaranteed that getCarthesianProduct is > always on two DataSets? But for now it should work like this. > > And, as for me a deadline is approaching fast, i am unable to commit time > until next week. But as i want to do a lot of joins with metamodel, i have > a keen interest of improving it. > > Reg 4.: i was refering to MetaModelHelper, line 220 for example > > About the BaseObject: I do not think, that having a BaseObject is a bad > idea (but also i am not sure if it is a good one) but, the implementation > of the hashCode/equals and imposing it on the subclasses as final perhaps > hinders the VM to optimize here. But i just wanted to mention it, as it > popped up during profiling. Anyways, you can work your way around it. > > Best, > > Jörg > > > Am 04.05.16 um 06:18 schrieb Kasper Sørensen: > > Hi Jörg, >> >> Thank you for the great work so far! Discussing here is great, JIRA also >> works, but I think it usually gets a bit more direct and personal/friendly >> via the mailing list, which I like ;-) >> >> Hmm yes the whole BaseObject thing is kinda nasty. It's one of those >> things >> I'd like to get rid of, but it would break a couple of dependent projects >> IMO so it should happen as part of a major version bump. Not that we don't >> want to do it. >> >> Reg. 4 - which arraycopy invocation are you referring to? Something in >> FilterItem presumably? I couldn't find it. >> >> I see that some of the unittests are breaking on your branch, so that >> would >> be the minimal requirement from my side at least, that we bring them back >> to green. >> >> I saw something that could be optimized, which is your loading of rows >> into >> a List<Row>. For that there is already a DataSet.toRows() method which >> will >> in some cases be much faster (and less memory consuming) than to run >> through the dataset. And it would also have the benefit of having the same >> method call for both in-memory and other datasets. >> >> A note on your "instanceof InMemoryDataSet" checks: Actually we quite >> often >> wrap datasets on top of other datasets. So probably you will want to check >> that instead. You can add a method like this maybe: >> >> public static boolean isInMemory(final DataSet ds) { >> >>> DataSet childDataSet = ds; >>> while (childDataSet instanceof WrappingDataSet) { >>> if (childDataSet instanceof InMemoryDataSet) { >>> return true; >>> } >>> childDataSet = >>> ((WrappingDataSet)childDataSet).getWrappedDataSet(); >>> } >>> return false; >>> } >>> >> >> That will make your checks catch more scenarios, e.g. where an >> InMemoryDataSet is maybe decorated by a FilteredDataSet or a >> MaxRowsDataSet >> or so. >> >> Other than that, keep up the great work. Let me know if I can help. Looks >> like you simply need to work a bit more on the corner cases found by the >> unittests etc. >> >> BR, >> Kasper >> >> >> 2016-05-03 6:09 GMT-07:00 Jörg Unbehauen < >> [email protected]>: >> >> Hi all, >>> >>> i just pushed into https://github.com/tomatophantastico/metamodel , >>> into >>> the accelerateCarthesianProduct branch. >>> >>> The code is clearly not production ready, but you might want to look at >>> it. >>> >>> Let me explain: >>> >>> 1., I added a test, that calculates the join between an employee and a >>> department table, each with 10,000 tuples. >>> >>> This test terminates with a heap-overflow on my machine. >>> >>> 2., i implemented all the points, below, which pushed the execution time >>> to 100 seconds, >>> >>> Profiling the app, i discovered two things: >>> >>> 3. he BaseObject.hashcode implementation was now causing a lot of load, >>> as >>> i as constantly putting/getting them from hashmaps. I overrode this just >>> out of curiosity, which improved performance a lot. After doing P.2. this >>> was however no longer necessary. >>> >>> 4. The System.arraycopy was also causing a lot of load, this changed the >>> visibility of FilterItem.compare(), to only copy tuples matching the >>> filter >>> items. >>> >>> BTW, where do you usually have those discussions? JIRA, or Github or this >>> mailinglist? >>> >>> Best, >>> >>> Jörg >>> >>> >>> >>> >>> >>> Am 03.05.16 um 12:06 schrieb Jörg Unbehauen: >>> >>> Hi Kasper, >>> >>>> i'd happily contribute, actually i already started working on it, but >>>> soon discovered, that there might be a lot of side effects. >>>> >>>> So basic ideas i had were: >>>> >>>> 1. Start with in memory datasets >>>> >>>> 2. Stream non-rewindable datasets >>>> >>>> 3. Directy apply filters on every row created. >>>> >>>> 4. Join first between datasets with filters, in order to prevent >>>> cathesian products. >>>> >>>> And further: >>>> >>>> 5. When joining, avoid loops, but build indexes for the in memory >>>> datasets. Not sure about this one, though. >>>> >>>> Feedback is appreciated. >>>> >>>> I already started a fork ( >>>> https://github.com/tomatophantastico/metamodel) >>>> , as soon as it works, i'll write an email again, before any pull >>>> requests. >>>> >>>> Best, >>>> >>>> Jörg >>>> >>>> >>>> Am 02.05.16 um 06:56 schrieb Kasper Sørensen: >>>> >>>> Hi Jörg, >>>>> >>>>> You're right about the very naive behaviour of that method. It could >>>>> _certainly_ use an optimization or two. I can only speak for myself, >>>>> but >>>>> I >>>>> just never used MetaModel much for joins and thus never gave it much >>>>> thought. Looking at the code I'm thinking that we can do much better. >>>>> >>>>> Would you be interested in working on improving this condition? If so I >>>>> will happily share insights and ideas on how we can pull it off. >>>>> >>>>> Cheers, >>>>> Kasper >>>>> >>>>> 2016-05-01 4:12 GMT-07:00 Jörg Unbehauen < >>>>> [email protected]>: >>>>> >>>>> Hi all, >>>>> >>>>>> we just tried out metamodel with mongodb and tried out a simple join >>>>>> (as >>>>>> in select * from t1 join t2 on (t1.id = t2.oid) ) between two >>>>>> collections >>>>>> each containing roughly 10,000 documents. Using a developer setup on a >>>>>> mac, >>>>>> we did not get a result, as the system was more or less stuck. >>>>>> A quick examination revealed that >>>>>> MetaModelHelper.getCarthesianProduct(DataSet[] fromDataSets, >>>>>> Iterable<FilterItem> whereItems) consumes most of the resources. >>>>>> This implementation first computes the carthesian product in memory >>>>>> and >>>>>> than applies filters on it. >>>>>> I wonder what the rationale behind this implementation is, as it will >>>>>> not >>>>>> scale well, even for selective joins. >>>>>> Or am i using Metamodel wrong here, as in: The join should never be >>>>>> computed by getCarthesianProduct(). >>>>>> >>>>>> The problem appears to me as a general one, i did not supply a code >>>>>> example. >>>>>> >>>>>> Best, >>>>>> >>>>>> Jörg >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >
