I tested (1000 disjunctions) and it was extremely slow but no OOM. The issue seems to be the fact that we serialize the plan by writing to HDFS file directly. We probably should cache it locally and then write it to HDFS.
On Aug 4, 2010, at 10:23 AM, Edward Capriolo wrote: > On Wed, Aug 4, 2010 at 1:15 PM, Ning Zhang <[email protected]> wrote: >> Currently an expression tree (series of ORs in this case) is not collapsed >> to one operator or any other optimizations. It would be great to have this >> optimization rule to convert an OR operator tree to one IN operator. Would >> you be able to file a JIRA and contribute a patch? >> >> On Aug 4, 2010, at 7:46 AM, Mark Tozzi wrote: >> >>> I haven't looked at the code, but I assume the query parser would sort >>> the 'in' terms and then do a binary search lookup into them for each >>> row, while the 'or' terms don't have that kind of obvious relationship >>> and are probably tested in sequence. This would give the in O(log N) >>> performance compared to a chain of or's having O(N) performance, per >>> row queried. For large N, that could add up. That being said, I'm >>> just speculating here. The query parser may be smart enough to >>> optimize the related or's in the same way, or it may not optimize that >>> at all. If I get a chance, I'll try to dig around and see what it's >>> doing, as I have also had a lot of large 'in' queries and could use >>> every drop of performance I can get. >>> >>> --Mark >>> >>> On Wed, Aug 4, 2010 at 9:47 AM, Edward Capriolo <[email protected]> >>> wrote: >>>> On Wed, Aug 4, 2010 at 6:10 AM, lei liu <[email protected]> wrote: >>>>> Because my company reuire we use 0.4.1 version, the version don't support >>>>> IN >>>>> clause. I want to use the OR clause(example:where id=1 or id=2 or id=3) >>>>> to >>>>> implement the IN clause(example: id in(1,2,3) ). I know it will be slower >>>>> especially when the list after "in" is very long. Could anybody can tell >>>>> me >>>>> why is slow when use OR clause to implement In clause? >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> >>>>> LiuLei >>>>> >>>> >>>> I can not imagine the performance difference between 'or' or 'in' >>>> would be that great but I never benchmarked it. The big looming >>>> problems is that if you string enough 'or' together (say 8000) the >>>> query parser which uses java beans serialization will OOM. >>>> >>>> Edward >>>> >> >> > > For reference I did this as a test case.... > SELECT * FROM src where > key=0 OR key=0 OR key=0 OR key=0 OR key=0 OR key=0 OR key=0 OR key=0 > OR key=0 OR key=0 OR key=0 OR > key=0 OR key=0 OR key=0 OR key=0 OR key=0 OR key=0 OR key=0 OR key=0 > OR key=0 OR key=0 OR key=0 OR > ...(100 more of these) > > No OOM but I gave up after the test case did not go anywhere for about > 2 minutes. > > Edward
