On Thu, March 31, 2011 12:33, mos wrote: > At 11:20 AM 3/31/2011, you wrote: >>At 07:29 AM 3/31/2011, you wrote: >>>Thanks for your insight! But I'm still worried about the performance of >>>IN ( big list of values ). Can you tell me how it is implemented? >>> >>>Suppose I have SELECT a FROM b WHERE c IN (1, 4, 5, 117, 118, 119, ..., >>>387945) >>> >>>1) If I put 200 values there, does it do 200 individual SELECTs >>>internally, and union them? >> >>No. It uses one Select statement. >> >>>Or does it notice that c has a "UNIQUE" index and thus at most one row >>>can be returned per SELECT, and does them all at once? >> >>The IN() clause is very inefficient because MySQL will NOT use the index. >>It will have to traverse the entire table looking for these values. That >>is why a table join will be much faster than using IN(). > > Oops. Sorry, the In() clause in MySQL 5.5 does use the index (not sure > when > they implemented that). Even so, I still find it slow when the In() > clause > has a lot of elements. > I should have used an Explain in front of my Select statement to see if > the > index is used before I posted. Its in 5.0 series. When you made your original statement I went back and checked since I uses it in place of multiple "OR"s > > Mike > >>>2) If I want to get just the primary key, or join with another table >>>based on just the primary key, does this query ever touch the disk >>>(assuming the index is in memory, which I think it always is -- correct >>>me if I'm wrong about that). >> >>It will get the information from the index and not have to access the >>record data from disk. If the index is stored in memory, then it won't >>have to go to disk (unless you also have a sort). That is why the query >>cache is so important. >> >> >>>The way I would recommend doing it (for BTREE indexes, anyway) is to >>> sort >>>the values in ascending order, and do the search in one pass through the >>>index. The index is already in memory, and it would be straightforward >>> to >>>modify a binary search algorithm to find the rows corresponding to >>>monotonically ascending values of the primary key, all in one pass. >>> >>>Even if the binary search algorithm is run 200 or 2000 times for a list, >>>it would still be faster than hitting the disk. (Even though the CPU >>>cache performance would be worse.) >>> >>>Can you let me know the specifics of it, and especially how I can avoid >>>hitting the I/O bottlenecks? >> >>Use a table join and make sure you have the indexes loaded into memory. >>See http://dev.mysql.com/doc/refman/5.0/en/load-index.html. If using >>InnoDb then its index cache scheme is quite good. >> >> >>Mike >> >> >>>Thank you, >>>Greg >>> >>>On 3/29/11 4:17 PM, Peter Brawley wrote: >>>> > Why not optimize the IN ( ... ) to do the same type of thing? >>>> >>>>If the argument to IN() is a list of values, it'll be OK. If it's a >>>>SELECT, in 5.0 it will be slower than molasses (see "The unbearable >>>>slowness of IN()" at http://www.artfulsoftware.com/queries.php. >>>> >>>> > I always tried to avoid joins because I am planning to horizontally >>>> partition my data. >>>> >>>>A severe & unfortunate constraint. Can't help you there. >>>> >>>>PB >>>> >>>>----- >>>> >>>>On 3/29/2011 1:27 PM, Gregory Magarshak wrote: >>>>>Yes, this would be fine. But often, the list of friends is obtained >>>>>from a social network like facebook, and is not stored internally. >>>>>Basically, I obtain the friend list in a request to facebook, and then >>>>>see which of those users have created things. So would I have to >>>>> create >>>>>a temporary table and insert all those uids just to make a join? Why >>>>>not optimize the IN ( ... ) to do the same type of thing? >>>>> >>>>>There is also a second problem: I want to use MySQL Cluster, because I >>>>>expect to have many users. Would it be efficient to use JOIN between >>>>>the friends table and the articles table? Both tables are partitioned >>>>>by user_id as the primary key, so the join would have to hit many >>>>>different nodes. I always tried to avoid joins because I am planning >>>>> to >>>>>horizontally partition my data. But if MySQL cluster can handle this >>>>>join transparently and split it up based on the partition, then that's >>>>>fine. Do you have any info on this? >>>>> >>>>>Greg >>>>> >>>>>On 3/29/11 2:10 PM, Peter Brawley wrote: >>>>>> > How can I quickly find all the articles written by this user's >>>>>> friends, and not just random articles? >>>>>> >>>>>>Taking the simplest possible case, with table >>>>>> friends(userID,friendID) >>>>>>where each friendID refers to a userID in another row, the friends of >>>>>>userID u are ... >>>>>> >>>>>>select friendID from user where userID=u; >>>>>> >>>>>>so articles by those friends of u are ... >>>>>> >>>>>>select a.* from article a join ( select friendID from user where >>>>>>userID=u ) f on a.userID=f.friendID; >>>>>> >>>>>>PB >>>>>> >>>>>>----- >>>>>> >>>>>>On 3/29/2011 12:50 PM, Gregory Magarshak wrote: >>>>>>>Hey there. My company writes a lot of social applications, and there >>>>>>>is one operation that is very common, but I don't know if MySQL >>>>>>>supports it in a good way. I thought I'd write to this list for two >>>>>>> reasons: >>>>>>> >>>>>>> 1) Maybe MySQL has a good way to do this, and I just don't know >>>>>>> about it >>>>>>> >>>>>>> 2) Propose to MySQL developers a simple algorithm which would >>>>>>> greatly improve MySQL support for social networking apps. >>>>>>> >>>>>>> Here is the situation. Let's say I have built a social >>>>>>> networking application where people create and edit some item >>>>>>> (article, photo, music mix, whatever). Now, a typical user logs in, >>>>>>> and this user has 3000 friends. How can I quickly find all the >>>>>>> articles written by this user's friends, and not just random >>>>>>> articles? >>>>>>> >>>>>>> Ideally, I would want to write something like this: >>>>>>> >>>>>>> SELECT * FROM article WHERE user_id IN (345789, 324875, 398, >>>>>>> ..., 349580) >>>>>>> >>>>>>> basically, execute a query with a huge IN ( ... ). Maybe if >>>>>>> this >>>>>>> would exceed the buffer size for the MySQL wire protocol, I would >>>>>>> break up the list into several lists, and execute several queries, >>>>>>> and union the results together myself. >>>>>>> >>>>>>> But my point is, this is very common for social networking >>>>>>> apps. >>>>>>> Every app wants to show "the X created by your friends", or >>>>>>> "friends >>>>>>> of yours (given some list from a social network) who have taken >>>>>>> action X". >>>>>>> >>>>>>> Here is how I would do it if I had raw access to the MySQL >>>>>>> index >>>>>>> in memory: >>>>>>> >>>>>>> a) Sort the list of entries in the IN, in ascending order. >>>>>>> >>>>>>> b) Do *ONE* binary search through the index (assuming it's a >>>>>>> BTREE index) and get them all in one pass. If it's a HASH index or >>>>>>> something, I would have to look up each one individually. >>>>>>> >>>>>>> The benefits of this approach would be that this common >>>>>>> operation would be done extremely quickly. If the index fits >>>>>>> entirely in memory, and I just want to get the primary keys (i.e. >>>>>>> get the list of friends who did X), the disk isn't even touched. In >>>>>>> addition, for BTREE indexes, I would just need ONE binary search, >>>>>>> because the entries have been sorted in ascending order. >>>>>>> >>>>>>> Does MySQL have something like this? And if not, perhaps you >>>>>>> can >>>>>>> add it in the next version? It would really boost MySQL's support >>>>>>> for social networking apps tremendously. Alternative, how can I add >>>>>>> this to my MySQL? Any advice would be appreciated. >>>>>>> >>>>>>>Sincerely, >>>>>>>Gregory Magarshak >>>>>>>Qbix ------ William R. Mussatto Systems Engineer http://www.csz.com 909-920-9154
-- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql?unsub=arch...@jab.org