Re: A common request

Wm Mussatto Thu, 31 Mar 2011 13:16:55 -0700

On Thu, March 31, 2011 12:33, mos wrote:
> At 11:20 AM 3/31/2011, you wrote:
>>At 07:29 AM 3/31/2011, you wrote:
>>>Thanks for your insight! But I'm still worried about the performance of
>>>IN ( big list of values ). Can you tell me how it is implemented?
>>>
>>>Suppose I have SELECT a FROM b WHERE c IN (1, 4, 5, 117, 118, 119, ...,
>>>387945)
>>>
>>>1) If I put 200 values there, does it do 200 individual SELECTs
>>>internally, and union them?
>>
>>No. It uses one Select statement.
>>
>>>Or does it notice that c has a "UNIQUE" index and thus at most one row
>>>can be returned per SELECT, and does them all at once?
>>
>>The IN() clause is very inefficient because MySQL will NOT use the index.
>>It will have to traverse the entire table looking for these values. That
>>is why a table join will be much faster than using IN().
>
> Oops. Sorry, the In() clause in MySQL 5.5 does use the index (not sure
> when
> they implemented that).  Even so, I still find it slow when the In()
> clause
> has a lot of elements.
> I should have used an Explain in front of my Select statement to see if
> the
> index is used before I posted.
Its in 5.0 series.  When you made your original statement I went back and
checked since I uses it in place of multiple "OR"s
>
> Mike
>
>>>2) If I want to get just the primary key, or join with another table
>>>based on just the primary key, does this query ever touch the disk
>>>(assuming the index is in memory, which I think it always is -- correct
>>>me if I'm wrong about that).
>>
>>It will get the information from the index and not have to access the
>>record data from disk. If the index is stored in memory, then it won't
>>have to go to disk (unless you also have a sort). That is why the query
>>cache is so important.
>>
>>
>>>The way I would recommend doing it (for BTREE indexes, anyway) is to
>>> sort
>>>the values in ascending order, and do the search in one pass through the
>>>index. The index is already in memory, and it would be straightforward
>>> to
>>>modify a binary search algorithm to find the rows corresponding to
>>>monotonically ascending values of the primary key, all in one pass.
>>>
>>>Even if the binary search algorithm is run 200 or 2000 times for a list,
>>>it would still be faster than hitting the disk. (Even though the CPU
>>>cache performance would be worse.)
>>>
>>>Can you let me know the specifics of it, and especially how I can avoid
>>>hitting the I/O bottlenecks?
>>
>>Use a table join and make sure you have the indexes loaded into memory.
>>See http://dev.mysql.com/doc/refman/5.0/en/load-index.html. If using
>>InnoDb then its index cache scheme is quite good.
>>
>>
>>Mike
>>
>>
>>>Thank you,
>>>Greg
>>>
>>>On 3/29/11 4:17 PM, Peter Brawley wrote:
>>>> > Why not optimize the IN ( ... ) to do the same type of thing?
>>>>
>>>>If the argument to IN() is a list of values, it'll be OK. If it's a
>>>>SELECT, in 5.0 it will be slower than molasses (see "The unbearable
>>>>slowness of IN()" at http://www.artfulsoftware.com/queries.php.
>>>>
>>>> > I always tried to avoid joins because I am planning to horizontally
>>>> partition my data.
>>>>
>>>>A severe & unfortunate constraint. Can't help you there.
>>>>
>>>>PB
>>>>
>>>>-----
>>>>
>>>>On 3/29/2011 1:27 PM, Gregory Magarshak wrote:
>>>>>Yes, this would be fine. But often, the list of friends is obtained
>>>>>from a social network like facebook, and is not stored internally.
>>>>>Basically, I obtain the friend list in a request to facebook, and then
>>>>>see which of those users have created things. So would I have to
>>>>> create
>>>>>a temporary table and insert all those uids just to make a join? Why
>>>>>not optimize the IN ( ... ) to do the same type of thing?
>>>>>
>>>>>There is also a second problem: I want to use MySQL Cluster, because I
>>>>>expect to have many users. Would it be efficient to use JOIN between
>>>>>the friends table and the articles table? Both tables are partitioned
>>>>>by user_id as the primary key, so the join would have to hit many
>>>>>different nodes. I always tried to avoid joins because I am planning
>>>>> to
>>>>>horizontally partition my data. But if MySQL cluster can handle this
>>>>>join transparently and split it up based on the partition, then that's
>>>>>fine. Do you have any info on this?
>>>>>
>>>>>Greg
>>>>>
>>>>>On 3/29/11 2:10 PM, Peter Brawley wrote:
>>>>>> > How can I quickly find all the articles written by this user's
>>>>>> friends, and not just random articles?
>>>>>>
>>>>>>Taking the simplest possible case, with table
>>>>>> friends(userID,friendID)
>>>>>>where each friendID refers to a userID in another row, the friends of
>>>>>>userID u are ...
>>>>>>
>>>>>>select friendID from user where userID=u;
>>>>>>
>>>>>>so articles by those friends of u are ...
>>>>>>
>>>>>>select a.* from article a join (  select friendID from user where
>>>>>>userID=u ) f on a.userID=f.friendID;
>>>>>>
>>>>>>PB
>>>>>>
>>>>>>-----
>>>>>>
>>>>>>On 3/29/2011 12:50 PM, Gregory Magarshak wrote:
>>>>>>>Hey there. My company writes a lot of social applications, and there
>>>>>>>is one operation that is very common, but I don't know if MySQL
>>>>>>>supports it in a good way. I thought I'd write to this list for two
>>>>>>> reasons:
>>>>>>>
>>>>>>>     1) Maybe MySQL has a good way to do this, and I just don't know
>>>>>>> about it
>>>>>>>
>>>>>>>     2) Propose to MySQL developers a simple algorithm which would
>>>>>>> greatly improve MySQL support for social networking apps.
>>>>>>>
>>>>>>>     Here is the situation. Let's say I have built a social
>>>>>>> networking application where people create and edit some item
>>>>>>> (article, photo, music mix, whatever). Now, a typical user logs in,
>>>>>>> and this user has 3000 friends. How can I quickly find all the
>>>>>>> articles written by this user's friends, and not just random
>>>>>>> articles?
>>>>>>>
>>>>>>>     Ideally, I would want to write something like this:
>>>>>>>
>>>>>>>     SELECT * FROM article WHERE user_id IN (345789, 324875, 398,
>>>>>>> ..., 349580)
>>>>>>>
>>>>>>>     basically, execute a query with a huge IN ( ... ). Maybe if
>>>>>>> this
>>>>>>> would exceed the buffer size for the MySQL wire protocol, I would
>>>>>>> break up the list into several lists, and execute several queries,
>>>>>>> and union the results together myself.
>>>>>>>
>>>>>>>     But my point is, this is very common for social networking
>>>>>>> apps.
>>>>>>> Every app wants to show "the X created by your friends", or
>>>>>>> "friends
>>>>>>> of yours (given some list from a social network) who have taken
>>>>>>> action X".
>>>>>>>
>>>>>>>     Here is how I would do it if I had raw access to the MySQL
>>>>>>> index
>>>>>>> in memory:
>>>>>>>
>>>>>>>     a) Sort the list of entries in the IN, in ascending order.
>>>>>>>
>>>>>>>     b) Do *ONE* binary search through the index (assuming it's a
>>>>>>> BTREE index) and get them all in one pass. If it's a HASH index or
>>>>>>> something, I would have to look up each one individually.
>>>>>>>
>>>>>>>     The benefits of this approach would be that this common
>>>>>>> operation would be done extremely quickly. If the index fits
>>>>>>> entirely in memory, and I just want to get the primary keys (i.e.
>>>>>>> get the list of friends who did X), the disk isn't even touched. In
>>>>>>> addition, for BTREE indexes, I would just need ONE binary search,
>>>>>>> because the entries have been sorted in ascending order.
>>>>>>>
>>>>>>>     Does MySQL have something like this? And if not, perhaps you
>>>>>>> can
>>>>>>> add it in the next version? It would really boost MySQL's support
>>>>>>> for social networking apps tremendously. Alternative, how can I add
>>>>>>> this to my MySQL? Any advice would be appreciated.
>>>>>>>
>>>>>>>Sincerely,
>>>>>>>Gregory Magarshak
>>>>>>>Qbix
------
William R. Mussatto
Systems Engineer
http://www.csz.com
909-920-9154



-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/mysql?unsub=arch...@jab.org

Re: A common request

Reply via email to