On Tue, Dec 11, 2018 at 12:34:12AM +0200, Vicențiu Ciorbaru wrote: > Hi! > > Here is my proposal on extending the storage engine API to provide a > functionality for retrieving random rows from tables (those that have > indexes). The storage engines for which I plan to implement this are: > MyISAM, Aria, Innodb. Possibly RocksDB, TokuDB.
Observations: - as far as I understand, random skip scan is not possible with this API? (which is probably fine as we expect that sampling will only retrieve a small fraction of table rows, which means the difference between a forward-walking skip scan and genuinely random probing is negligible). - Can the scan return the same row twice? - Do we want/need a concept of random "seed" which will cause the same rows to be returned on the same table? > > --- a/sql/handler.h > +++ b/sql/handler.h > @@ -2927,7 +2927,7 @@ class handler :public Sql_alloc > /** Length of ref (1-8 or the clustered key length) */ > uint ref_length; > FT_INFO *ft_handler; > - enum init_stat { NONE=0, INDEX, RND }; > + enum init_stat { NONE=0, INDEX, RND, RANDOM }; > init_stat inited, pre_inited; > ........ > + virtual int ha_random_sample_init() __attribute__((warn_unused_result)) > + { > + DBUG_ENTER("ha_random_sample_init"); > + inited= RANDOM; > + DBUG_RETURN(random_sample_init()); > + } > + virtual int ha_random_sample(uint inx, > + key_range *min_key, > + key_range *max_key) > + __attribute__((warn_unused_result)) > + { > + DBUG_ENTER("ha_random_sample"); > + DBUG_ASSERT(inited == RANDOM); > + DBUG_RETURN(random_sample(inx, min_key, max_key)); > + } > + virtual int ha_random_sample_end() __attribute__((warn_unused_result)) > + { > + DBUG_ENTER("ha_random_sample_end"); > + inited= NONE; > + DBUG_RETURN(random_sample_end()); > + } > + > > This is the default implementation for a storage engine which does not > support it: > > + virtual int random_sample_init() { return 0; } ; > + virtual int random_sample(uint idx, key_range *min_key, key_range > *max_key) > + { > + return HA_ERR_WRONG_COMMAND; > + } > + virtual int random_sample_end() { return 0; }; > > Alternative ideas: random_sample_init() takes the idx as a parameter and > random_sample just fetches a row from the range using the index previously > specified. The range can be left unspecified with nulls to provide a fetch > from the full table range. > I don't know enough about storage engine internals to know if an index > declaration within the init function instead of within the "sample" > function is better. Maybe I am complicating it too much and a simple > random_sample() function is sufficient, kind of how ha_records_in_range > does it. > > Thoughts? > Vicențiu -- BR Sergei -- Sergei Petrunia, Software Developer MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog _______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp