Good questions all! I am to start trying to just take a percentile of rows in a table similar to a percentile to construct training, cross-validation and testing sets. I am a machine learning person and what to be able to do say a 25% random sample of rows in a table ( I may not know the size and the percentile should be settable) Starting with the easiest assumption, that all row are the say "type" will get things started. I can then move to more exotic scenarios. Accumulo is a new nut for me to crack and I would very much like your thoughts. Thanks mate!
On Tue, Feb 4, 2014 at 7:27 PM, Chris Bennight [via Apache Accumulo] < [email protected]> wrote: > I'm assuming you want a random selection of entries in accumulo - so say a > random selection of key's/values? > > How are your keys formatted (conceptually is fine); is there some sort of > regularity to them? (I.e. can you calculate ahead of time a random > distribution of keys without validating which keys are present)? > > If you can't calculate the key distribution ahead of time, are you keeping > any statistics (or could you) on ingest (cardinality, distribution, etc.) > - > and finally, how rigorous and performant do you need this random sampling > to be? Do you just want representative data, or are you trying to do > something like BlinkDB[1] (allow people to specify confidence intervals > on > queries, and only sample enough data to meet the requisite uncertainty > requirements)? > > [1] http://blinkdb.org/ > > Chris > > > > > On Sat, Feb 1, 2014 at 3:58 PM, cprigano <[hidden > email]<http://user/SendEmail.jtp?type=node&node=7394&i=0>> > wrote: > > > I am looking at writing an Accumulo iterator to return a random sample > of a > > percentile of a table. > > > > I would appreciate any suggestions. > > > > Thnaks, > > > > Chris > > > > > > > > -- > > View this message in context: > > > http://apache-accumulo.1065345.n5.nabble.com/Accumulo-iterator-to-return-a-random-sample-of-a-percentile-of-a-table-tp7354.html > > Sent from the Developers mailing list archive at Nabble.com. > > > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://apache-accumulo.1065345.n5.nabble.com/Accumulo-iterator-to-return-a-random-sample-of-a-percentile-of-a-table-tp7354p7394.html > To unsubscribe from Accumulo iterator to return a random sample of a > percentile of a table, click > here<http://apache-accumulo.1065345.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=7354&code=Y2hyaXMucC5yaWdhbm9AZ21haWwuY29tfDczNTR8NTkyODE0MjEy> > . > NAML<http://apache-accumulo.1065345.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Accumulo-iterator-to-return-a-random-sample-of-a-percentile-of-a-table-tp7354p7400.html Sent from the Developers mailing list archive at Nabble.com.
