On 12/04/13 19:49, Christian Dahlqvist wrote:
Hi Toby,

Inserting lots of small records in Riak and querying the full data set
via MapReduce is definitely not the best way to go around things. As
Alexander points out, easy object is stored with metadata, which adds
some overhead and Riak MapReduce tends to work best when run over
smaller data sets.

In order to help and come up with suggestions on how to model your data
efficiently I would need to better understand the nature of the data you
want to store in Riak, the access patterns and how you need to be able
to query it.

- What does the data represent? Are there any natural way to group records?
- How frequently are records inserted? How often are they updated? Do
you delete records or keep them forever?
- How do you need to be able to query this data? What is the logical
task of the MapReduce job you described?
- In what different ways is the data aggregated?


Hi Christian,
thanks for your offer of help.

The dataset I'm looking at is roughly described as follows:

Records consist of a small number of organisations and their metadata; a moderate number of people associated with those organisations and their metadata; and then a large number of events which refer to a single person and single organisation along with data about the event.

This dataset has data that goes back in time a fair way, and the intent is to keep it indefinitely. New data arrives in batches throughout the year. The only time data is deleted is when it is discovered to be faulty and is replaced by a new batch of corrected records.

There are some ways the data could be logically grouped, such as by organisation, or by person, but we can't just roll it all up into aggregates at the start.

Actual queries can easily be segmented to only be working over a subset of the whole data, per query. We do that already on the existing SQL system.

The actual queries we run usually aggregate the data from the raw events up to either person or organisation level by time period. Some are quite simple queries, such as "number of events per org where this single value exceeded a fixed threshold". Some others are more complex, such as finding events where certain values have exceeded the typical range for that value as seen by the parent person or organisation.

The big queries don't need to run instantly; we run them all and store the results. The queries tend to get tweaked or updated regularly, requiring them to be re-run over the entire dataset, or at least the recent periods, and it's desirable that this process completes in a reasonable timeframe.



Currently this all runs in a big PostgreSQL database, reasonably successfully, but as the amount of data and the number of queries grows, I'm investigating options to horizontally scale the system, and on the face of it I think the data and queries would suit a map-reduce system?


Cheers,
Toby


On 12 Apr 2013, at 09:11, Alexander Sicular <[email protected]
<mailto:[email protected]>> wrote:

Inline -Alexander.

On Apr 12, 2013 3:11 AM, "Toby Corkindale"
<[email protected]
<mailto:[email protected]>> wrote:
>
> Hi,
> I wondered if I could get a little advice on good practices for
storing my records in Riak, such that they perform reasonably well in
map-reduce queries?
>
> I have a little over 200 million records, currently stored in a
regular SQL database. I'm expecting this dataset to continue to grow,
of course.
> Each record is reasonably small - some get up to couple of hundred
bytes, but most are smaller, and consist of around a dozen numeric
fields and some small alphanumeric identifier fields.
>
> My initial trial of importing these into Riak were to take each
database row and convert it into a small JSON of key=>value pairs.
>
> I'm find two issues with this though.
> 1) It takes a really long time to import everything into Riak, at
least compared to ingesting into PostgreSQL. (I'm using Riak's HTTP API)

Proto buff interface is faster,  less overhead. ..

> 2) An initial trial of some map-reduce queries was significantly
slower than I was hoping; I suspect this is because of my data
structure though.
> My initial map phase was iterating over a high percentage of the
keys, decoding the JSON, and then returning just one or two of the
fields from the JSON structure, which is maybe an inefficient way to
go about things?

Not the most efficient.  Everything is translated from erlang to
JavaScript and shipped over to the coordinating node.  MR over smaller
sets,  accumulate over some range like time or something native to
your app.

>
>
> So I was wondering if there's a better way to be approaching the
problem.. I wondered about breaking up the records further, and
storing individual fields against keys, rather than the whole record
as a JSON object.
>
> Eg. This was my initial method:
> Key: {id}:{recordtype}:{recordid}
> Value: { field1: "foo", field2: "bar", field3: "baz" }
>
> I wondered about this, creating one key for each field:
> {id}:{recordtype}:{recordid}:field1 ==> "foo"
> {id}:{recordtype}:{recordid}:field2 ==> "bar"
> {id}:{recordtype}:{recordid}:field3 ==> "baz"
>

Don't do that.  There is a ~400b per key overhead in riak.

> That would avoid the need for one of the map phases; but on the
other hand, now I'd be creating an order of magnitude more overall
keys in the db.
>
>
> On the other hand, I wondered about going the other way, and
grouping records under one key. So instead of having keys 100, 101,
102 .. 109, I would have one key 10x that contained a JSON structure
with an array of records.. (I don't know whether I store 10, 50 or 100
records per key)
>

I might batch depending on your access pattern and update pattern. If
you update values with any frequency it may not be worth it.  Riak has
no in place updates.

> This would speed up the time taken to ingest data to Riak, and
reduce the number of total queries made by the map phase.. but would
increase the work DONE in the map phase and add inefficiencies as
sometimes only a few rows of the set would actually be required for a
given query.
>
>
>
> And the third consideration is that maybe I just need to scale up
the cluster size to have more machines. Currently it's running on a
small cluster of four nodes while trialling Riak. (And I'm comparing
performance with a single, but significantly more powerful, PostgreSQL
node)
>
>
> There's nothing to stop me trying out all these methods, but I
thought I'd poll the community for advice since no doubt implemented
similar things before and know what rough things may or may not work well.
>
>
> Thanks,
> Toby


_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to