Ok, I've changed my two VMs to each have:
3 CPUs, 1GB ram, 120GB disk
I'm ingesting the twitter spritzer stream (about 10-20 tweets per second,
approx 2k of data per tweet). One bucket is storing the non-indexed
tweets in full. Another bucket is storing the indexed tweet string, id,
date and username. A maximum of 20 clients can be hitting the 'cluster'
at any one time.
I'm using n_val=2 so there is replication going on behind the scenes.
I'm using a hardware load-balancer to distribute the work amongst the two
nodes and now I'm seeing about 75% CPU usage as opposed to 100% on one
node and 50% on the replicating-only node.
I've monitored the VM over the last few days and it seems to be
mostly CPU-bound. The disk I/O is low. The Network I/O is low.
Q: Can I change the pre-commit to a post-commit trigger or something
perhaps or will that make any difference at all? I'm ok if the tweet
stuff doesn't get indexed immediately and there's a slight lag in indexing
if it saves on CPU.
Here's my search schema (the default, I think):
root@ha2:/var/log/riaksearch# search-cmd show_schema Index
Attempting to restart script through sudo -u riak
%% Schema for 'Index'
{
schema,
[
{version, "1.1"},
{n_val, 3},
{default_field, "value"},
{analyzer_factory, {erlang, text_analyzers,
whitespace_analyzer_factory}}
],
[
%% Field names ending in "_num" are indexed as integers
{dynamic_field, [
{name, "*_num"},
{type, integer},
{analyzer_factory, {erlang, text_analyzers,
integer_analyzer_factory}}
]},
%% Field names ending in "_int" are indexed as integers
{dynamic_field, [
{name, "*_int"},
{type, integer},
{analyzer_factory, {erlang, text_analyzers,
integer_analyzer_factory}}
]},
%% Field names ending in "_dt" are indexed as dates
{dynamic_field, [
{name, "*_dt"},
{type, date},
{analyzer_factory, {erlang, text_analyzers,
noop_analyzer_factory}}
]},
%% Field names ending in "_date" are indexed as dates
{dynamic_field, [
{name, "*_date"},
{type, date},
{analyzer_factory, {erlang, text_analyzers,
noop_analyzer_factory}}
]},
%% Field names ending in "_txt" are indexed as full text"
{dynamic_field, [
{name, "*_txt"},
{type, string},
{analyzer_factory, {erlang, text_analyzers,
standard_analyzer_factory}}
]},
%% Field names ending in "_text" are indexed as full text"
{dynamic_field, [
{name, "*_text"},
{type, string},
{analyzer_factory, {erlang, text_analyzers,
standard_analyzer_factory}}
]},
%% Everything else is a string
{dynamic_field, [
{name, "*"},
{type, string},
{analyzer_factory, {erlang, text_analyzers,
whitespace_analyzer_factory}}
]}
]
}.
Here's an indexed record:
root@ha1:~# curl -s http://ha:8098/riak/gnip/80329247314550784 | json_xs
{
"created_at" : "Mon Jun 13 17:42:39 +0000 2011",
"tweet" : "@NielJDBSimpson yeaah",
"screen_name" : "SophieBieber69"
}
A non-indexed record:
root@ha1:~# curl -s http://ha:8098/riak/tweets/80329247314550784
"{\"entities\":{\"urls\":[],\"hashtags\":[],\"user_mentions\":[{\"indices\":[0,15],\"screen_name\":\"NielJDBSimpson\",\"name\":\"\\u2665Enielle
Anne\\u2665
\\u25d5\\u203f\\u25d5\",\"id_str\":\"197405933\",\"id\":197405933}]},\"retweet_count\":0,\"truncated\":false,\"text\":\"@NielJDBSimpson
yeaah\",\"created_at\":\"Mon Jun 13 17:42:39 +0000
2011\",\"place\":null,\"in_reply_to_status_id\":77609368182472704,\"coordinates\":null,\"source\":\"web\",\"geo\":null,\"favorited\":false,\"in_reply_to_status_id_str\":\"77609368182472704\",\"id_str\":\"80329247314550784\",\"in_reply_to_screen_name\":\"N
ielJDBSimpson\",\"in_reply_to_user_id_str\":\"197405933\",\"user\":{\"lang\":\"en\",\"created_at\":\"Wed
May 04 17:02:14 +0000
2011\",\"profile_text_color\":\"3D1957\",\"profile_image_url\":\"http:\\\/\\\/a3.twimg.com\\\/profile_images\\\/1372500926\\\/IMG01256-20110418-1827_normal.jpg\",\"is_translator\":false,\"statuses_count\":124,\"profile_sidebar_fill_color\":\"7AC3EE\",\"li
sted_count\":0,\"following\":null,\"profile_background_tile\":true,\"friends_count\":425,\"description\":\"I
love Justin Bieber i saw him 23\\\/03\\\/2011 in concert best nite ever.
Never say Never imma beliber.Follow me I will follow back xx
:P\",\"screen_name\":\"SophieBieber69\",\"contributors_enabled\":false,\"verified\":false,\"profile_link_color\":\"FF0000\",\"url\":null,\"profile_sidebar_border_color\":\"65B0DA\",\"default_profile_image\":false,\"time_zone\":null,\"protected\":false,\"i
d_str\":\"293033762\",\"notifications\":null,\"profile_use_background_image\":true,\"favourites_count\":6,\"location\":\"Sheffield\",\"name\":\"Sophie
Bieber
\",\"profile_background_color\":\"642D8B\",\"id\":293033762,\"default_profile\":false,\"show_all_inline_media\":false,\"follow_request_sent\":null,\"geo_enabled\":false,\"profile_background_image_url\":\"http:\\\/\\\/a1.twimg.com\\\/images\\\/themes\\\/th
eme10\\\/bg.gif\",\"utc_offset\":null,\"followers_count\":184},\"id\":80329247314550784,\"contributors\":null,\"retweeted\":false,\"in_reply_to_user_id\":197405933}\r"
- Steve Webb
-- Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb
On Thu, 9 Jun 2011, Rusty Klophaus wrote:
Hi Steve,
Riak does best with a lot of memory and a fast disk. Depending on how much
data you have in the system, putting two nodes into 1GB of memory on a
single VM may be causing the system to overrun available resources and page
out to disk, and depending on how you've set up your virtualized
environment, you could be paying extra costs with each disk access,
compounding the problem. My first recommendation would be to either run the
test again while monitoring disk operations using iostat to see if disk is
the problem, or to just go ahead and test on bigger hardware. I suspect you
will see much less of a performance difference between the tests once there
are ample resources.
That said, some slowdown is expected when you turn on indexing, as Riak
Search adds quite a bit of overhead in parsing and tokenizing the document,
and then storing the results.
There are two ways to speed up indexing:
1. Reduce the size of your documents. If your documents are large, but
you only need one or two fields indexed, you can create smaller "surrogate"
documents with just the fields you need indexed, plus a link back to your
original document.
2. Batch your writes using the Solr interface. Riak Search uses
"term-based partitioning". Term-based partitioning reduces complexity during
queries, at the cost of increased complexity during writes. You can gain
back some of the lost performance by batching your writes. This allows the
system to plan which messages it sends more intelligently, thus sending
fewer messages and reducing overhead. The downside here is that you can't
use the Riak KV interface, you need to switch to the Solr interface.
Would you mind describing a bit more about your the size and shape of your
data (how many objects, average object size, object format, throughput,
etc.) and ideally attach your Riak Search schema?
Thanks,
Rusty
On Tue, Jun 7, 2011 at 4:35 PM, Steve Webb <[email protected]> wrote:
Hey there.
I'm inserting twitter spritzer tweets into a bucket that doesn't have a
precommit index hook, and a few fields from the tweet into a second bucket
that does have the precommit hook.
Speeds on the inserts into the indexed bucket are an order or magnitude
slower than the non-indexed bucket.
I'm using a 1GB ram, 20GB disk vmware VM, 2-node cluster, ubuntu 10.4,
riaksearch 0.14.0 with n_val = 2.
Is there a way to do a more lazy indexing to where it doesn't slow down
inserts so much?
- Steve
--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com