Hi Rusty,
On Thu, Apr 14, 2011 at 8:00 PM, Rusty Klophaus <[email protected]> wrote:
> Hi Morten,
> Thanks for sending the log files. I was able to figure out, at least
> partially, what's going on here.
Fantastic - thanks!
> The "Failed to compact" message is a result of trying to index a token
> that's greater than 32kb in size. (The index storage engine, called
> merge_index, assumes tokens sizes smaller than 32kb.) I was able to decode
> part of the term in question by pulling data from the log file, and it looks
> like you may be indexing HTML with base64 encoded inline images, ie: <img
> src="data:image/jpeg;base64,iVBORw0KG..."> The inline image is being treated
> as a single token, and it's greater than 32kb.
That's odd - in the search schema, I asked it to ignore everything
besides a few specific fields:
{
schema,
[
{version, "0.1"},
{default_field, "_owner"},
{n_val, 1}
],
[
%% Don't parse _id and _owner, just treat it as single token
{field, [
{name, "id"},
{required, true},
{analyzer_factory, {erlang, text_analyzers,
noop_analyzer_factory}}
]},
{field, [
{name, "_owner"},
{required, true},
{analyzer_factory, {erlang, text_analyzers,
noop_analyzer_factory}}
]},
%% Parse Name fields for full-text indexing
{field, [
{name, "displayName"},
{aliases, ["nickname", "preferredUsername",
"name_formatted",
"name_displayName"]},
{analyzer_factory, {erlang, text_analyzers,
standard_analyzer_factory}}
]},
{field, [
{name, "emails_value"},
{analyzer_factory, {erlang, text_analyzers,
standard_analyzer_factory}}
]},
%% Add modification dates
{field, [
{name, "published"},
{aliases, ["updated"]},
{type, date}
]},
%% Skip all else...
{dynamic_field, [
{name, "*"},
{skip, true}
]}
]
}.
(We're indexing Portable Contacts, where the user images reside in a
'image'-field.)
> The short term workaround is to either:
> 1) Preprocess your data to avoid this situation.
> 2) Or, create a custom analyzer that limits the size of terms
> (See http://wiki.basho.com/Riak-Search---Schema.html for more information
> about analyzers and custom analyzers.)
> The long term solution is for us to increase the maximum token size in
> merge_index. I've filed a bugzilla issue for this, trackable
> here: https://issues.basho.com/show_bug.cgi?id=1069
> Still investigating the "Too many db tables" error. This is being caused by
> the system opening too many ETS tables. It *may* be related to the
> compaction error described above, but I'm not sure.
> Search (specifically merge_index) uses ETS tables heavily, and the number of
> tables is affected by a few different factors. Can you send me some more
> information to help debug, specifically:
>
> How many partitions (vnodes) are in your cluster? (If you haven't changed
> any settings, then the default is 64.)
It's 64 (no defaults changed at all).
> How many machines are in your cluster?
Four.
> How many segments are on the node where you are seeing these errors?
> (Run: "find DATAPATH/merge_index/*/*.data | wc -l", replacing DATAPATH with
> the path to your Riak data directory for that node.)
foreach srv ( nosql1 nosql2 nosql4 nosql5 )
echo -n "$srv "; ssh $srv sh -c 'find
/var/lib/riaksearch/merge_index/*/*.data | wc -l'
end
nosql1 32434
nosql2 14170
nosql4 15480
nosql5 13501
(nosql1 is the one the error log is lifted from - but the errors
seemed to come of all of the servers.)
> Approximately how much data are you loading (# Docs and # MB), and how
> quickly are you trying to load it?
~17m records, weighing in just shy of four GB.
While I didn't do the loading, I believe we did it with 25 concurrent
threads, using the four machines in round-robin fashion.
/Siebuhr
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com