RE: [MarkLogic Dev General] building a dictionary from a word lexicon

Danny Sokolsky Tue, 01 May 2007 11:17:25 -0700

Hi Alan,

The dictionary you are loading is larger than your in memory list size,
which is the largest fragment you can load into that database.  You need
to either increase your in-memory-list-size parameter on the database in
which the dictionary is being loaded or break the dictionary up into 2
or more smaller dictionaries.  The in-memory-list-size is in the Admin
Interface on the database configuration page for the db you are using.


Also, if you are going to use this with the spell APIs, you should use
the spell:load funtion to load the dictionary, which puts the document
in the proper collections for the spell API.

-Danny

-----Original Message-----
From: Alan Darnell [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, May 01, 2007 10:32 AM
To: Danny Sokolsky
Cc: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] building a dictionary from a word
lexicon


Thanks Danny,

This worked great but when I tried to load the resulting file (about
400K words -- lots of specialized medical terms) I got this error:

ERROR: eval-in sp-nih at file:/opt/MarkLogic/Modules/
XDMP-FRAGTOOLARGE: Fragment of /sp-dictionary.xml too large for
in-memory storage: XDMP-INMMLISTFULL: In-memory list storage full;
list: table=89%, wordsused=67%, wordsfree=0%, overhead=33%; tree:
table=0%, wordsused=12%, wordsfree=88%, overhead=0%

Are there some admin settings I can adjust to get past this or should I
break the dictionary file up into smaller chunks or load the thing via
XCC one word at a time?

Alan


On 4/30/07, Danny Sokolsky <[EMAIL PROTECTED]> wrote:
> Hi Alan,
>
> I think your approach would work.
>
> If you really want a dictionary of all of the words in the database, 
> however, this might be easier:
>
> xdmp:save("c:/tmp/tmp.xml",
> <dictionary>{"
> ",
> for $x in cts:words()
> return (
> <word>{$x}</word>, "
> ")
> }</dictionary>)
>
> The spaces are in there so line breaks will appear between the terms. 
> This includes everything in the db, not just things starting with a-z 
> (not sure if that is what you want or not).  I didn't try this on a 
> large data set, but I think it will work because it will just stream 
> everything out to the disk (assuming you don't run out of disk 
> space...).
>
> Of course using the lexicon to create a dictionary means that all of 
> the words (including the misspelled ones) are put in the dictionary.  
> So maybe I am not reading the intent of your question correctly.
>
> -Danny
>
> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Alan 
> Darnell
> Sent: Monday, April 30, 2007 3:33 PM
> To: [email protected]
> Subject: [MarkLogic Dev General] building a dictionary from a word 
> lexicon
>
>
> I'd like to build a dictionary file for use with the spelling module 
> and base that dictionary on words that appear in my word lexicon.  So 
> I want to dump the contents of the lexicon to a file formatted 
> according to the spelling dictionary schema.
>
> To do this, I'm thinking of running through the lexicon letter by 
> letter and constructing the spelling dictionary from the output.
>
> for $i in cts:word-match("a*") [1 to 2000]
> return
> <word>{$i}</word>
>
> Is this the best way to do this?  I'm thinking that creating a 
> dictionary out of lexicons is probably a pretty common task and that 
> my approach seems cumbersome.  I'm thinking also it would be great if 
> you could have the dictionary automatically update itself based on the

> content of one or more word lexicons as new documents were added, 
> updated, and deleted in a database or databases.
>
> Alan
>
> Alan Darnell
> University of Toronto _______________________________________________
> General mailing list
> [email protected] 
> http://xqzone.com/mailman/listinfo/general
>
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] building a dictionary from a word lexicon

Reply via email to