Re: [MarkLogic Dev General] Maximum size for a map?

Will Thompson Tue, 07 May 2013 10:18:17 -0700

Damon, that's a good idea. Only problem is that the value would actually
be multiple values in some XML, so maybe it could be stored as JSON as a
means of shoehorning that into a string.


Ultimately I think serializing a giant map to the database won't be
workable due to limitations of the in-memory list and expanded tree cache,
especially if we want to grow the map. We can test 1) breaking up the map,
and 2) performance of doing it with a range index instead.

-Will


On 5/6/13 7:01 PM, "Damon Feldman" <[email protected]> wrote:

>Will,
>
>You may be able to use range indexes either by using cts:element-values
>with an element-value-query to "key" the lookup and have the value in the
>index, or by range-indexing a value that has the key and value separated
>by a token. This may not be quite as fast a map lookup but can simplify
>your code.
>
>If you describe the nature of the lookup we can brainstorm other ideas.
>
>Yours,
>Damon
>
>--
>Damon Feldman
>Sr. Principal Consultant, MarkLogic
>
>
>-----Original Message-----
>From: [email protected]
>[mailto:[email protected]] On Behalf Of Will
>Thompson
>Sent: Monday, May 06, 2013 9:01 PM
>To: MarkLogic Developer Discussion
>Subject: Re: [MarkLogic Dev General] Maximum size for a map?
>
>The map won't need to be updated frequently, so the idea is to serialize
>it to the database and filesystem for portability. Then on first use, it
>gets loaded into a server field.  My tests are showing you're pretty spot
>on for the deserializing time. But after that it's loaded in the field
>and always available. My worry is about that initial doc() call on boxes
>that may have a smaller expanded-tree cache. In this case, is my only
>option to ensure each box has sufficient values to hold the 400MB
>deserialized map or face XDMP-EXPNTREECACHEFULL? I could try/catch, and
>throw a friendlier error for the small systems.
>
>I chose map for speed, but I if that's too much trouble then I suppose
>the kay/value pairs could also be stored in a non-map document with a
>range index on the keys and fragment root set to its children. Then there
>would be no need for doc(), although I'm not sure how much speed that
>would give up.
>
>-Will
> 
>
>On 5/6/13 4:48 PM, "Michael Blakeley" <[email protected]> wrote:
>
>>Yes, any doc() call will use space in the expanded-tree cache. So you
>>might end up with X in the cache, plus Y for the deserialized map.
>>
>>I would also worry about how long it might take to deserialize a 400-MB
>>map, even if the XML is already in cache. My guess is around 30-sec to
>>construct the map. If the cache is cold that might double because the
>>fragment has to be read from disk and decoded. But those are just
>>guesses.
>>
>>There are a couple of approaches that might avoid that cost. One is to
>>break up the map into multiple small documents. You could query a
>>special directory or collection for document that have the key(s) you
>>need, and let the expanded-tree cache handle the memory management.
>>Each map would be relatively small, so deserialization wouldn't be as
>>expensive.
>>
>>Another approach is to keep the map in a server field. That would be
>>both powerful and dangerous, because the memory for a server field is
>>persistent. We are used to working with query allocations, which
>>disappear when the query ends. So a single query is limited in its
>>scope for damage. But a 400-MB server field allocates 400-MB per eval
>>host, for the lifetime of the host process.
>>
>>So you'd want to be very careful to ensure that each host has exactly
>>one of these huge server fields. You'd also have to be very careful
>>about updating the map, partly because of the size and also because
>>server fields do not offer much in the way of memory protection.
>>Depending on your needs you might be able to do some sort of A-B
>>switching when you need to update the map, or develop a locking
>>strategy, or both.
>>
>>-- Mike
>>
>>On 6 May 2013, at 16:29 , Will Thompson <[email protected]>
>>wrote:
>>
>>> Mike - I should have been a little more specific about the use case.
>>>What
>>> if that map is serialized to the db; would calling doc() on that
>>>potentially overload the expanded tree cache?
>>> 
>>> let $m := map:map(doc('/path/to/map.xml')/map:map)
>>> return xdmp:set-server-field('my-map', $m)
>>> 
>>> Best guess on the QA server is that ML was installed when its VM was
>>> allocated fewer resources. But that's a good point about catching bad
>>> queries.
>>> 
>>> -Will
>>> 
>>> 
>>> On 5/6/13 4:05 PM, "Michael Blakeley" <[email protected]> wrote:
>>> 
>>>> No, maps don't use expanded tree cache space. A really large map
>>>>might  hit some per-eval limits, but I didn't find them when I
>>>>created map  around 800-MiB on my laptop, with 6.0-3. I used an
>>>>xdmp:quote to try to  make sure the map would really allocated more
>>>>space for each entry.
>>>>This
>>>> was fine at 80-MiB and took about 5-sec. For 800-MiB it took a
>>>>little  longer, and the OS swapped some pages out. So I conclude that
>>>>it was  working hard to allocate all the memory.
>>>> 
>>>> let $m := map:map()
>>>> let $n := doc()[1]
>>>> let $_ := (1 to 1000000) ! (
>>>> map:put($m, xdmp:integer-to-hex(xdmp:random()), xdmp:quote($n)))
>>>> return map:count($m) * string-length(xdmp:quote($n)) div (1024 *
>>>> 1024) , xdmp:elapsed-time() =>
>>>> 802.04010009765625
>>>> PT1M6.429219S
>>>> 
>>>> On that QA system, you might have set the expanded tree cache size
>>>> to a smaller value on purpose. That can be a good way to catch
>>>> poorly-optimized queries.
>>>> 
>>>> -- Mike
>>>> 
>>>> On 6 May 2013, at 14:44 , Will Thompson <[email protected]>
>>>> wrote:
>>>> 
>>>>> Here's another one related to the Expanded Tree Cache: Say I want
>>>>>to  load  a giant map: 400MB or more. Will this always be dependent
>>>>>on the size of  the Expanded Tree Cache? Most of our dev machines
>>>>>have an Expanded Tree  Cache big enough to handle a map like this,
>>>>>but some don't, and for some  reason our QA server is set to an
>>>>>inexplicably small value. Is it  advisable to just manually increase
>>>>>that value so everything fits? Are  there any other general rules
>>>>>when adjusting server spec values? I have  mostly heard "look don't
>>>>>touch" with regard to these settings.
>>>>> 
>>>>> -Will
>>>>> 
>>>>> _______________________________________________
>>>>> General mailing list
>>>>> [email protected]
>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> General mailing list
>>>> [email protected]
>>>> http://developer.marklogic.com/mailman/listinfo/general
>>> 
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://developer.marklogic.com/mailman/listinfo/general
>>> 
>>
>>_______________________________________________
>>General mailing list
>>[email protected]
>>http://developer.marklogic.com/mailman/listinfo/general
>
>_______________________________________________
>General mailing list
>[email protected]
>http://developer.marklogic.com/mailman/listinfo/general
>_______________________________________________
>General mailing list
>[email protected]
>http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Maximum size for a map?

Reply via email to