Re: [MarkLogic Dev General] Maximum size for a map?

Will Thompson Tue, 07 May 2013 16:08:02 -0700

Okay, so the consensus is the map-based solution is inhumane :) ...and
less manageable. Back to testing.


-Will

On 5/7/13 3:17 PM, "David Lee" <[email protected]> wrote:

>IMHO eventually putting too much stuff in a map will kill the horse.
>When ? Only the horse will know :)
>I encourage you to consider the alternatives such as range indexes and
>lexicons ... 
>the system will manage the memory for you and it may not be *quite* as
>fast as a map but then
>fast isnt always better if your horse dies.
>
>
>--------------------------------------------------------------------------
>---
>David Lee
>Lead Engineer
>MarkLogic Corporation
>[email protected]
>Phone: +1 812-482-5224
>Cell:  +1 812-630-7622
>www.marklogic.com
>
>
>-----Original Message-----
>From: [email protected]
>[mailto:[email protected]] On Behalf Of Will
>Thompson
>Sent: Tuesday, May 07, 2013 6:15 PM
>To: MarkLogic Developer Discussion
>Subject: Re: [MarkLogic Dev General] Maximum size for a map?
>
>Not to beat a dead horse, but I thought I had come up with a way to get
>around the expanded-tree cache error when populating a large map from disk
>by using spawned batches: create a map, pass that as an external variable
>to multiple spawn tasks (using the "result" option to force it to wait
>until they're done) that will each read from the database and add to the
>map before finally inserting. My assumption was that the expanded-tree
>cache limits were per transaction and that by batching the map loading
>into spawned tasks, it wouldn't fill up as long each batch never exceeded
>the limit. However, that doesn't seem to be the case, and this still gives
>an expanded-tree cache error. Did I misunderstand something?
>
>-Will
>
>On 5/7/13 11:41 AM, "Geert Josten" <[email protected]> wrote:
>
>>Hi Will,
>>
>>Yes, if you revert to lexicons and range indexes, you can only use atomic
>>values, but if are actually wrapping a fixed number of atomic values in
>>XML, then you can easily look them up one by one (using separate
>>indexes).
>>The benefit is that using plain docs to look values up fits more
>>naturally
>>into MarkLogic, saving you from fuss around initialization and updating..
>>
>>Cheers,
>>Geert
>>
>>> -----Oorspronkelijk bericht-----
>>> Van: [email protected] [mailto:general-
>>> [email protected]] Namens Will Thompson
>>> Verzonden: dinsdag 7 mei 2013 19:18
>>> Aan: MarkLogic Developer Discussion
>>> Onderwerp: Re: [MarkLogic Dev General] Maximum size for a map?
>>>
>>> Damon, that's a good idea. Only problem is that the value would
>>>actually
>>> be multiple values in some XML, so maybe it could be stored as JSON as
>>>a
>>> means of shoehorning that into a string.
>>>
>>> Ultimately I think serializing a giant map to the database won't be
>>> workable due to limitations of the in-memory list and expanded tree
>>cache,
>>> especially if we want to grow the map. We can test 1) breaking up the
>>map,
>>> and 2) performance of doing it with a range index instead.
>>>
>>> -Will
>>>
>>>
>>> On 5/6/13 7:01 PM, "Damon Feldman" <[email protected]>
>>> wrote:
>>>
>>> >Will,
>>> >
>>> >You may be able to use range indexes either by using
>>>cts:element-values
>>> >with an element-value-query to "key" the lookup and have the value in
>>the
>>> >index, or by range-indexing a value that has the key and value
>>separated
>>> >by a token. This may not be quite as fast a map lookup but can
>>>simplify
>>> >your code.
>>> >
>>> >If you describe the nature of the lookup we can brainstorm other
>>>ideas.
>>> >
>>> >Yours,
>>> >Damon
>>> >
>>> >--
>>> >Damon Feldman
>>> >Sr. Principal Consultant, MarkLogic
>>> >
>>> >
>>> >-----Original Message-----
>>> >From: [email protected]
>>> >[mailto:[email protected]] On Behalf Of Will
>>> >Thompson
>>> >Sent: Monday, May 06, 2013 9:01 PM
>>> >To: MarkLogic Developer Discussion
>>> >Subject: Re: [MarkLogic Dev General] Maximum size for a map?
>>> >
>>> >The map won't need to be updated frequently, so the idea is to
>>serialize
>>> >it to the database and filesystem for portability. Then on first use,
>>it
>>> >gets loaded into a server field.  My tests are showing you're pretty
>>spot
>>> >on for the deserializing time. But after that it's loaded in the field
>>> >and always available. My worry is about that initial doc() call on
>>boxes
>>> >that may have a smaller expanded-tree cache. In this case, is my only
>>> >option to ensure each box has sufficient values to hold the 400MB
>>> >deserialized map or face XDMP-EXPNTREECACHEFULL? I could try/catch,
>>> and
>>> >throw a friendlier error for the small systems.
>>> >
>>> >I chose map for speed, but I if that's too much trouble then I suppose
>>> >the kay/value pairs could also be stored in a non-map document with a
>>> >range index on the keys and fragment root set to its children. Then
>>there
>>> >would be no need for doc(), although I'm not sure how much speed that
>>> >would give up.
>>> >
>>> >-Will
>>> >
>>> >
>>> >On 5/6/13 4:48 PM, "Michael Blakeley" <[email protected]> wrote:
>>> >
>>> >>Yes, any doc() call will use space in the expanded-tree cache. So you
>>> >>might end up with X in the cache, plus Y for the deserialized map.
>>> >>
>>> >>I would also worry about how long it might take to deserialize a
>>400-MB
>>> >>map, even if the XML is already in cache. My guess is around 30-sec
>>>to
>>> >>construct the map. If the cache is cold that might double because the
>>> >>fragment has to be read from disk and decoded. But those are just
>>> >>guesses.
>>> >>
>>> >>There are a couple of approaches that might avoid that cost. One is
>>>to
>>> >>break up the map into multiple small documents. You could query a
>>> >>special directory or collection for document that have the key(s) you
>>> >>need, and let the expanded-tree cache handle the memory management.
>>> >>Each map would be relatively small, so deserialization wouldn't be as
>>> >>expensive.
>>> >>
>>> >>Another approach is to keep the map in a server field. That would be
>>> >>both powerful and dangerous, because the memory for a server field is
>>> >>persistent. We are used to working with query allocations, which
>>> >>disappear when the query ends. So a single query is limited in its
>>> >>scope for damage. But a 400-MB server field allocates 400-MB per eval
>>> >>host, for the lifetime of the host process.
>>> >>
>>> >>So you'd want to be very careful to ensure that each host has exactly
>>> >>one of these huge server fields. You'd also have to be very careful
>>> >>about updating the map, partly because of the size and also because
>>> >>server fields do not offer much in the way of memory protection.
>>> >>Depending on your needs you might be able to do some sort of A-B
>>> >>switching when you need to update the map, or develop a locking
>>> >>strategy, or both.
>>> >>
>>> >>-- Mike
>>> >>
>>> >>On 6 May 2013, at 16:29 , Will Thompson
>>> <[email protected]>
>>> >>wrote:
>>> >>
>>> >>> Mike - I should have been a little more specific about the use
>>>case.
>>> >>>What
>>> >>> if that map is serialized to the db; would calling doc() on that
>>> >>>potentially overload the expanded tree cache?
>>> >>>
>>> >>> let $m := map:map(doc('/path/to/map.xml')/map:map)
>>> >>> return xdmp:set-server-field('my-map', $m)
>>> >>>
>>> >>> Best guess on the QA server is that ML was installed when its VM
>>>was
>>> >>> allocated fewer resources. But that's a good point about catching
>>bad
>>> >>> queries.
>>> >>>
>>> >>> -Will
>>> >>>
>>> >>>
>>> >>> On 5/6/13 4:05 PM, "Michael Blakeley" <[email protected]> wrote:
>>> >>>
>>> >>>> No, maps don't use expanded tree cache space. A really large map
>>> >>>>might  hit some per-eval limits, but I didn't find them when I
>>> >>>>created map  around 800-MiB on my laptop, with 6.0-3. I used an
>>> >>>>xdmp:quote to try to  make sure the map would really allocated more
>>> >>>>space for each entry.
>>> >>>>This
>>> >>>> was fine at 80-MiB and took about 5-sec. For 800-MiB it took a
>>> >>>>little  longer, and the OS swapped some pages out. So I conclude
>>that
>>> >>>>it was  working hard to allocate all the memory.
>>> >>>>
>>> >>>> let $m := map:map()
>>> >>>> let $n := doc()[1]
>>> >>>> let $_ := (1 to 1000000) ! (
>>> >>>> map:put($m, xdmp:integer-to-hex(xdmp:random()), xdmp:quote($n)))
>>> >>>> return map:count($m) * string-length(xdmp:quote($n)) div (1024 *
>>> >>>> 1024) , xdmp:elapsed-time() =>
>>> >>>> 802.04010009765625
>>> >>>> PT1M6.429219S
>>> >>>>
>>> >>>> On that QA system, you might have set the expanded tree cache size
>>> >>>> to a smaller value on purpose. That can be a good way to catch
>>> >>>> poorly-optimized queries.
>>> >>>>
>>> >>>> -- Mike
>>> >>>>
>>> >>>> On 6 May 2013, at 14:44 , Will Thompson
>>> <[email protected]>
>>> >>>> wrote:
>>> >>>>
>>> >>>>> Here's another one related to the Expanded Tree Cache: Say I want
>>> >>>>>to  load  a giant map: 400MB or more. Will this always be
>>>dependent
>>> >>>>>on the size of  the Expanded Tree Cache? Most of our dev machines
>>> >>>>>have an Expanded Tree  Cache big enough to handle a map like this,
>>> >>>>>but some don't, and for some  reason our QA server is set to an
>>> >>>>>inexplicably small value. Is it  advisable to just manually
>>increase
>>> >>>>>that value so everything fits? Are  there any other general rules
>>> >>>>>when adjusting server spec values? I have  mostly heard "look
>>>don't
>>> >>>>>touch" with regard to these settings.
>>> >>>>>
>>> >>>>> -Will
>>> >>>>>
>>> >>>>> _______________________________________________
>>> >>>>> General mailing list
>>> >>>>> [email protected]
>>> >>>>> http://developer.marklogic.com/mailman/listinfo/general
>>> >>>>>
>>> >>>>
>>> >>>> _______________________________________________
>>> >>>> General mailing list
>>> >>>> [email protected]
>>> >>>> http://developer.marklogic.com/mailman/listinfo/general
>>> >>>
>>> >>> _______________________________________________
>>> >>> General mailing list
>>> >>> [email protected]
>>> >>> http://developer.marklogic.com/mailman/listinfo/general
>>> >>>
>>> >>
>>> >>_______________________________________________
>>> >>General mailing list
>>> >>[email protected]
>>> >>http://developer.marklogic.com/mailman/listinfo/general
>>> >
>>> >_______________________________________________
>>> >General mailing list
>>> >[email protected]
>>> >http://developer.marklogic.com/mailman/listinfo/general
>>> >_______________________________________________
>>> >General mailing list
>>> >[email protected]
>>> >http://developer.marklogic.com/mailman/listinfo/general
>>>
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://developer.marklogic.com/mailman/listinfo/general
>>_______________________________________________
>>General mailing list
>>[email protected]
>>http://developer.marklogic.com/mailman/listinfo/general
>
>_______________________________________________
>General mailing list
>[email protected]
>http://developer.marklogic.com/mailman/listinfo/general
>_______________________________________________
>General mailing list
>[email protected]
>http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Maximum size for a map?

Reply via email to