Damon, that's a good idea. Only problem is that the value would actually be multiple values in some XML, so maybe it could be stored as JSON as a means of shoehorning that into a string.
Ultimately I think serializing a giant map to the database won't be workable due to limitations of the in-memory list and expanded tree cache, especially if we want to grow the map. We can test 1) breaking up the map, and 2) performance of doing it with a range index instead. -Will On 5/6/13 7:01 PM, "Damon Feldman" <[email protected]> wrote: >Will, > >You may be able to use range indexes either by using cts:element-values >with an element-value-query to "key" the lookup and have the value in the >index, or by range-indexing a value that has the key and value separated >by a token. This may not be quite as fast a map lookup but can simplify >your code. > >If you describe the nature of the lookup we can brainstorm other ideas. > >Yours, >Damon > >-- >Damon Feldman >Sr. Principal Consultant, MarkLogic > > >-----Original Message----- >From: [email protected] >[mailto:[email protected]] On Behalf Of Will >Thompson >Sent: Monday, May 06, 2013 9:01 PM >To: MarkLogic Developer Discussion >Subject: Re: [MarkLogic Dev General] Maximum size for a map? > >The map won't need to be updated frequently, so the idea is to serialize >it to the database and filesystem for portability. Then on first use, it >gets loaded into a server field. My tests are showing you're pretty spot >on for the deserializing time. But after that it's loaded in the field >and always available. My worry is about that initial doc() call on boxes >that may have a smaller expanded-tree cache. In this case, is my only >option to ensure each box has sufficient values to hold the 400MB >deserialized map or face XDMP-EXPNTREECACHEFULL? I could try/catch, and >throw a friendlier error for the small systems. > >I chose map for speed, but I if that's too much trouble then I suppose >the kay/value pairs could also be stored in a non-map document with a >range index on the keys and fragment root set to its children. Then there >would be no need for doc(), although I'm not sure how much speed that >would give up. > >-Will > > >On 5/6/13 4:48 PM, "Michael Blakeley" <[email protected]> wrote: > >>Yes, any doc() call will use space in the expanded-tree cache. So you >>might end up with X in the cache, plus Y for the deserialized map. >> >>I would also worry about how long it might take to deserialize a 400-MB >>map, even if the XML is already in cache. My guess is around 30-sec to >>construct the map. If the cache is cold that might double because the >>fragment has to be read from disk and decoded. But those are just >>guesses. >> >>There are a couple of approaches that might avoid that cost. One is to >>break up the map into multiple small documents. You could query a >>special directory or collection for document that have the key(s) you >>need, and let the expanded-tree cache handle the memory management. >>Each map would be relatively small, so deserialization wouldn't be as >>expensive. >> >>Another approach is to keep the map in a server field. That would be >>both powerful and dangerous, because the memory for a server field is >>persistent. We are used to working with query allocations, which >>disappear when the query ends. So a single query is limited in its >>scope for damage. But a 400-MB server field allocates 400-MB per eval >>host, for the lifetime of the host process. >> >>So you'd want to be very careful to ensure that each host has exactly >>one of these huge server fields. You'd also have to be very careful >>about updating the map, partly because of the size and also because >>server fields do not offer much in the way of memory protection. >>Depending on your needs you might be able to do some sort of A-B >>switching when you need to update the map, or develop a locking >>strategy, or both. >> >>-- Mike >> >>On 6 May 2013, at 16:29 , Will Thompson <[email protected]> >>wrote: >> >>> Mike - I should have been a little more specific about the use case. >>>What >>> if that map is serialized to the db; would calling doc() on that >>>potentially overload the expanded tree cache? >>> >>> let $m := map:map(doc('/path/to/map.xml')/map:map) >>> return xdmp:set-server-field('my-map', $m) >>> >>> Best guess on the QA server is that ML was installed when its VM was >>> allocated fewer resources. But that's a good point about catching bad >>> queries. >>> >>> -Will >>> >>> >>> On 5/6/13 4:05 PM, "Michael Blakeley" <[email protected]> wrote: >>> >>>> No, maps don't use expanded tree cache space. A really large map >>>>might hit some per-eval limits, but I didn't find them when I >>>>created map around 800-MiB on my laptop, with 6.0-3. I used an >>>>xdmp:quote to try to make sure the map would really allocated more >>>>space for each entry. >>>>This >>>> was fine at 80-MiB and took about 5-sec. For 800-MiB it took a >>>>little longer, and the OS swapped some pages out. So I conclude that >>>>it was working hard to allocate all the memory. >>>> >>>> let $m := map:map() >>>> let $n := doc()[1] >>>> let $_ := (1 to 1000000) ! ( >>>> map:put($m, xdmp:integer-to-hex(xdmp:random()), xdmp:quote($n))) >>>> return map:count($m) * string-length(xdmp:quote($n)) div (1024 * >>>> 1024) , xdmp:elapsed-time() => >>>> 802.04010009765625 >>>> PT1M6.429219S >>>> >>>> On that QA system, you might have set the expanded tree cache size >>>> to a smaller value on purpose. That can be a good way to catch >>>> poorly-optimized queries. >>>> >>>> -- Mike >>>> >>>> On 6 May 2013, at 14:44 , Will Thompson <[email protected]> >>>> wrote: >>>> >>>>> Here's another one related to the Expanded Tree Cache: Say I want >>>>>to load a giant map: 400MB or more. Will this always be dependent >>>>>on the size of the Expanded Tree Cache? Most of our dev machines >>>>>have an Expanded Tree Cache big enough to handle a map like this, >>>>>but some don't, and for some reason our QA server is set to an >>>>>inexplicably small value. Is it advisable to just manually increase >>>>>that value so everything fits? Are there any other general rules >>>>>when adjusting server spec values? I have mostly heard "look don't >>>>>touch" with regard to these settings. >>>>> >>>>> -Will >>>>> >>>>> _______________________________________________ >>>>> General mailing list >>>>> [email protected] >>>>> http://developer.marklogic.com/mailman/listinfo/general >>>>> >>>> >>>> _______________________________________________ >>>> General mailing list >>>> [email protected] >>>> http://developer.marklogic.com/mailman/listinfo/general >>> >>> _______________________________________________ >>> General mailing list >>> [email protected] >>> http://developer.marklogic.com/mailman/listinfo/general >>> >> >>_______________________________________________ >>General mailing list >>[email protected] >>http://developer.marklogic.com/mailman/listinfo/general > >_______________________________________________ >General mailing list >[email protected] >http://developer.marklogic.com/mailman/listinfo/general >_______________________________________________ >General mailing list >[email protected] >http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
