Okay, so the consensus is the map-based solution is inhumane :) ...and less manageable. Back to testing.
-Will On 5/7/13 3:17 PM, "David Lee" <[email protected]> wrote: >IMHO eventually putting too much stuff in a map will kill the horse. >When ? Only the horse will know :) >I encourage you to consider the alternatives such as range indexes and >lexicons ... >the system will manage the memory for you and it may not be *quite* as >fast as a map but then >fast isnt always better if your horse dies. > > >-------------------------------------------------------------------------- >--- >David Lee >Lead Engineer >MarkLogic Corporation >[email protected] >Phone: +1 812-482-5224 >Cell: +1 812-630-7622 >www.marklogic.com > > >-----Original Message----- >From: [email protected] >[mailto:[email protected]] On Behalf Of Will >Thompson >Sent: Tuesday, May 07, 2013 6:15 PM >To: MarkLogic Developer Discussion >Subject: Re: [MarkLogic Dev General] Maximum size for a map? > >Not to beat a dead horse, but I thought I had come up with a way to get >around the expanded-tree cache error when populating a large map from disk >by using spawned batches: create a map, pass that as an external variable >to multiple spawn tasks (using the "result" option to force it to wait >until they're done) that will each read from the database and add to the >map before finally inserting. My assumption was that the expanded-tree >cache limits were per transaction and that by batching the map loading >into spawned tasks, it wouldn't fill up as long each batch never exceeded >the limit. However, that doesn't seem to be the case, and this still gives >an expanded-tree cache error. Did I misunderstand something? > >-Will > >On 5/7/13 11:41 AM, "Geert Josten" <[email protected]> wrote: > >>Hi Will, >> >>Yes, if you revert to lexicons and range indexes, you can only use atomic >>values, but if are actually wrapping a fixed number of atomic values in >>XML, then you can easily look them up one by one (using separate >>indexes). >>The benefit is that using plain docs to look values up fits more >>naturally >>into MarkLogic, saving you from fuss around initialization and updating.. >> >>Cheers, >>Geert >> >>> -----Oorspronkelijk bericht----- >>> Van: [email protected] [mailto:general- >>> [email protected]] Namens Will Thompson >>> Verzonden: dinsdag 7 mei 2013 19:18 >>> Aan: MarkLogic Developer Discussion >>> Onderwerp: Re: [MarkLogic Dev General] Maximum size for a map? >>> >>> Damon, that's a good idea. Only problem is that the value would >>>actually >>> be multiple values in some XML, so maybe it could be stored as JSON as >>>a >>> means of shoehorning that into a string. >>> >>> Ultimately I think serializing a giant map to the database won't be >>> workable due to limitations of the in-memory list and expanded tree >>cache, >>> especially if we want to grow the map. We can test 1) breaking up the >>map, >>> and 2) performance of doing it with a range index instead. >>> >>> -Will >>> >>> >>> On 5/6/13 7:01 PM, "Damon Feldman" <[email protected]> >>> wrote: >>> >>> >Will, >>> > >>> >You may be able to use range indexes either by using >>>cts:element-values >>> >with an element-value-query to "key" the lookup and have the value in >>the >>> >index, or by range-indexing a value that has the key and value >>separated >>> >by a token. This may not be quite as fast a map lookup but can >>>simplify >>> >your code. >>> > >>> >If you describe the nature of the lookup we can brainstorm other >>>ideas. >>> > >>> >Yours, >>> >Damon >>> > >>> >-- >>> >Damon Feldman >>> >Sr. Principal Consultant, MarkLogic >>> > >>> > >>> >-----Original Message----- >>> >From: [email protected] >>> >[mailto:[email protected]] On Behalf Of Will >>> >Thompson >>> >Sent: Monday, May 06, 2013 9:01 PM >>> >To: MarkLogic Developer Discussion >>> >Subject: Re: [MarkLogic Dev General] Maximum size for a map? >>> > >>> >The map won't need to be updated frequently, so the idea is to >>serialize >>> >it to the database and filesystem for portability. Then on first use, >>it >>> >gets loaded into a server field. My tests are showing you're pretty >>spot >>> >on for the deserializing time. But after that it's loaded in the field >>> >and always available. My worry is about that initial doc() call on >>boxes >>> >that may have a smaller expanded-tree cache. In this case, is my only >>> >option to ensure each box has sufficient values to hold the 400MB >>> >deserialized map or face XDMP-EXPNTREECACHEFULL? I could try/catch, >>> and >>> >throw a friendlier error for the small systems. >>> > >>> >I chose map for speed, but I if that's too much trouble then I suppose >>> >the kay/value pairs could also be stored in a non-map document with a >>> >range index on the keys and fragment root set to its children. Then >>there >>> >would be no need for doc(), although I'm not sure how much speed that >>> >would give up. >>> > >>> >-Will >>> > >>> > >>> >On 5/6/13 4:48 PM, "Michael Blakeley" <[email protected]> wrote: >>> > >>> >>Yes, any doc() call will use space in the expanded-tree cache. So you >>> >>might end up with X in the cache, plus Y for the deserialized map. >>> >> >>> >>I would also worry about how long it might take to deserialize a >>400-MB >>> >>map, even if the XML is already in cache. My guess is around 30-sec >>>to >>> >>construct the map. If the cache is cold that might double because the >>> >>fragment has to be read from disk and decoded. But those are just >>> >>guesses. >>> >> >>> >>There are a couple of approaches that might avoid that cost. One is >>>to >>> >>break up the map into multiple small documents. You could query a >>> >>special directory or collection for document that have the key(s) you >>> >>need, and let the expanded-tree cache handle the memory management. >>> >>Each map would be relatively small, so deserialization wouldn't be as >>> >>expensive. >>> >> >>> >>Another approach is to keep the map in a server field. That would be >>> >>both powerful and dangerous, because the memory for a server field is >>> >>persistent. We are used to working with query allocations, which >>> >>disappear when the query ends. So a single query is limited in its >>> >>scope for damage. But a 400-MB server field allocates 400-MB per eval >>> >>host, for the lifetime of the host process. >>> >> >>> >>So you'd want to be very careful to ensure that each host has exactly >>> >>one of these huge server fields. You'd also have to be very careful >>> >>about updating the map, partly because of the size and also because >>> >>server fields do not offer much in the way of memory protection. >>> >>Depending on your needs you might be able to do some sort of A-B >>> >>switching when you need to update the map, or develop a locking >>> >>strategy, or both. >>> >> >>> >>-- Mike >>> >> >>> >>On 6 May 2013, at 16:29 , Will Thompson >>> <[email protected]> >>> >>wrote: >>> >> >>> >>> Mike - I should have been a little more specific about the use >>>case. >>> >>>What >>> >>> if that map is serialized to the db; would calling doc() on that >>> >>>potentially overload the expanded tree cache? >>> >>> >>> >>> let $m := map:map(doc('/path/to/map.xml')/map:map) >>> >>> return xdmp:set-server-field('my-map', $m) >>> >>> >>> >>> Best guess on the QA server is that ML was installed when its VM >>>was >>> >>> allocated fewer resources. But that's a good point about catching >>bad >>> >>> queries. >>> >>> >>> >>> -Will >>> >>> >>> >>> >>> >>> On 5/6/13 4:05 PM, "Michael Blakeley" <[email protected]> wrote: >>> >>> >>> >>>> No, maps don't use expanded tree cache space. A really large map >>> >>>>might hit some per-eval limits, but I didn't find them when I >>> >>>>created map around 800-MiB on my laptop, with 6.0-3. I used an >>> >>>>xdmp:quote to try to make sure the map would really allocated more >>> >>>>space for each entry. >>> >>>>This >>> >>>> was fine at 80-MiB and took about 5-sec. For 800-MiB it took a >>> >>>>little longer, and the OS swapped some pages out. So I conclude >>that >>> >>>>it was working hard to allocate all the memory. >>> >>>> >>> >>>> let $m := map:map() >>> >>>> let $n := doc()[1] >>> >>>> let $_ := (1 to 1000000) ! ( >>> >>>> map:put($m, xdmp:integer-to-hex(xdmp:random()), xdmp:quote($n))) >>> >>>> return map:count($m) * string-length(xdmp:quote($n)) div (1024 * >>> >>>> 1024) , xdmp:elapsed-time() => >>> >>>> 802.04010009765625 >>> >>>> PT1M6.429219S >>> >>>> >>> >>>> On that QA system, you might have set the expanded tree cache size >>> >>>> to a smaller value on purpose. That can be a good way to catch >>> >>>> poorly-optimized queries. >>> >>>> >>> >>>> -- Mike >>> >>>> >>> >>>> On 6 May 2013, at 14:44 , Will Thompson >>> <[email protected]> >>> >>>> wrote: >>> >>>> >>> >>>>> Here's another one related to the Expanded Tree Cache: Say I want >>> >>>>>to load a giant map: 400MB or more. Will this always be >>>dependent >>> >>>>>on the size of the Expanded Tree Cache? Most of our dev machines >>> >>>>>have an Expanded Tree Cache big enough to handle a map like this, >>> >>>>>but some don't, and for some reason our QA server is set to an >>> >>>>>inexplicably small value. Is it advisable to just manually >>increase >>> >>>>>that value so everything fits? Are there any other general rules >>> >>>>>when adjusting server spec values? I have mostly heard "look >>>don't >>> >>>>>touch" with regard to these settings. >>> >>>>> >>> >>>>> -Will >>> >>>>> >>> >>>>> _______________________________________________ >>> >>>>> General mailing list >>> >>>>> [email protected] >>> >>>>> http://developer.marklogic.com/mailman/listinfo/general >>> >>>>> >>> >>>> >>> >>>> _______________________________________________ >>> >>>> General mailing list >>> >>>> [email protected] >>> >>>> http://developer.marklogic.com/mailman/listinfo/general >>> >>> >>> >>> _______________________________________________ >>> >>> General mailing list >>> >>> [email protected] >>> >>> http://developer.marklogic.com/mailman/listinfo/general >>> >>> >>> >> >>> >>_______________________________________________ >>> >>General mailing list >>> >>[email protected] >>> >>http://developer.marklogic.com/mailman/listinfo/general >>> > >>> >_______________________________________________ >>> >General mailing list >>> >[email protected] >>> >http://developer.marklogic.com/mailman/listinfo/general >>> >_______________________________________________ >>> >General mailing list >>> >[email protected] >>> >http://developer.marklogic.com/mailman/listinfo/general >>> >>> _______________________________________________ >>> General mailing list >>> [email protected] >>> http://developer.marklogic.com/mailman/listinfo/general >>_______________________________________________ >>General mailing list >>[email protected] >>http://developer.marklogic.com/mailman/listinfo/general > >_______________________________________________ >General mailing list >[email protected] >http://developer.marklogic.com/mailman/listinfo/general >_______________________________________________ >General mailing list >[email protected] >http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
