Probably someone from MarkLogic will give you a more thorough answer, but our experience with sub-document fragments has led us ultimately, to chunk into separate documents in most cases, rather than relying on the fragmentation configuration. Here are some reasons:
1) overlapping fragments lead to trouble. It has often happened that we *thought* we were fragmenting documents into non-overlapping chunks, but in fact the rules we set up led to nested fragments, and this can negatively impact performance. If you explicitly chunk your content into documents, it forces these issues out into the open. 2) Documents have uris. These are supported by various useful features in the language (like the doc() function), and in MarkLogic's extensions (see, eg uri-lexicon). Sub-document fragments don't have any first-class existence. This also means that documents can be accessed using directories, and this makes them accessible to standard browse interfaces like WebDav. No similar tools exist for fragments. For example, if you have a random node, and you want to know what document it's a part of, you can call base-uri($node). There's no analogous capability for some arbitrary fragment: you'd have to use special knowledge of your own data structure to do something like that (eg: $node/ancestor-or-self::my-fragment-element[1]/@id) 3) There is the relevance consideration Kelly raised, which could be critical if you are relying on relevance ranking of search results. 4) I think there may be some locking/transactional isolation behaviors that are different between fragments and documents, but I'm not sure: perhaps they actually behave the same w.r.t. updates. -Mike > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of > Lee, David > Sent: Sunday, November 22, 2009 11:56 AM > To: General Mark Logic Developer Discussion > Subject: RE: [MarkLogic Dev General] RE: RE: Creating Collections > > Good suggestion about separate documents. In fact these particular > documents are just lists of identical smaller things, just > like you surmise. > > You say > > "Now, there is a concept in MarkLogic called fragmentation > which allows you to store very large documents, and to > perform minimal disk IO when retrieving or updating the > individual fragments. This is a very useful feature. However, > for search applications, the best practice is to load the > individual nodes as documents. If there is metadata that > applies to all your individual nodes, then we can talk about > how you might deal > with that." > > > Is this really fundamentally true ? I hear conflicting statements. > How have you determined this "best practice" ? > > I've been using Fragment Parents so that this "big document" > is fragmented into individual fragments, without having to > create separate "documents". > I have no need to apply meta-data to these mini-docs at all. > > Is it really fundamentally true that given the same data set, > that splitting them into documents, instead of fragments, > improves performance ? > The performance I'm getting is phenomenal, and I have read > implied in many places in the ML documentation that > fragmenting documents is a great way of doing things. > > Besides meta-data associated with each mini-doc, do I really > truly gain an advantage by splitting the big doc to littler docs ? > That seems contrary to what I'm reading in the ML > documentation. One > of the huge advantages I see with simply storing this > mega-document (in MarkLogic as apposed to my old way of > thinking 'file based' XML) is that it seems to work > perfectly as-is, and it seems to me an unnecessary complexity > to split it up unless there are hard gains to it. > > I can certianly do some tests, but I'd love someone who knows > the authoritative answer, or even hard anecdotal evidence, to comment. > > > -David > > > > > > > > > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of > Kelly Stirman > Sent: Sunday, November 22, 2009 11:47 AM > To: [email protected] > Subject: [MarkLogic Dev General] RE: RE: Creating Collections > > David, > > I'm glad I asked why you're searching two documents. :) > > If you have 20,000 top level elements in a large document, > and you also wish to return those elements as results, then > to me it seems like you probably want to load them as > individual documents. (To me your "document" sounds more like > an archive than a document.) There are several reasons for this. > > 1) the performance of queries will be better > 2) relevance is determined at the document level, so all the > matching nodes in your large document will have the same score > 3) you may be able to use the searchAPI to do most of what > you're trying to do > > If you follow this approach, you could group all these > documents in a common collection or directory. Or, perhaps > they all have a common root node. Keep in mind if you do: > > cts:search(/my-root-node,$query) > > You will only search documents with that root node. No need > to specify a specific collection or document name. You can also do: > > cts:search(collection("my-collection")/my-root-node,$query) > > or > > cts:search(xdmp:directory("/my-directory/","infinity")/my-root -node,$que > ry) > > In both of these approaches, the collection and directory > constraints will be combined with the root node constraint. > > As Geert says, you can use cts:collection-query() and > cts:directory-query() instead, if you want to group all of > your constraints in a cts:query, which I think is generally a > best practice. > However, it is a little easier to explain the XPath part of > the constraints with the syntax above. :) > > To load these individual nodes as documents, you should take > a look at RecordLoader - that's probably the easiest and most > efficient way. > > Now, there is a concept in MarkLogic called fragmentation > which allows you to store very large documents, and to > perform minimal disk IO when retrieving or updating the > individual fragments. This is a very useful feature. However, > for search applications, the best practice is to load the > individual nodes as documents. If there is metadata that > applies to all your individual nodes, then we can talk about > how you might deal with that. > > I think one lesson here is that it is a good idea to step > back and talk about the bigger picture - what are you trying > to do in your application? This will help us to recommend the > best approach. Right now it still seems mysterious. :) > > Kelly > > Thanks for the suggestion. > Collections are probably what I want to use, but I'm just > experimenting to try to learn. > In my case I have 2 documents (both large, about 500MB each, > with about 20,000 top level elements each) where I want to do > a single search and produce a single set of results ordered by score. > This will lead to a page which shows "Top 10 results" where > if you click on the link it will go to the detail for that > record. These documents are from the same category of data, > but are totally different > schemas. They happen right now to reside in a directory with 3 other > documents I do NOT want to search. So my choices are > > 1) Move these 2 into their own private directory > 2) Create a collection with these 2 docs (learning how to do > that now, thanks to the group !) > 3) Search the 2 docs seperately and try to combine the > results (very ugly , all sorts of extra work needed to > combine the results) > 4) Search the 2 documents in ONE search. Thanks to the suggestion of > using multiple documents to fn:doc() I found this search > works *exactly how I want* > > The trick here is that I want a particular sub-record > returned as the result, not the entire document, so hence the > funky xpath expression after doc() to pick the 2 kinds of sub > record sets I want searched. > > > cts:search( > fn:doc(("/Mesh/desc2010.xml","/Mesh/supp2010.xml"))/ > > (DescriptorRecordSet/DescriptorRecord|SupplementalRecordSet/Su > pplemental > Record), > "Codeine") > > > This Works !!!!! > > > And it beats the hell out of my 3 solution which is > cts:search( doc("doc1") ) | cts:search(doc("doc2")) ... > > because it seems like now the results are combined which means > remainder() should work on the combined set and I dont have > to re-order the results ! > > I will definitely look into making a collection and/or > separating these > 2 docs into their own directory. But good to know I can do > this way too. > _______________________________________________ > General mailing list > [email protected] > http://xqzone.com/mailman/listinfo/general > _______________________________________________ > General mailing list > [email protected] > http://xqzone.com/mailman/listinfo/general > _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
