Good suggestion about separate documents. In fact these particular documents are just lists of identical smaller things, just like you surmise.
You say "Now, there is a concept in MarkLogic called fragmentation which allows you to store very large documents, and to perform minimal disk IO when retrieving or updating the individual fragments. This is a very useful feature. However, for search applications, the best practice is to load the individual nodes as documents. If there is metadata that applies to all your individual nodes, then we can talk about how you might deal with that." Is this really fundamentally true ? I hear conflicting statements. How have you determined this "best practice" ? I've been using Fragment Parents so that this "big document" is fragmented into individual fragments, without having to create separate "documents". I have no need to apply meta-data to these mini-docs at all. Is it really fundamentally true that given the same data set, that splitting them into documents, instead of fragments, improves performance ? The performance I'm getting is phenomenal, and I have read implied in many places in the ML documentation that fragmenting documents is a great way of doing things. Besides meta-data associated with each mini-doc, do I really truly gain an advantage by splitting the big doc to littler docs ? That seems contrary to what I'm reading in the ML documentation. One of the huge advantages I see with simply storing this mega-document (in MarkLogic as apposed to my old way of thinking 'file based' XML) is that it seems to work perfectly as-is, and it seems to me an unnecessary complexity to split it up unless there are hard gains to it. I can certianly do some tests, but I'd love someone who knows the authoritative answer, or even hard anecdotal evidence, to comment. -David -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Kelly Stirman Sent: Sunday, November 22, 2009 11:47 AM To: [email protected] Subject: [MarkLogic Dev General] RE: RE: Creating Collections David, I'm glad I asked why you're searching two documents. :) If you have 20,000 top level elements in a large document, and you also wish to return those elements as results, then to me it seems like you probably want to load them as individual documents. (To me your "document" sounds more like an archive than a document.) There are several reasons for this. 1) the performance of queries will be better 2) relevance is determined at the document level, so all the matching nodes in your large document will have the same score 3) you may be able to use the searchAPI to do most of what you're trying to do If you follow this approach, you could group all these documents in a common collection or directory. Or, perhaps they all have a common root node. Keep in mind if you do: cts:search(/my-root-node,$query) You will only search documents with that root node. No need to specify a specific collection or document name. You can also do: cts:search(collection("my-collection")/my-root-node,$query) or cts:search(xdmp:directory("/my-directory/","infinity")/my-root-node,$que ry) In both of these approaches, the collection and directory constraints will be combined with the root node constraint. As Geert says, you can use cts:collection-query() and cts:directory-query() instead, if you want to group all of your constraints in a cts:query, which I think is generally a best practice. However, it is a little easier to explain the XPath part of the constraints with the syntax above. :) To load these individual nodes as documents, you should take a look at RecordLoader - that's probably the easiest and most efficient way. Now, there is a concept in MarkLogic called fragmentation which allows you to store very large documents, and to perform minimal disk IO when retrieving or updating the individual fragments. This is a very useful feature. However, for search applications, the best practice is to load the individual nodes as documents. If there is metadata that applies to all your individual nodes, then we can talk about how you might deal with that. I think one lesson here is that it is a good idea to step back and talk about the bigger picture - what are you trying to do in your application? This will help us to recommend the best approach. Right now it still seems mysterious. :) Kelly Thanks for the suggestion. Collections are probably what I want to use, but I'm just experimenting to try to learn. In my case I have 2 documents (both large, about 500MB each, with about 20,000 top level elements each) where I want to do a single search and produce a single set of results ordered by score. This will lead to a page which shows "Top 10 results" where if you click on the link it will go to the detail for that record. These documents are from the same category of data, but are totally different schemas. They happen right now to reside in a directory with 3 other documents I do NOT want to search. So my choices are 1) Move these 2 into their own private directory 2) Create a collection with these 2 docs (learning how to do that now, thanks to the group !) 3) Search the 2 docs seperately and try to combine the results (very ugly , all sorts of extra work needed to combine the results) 4) Search the 2 documents in ONE search. Thanks to the suggestion of using multiple documents to fn:doc() I found this search works *exactly how I want* The trick here is that I want a particular sub-record returned as the result, not the entire document, so hence the funky xpath expression after doc() to pick the 2 kinds of sub record sets I want searched. cts:search( fn:doc(("/Mesh/desc2010.xml","/Mesh/supp2010.xml"))/ (DescriptorRecordSet/DescriptorRecord|SupplementalRecordSet/Supplemental Record), "Codeine") This Works !!!!! And it beats the hell out of my 3 solution which is cts:search( doc("doc1") ) | cts:search(doc("doc2")) ... because it seems like now the results are combined which means remainder() should work on the combined set and I dont have to re-order the results ! I will definitely look into making a collection and/or separating these 2 docs into their own directory. But good to know I can do this way too. _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
